{ "cells": [ { "cell_type": "markdown", "id": "3963da99-70f9-4d44-88a9-031723c27f7a", "metadata": {}, "source": [ "# Zadanie 2 [5pkt]\n", "Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n", "\n", "Zbiór powinien być:\n", "* nie za duży (max ~200 MB)\n", "* nie za mały (np. IRIS jest za mały ;))\n", "* unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/❌/r/sites/2024SL06-S4IN01-F01005LABInynieriauczeniamaszynowego-Grupa11/Materiay%20z%20zaj/IUM-2024.xlsx?d=w23a1cad8c73a4fe183404d1b0671af36&csf=1&web=1&e=zUvrxN\n", "najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n", "\n", "Napisz skrypt, który:\n", "1. Pobierze wybrany przez Ciebie zbiór\n", "2. Jeśli brak w zbiorze gotowego podziału na podzbiory train/dev/test, to dokona takiego podziału\n", "3. Zbierze i wydrukuje statystyki dla tego zbioru i jego podzbiorów, takie jak np.:\n", "* wielkość zbioru i podzbiorów\n", "* średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)\n", "* rozkład częstości przykładów dla poszczególnych klas\n", "4. Dokona normalizacji danych w zbiorze (np. normalizacja wartości float do zakresu 0.0 - 1.0)\n", "5. Wyczyści zbiór z artefaktów (np. puste linie, przykłady z niepoprawnymi wartościami)\n", "\n", "Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n", "Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n", "Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n", "Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/❌/r/sites/2024SL06-S4IN01-F01005LABInynieriauczeniamaszynowego-Grupa11/Materiay%20z%20zaj/IUM-2024.xlsx?d=w23a1cad8c73a4fe183404d1b0671af36&csf=1&web=1&e=zUvrxN" ] }, { "cell_type": "code", "execution_count": 47, "id": "15654f1c-682b-422d-b2d5-398b93a92518", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Collecting kaggle\n", " Downloading kaggle-1.6.6.tar.gz (84 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.6/84.6 kB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hRequirement already satisfied: bleach in /usr/local/lib/python3.9/dist-packages (from kaggle) (5.0.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.9.14)\n", "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2)\n", "Collecting python-slugify (from kaggle)\n", " Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.28.1)\n", "Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.16.0)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.64.1)\n", "Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.12)\n", "Requirement already satisfied: webencodings in /usr/local/lib/python3.9/dist-packages (from bleach->kaggle) (0.5.1)\n", "Collecting text-unidecode>=1.3 (from python-slugify->kaggle)\n", " Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4)\n", "Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)\n", "Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.2/78.2 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hBuilding wheels for collected packages: kaggle\n", " Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111949 sha256=7e53aec9a3b77513258bc4a5e64c8024f55ccc060eee29fae9b6a5dcdb06b381\n", " Stored in directory: /home/students/s464953/.cache/pip/wheels/46/aa/c3/b3e421522fb5acdd7c366a05c5fc80787615bdeed207e7f79b\n", "Successfully built kaggle\n", "Installing collected packages: text-unidecode, python-slugify, kaggle\n", "\u001b[33m WARNING: The script slugify is installed in '/home/students/s464953/.local/bin' which is not on PATH.\n", " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33m WARNING: The script kaggle is installed in '/home/students/s464953/.local/bin' which is not on PATH.\n", " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n", "\u001b[0mSuccessfully installed kaggle-1.6.6 python-slugify-8.0.4 text-unidecode-1.3\n" ] } ], "source": [ "!pip install kaggle" ] }, { "cell_type": "code", "execution_count": 55, "id": "37ed37d7-40fd-4f79-8d12-4a54cb62540c", "metadata": {}, "outputs": [], "source": [ "# Import bibliotek \n", "\n", "import os\n", "import shutil\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "import requests\n", "from sklearn.preprocessing import MinMaxScaler\n", "from kaggle.api.kaggle_api_extended import KaggleApi" ] }, { "cell_type": "code", "execution_count": 56, "id": "393cb55e-404b-4f67-a363-9ea94fca8aea", "metadata": {}, "outputs": [], "source": [ "#funkcja pobierająca plik \n", "\n", "def download_file(url, filename, destination_folder):\n", " # Wersja dla datasetów kaggle\n", " api = KaggleApi()\n", " api.authenticate()\n", "\n", " api.dataset_download_files('brunoalarcon123/top-200-spotify-songs-dataset', path='temp', unzip=True)\n", "\n", " # Wersja dla datasetów nie z kaggle\n", " # response = requests.get(url)\n", " # if response.status_code == 200:\n", " # # Ścieżka do pliku w folderze docelowym\n", " # filepath = os.path.join(destination_folder, filename)\n", " # with open(filepath, 'wb') as f:\n", " # f.write(response.content)\n", " # print(f\"Pobrano plik: {filename}\")\n", " # return filepath\n", " # else:\n", " # print(\"Wystąpił błąd podczas pobierania pliku.\")\n" ] }, { "cell_type": "code", "execution_count": 57, "id": "368a3d3c-31c1-4131-b0f8-d6b374781251", "metadata": {}, "outputs": [], "source": [ "# funkcja dzieląca zbiór\n", "\n", "def split_dataset(data, test_size=0.2, val_size=0.1, random_state=42):\n", " #Podział na test i trening\n", " train_data, test_data = train_test_split(data, test_size=test_size, random_state=random_state)\n", " #Podział na walidacje i trening\n", " train_data, val_data = train_test_split(train_data, test_size=val_size/(1-test_size), random_state=random_state)\n", " \n", " return train_data, val_data, test_data" ] }, { "cell_type": "code", "execution_count": 58, "id": "2089aa48-9898-411b-86dd-d728fced1dc0", "metadata": {}, "outputs": [], "source": [ "# Wyświetlanie statystyk zbioru \n", "\n", "def print_dataset_stats(data, subset_name):\n", " print(f\"Statystyki dla zbioru {subset_name}:\")\n", " print(f\"Wielkość zbioru {subset_name}: {len(data)}\")\n", "\n", " print(\"\\nStatystyki wartości poszczególnych parametrów:\")\n", " print(data.describe())\n", "\n", " for column in data.columns:\n", " print(f\"Rozkład częstości dla kolumny '{column}':\")\n", " print(data[column].value_counts())\n", " print(\"\\n\")" ] }, { "cell_type": "code", "execution_count": 59, "id": "eb057bf2-8848-4544-a2b0-13f0edd3a107", "metadata": {}, "outputs": [], "source": [ "# Normalizacja danych \n", "\n", "def normalize_data(data):\n", " scaler = MinMaxScaler()\n", " numeric_columns = data.select_dtypes(include=['int', 'float']).columns\n", " scaler.fit(data[numeric_columns])\n", " df_normalized = data.copy()\n", " df_normalized[numeric_columns] = scaler.transform(df_normalized[numeric_columns])\n", " return df_normalized" ] }, { "cell_type": "code", "execution_count": 60, "id": "5bb10125-17a6-40b9-87de-c2d237f50ff5", "metadata": {}, "outputs": [], "source": [ "#Czyszczenie danych \n", "\n", "def clean_dataset(data):\n", " data.dropna(inplace=True)\n", " data.drop_duplicates(inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 61, "id": "54eead00-348a-40c2-b56d-2dc82b269269", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Statystyki dla zbioru treningowego:\n", "Wielkość zbioru treningowego: 456354\n", "\n", "Statystyki wartości poszczególnych parametrów:\n", " Rank Danceability Energy Loudness \\\n", "count 456354.000000 456354.000000 456354.000000 456354.000000 \n", "mean 100.358899 0.697782 0.652136 -5293.064991 \n", "std 57.398624 0.133183 0.155760 2783.750842 \n", "min 1.000000 0.073000 0.005000 -34475.000000 \n", "25% 51.000000 0.617000 0.549000 -6825.000000 \n", "50% 100.000000 0.719000 0.671000 -5206.000000 \n", "75% 150.000000 0.793000 0.771000 -3875.000000 \n", "max 200.000000 0.985000 0.996000 1509.000000 \n", "\n", " Speechiness Acousticness Instrumentalness Valence \\\n", "count 456354.000000 456354.000000 456354.000000 456354.000000 \n", "mean 0.109976 0.230610 0.007728 0.523162 \n", "std 0.096896 0.230671 0.055278 0.223983 \n", "min 0.022000 0.000000 0.000000 0.032000 \n", "25% 0.045000 0.048000 0.000000 0.356000 \n", "50% 0.068000 0.152000 0.000000 0.521000 \n", "75% 0.136000 0.349000 0.000000 0.696000 \n", "max 0.966000 0.994000 0.956000 0.982000 \n", "\n", " Points (Total) Points (Ind for each Artist/Nat) \n", "count 456354.000000 456354.000000 \n", "mean 100.641101 72.473544 \n", "std 57.398624 54.254094 \n", "min 1.000000 0.200000 \n", "25% 51.000000 28.000000 \n", "50% 101.000000 60.000000 \n", "75% 150.000000 104.000000 \n", "max 200.000000 200.000000 \n", "Rozkład częstości dla kolumny 'Rank':\n", "88 2428\n", "57 2416\n", "56 2416\n", "83 2409\n", "78 2407\n", " ... \n", "11 2161\n", "1 2147\n", "4 2129\n", "13 2101\n", "3 2060\n", "Name: Rank, Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Title':\n", "Sunflower - Spider-Man: Into the Spider-Verse 2280\n", "One Dance 2107\n", "Something Just Like This 1795\n", "Shallow 1750\n", "Closer 1745\n", " ... \n", "Science Fiction 1\n", "Sweet Caroline 1\n", "Open It Up 1\n", "Lovin' Me (feat. Phoebe Bridgers) 1\n", "You Don't Do It For Me Anymore 1\n", "Name: Title, Length: 6954, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artists':\n", "Ed Sheeran 8532\n", "Post Malone 5442\n", "XXXTENTACION 4826\n", "Billie Eilish 4786\n", "Bad Bunny 4050\n", " ... \n", "Daði Freyr 1\n", "Justin Bieber, - 1\n", "Brent Faiyaz, Alicia Keys 1\n", "CHVRCHES 1\n", "NCT DREAM 1\n", "Name: Artists, Length: 2849, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Date':\n", "22/01/2022 254\n", "05/01/2022 253\n", "28/01/2022 249\n", "07/01/2022 248\n", "19/03/2022 247\n", " ... \n", "13/12/2020 162\n", "26/12/2020 162\n", "23/07/2017 160\n", "04/02/2021 160\n", "23/03/2020 158\n", "Name: Date, Length: 2336, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Danceability':\n", "0.795 4708\n", "0.791 3910\n", "0.671 3493\n", "0.755 3417\n", "0.807 3282\n", " ... \n", "0.223 1\n", "0.270 1\n", "0.185 1\n", "0.204 1\n", "0.373 1\n", "Name: Danceability, Length: 726, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Energy':\n", "0.771 2721\n", "0.648 2632\n", "0.522 2613\n", "0.631 2508\n", "0.726 2453\n", " ... \n", "0.120 1\n", "0.981 1\n", "0.162 1\n", "0.096 1\n", "0.220 1\n", "Name: Energy, Length: 852, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Loudness':\n", "-4368.00 1873\n", "-6769.00 1795\n", "-10109.00 1794\n", "-6362.00 1751\n", "-5599.00 1745\n", " ... \n", "-4455.00 1\n", "-2886.00 1\n", "-4446.00 1\n", "-15.34 1\n", "-1627.00 1\n", "Name: Loudness, Length: 5090, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Speechiness':\n", "0.032 11398\n", "0.048 9246\n", "0.036 8197\n", "0.045 8167\n", "0.034 7596\n", " ... \n", "0.515 1\n", "0.314 1\n", "0.388 1\n", "0.888 1\n", "0.399 1\n", "Name: Speechiness, Length: 525, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Acousticness':\n", "0.008 5145\n", "0.003 4843\n", "0.017 4482\n", "0.002 4137\n", "0.005 4012\n", " ... \n", "0.502 1\n", "0.715 1\n", "0.992 1\n", "0.854 1\n", "0.787 1\n", "Name: Acousticness, Length: 942, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Instrumentalness':\n", "0.000 399153\n", "0.001 15897\n", "0.002 5440\n", "0.004 4680\n", "0.003 4503\n", " ... \n", "0.153 1\n", "0.468 1\n", "0.736 1\n", "0.295 1\n", "0.939 1\n", "Name: Instrumentalness, Length: 289, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Valence':\n", "0.446 4102\n", "0.437 3247\n", "0.580 2704\n", "0.494 2353\n", "0.661 2238\n", " ... \n", "0.047 1\n", "0.070 1\n", "0.935 1\n", "0.936 1\n", "0.057 1\n", "Name: Valence, Length: 933, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Artist':\n", "Artist 1 329460\n", "Artist 2 89046\n", "Artist 3 24054\n", "Artist 4 6748\n", "Artist 5 4185\n", "Artist 6 1865\n", "Artist 7 754\n", "Artist 8 175\n", "Artist 9 67\n", "Name: # of Artist, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artist (Ind.)':\n", "Bad Bunny 11664\n", "Ed Sheeran 8928\n", "Post Malone 8057\n", "J Balvin 7359\n", "Drake 7067\n", " ... \n", "Wilbur Soot 1\n", "Hayley Williams 1\n", "Malik Monta 1\n", "Kollegah 1\n", "NCT DREAM 1\n", "Name: Artist (Ind.), Length: 2115, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Nationality':\n", "Nationality 1 329460\n", "Nationality 2 89046\n", "Nationality 3 24054\n", "Nationality 4 6748\n", "Nationality 5 4185\n", "Nationality 6 1865\n", "Nationality 7 754\n", "Nationality 8 175\n", "Nationality 9 67\n", "Name: # of Nationality, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Nationality':\n", "United States 192533\n", "United Kingdom 59024\n", "Puerto Rico 53571\n", "Canada 27768\n", "Colombia 24259\n", " ... \n", "Bonaire 1\n", "Senegal 1\n", "Sri Lanka 1\n", "Suecia 1\n", "Azerbaijan 1\n", "Name: Nationality, Length: 72, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Continent':\n", "Anglo-America 222778\n", "Latin-America 108347\n", "Europe 103046\n", "Asia 9688\n", "Oceania 9319\n", "Africa 2861\n", "Unknown 315\n", "Name: Continent, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Total)':\n", "113 2428\n", "144 2416\n", "145 2416\n", "118 2409\n", "123 2407\n", " ... \n", "190 2161\n", "200 2147\n", "197 2129\n", "188 2101\n", "198 2060\n", "Name: Points (Total), Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n", "18.000000 5180\n", "12.000000 4958\n", "2.000000 4868\n", "14.000000 4864\n", "24.000000 4839\n", " ... \n", "19.250000 2\n", "9.666667 1\n", "55.333333 1\n", "35.600000 1\n", "4.250000 1\n", "Name: Points (Ind for each Artist/Nat), Length: 476, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'id':\n", "0RiRZpuVRbi7oqRdSMwhQY 1805\n", "6RUKPb4LETWmmr3iAEQktW 1795\n", "7BKLCZ1jbUBVqRi2FVlTVw 1745\n", "2VxeLyX666F8uXCJ0dZF8B 1742\n", "7qiZfU4dY1lWllzX7mPBI3 1573\n", " ... \n", "5S7FewmYYyLNdMOfeEcB6P 1\n", "6mPZVis3gEGSSR2rhxlehT 1\n", "12VqHHz4wvVcnEdSivjLeQ 1\n", "2O1qYJTA2BI5ypFFqEZhh4 1\n", "4I39irD0xSyfRA099AsWow 1\n", "Name: id, Length: 8521, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Song URL':\n", "https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 1805\n", "https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 1795\n", "https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 1745\n", "https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 1742\n", "https://open.spotify.com/track/7qiZfU4dY1lWllzX7mPBI3 1573\n", " ... \n", "https://open.spotify.com/track/5S7FewmYYyLNdMOfeEcB6P 1\n", "https://open.spotify.com/track/6mPZVis3gEGSSR2rhxlehT 1\n", "https://open.spotify.com/track/12VqHHz4wvVcnEdSivjLeQ 1\n", "https://open.spotify.com/track/2O1qYJTA2BI5ypFFqEZhh4 1\n", "https://open.spotify.com/track/4I39irD0xSyfRA099AsWow 1\n", "Name: Song URL, Length: 8521, dtype: int64\n", "\n", "\n", "\n", "\n", "Statystyki dla zbioru walidacyjnego:\n", "Wielkość zbioru walidacyjnego: 65194\n", "\n", "Statystyki wartości poszczególnych parametrów:\n", " Rank Danceability Energy Loudness Speechiness \\\n", "count 65194.000000 65194.000000 65194.000000 65194.000000 65194.000000 \n", "mean 100.435684 0.697871 0.652172 -5285.533729 0.109990 \n", "std 57.438715 0.133125 0.155214 2794.797372 0.097208 \n", "min 1.000000 0.150000 0.005000 -23023.000000 0.023000 \n", "25% 51.000000 0.618000 0.548000 -6825.000000 0.044000 \n", "50% 100.000000 0.719000 0.671000 -5211.000000 0.068000 \n", "75% 150.000000 0.793000 0.771000 -3872.000000 0.135000 \n", "max 200.000000 0.985000 0.989000 1509.000000 0.966000 \n", "\n", " Acousticness Instrumentalness Valence Points (Total) \\\n", "count 65194.000000 65194.000000 65194.000000 65194.000000 \n", "mean 0.230139 0.007433 0.523956 100.564316 \n", "std 0.230539 0.053366 0.224228 57.438715 \n", "min 0.000000 0.000000 0.032000 1.000000 \n", "25% 0.048000 0.000000 0.356000 51.000000 \n", "50% 0.152000 0.000000 0.524000 101.000000 \n", "75% 0.348000 0.000000 0.697000 150.000000 \n", "max 0.994000 0.919000 0.982000 200.000000 \n", "\n", " Points (Ind for each Artist/Nat) \n", "count 65194.000000 \n", "mean 72.301296 \n", "std 54.263761 \n", "min 0.250000 \n", "25% 28.000000 \n", "50% 59.000000 \n", "75% 103.000000 \n", "max 200.000000 \n", "Rozkład częstości dla kolumny 'Rank':\n", "79 372\n", "58 370\n", "96 368\n", "25 362\n", "77 362\n", " ... \n", "29 286\n", "48 285\n", "3 285\n", "193 284\n", "144 268\n", "Name: Rank, Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Title':\n", "Sunflower - Spider-Man: Into the Spider-Verse 347\n", "One Dance 301\n", "Shallow 256\n", "Closer 252\n", "Something Just Like This 247\n", " ... \n", "How to Talk 1\n", "Black Tux, White Collar 1\n", "Shot in the Dark 1\n", "Until I Bleed Out 1\n", "Don't Shoot (feat. Rick Ross, 2 Chainz, Diddy, Fabolous, Wale, DJ Khaled, Swizz Beatz, Yo Gotti, Currensy, Problem, King Pharaoh & TGT) 1\n", "Name: Title, Length: 4548, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artists':\n", "Ed Sheeran 1279\n", "Post Malone 779\n", "XXXTENTACION 709\n", "Billie Eilish 620\n", "Taylor Swift 561\n", " ... \n", "Louis Armstrong, The Commanders 1\n", "Nas 1\n", "Kungs, Olly Murs, Coely 1\n", "Paloma Mami 1\n", "The Game, Curren$y 1\n", "Name: Artists, Length: 2261, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Date':\n", "18/01/2022 48\n", "11/06/2017 48\n", "14/01/2022 48\n", "26/03/2022 46\n", "17/08/2019 45\n", " ..\n", "25/07/2020 15\n", "07/03/2022 14\n", "20/07/2020 14\n", "12/10/2020 12\n", "19/08/2020 12\n", "Name: Date, Length: 2336, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Danceability':\n", "0.795 656\n", "0.791 554\n", "0.755 511\n", "0.807 499\n", "0.671 483\n", " ... \n", "0.233 1\n", "0.270 1\n", "0.408 1\n", "0.419 1\n", "0.269 1\n", "Name: Danceability, Length: 679, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Energy':\n", "0.522 410\n", "0.771 407\n", "0.648 404\n", "0.715 383\n", "0.633 357\n", " ... \n", "0.326 1\n", "0.228 1\n", "0.333 1\n", "0.135 1\n", "0.005 1\n", "Name: Energy, Length: 782, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Loudness':\n", "-4368.0 289\n", "-6312.0 256\n", "-6362.0 256\n", "-5599.0 252\n", "-10109.0 248\n", " ... \n", "-7496.0 1\n", "-14542.0 1\n", "-4969.0 1\n", "-4396.0 1\n", "-9807.0 1\n", "Name: Loudness, Length: 3739, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Speechiness':\n", "0.032 1635\n", "0.048 1247\n", "0.036 1229\n", "0.045 1135\n", "0.034 1090\n", " ... \n", "0.488 1\n", "0.553 1\n", "0.642 1\n", "0.394 1\n", "0.503 1\n", "Name: Speechiness, Length: 469, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Acousticness':\n", "0.008 728\n", "0.003 672\n", "0.017 609\n", "0.002 594\n", "0.001 574\n", " ... \n", "0.857 1\n", "0.637 1\n", "0.763 1\n", "0.608 1\n", "0.845 1\n", "Name: Acousticness, Length: 861, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Instrumentalness':\n", "0.000 57090\n", "0.001 2269\n", "0.002 745\n", "0.003 676\n", "0.004 646\n", " ... \n", "0.504 1\n", "0.349 1\n", "0.066 1\n", "0.375 1\n", "0.269 1\n", "Name: Instrumentalness, Length: 203, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Valence':\n", "0.446 542\n", "0.437 489\n", "0.580 367\n", "0.609 324\n", "0.323 322\n", " ... \n", "0.065 1\n", "0.929 1\n", "0.977 1\n", "0.117 1\n", "0.103 1\n", "Name: Valence, Length: 921, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Artist':\n", "Artist 1 47052\n", "Artist 2 12765\n", "Artist 3 3374\n", "Artist 4 1004\n", "Artist 5 583\n", "Artist 6 261\n", "Artist 7 127\n", "Artist 8 22\n", "Artist 9 6\n", "Name: # of Artist, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artist (Ind.)':\n", "Bad Bunny 1677\n", "Ed Sheeran 1328\n", "Post Malone 1191\n", "J Balvin 1087\n", "Drake 994\n", " ... \n", "Hit-Boy 1\n", "Luis Miguel 1\n", "Ludacris 1\n", "Stargate 1\n", "The Game 1\n", "Name: Artist (Ind.), Length: 1676, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Nationality':\n", "Nationality 1 47052\n", "Nationality 2 12765\n", "Nationality 3 3374\n", "Nationality 4 1004\n", "Nationality 5 583\n", "Nationality 6 261\n", "Nationality 7 127\n", "Nationality 8 22\n", "Nationality 9 6\n", "Name: # of Nationality, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Nationality':\n", "United States 27308\n", "United Kingdom 8416\n", "Puerto Rico 7625\n", "Canada 3935\n", "Colombia 3546\n", " ... \n", "China 1\n", "Sri Lanka 1\n", "Haiti 1\n", "Ivory Coast 1\n", "Greece 1\n", "Name: Nationality, Length: 62, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Continent':\n", "Anglo-America 31593\n", "Latin-America 15516\n", "Europe 14734\n", "Asia 1483\n", "Oceania 1385\n", "Africa 430\n", "Unknown 53\n", "Name: Continent, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Total)':\n", "122 372\n", "143 370\n", "105 368\n", "176 362\n", "124 362\n", " ... \n", "172 286\n", "153 285\n", "198 285\n", "8 284\n", "57 268\n", "Name: Points (Total), Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n", "18.000000 746\n", "24.000000 725\n", "12.000000 716\n", "28.000000 707\n", "26.000000 701\n", " ... \n", "23.333333 1\n", "15.333333 1\n", "47.333333 1\n", "12.500000 1\n", "64.666667 1\n", "Name: Points (Ind for each Artist/Nat), Length: 412, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'id':\n", "0RiRZpuVRbi7oqRdSMwhQY 282\n", "2VxeLyX666F8uXCJ0dZF8B 253\n", "7BKLCZ1jbUBVqRi2FVlTVw 252\n", "6RUKPb4LETWmmr3iAEQktW 247\n", "0tgVpDi06FyKpA1z0VMD4v 235\n", " ... \n", "5egD7A5x9AHdVO2fMo3Wbo 1\n", "7oTE1KmtU2ml9zBhv9Reao 1\n", "0TFVOjSvPTjFHkiZZekK5k 1\n", "2yBWnRKj0Zx7AF1kufajvW 1\n", "53uF5QwHADVXKv383qXnXd 1\n", "Name: id, Length: 5441, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Song URL':\n", "https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 282\n", "https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 253\n", "https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 252\n", "https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 247\n", "https://open.spotify.com/track/0tgVpDi06FyKpA1z0VMD4v 235\n", " ... \n", "https://open.spotify.com/track/5egD7A5x9AHdVO2fMo3Wbo 1\n", "https://open.spotify.com/track/7oTE1KmtU2ml9zBhv9Reao 1\n", "https://open.spotify.com/track/0TFVOjSvPTjFHkiZZekK5k 1\n", "https://open.spotify.com/track/2yBWnRKj0Zx7AF1kufajvW 1\n", "https://open.spotify.com/track/53uF5QwHADVXKv383qXnXd 1\n", "Name: Song URL, Length: 5441, dtype: int64\n", "\n", "\n", "\n", "\n", "Statystyki dla zbioru testowego:\n", "Wielkość zbioru testowego: 130388\n", "\n", "Statystyki wartości poszczególnych parametrów:\n", " Rank Danceability Energy Loudness \\\n", "count 130388.000000 130388.000000 130388.000000 130388.000000 \n", "mean 100.564922 0.697480 0.651683 -5309.731673 \n", "std 57.418920 0.133228 0.155650 2785.742733 \n", "min 1.000000 0.150000 0.022000 -34475.000000 \n", "25% 51.000000 0.617000 0.548000 -6827.000000 \n", "50% 100.000000 0.718000 0.671000 -5224.000000 \n", "75% 150.000000 0.792000 0.770000 -3912.000000 \n", "max 200.000000 0.985000 0.989000 1509.000000 \n", "\n", " Speechiness Acousticness Instrumentalness Valence \\\n", "count 130388.000000 130388.000000 130388.000000 130388.000000 \n", "mean 0.109818 0.231263 0.007470 0.522624 \n", "std 0.096464 0.230932 0.053432 0.223576 \n", "min 0.023000 0.000000 0.000000 0.026000 \n", "25% 0.045000 0.048000 0.000000 0.356000 \n", "50% 0.068000 0.152500 0.000000 0.520000 \n", "75% 0.136000 0.352000 0.000000 0.695000 \n", "max 0.966000 0.994000 0.942000 0.982000 \n", "\n", " Points (Total) Points (Ind for each Artist/Nat) \n", "count 130388.000000 130388.000000 \n", "mean 100.435078 72.147657 \n", "std 57.418920 54.117721 \n", "min 1.000000 0.250000 \n", "25% 51.000000 28.000000 \n", "50% 101.000000 59.000000 \n", "75% 150.000000 103.000000 \n", "max 200.000000 200.000000 \n", "Rozkład częstości dla kolumny 'Rank':\n", "77 729\n", "54 724\n", "50 717\n", "63 709\n", "137 706\n", " ... \n", "4 595\n", "133 593\n", "121 590\n", "1 586\n", "3 582\n", "Name: Rank, Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Title':\n", "Sunflower - Spider-Man: Into the Spider-Verse 635\n", "One Dance 562\n", "Something Just Like This 560\n", "Closer 535\n", "Shallow 512\n", " ... \n", "What Makes A Woman 1\n", "Te Amo Demais 1\n", "Sharp Edges 1\n", "Bella ciao - HUGEL Remix 1\n", "Real Baby Pluto 1\n", "Name: Title, Length: 5361, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artists':\n", "Ed Sheeran 2478\n", "Post Malone 1503\n", "XXXTENTACION 1389\n", "Billie Eilish 1350\n", "The Weeknd 1169\n", " ... \n", "Leo Lewis, Avicii 1\n", "Peggy Lee 1\n", "Benjamin Ingrosso 1\n", "Dr. Dre, Eminem 1\n", "blink-182 1\n", "Name: Artists, Length: 2487, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Date':\n", "03/02/2018 82\n", "27/01/2022 81\n", "08/09/2019 80\n", "01/05/2022 79\n", "07/03/2022 79\n", " ..\n", "13/03/2020 36\n", "05/02/2020 36\n", "13/08/2020 35\n", "14/09/2017 32\n", "06/01/2017 32\n", "Name: Date, Length: 2336, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Danceability':\n", "0.795 1295\n", "0.791 1119\n", "0.671 978\n", "0.647 955\n", "0.807 932\n", " ... \n", "0.172 1\n", "0.272 1\n", "0.326 1\n", "0.304 1\n", "0.208 1\n", "Name: Danceability, Length: 697, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Energy':\n", "0.771 753\n", "0.648 746\n", "0.522 740\n", "0.715 734\n", "0.726 714\n", " ... \n", "0.949 1\n", "0.212 1\n", "0.345 1\n", "0.220 1\n", "0.067 1\n", "Name: Energy, Length: 804, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Loudness':\n", "-6769.00 560\n", "-5599.00 535\n", "-4368.00 525\n", "-10109.00 516\n", "-6362.00 512\n", " ... \n", "-13.04 1\n", "-13778.00 1\n", "-6.15 1\n", "-11667.00 1\n", "-14104.00 1\n", "Name: Loudness, Length: 4220, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Speechiness':\n", "0.032 3269\n", "0.048 2646\n", "0.045 2406\n", "0.036 2341\n", "0.034 2262\n", " ... \n", "0.870 1\n", "0.505 1\n", "0.747 1\n", "0.522 1\n", "0.663 1\n", "Name: Speechiness, Length: 492, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Acousticness':\n", "0.008 1457\n", "0.003 1332\n", "0.002 1195\n", "0.001 1169\n", "0.017 1165\n", " ... \n", "0.879 1\n", "0.638 1\n", "0.710 1\n", "0.754 1\n", "0.983 1\n", "Name: Acousticness, Length: 905, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Instrumentalness':\n", "0.000 114104\n", "0.001 4499\n", "0.002 1512\n", "0.003 1343\n", "0.004 1336\n", " ... \n", "0.168 1\n", "0.295 1\n", "0.759 1\n", "0.079 1\n", "0.444 1\n", "Name: Instrumentalness, Length: 226, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Valence':\n", "0.446 1168\n", "0.437 973\n", "0.580 778\n", "0.661 678\n", "0.494 670\n", " ... \n", "0.949 1\n", "0.117 1\n", "0.043 1\n", "0.054 1\n", "0.034 1\n", "Name: Valence, Length: 925, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Artist':\n", "Artist 1 93924\n", "Artist 2 25612\n", "Artist 3 6913\n", "Artist 4 1971\n", "Artist 5 1161\n", "Artist 6 516\n", "Artist 7 235\n", "Artist 8 39\n", "Artist 9 17\n", "Name: # of Artist, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Artist (Ind.)':\n", "Bad Bunny 3283\n", "Ed Sheeran 2581\n", "Post Malone 2257\n", "J Balvin 2213\n", "The Weeknd 2061\n", " ... \n", "Peggy Lee 1\n", "Rashmi Virag 1\n", "The Game 1\n", "MC WM 1\n", "blink-182 1\n", "Name: Artist (Ind.), Length: 1833, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny '# of Nationality':\n", "Nationality 1 93924\n", "Nationality 2 25612\n", "Nationality 3 6913\n", "Nationality 4 1971\n", "Nationality 5 1161\n", "Nationality 6 516\n", "Nationality 7 235\n", "Nationality 8 39\n", "Nationality 9 17\n", "Name: # of Nationality, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Nationality':\n", "United States 54963\n", "United Kingdom 16896\n", "Puerto Rico 15404\n", "Canada 7899\n", "Colombia 7084\n", " ... \n", "Sri Lanka 2\n", "Czech Republic 2\n", "Malta 1\n", "Ecuador 1\n", "Moldavia 1\n", "Name: Nationality, Length: 66, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Continent':\n", "Anglo-America 63521\n", "Latin-America 31219\n", "Europe 29388\n", "Asia 2745\n", "Oceania 2602\n", "Africa 816\n", "Unknown 97\n", "Name: Continent, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Total)':\n", "124 729\n", "147 724\n", "151 717\n", "138 709\n", "64 706\n", " ... \n", "197 595\n", "68 593\n", "80 590\n", "200 586\n", "198 582\n", "Name: Points (Total), Length: 200, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n", "18.000000 1529\n", "14.000000 1444\n", "12.000000 1435\n", "26.000000 1413\n", "32.000000 1410\n", " ... \n", "4.250000 1\n", "47.333333 1\n", "56.666667 1\n", "46.666667 1\n", "4.666667 1\n", "Name: Points (Ind for each Artist/Nat), Length: 451, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'id':\n", "6RUKPb4LETWmmr3iAEQktW 560\n", "7BKLCZ1jbUBVqRi2FVlTVw 535\n", "2VxeLyX666F8uXCJ0dZF8B 509\n", "0RiRZpuVRbi7oqRdSMwhQY 509\n", "0tgVpDi06FyKpA1z0VMD4v 472\n", " ... \n", "52Rfxu5AUNMV1qhhC2ZCkb 1\n", "5LHHKZOwV8XW4LJP2C64mw 1\n", "1EWkw4Fa6IlnsAihLUlFFM 1\n", "3CNbrXrUrEARw8zeKNCdYo 1\n", "1rP5gAqMlm8d6UnfseuzSm 1\n", "Name: id, Length: 6486, dtype: int64\n", "\n", "\n", "Rozkład częstości dla kolumny 'Song URL':\n", "https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 560\n", "https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 535\n", "https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 509\n", "https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 509\n", "https://open.spotify.com/track/0tgVpDi06FyKpA1z0VMD4v 472\n", " ... \n", "https://open.spotify.com/track/52Rfxu5AUNMV1qhhC2ZCkb 1\n", "https://open.spotify.com/track/5LHHKZOwV8XW4LJP2C64mw 1\n", "https://open.spotify.com/track/1EWkw4Fa6IlnsAihLUlFFM 1\n", "https://open.spotify.com/track/3CNbrXrUrEARw8zeKNCdYo 1\n", "https://open.spotify.com/track/1rP5gAqMlm8d6UnfseuzSm 1\n", "Name: Song URL, Length: 6486, dtype: int64\n", "\n", "\n" ] } ], "source": [ "# main \n", "\n", "url = \"https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated?select=universal_top_spotify_songs.csv\"\n", "filename = \"dataset.csv\"\n", "destination_folder = \"datasets\"\n", "\n", "# Pobieranie jeśli nie ma już pobranego pliku\n", "if len(os.listdir(destination_folder)) == 0:\n", " # Pobranie pliku\n", " filepath = download_file(url, filename, destination_folder)\n", "\n", " # Przeniesienie pobranego pliku do wskazanego folderu\n", " if filepath:\n", " print(\"Przenoszenie pliku do wskazanego folderu...\")\n", " shutil.move(filepath, os.path.join(destination_folder, filename))\n", " print(\"Plik przeniesiony.\")\n", "\n", "\n", "# Wczytanie danych z pliku CSV\n", "data = pd.read_csv(\"datasets/Spotify_Dataset_V3.csv\", sep=\";\")\n", "\n", "# Podział datasetu na zbiory treningowy, walidacyjny i testowy\n", "train_data, val_data, test_data = split_dataset(data)\n", "\n", "# Zapisanie podzielonych zbiorów danych do osobnych plików CSV\n", "train_data.to_csv(\"datasets/train.csv\", index=False)\n", "val_data.to_csv(\"datasets/val.csv\", index=False)\n", "test_data.to_csv(\"datasets/test.csv\", index=False)\n", "\n", "# Wydrukowanie statystyk dla zbiorów\n", "print_dataset_stats(train_data, \"treningowego\")\n", "print(\"\\n\")\n", "print_dataset_stats(val_data, \"walidacyjnego\")\n", "print(\"\\n\")\n", "print_dataset_stats(test_data, \"testowego\")\n", "\n", "# Normalizacja i czyszczenie zbirów\n", "train_data = normalize_data(train_data)\n", "train_data = clean_dataset(train_data)\n", "val_data = normalize_data(train_data)\n", "val_data = clean_dataset(train_data)\n", "test_data = normalize_data(train_data)\n", "test_data = clean_dataset(train_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "e88874c0-afac-488a-9051-ae2537dea531", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 }