1272 lines
49 KiB
Plaintext
1272 lines
49 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3963da99-70f9-4d44-88a9-031723c27f7a",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Zadanie 2 [5pkt]\n",
|
|
"Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n",
|
|
"\n",
|
|
"Zbiór powinien być:\n",
|
|
"* nie za duży (max ~200 MB)\n",
|
|
"* nie za mały (np. IRIS jest za mały ;))\n",
|
|
"* unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/❌/r/sites/2024SL06-S4IN01-F01005LABInynieriauczeniamaszynowego-Grupa11/Materiay%20z%20zaj/IUM-2024.xlsx?d=w23a1cad8c73a4fe183404d1b0671af36&csf=1&web=1&e=zUvrxN\n",
|
|
"najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n",
|
|
"\n",
|
|
"Napisz skrypt, który:\n",
|
|
"1. Pobierze wybrany przez Ciebie zbiór\n",
|
|
"2. Jeśli brak w zbiorze gotowego podziału na podzbiory train/dev/test, to dokona takiego podziału\n",
|
|
"3. Zbierze i wydrukuje statystyki dla tego zbioru i jego podzbiorów, takie jak np.:\n",
|
|
"* wielkość zbioru i podzbiorów\n",
|
|
"* średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)\n",
|
|
"* rozkład częstości przykładów dla poszczególnych klas\n",
|
|
"4. Dokona normalizacji danych w zbiorze (np. normalizacja wartości float do zakresu 0.0 - 1.0)\n",
|
|
"5. Wyczyści zbiór z artefaktów (np. puste linie, przykłady z niepoprawnymi wartościami)\n",
|
|
"\n",
|
|
"Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n",
|
|
"Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n",
|
|
"Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n",
|
|
"Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/❌/r/sites/2024SL06-S4IN01-F01005LABInynieriauczeniamaszynowego-Grupa11/Materiay%20z%20zaj/IUM-2024.xlsx?d=w23a1cad8c73a4fe183404d1b0671af36&csf=1&web=1&e=zUvrxN"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"id": "15654f1c-682b-422d-b2d5-398b93a92518",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Defaulting to user installation because normal site-packages is not writeable\n",
|
|
"Collecting kaggle\n",
|
|
" Downloading kaggle-1.6.6.tar.gz (84 kB)\n",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.6/84.6 kB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
|
|
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n",
|
|
"\u001b[?25hRequirement already satisfied: bleach in /usr/local/lib/python3.9/dist-packages (from kaggle) (5.0.1)\n",
|
|
"Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.9.14)\n",
|
|
"Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2)\n",
|
|
"Collecting python-slugify (from kaggle)\n",
|
|
" Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)\n",
|
|
"Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.28.1)\n",
|
|
"Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.16.0)\n",
|
|
"Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.64.1)\n",
|
|
"Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.12)\n",
|
|
"Requirement already satisfied: webencodings in /usr/local/lib/python3.9/dist-packages (from bleach->kaggle) (0.5.1)\n",
|
|
"Collecting text-unidecode>=1.3 (from python-slugify->kaggle)\n",
|
|
" Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)\n",
|
|
"Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.1.1)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4)\n",
|
|
"Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)\n",
|
|
"Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)\n",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.2/78.2 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
|
|
"\u001b[?25hBuilding wheels for collected packages: kaggle\n",
|
|
" Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
|
|
"\u001b[?25h Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111949 sha256=7e53aec9a3b77513258bc4a5e64c8024f55ccc060eee29fae9b6a5dcdb06b381\n",
|
|
" Stored in directory: /home/students/s464953/.cache/pip/wheels/46/aa/c3/b3e421522fb5acdd7c366a05c5fc80787615bdeed207e7f79b\n",
|
|
"Successfully built kaggle\n",
|
|
"Installing collected packages: text-unidecode, python-slugify, kaggle\n",
|
|
"\u001b[33m WARNING: The script slugify is installed in '/home/students/s464953/.local/bin' which is not on PATH.\n",
|
|
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
|
|
"\u001b[0m\u001b[33m WARNING: The script kaggle is installed in '/home/students/s464953/.local/bin' which is not on PATH.\n",
|
|
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
|
|
"\u001b[0mSuccessfully installed kaggle-1.6.6 python-slugify-8.0.4 text-unidecode-1.3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!pip install kaggle"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 55,
|
|
"id": "37ed37d7-40fd-4f79-8d12-4a54cb62540c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import bibliotek \n",
|
|
"\n",
|
|
"import os\n",
|
|
"import shutil\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"import requests\n",
|
|
"from sklearn.preprocessing import MinMaxScaler\n",
|
|
"from kaggle.api.kaggle_api_extended import KaggleApi"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 56,
|
|
"id": "393cb55e-404b-4f67-a363-9ea94fca8aea",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#funkcja pobierająca plik \n",
|
|
"\n",
|
|
"def download_file(url, filename, destination_folder):\n",
|
|
" # Wersja dla datasetów kaggle\n",
|
|
" api = KaggleApi()\n",
|
|
" api.authenticate()\n",
|
|
"\n",
|
|
" api.dataset_download_files('brunoalarcon123/top-200-spotify-songs-dataset', path='temp', unzip=True)\n",
|
|
"\n",
|
|
" # Wersja dla datasetów nie z kaggle\n",
|
|
" # response = requests.get(url)\n",
|
|
" # if response.status_code == 200:\n",
|
|
" # # Ścieżka do pliku w folderze docelowym\n",
|
|
" # filepath = os.path.join(destination_folder, filename)\n",
|
|
" # with open(filepath, 'wb') as f:\n",
|
|
" # f.write(response.content)\n",
|
|
" # print(f\"Pobrano plik: {filename}\")\n",
|
|
" # return filepath\n",
|
|
" # else:\n",
|
|
" # print(\"Wystąpił błąd podczas pobierania pliku.\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 57,
|
|
"id": "368a3d3c-31c1-4131-b0f8-d6b374781251",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# funkcja dzieląca zbiór\n",
|
|
"\n",
|
|
"def split_dataset(data, test_size=0.2, val_size=0.1, random_state=42):\n",
|
|
" #Podział na test i trening\n",
|
|
" train_data, test_data = train_test_split(data, test_size=test_size, random_state=random_state)\n",
|
|
" #Podział na walidacje i trening\n",
|
|
" train_data, val_data = train_test_split(train_data, test_size=val_size/(1-test_size), random_state=random_state)\n",
|
|
" \n",
|
|
" return train_data, val_data, test_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 58,
|
|
"id": "2089aa48-9898-411b-86dd-d728fced1dc0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Wyświetlanie statystyk zbioru \n",
|
|
"\n",
|
|
"def print_dataset_stats(data, subset_name):\n",
|
|
" print(f\"Statystyki dla zbioru {subset_name}:\")\n",
|
|
" print(f\"Wielkość zbioru {subset_name}: {len(data)}\")\n",
|
|
"\n",
|
|
" print(\"\\nStatystyki wartości poszczególnych parametrów:\")\n",
|
|
" print(data.describe())\n",
|
|
"\n",
|
|
" for column in data.columns:\n",
|
|
" print(f\"Rozkład częstości dla kolumny '{column}':\")\n",
|
|
" print(data[column].value_counts())\n",
|
|
" print(\"\\n\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 59,
|
|
"id": "eb057bf2-8848-4544-a2b0-13f0edd3a107",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Normalizacja danych \n",
|
|
"\n",
|
|
"def normalize_data(data):\n",
|
|
" scaler = MinMaxScaler()\n",
|
|
" numeric_columns = data.select_dtypes(include=['int', 'float']).columns\n",
|
|
" scaler.fit(data[numeric_columns])\n",
|
|
" df_normalized = data.copy()\n",
|
|
" df_normalized[numeric_columns] = scaler.transform(df_normalized[numeric_columns])\n",
|
|
" return df_normalized"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 60,
|
|
"id": "5bb10125-17a6-40b9-87de-c2d237f50ff5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Czyszczenie danych \n",
|
|
"\n",
|
|
"def clean_dataset(data):\n",
|
|
" data.dropna(inplace=True)\n",
|
|
" data.drop_duplicates(inplace=True)\n",
|
|
" return data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 61,
|
|
"id": "54eead00-348a-40c2-b56d-2dc82b269269",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Statystyki dla zbioru treningowego:\n",
|
|
"Wielkość zbioru treningowego: 456354\n",
|
|
"\n",
|
|
"Statystyki wartości poszczególnych parametrów:\n",
|
|
" Rank Danceability Energy Loudness \\\n",
|
|
"count 456354.000000 456354.000000 456354.000000 456354.000000 \n",
|
|
"mean 100.358899 0.697782 0.652136 -5293.064991 \n",
|
|
"std 57.398624 0.133183 0.155760 2783.750842 \n",
|
|
"min 1.000000 0.073000 0.005000 -34475.000000 \n",
|
|
"25% 51.000000 0.617000 0.549000 -6825.000000 \n",
|
|
"50% 100.000000 0.719000 0.671000 -5206.000000 \n",
|
|
"75% 150.000000 0.793000 0.771000 -3875.000000 \n",
|
|
"max 200.000000 0.985000 0.996000 1509.000000 \n",
|
|
"\n",
|
|
" Speechiness Acousticness Instrumentalness Valence \\\n",
|
|
"count 456354.000000 456354.000000 456354.000000 456354.000000 \n",
|
|
"mean 0.109976 0.230610 0.007728 0.523162 \n",
|
|
"std 0.096896 0.230671 0.055278 0.223983 \n",
|
|
"min 0.022000 0.000000 0.000000 0.032000 \n",
|
|
"25% 0.045000 0.048000 0.000000 0.356000 \n",
|
|
"50% 0.068000 0.152000 0.000000 0.521000 \n",
|
|
"75% 0.136000 0.349000 0.000000 0.696000 \n",
|
|
"max 0.966000 0.994000 0.956000 0.982000 \n",
|
|
"\n",
|
|
" Points (Total) Points (Ind for each Artist/Nat) \n",
|
|
"count 456354.000000 456354.000000 \n",
|
|
"mean 100.641101 72.473544 \n",
|
|
"std 57.398624 54.254094 \n",
|
|
"min 1.000000 0.200000 \n",
|
|
"25% 51.000000 28.000000 \n",
|
|
"50% 101.000000 60.000000 \n",
|
|
"75% 150.000000 104.000000 \n",
|
|
"max 200.000000 200.000000 \n",
|
|
"Rozkład częstości dla kolumny 'Rank':\n",
|
|
"88 2428\n",
|
|
"57 2416\n",
|
|
"56 2416\n",
|
|
"83 2409\n",
|
|
"78 2407\n",
|
|
" ... \n",
|
|
"11 2161\n",
|
|
"1 2147\n",
|
|
"4 2129\n",
|
|
"13 2101\n",
|
|
"3 2060\n",
|
|
"Name: Rank, Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Title':\n",
|
|
"Sunflower - Spider-Man: Into the Spider-Verse 2280\n",
|
|
"One Dance 2107\n",
|
|
"Something Just Like This 1795\n",
|
|
"Shallow 1750\n",
|
|
"Closer 1745\n",
|
|
" ... \n",
|
|
"Science Fiction 1\n",
|
|
"Sweet Caroline 1\n",
|
|
"Open It Up 1\n",
|
|
"Lovin' Me (feat. Phoebe Bridgers) 1\n",
|
|
"You Don't Do It For Me Anymore 1\n",
|
|
"Name: Title, Length: 6954, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artists':\n",
|
|
"Ed Sheeran 8532\n",
|
|
"Post Malone 5442\n",
|
|
"XXXTENTACION 4826\n",
|
|
"Billie Eilish 4786\n",
|
|
"Bad Bunny 4050\n",
|
|
" ... \n",
|
|
"Daði Freyr 1\n",
|
|
"Justin Bieber, - 1\n",
|
|
"Brent Faiyaz, Alicia Keys 1\n",
|
|
"CHVRCHES 1\n",
|
|
"NCT DREAM 1\n",
|
|
"Name: Artists, Length: 2849, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Date':\n",
|
|
"22/01/2022 254\n",
|
|
"05/01/2022 253\n",
|
|
"28/01/2022 249\n",
|
|
"07/01/2022 248\n",
|
|
"19/03/2022 247\n",
|
|
" ... \n",
|
|
"13/12/2020 162\n",
|
|
"26/12/2020 162\n",
|
|
"23/07/2017 160\n",
|
|
"04/02/2021 160\n",
|
|
"23/03/2020 158\n",
|
|
"Name: Date, Length: 2336, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Danceability':\n",
|
|
"0.795 4708\n",
|
|
"0.791 3910\n",
|
|
"0.671 3493\n",
|
|
"0.755 3417\n",
|
|
"0.807 3282\n",
|
|
" ... \n",
|
|
"0.223 1\n",
|
|
"0.270 1\n",
|
|
"0.185 1\n",
|
|
"0.204 1\n",
|
|
"0.373 1\n",
|
|
"Name: Danceability, Length: 726, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Energy':\n",
|
|
"0.771 2721\n",
|
|
"0.648 2632\n",
|
|
"0.522 2613\n",
|
|
"0.631 2508\n",
|
|
"0.726 2453\n",
|
|
" ... \n",
|
|
"0.120 1\n",
|
|
"0.981 1\n",
|
|
"0.162 1\n",
|
|
"0.096 1\n",
|
|
"0.220 1\n",
|
|
"Name: Energy, Length: 852, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Loudness':\n",
|
|
"-4368.00 1873\n",
|
|
"-6769.00 1795\n",
|
|
"-10109.00 1794\n",
|
|
"-6362.00 1751\n",
|
|
"-5599.00 1745\n",
|
|
" ... \n",
|
|
"-4455.00 1\n",
|
|
"-2886.00 1\n",
|
|
"-4446.00 1\n",
|
|
"-15.34 1\n",
|
|
"-1627.00 1\n",
|
|
"Name: Loudness, Length: 5090, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Speechiness':\n",
|
|
"0.032 11398\n",
|
|
"0.048 9246\n",
|
|
"0.036 8197\n",
|
|
"0.045 8167\n",
|
|
"0.034 7596\n",
|
|
" ... \n",
|
|
"0.515 1\n",
|
|
"0.314 1\n",
|
|
"0.388 1\n",
|
|
"0.888 1\n",
|
|
"0.399 1\n",
|
|
"Name: Speechiness, Length: 525, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Acousticness':\n",
|
|
"0.008 5145\n",
|
|
"0.003 4843\n",
|
|
"0.017 4482\n",
|
|
"0.002 4137\n",
|
|
"0.005 4012\n",
|
|
" ... \n",
|
|
"0.502 1\n",
|
|
"0.715 1\n",
|
|
"0.992 1\n",
|
|
"0.854 1\n",
|
|
"0.787 1\n",
|
|
"Name: Acousticness, Length: 942, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Instrumentalness':\n",
|
|
"0.000 399153\n",
|
|
"0.001 15897\n",
|
|
"0.002 5440\n",
|
|
"0.004 4680\n",
|
|
"0.003 4503\n",
|
|
" ... \n",
|
|
"0.153 1\n",
|
|
"0.468 1\n",
|
|
"0.736 1\n",
|
|
"0.295 1\n",
|
|
"0.939 1\n",
|
|
"Name: Instrumentalness, Length: 289, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Valence':\n",
|
|
"0.446 4102\n",
|
|
"0.437 3247\n",
|
|
"0.580 2704\n",
|
|
"0.494 2353\n",
|
|
"0.661 2238\n",
|
|
" ... \n",
|
|
"0.047 1\n",
|
|
"0.070 1\n",
|
|
"0.935 1\n",
|
|
"0.936 1\n",
|
|
"0.057 1\n",
|
|
"Name: Valence, Length: 933, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Artist':\n",
|
|
"Artist 1 329460\n",
|
|
"Artist 2 89046\n",
|
|
"Artist 3 24054\n",
|
|
"Artist 4 6748\n",
|
|
"Artist 5 4185\n",
|
|
"Artist 6 1865\n",
|
|
"Artist 7 754\n",
|
|
"Artist 8 175\n",
|
|
"Artist 9 67\n",
|
|
"Name: # of Artist, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artist (Ind.)':\n",
|
|
"Bad Bunny 11664\n",
|
|
"Ed Sheeran 8928\n",
|
|
"Post Malone 8057\n",
|
|
"J Balvin 7359\n",
|
|
"Drake 7067\n",
|
|
" ... \n",
|
|
"Wilbur Soot 1\n",
|
|
"Hayley Williams 1\n",
|
|
"Malik Monta 1\n",
|
|
"Kollegah 1\n",
|
|
"NCT DREAM 1\n",
|
|
"Name: Artist (Ind.), Length: 2115, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Nationality':\n",
|
|
"Nationality 1 329460\n",
|
|
"Nationality 2 89046\n",
|
|
"Nationality 3 24054\n",
|
|
"Nationality 4 6748\n",
|
|
"Nationality 5 4185\n",
|
|
"Nationality 6 1865\n",
|
|
"Nationality 7 754\n",
|
|
"Nationality 8 175\n",
|
|
"Nationality 9 67\n",
|
|
"Name: # of Nationality, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Nationality':\n",
|
|
"United States 192533\n",
|
|
"United Kingdom 59024\n",
|
|
"Puerto Rico 53571\n",
|
|
"Canada 27768\n",
|
|
"Colombia 24259\n",
|
|
" ... \n",
|
|
"Bonaire 1\n",
|
|
"Senegal 1\n",
|
|
"Sri Lanka 1\n",
|
|
"Suecia 1\n",
|
|
"Azerbaijan 1\n",
|
|
"Name: Nationality, Length: 72, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Continent':\n",
|
|
"Anglo-America 222778\n",
|
|
"Latin-America 108347\n",
|
|
"Europe 103046\n",
|
|
"Asia 9688\n",
|
|
"Oceania 9319\n",
|
|
"Africa 2861\n",
|
|
"Unknown 315\n",
|
|
"Name: Continent, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Total)':\n",
|
|
"113 2428\n",
|
|
"144 2416\n",
|
|
"145 2416\n",
|
|
"118 2409\n",
|
|
"123 2407\n",
|
|
" ... \n",
|
|
"190 2161\n",
|
|
"200 2147\n",
|
|
"197 2129\n",
|
|
"188 2101\n",
|
|
"198 2060\n",
|
|
"Name: Points (Total), Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n",
|
|
"18.000000 5180\n",
|
|
"12.000000 4958\n",
|
|
"2.000000 4868\n",
|
|
"14.000000 4864\n",
|
|
"24.000000 4839\n",
|
|
" ... \n",
|
|
"19.250000 2\n",
|
|
"9.666667 1\n",
|
|
"55.333333 1\n",
|
|
"35.600000 1\n",
|
|
"4.250000 1\n",
|
|
"Name: Points (Ind for each Artist/Nat), Length: 476, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'id':\n",
|
|
"0RiRZpuVRbi7oqRdSMwhQY 1805\n",
|
|
"6RUKPb4LETWmmr3iAEQktW 1795\n",
|
|
"7BKLCZ1jbUBVqRi2FVlTVw 1745\n",
|
|
"2VxeLyX666F8uXCJ0dZF8B 1742\n",
|
|
"7qiZfU4dY1lWllzX7mPBI3 1573\n",
|
|
" ... \n",
|
|
"5S7FewmYYyLNdMOfeEcB6P 1\n",
|
|
"6mPZVis3gEGSSR2rhxlehT 1\n",
|
|
"12VqHHz4wvVcnEdSivjLeQ 1\n",
|
|
"2O1qYJTA2BI5ypFFqEZhh4 1\n",
|
|
"4I39irD0xSyfRA099AsWow 1\n",
|
|
"Name: id, Length: 8521, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Song URL':\n",
|
|
"https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 1805\n",
|
|
"https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 1795\n",
|
|
"https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 1745\n",
|
|
"https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 1742\n",
|
|
"https://open.spotify.com/track/7qiZfU4dY1lWllzX7mPBI3 1573\n",
|
|
" ... \n",
|
|
"https://open.spotify.com/track/5S7FewmYYyLNdMOfeEcB6P 1\n",
|
|
"https://open.spotify.com/track/6mPZVis3gEGSSR2rhxlehT 1\n",
|
|
"https://open.spotify.com/track/12VqHHz4wvVcnEdSivjLeQ 1\n",
|
|
"https://open.spotify.com/track/2O1qYJTA2BI5ypFFqEZhh4 1\n",
|
|
"https://open.spotify.com/track/4I39irD0xSyfRA099AsWow 1\n",
|
|
"Name: Song URL, Length: 8521, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Statystyki dla zbioru walidacyjnego:\n",
|
|
"Wielkość zbioru walidacyjnego: 65194\n",
|
|
"\n",
|
|
"Statystyki wartości poszczególnych parametrów:\n",
|
|
" Rank Danceability Energy Loudness Speechiness \\\n",
|
|
"count 65194.000000 65194.000000 65194.000000 65194.000000 65194.000000 \n",
|
|
"mean 100.435684 0.697871 0.652172 -5285.533729 0.109990 \n",
|
|
"std 57.438715 0.133125 0.155214 2794.797372 0.097208 \n",
|
|
"min 1.000000 0.150000 0.005000 -23023.000000 0.023000 \n",
|
|
"25% 51.000000 0.618000 0.548000 -6825.000000 0.044000 \n",
|
|
"50% 100.000000 0.719000 0.671000 -5211.000000 0.068000 \n",
|
|
"75% 150.000000 0.793000 0.771000 -3872.000000 0.135000 \n",
|
|
"max 200.000000 0.985000 0.989000 1509.000000 0.966000 \n",
|
|
"\n",
|
|
" Acousticness Instrumentalness Valence Points (Total) \\\n",
|
|
"count 65194.000000 65194.000000 65194.000000 65194.000000 \n",
|
|
"mean 0.230139 0.007433 0.523956 100.564316 \n",
|
|
"std 0.230539 0.053366 0.224228 57.438715 \n",
|
|
"min 0.000000 0.000000 0.032000 1.000000 \n",
|
|
"25% 0.048000 0.000000 0.356000 51.000000 \n",
|
|
"50% 0.152000 0.000000 0.524000 101.000000 \n",
|
|
"75% 0.348000 0.000000 0.697000 150.000000 \n",
|
|
"max 0.994000 0.919000 0.982000 200.000000 \n",
|
|
"\n",
|
|
" Points (Ind for each Artist/Nat) \n",
|
|
"count 65194.000000 \n",
|
|
"mean 72.301296 \n",
|
|
"std 54.263761 \n",
|
|
"min 0.250000 \n",
|
|
"25% 28.000000 \n",
|
|
"50% 59.000000 \n",
|
|
"75% 103.000000 \n",
|
|
"max 200.000000 \n",
|
|
"Rozkład częstości dla kolumny 'Rank':\n",
|
|
"79 372\n",
|
|
"58 370\n",
|
|
"96 368\n",
|
|
"25 362\n",
|
|
"77 362\n",
|
|
" ... \n",
|
|
"29 286\n",
|
|
"48 285\n",
|
|
"3 285\n",
|
|
"193 284\n",
|
|
"144 268\n",
|
|
"Name: Rank, Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Title':\n",
|
|
"Sunflower - Spider-Man: Into the Spider-Verse 347\n",
|
|
"One Dance 301\n",
|
|
"Shallow 256\n",
|
|
"Closer 252\n",
|
|
"Something Just Like This 247\n",
|
|
" ... \n",
|
|
"How to Talk 1\n",
|
|
"Black Tux, White Collar 1\n",
|
|
"Shot in the Dark 1\n",
|
|
"Until I Bleed Out 1\n",
|
|
"Don't Shoot (feat. Rick Ross, 2 Chainz, Diddy, Fabolous, Wale, DJ Khaled, Swizz Beatz, Yo Gotti, Currensy, Problem, King Pharaoh & TGT) 1\n",
|
|
"Name: Title, Length: 4548, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artists':\n",
|
|
"Ed Sheeran 1279\n",
|
|
"Post Malone 779\n",
|
|
"XXXTENTACION 709\n",
|
|
"Billie Eilish 620\n",
|
|
"Taylor Swift 561\n",
|
|
" ... \n",
|
|
"Louis Armstrong, The Commanders 1\n",
|
|
"Nas 1\n",
|
|
"Kungs, Olly Murs, Coely 1\n",
|
|
"Paloma Mami 1\n",
|
|
"The Game, Curren$y 1\n",
|
|
"Name: Artists, Length: 2261, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Date':\n",
|
|
"18/01/2022 48\n",
|
|
"11/06/2017 48\n",
|
|
"14/01/2022 48\n",
|
|
"26/03/2022 46\n",
|
|
"17/08/2019 45\n",
|
|
" ..\n",
|
|
"25/07/2020 15\n",
|
|
"07/03/2022 14\n",
|
|
"20/07/2020 14\n",
|
|
"12/10/2020 12\n",
|
|
"19/08/2020 12\n",
|
|
"Name: Date, Length: 2336, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Danceability':\n",
|
|
"0.795 656\n",
|
|
"0.791 554\n",
|
|
"0.755 511\n",
|
|
"0.807 499\n",
|
|
"0.671 483\n",
|
|
" ... \n",
|
|
"0.233 1\n",
|
|
"0.270 1\n",
|
|
"0.408 1\n",
|
|
"0.419 1\n",
|
|
"0.269 1\n",
|
|
"Name: Danceability, Length: 679, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Energy':\n",
|
|
"0.522 410\n",
|
|
"0.771 407\n",
|
|
"0.648 404\n",
|
|
"0.715 383\n",
|
|
"0.633 357\n",
|
|
" ... \n",
|
|
"0.326 1\n",
|
|
"0.228 1\n",
|
|
"0.333 1\n",
|
|
"0.135 1\n",
|
|
"0.005 1\n",
|
|
"Name: Energy, Length: 782, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Loudness':\n",
|
|
"-4368.0 289\n",
|
|
"-6312.0 256\n",
|
|
"-6362.0 256\n",
|
|
"-5599.0 252\n",
|
|
"-10109.0 248\n",
|
|
" ... \n",
|
|
"-7496.0 1\n",
|
|
"-14542.0 1\n",
|
|
"-4969.0 1\n",
|
|
"-4396.0 1\n",
|
|
"-9807.0 1\n",
|
|
"Name: Loudness, Length: 3739, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Speechiness':\n",
|
|
"0.032 1635\n",
|
|
"0.048 1247\n",
|
|
"0.036 1229\n",
|
|
"0.045 1135\n",
|
|
"0.034 1090\n",
|
|
" ... \n",
|
|
"0.488 1\n",
|
|
"0.553 1\n",
|
|
"0.642 1\n",
|
|
"0.394 1\n",
|
|
"0.503 1\n",
|
|
"Name: Speechiness, Length: 469, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Acousticness':\n",
|
|
"0.008 728\n",
|
|
"0.003 672\n",
|
|
"0.017 609\n",
|
|
"0.002 594\n",
|
|
"0.001 574\n",
|
|
" ... \n",
|
|
"0.857 1\n",
|
|
"0.637 1\n",
|
|
"0.763 1\n",
|
|
"0.608 1\n",
|
|
"0.845 1\n",
|
|
"Name: Acousticness, Length: 861, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Instrumentalness':\n",
|
|
"0.000 57090\n",
|
|
"0.001 2269\n",
|
|
"0.002 745\n",
|
|
"0.003 676\n",
|
|
"0.004 646\n",
|
|
" ... \n",
|
|
"0.504 1\n",
|
|
"0.349 1\n",
|
|
"0.066 1\n",
|
|
"0.375 1\n",
|
|
"0.269 1\n",
|
|
"Name: Instrumentalness, Length: 203, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Valence':\n",
|
|
"0.446 542\n",
|
|
"0.437 489\n",
|
|
"0.580 367\n",
|
|
"0.609 324\n",
|
|
"0.323 322\n",
|
|
" ... \n",
|
|
"0.065 1\n",
|
|
"0.929 1\n",
|
|
"0.977 1\n",
|
|
"0.117 1\n",
|
|
"0.103 1\n",
|
|
"Name: Valence, Length: 921, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Artist':\n",
|
|
"Artist 1 47052\n",
|
|
"Artist 2 12765\n",
|
|
"Artist 3 3374\n",
|
|
"Artist 4 1004\n",
|
|
"Artist 5 583\n",
|
|
"Artist 6 261\n",
|
|
"Artist 7 127\n",
|
|
"Artist 8 22\n",
|
|
"Artist 9 6\n",
|
|
"Name: # of Artist, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artist (Ind.)':\n",
|
|
"Bad Bunny 1677\n",
|
|
"Ed Sheeran 1328\n",
|
|
"Post Malone 1191\n",
|
|
"J Balvin 1087\n",
|
|
"Drake 994\n",
|
|
" ... \n",
|
|
"Hit-Boy 1\n",
|
|
"Luis Miguel 1\n",
|
|
"Ludacris 1\n",
|
|
"Stargate 1\n",
|
|
"The Game 1\n",
|
|
"Name: Artist (Ind.), Length: 1676, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Nationality':\n",
|
|
"Nationality 1 47052\n",
|
|
"Nationality 2 12765\n",
|
|
"Nationality 3 3374\n",
|
|
"Nationality 4 1004\n",
|
|
"Nationality 5 583\n",
|
|
"Nationality 6 261\n",
|
|
"Nationality 7 127\n",
|
|
"Nationality 8 22\n",
|
|
"Nationality 9 6\n",
|
|
"Name: # of Nationality, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Nationality':\n",
|
|
"United States 27308\n",
|
|
"United Kingdom 8416\n",
|
|
"Puerto Rico 7625\n",
|
|
"Canada 3935\n",
|
|
"Colombia 3546\n",
|
|
" ... \n",
|
|
"China 1\n",
|
|
"Sri Lanka 1\n",
|
|
"Haiti 1\n",
|
|
"Ivory Coast 1\n",
|
|
"Greece 1\n",
|
|
"Name: Nationality, Length: 62, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Continent':\n",
|
|
"Anglo-America 31593\n",
|
|
"Latin-America 15516\n",
|
|
"Europe 14734\n",
|
|
"Asia 1483\n",
|
|
"Oceania 1385\n",
|
|
"Africa 430\n",
|
|
"Unknown 53\n",
|
|
"Name: Continent, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Total)':\n",
|
|
"122 372\n",
|
|
"143 370\n",
|
|
"105 368\n",
|
|
"176 362\n",
|
|
"124 362\n",
|
|
" ... \n",
|
|
"172 286\n",
|
|
"153 285\n",
|
|
"198 285\n",
|
|
"8 284\n",
|
|
"57 268\n",
|
|
"Name: Points (Total), Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n",
|
|
"18.000000 746\n",
|
|
"24.000000 725\n",
|
|
"12.000000 716\n",
|
|
"28.000000 707\n",
|
|
"26.000000 701\n",
|
|
" ... \n",
|
|
"23.333333 1\n",
|
|
"15.333333 1\n",
|
|
"47.333333 1\n",
|
|
"12.500000 1\n",
|
|
"64.666667 1\n",
|
|
"Name: Points (Ind for each Artist/Nat), Length: 412, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'id':\n",
|
|
"0RiRZpuVRbi7oqRdSMwhQY 282\n",
|
|
"2VxeLyX666F8uXCJ0dZF8B 253\n",
|
|
"7BKLCZ1jbUBVqRi2FVlTVw 252\n",
|
|
"6RUKPb4LETWmmr3iAEQktW 247\n",
|
|
"0tgVpDi06FyKpA1z0VMD4v 235\n",
|
|
" ... \n",
|
|
"5egD7A5x9AHdVO2fMo3Wbo 1\n",
|
|
"7oTE1KmtU2ml9zBhv9Reao 1\n",
|
|
"0TFVOjSvPTjFHkiZZekK5k 1\n",
|
|
"2yBWnRKj0Zx7AF1kufajvW 1\n",
|
|
"53uF5QwHADVXKv383qXnXd 1\n",
|
|
"Name: id, Length: 5441, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Song URL':\n",
|
|
"https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 282\n",
|
|
"https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 253\n",
|
|
"https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 252\n",
|
|
"https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 247\n",
|
|
"https://open.spotify.com/track/0tgVpDi06FyKpA1z0VMD4v 235\n",
|
|
" ... \n",
|
|
"https://open.spotify.com/track/5egD7A5x9AHdVO2fMo3Wbo 1\n",
|
|
"https://open.spotify.com/track/7oTE1KmtU2ml9zBhv9Reao 1\n",
|
|
"https://open.spotify.com/track/0TFVOjSvPTjFHkiZZekK5k 1\n",
|
|
"https://open.spotify.com/track/2yBWnRKj0Zx7AF1kufajvW 1\n",
|
|
"https://open.spotify.com/track/53uF5QwHADVXKv383qXnXd 1\n",
|
|
"Name: Song URL, Length: 5441, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Statystyki dla zbioru testowego:\n",
|
|
"Wielkość zbioru testowego: 130388\n",
|
|
"\n",
|
|
"Statystyki wartości poszczególnych parametrów:\n",
|
|
" Rank Danceability Energy Loudness \\\n",
|
|
"count 130388.000000 130388.000000 130388.000000 130388.000000 \n",
|
|
"mean 100.564922 0.697480 0.651683 -5309.731673 \n",
|
|
"std 57.418920 0.133228 0.155650 2785.742733 \n",
|
|
"min 1.000000 0.150000 0.022000 -34475.000000 \n",
|
|
"25% 51.000000 0.617000 0.548000 -6827.000000 \n",
|
|
"50% 100.000000 0.718000 0.671000 -5224.000000 \n",
|
|
"75% 150.000000 0.792000 0.770000 -3912.000000 \n",
|
|
"max 200.000000 0.985000 0.989000 1509.000000 \n",
|
|
"\n",
|
|
" Speechiness Acousticness Instrumentalness Valence \\\n",
|
|
"count 130388.000000 130388.000000 130388.000000 130388.000000 \n",
|
|
"mean 0.109818 0.231263 0.007470 0.522624 \n",
|
|
"std 0.096464 0.230932 0.053432 0.223576 \n",
|
|
"min 0.023000 0.000000 0.000000 0.026000 \n",
|
|
"25% 0.045000 0.048000 0.000000 0.356000 \n",
|
|
"50% 0.068000 0.152500 0.000000 0.520000 \n",
|
|
"75% 0.136000 0.352000 0.000000 0.695000 \n",
|
|
"max 0.966000 0.994000 0.942000 0.982000 \n",
|
|
"\n",
|
|
" Points (Total) Points (Ind for each Artist/Nat) \n",
|
|
"count 130388.000000 130388.000000 \n",
|
|
"mean 100.435078 72.147657 \n",
|
|
"std 57.418920 54.117721 \n",
|
|
"min 1.000000 0.250000 \n",
|
|
"25% 51.000000 28.000000 \n",
|
|
"50% 101.000000 59.000000 \n",
|
|
"75% 150.000000 103.000000 \n",
|
|
"max 200.000000 200.000000 \n",
|
|
"Rozkład częstości dla kolumny 'Rank':\n",
|
|
"77 729\n",
|
|
"54 724\n",
|
|
"50 717\n",
|
|
"63 709\n",
|
|
"137 706\n",
|
|
" ... \n",
|
|
"4 595\n",
|
|
"133 593\n",
|
|
"121 590\n",
|
|
"1 586\n",
|
|
"3 582\n",
|
|
"Name: Rank, Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Title':\n",
|
|
"Sunflower - Spider-Man: Into the Spider-Verse 635\n",
|
|
"One Dance 562\n",
|
|
"Something Just Like This 560\n",
|
|
"Closer 535\n",
|
|
"Shallow 512\n",
|
|
" ... \n",
|
|
"What Makes A Woman 1\n",
|
|
"Te Amo Demais 1\n",
|
|
"Sharp Edges 1\n",
|
|
"Bella ciao - HUGEL Remix 1\n",
|
|
"Real Baby Pluto 1\n",
|
|
"Name: Title, Length: 5361, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artists':\n",
|
|
"Ed Sheeran 2478\n",
|
|
"Post Malone 1503\n",
|
|
"XXXTENTACION 1389\n",
|
|
"Billie Eilish 1350\n",
|
|
"The Weeknd 1169\n",
|
|
" ... \n",
|
|
"Leo Lewis, Avicii 1\n",
|
|
"Peggy Lee 1\n",
|
|
"Benjamin Ingrosso 1\n",
|
|
"Dr. Dre, Eminem 1\n",
|
|
"blink-182 1\n",
|
|
"Name: Artists, Length: 2487, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Date':\n",
|
|
"03/02/2018 82\n",
|
|
"27/01/2022 81\n",
|
|
"08/09/2019 80\n",
|
|
"01/05/2022 79\n",
|
|
"07/03/2022 79\n",
|
|
" ..\n",
|
|
"13/03/2020 36\n",
|
|
"05/02/2020 36\n",
|
|
"13/08/2020 35\n",
|
|
"14/09/2017 32\n",
|
|
"06/01/2017 32\n",
|
|
"Name: Date, Length: 2336, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Danceability':\n",
|
|
"0.795 1295\n",
|
|
"0.791 1119\n",
|
|
"0.671 978\n",
|
|
"0.647 955\n",
|
|
"0.807 932\n",
|
|
" ... \n",
|
|
"0.172 1\n",
|
|
"0.272 1\n",
|
|
"0.326 1\n",
|
|
"0.304 1\n",
|
|
"0.208 1\n",
|
|
"Name: Danceability, Length: 697, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Energy':\n",
|
|
"0.771 753\n",
|
|
"0.648 746\n",
|
|
"0.522 740\n",
|
|
"0.715 734\n",
|
|
"0.726 714\n",
|
|
" ... \n",
|
|
"0.949 1\n",
|
|
"0.212 1\n",
|
|
"0.345 1\n",
|
|
"0.220 1\n",
|
|
"0.067 1\n",
|
|
"Name: Energy, Length: 804, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Loudness':\n",
|
|
"-6769.00 560\n",
|
|
"-5599.00 535\n",
|
|
"-4368.00 525\n",
|
|
"-10109.00 516\n",
|
|
"-6362.00 512\n",
|
|
" ... \n",
|
|
"-13.04 1\n",
|
|
"-13778.00 1\n",
|
|
"-6.15 1\n",
|
|
"-11667.00 1\n",
|
|
"-14104.00 1\n",
|
|
"Name: Loudness, Length: 4220, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Speechiness':\n",
|
|
"0.032 3269\n",
|
|
"0.048 2646\n",
|
|
"0.045 2406\n",
|
|
"0.036 2341\n",
|
|
"0.034 2262\n",
|
|
" ... \n",
|
|
"0.870 1\n",
|
|
"0.505 1\n",
|
|
"0.747 1\n",
|
|
"0.522 1\n",
|
|
"0.663 1\n",
|
|
"Name: Speechiness, Length: 492, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Acousticness':\n",
|
|
"0.008 1457\n",
|
|
"0.003 1332\n",
|
|
"0.002 1195\n",
|
|
"0.001 1169\n",
|
|
"0.017 1165\n",
|
|
" ... \n",
|
|
"0.879 1\n",
|
|
"0.638 1\n",
|
|
"0.710 1\n",
|
|
"0.754 1\n",
|
|
"0.983 1\n",
|
|
"Name: Acousticness, Length: 905, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Instrumentalness':\n",
|
|
"0.000 114104\n",
|
|
"0.001 4499\n",
|
|
"0.002 1512\n",
|
|
"0.003 1343\n",
|
|
"0.004 1336\n",
|
|
" ... \n",
|
|
"0.168 1\n",
|
|
"0.295 1\n",
|
|
"0.759 1\n",
|
|
"0.079 1\n",
|
|
"0.444 1\n",
|
|
"Name: Instrumentalness, Length: 226, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Valence':\n",
|
|
"0.446 1168\n",
|
|
"0.437 973\n",
|
|
"0.580 778\n",
|
|
"0.661 678\n",
|
|
"0.494 670\n",
|
|
" ... \n",
|
|
"0.949 1\n",
|
|
"0.117 1\n",
|
|
"0.043 1\n",
|
|
"0.054 1\n",
|
|
"0.034 1\n",
|
|
"Name: Valence, Length: 925, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Artist':\n",
|
|
"Artist 1 93924\n",
|
|
"Artist 2 25612\n",
|
|
"Artist 3 6913\n",
|
|
"Artist 4 1971\n",
|
|
"Artist 5 1161\n",
|
|
"Artist 6 516\n",
|
|
"Artist 7 235\n",
|
|
"Artist 8 39\n",
|
|
"Artist 9 17\n",
|
|
"Name: # of Artist, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Artist (Ind.)':\n",
|
|
"Bad Bunny 3283\n",
|
|
"Ed Sheeran 2581\n",
|
|
"Post Malone 2257\n",
|
|
"J Balvin 2213\n",
|
|
"The Weeknd 2061\n",
|
|
" ... \n",
|
|
"Peggy Lee 1\n",
|
|
"Rashmi Virag 1\n",
|
|
"The Game 1\n",
|
|
"MC WM 1\n",
|
|
"blink-182 1\n",
|
|
"Name: Artist (Ind.), Length: 1833, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny '# of Nationality':\n",
|
|
"Nationality 1 93924\n",
|
|
"Nationality 2 25612\n",
|
|
"Nationality 3 6913\n",
|
|
"Nationality 4 1971\n",
|
|
"Nationality 5 1161\n",
|
|
"Nationality 6 516\n",
|
|
"Nationality 7 235\n",
|
|
"Nationality 8 39\n",
|
|
"Nationality 9 17\n",
|
|
"Name: # of Nationality, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Nationality':\n",
|
|
"United States 54963\n",
|
|
"United Kingdom 16896\n",
|
|
"Puerto Rico 15404\n",
|
|
"Canada 7899\n",
|
|
"Colombia 7084\n",
|
|
" ... \n",
|
|
"Sri Lanka 2\n",
|
|
"Czech Republic 2\n",
|
|
"Malta 1\n",
|
|
"Ecuador 1\n",
|
|
"Moldavia 1\n",
|
|
"Name: Nationality, Length: 66, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Continent':\n",
|
|
"Anglo-America 63521\n",
|
|
"Latin-America 31219\n",
|
|
"Europe 29388\n",
|
|
"Asia 2745\n",
|
|
"Oceania 2602\n",
|
|
"Africa 816\n",
|
|
"Unknown 97\n",
|
|
"Name: Continent, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Total)':\n",
|
|
"124 729\n",
|
|
"147 724\n",
|
|
"151 717\n",
|
|
"138 709\n",
|
|
"64 706\n",
|
|
" ... \n",
|
|
"197 595\n",
|
|
"68 593\n",
|
|
"80 590\n",
|
|
"200 586\n",
|
|
"198 582\n",
|
|
"Name: Points (Total), Length: 200, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Points (Ind for each Artist/Nat)':\n",
|
|
"18.000000 1529\n",
|
|
"14.000000 1444\n",
|
|
"12.000000 1435\n",
|
|
"26.000000 1413\n",
|
|
"32.000000 1410\n",
|
|
" ... \n",
|
|
"4.250000 1\n",
|
|
"47.333333 1\n",
|
|
"56.666667 1\n",
|
|
"46.666667 1\n",
|
|
"4.666667 1\n",
|
|
"Name: Points (Ind for each Artist/Nat), Length: 451, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'id':\n",
|
|
"6RUKPb4LETWmmr3iAEQktW 560\n",
|
|
"7BKLCZ1jbUBVqRi2FVlTVw 535\n",
|
|
"2VxeLyX666F8uXCJ0dZF8B 509\n",
|
|
"0RiRZpuVRbi7oqRdSMwhQY 509\n",
|
|
"0tgVpDi06FyKpA1z0VMD4v 472\n",
|
|
" ... \n",
|
|
"52Rfxu5AUNMV1qhhC2ZCkb 1\n",
|
|
"5LHHKZOwV8XW4LJP2C64mw 1\n",
|
|
"1EWkw4Fa6IlnsAihLUlFFM 1\n",
|
|
"3CNbrXrUrEARw8zeKNCdYo 1\n",
|
|
"1rP5gAqMlm8d6UnfseuzSm 1\n",
|
|
"Name: id, Length: 6486, dtype: int64\n",
|
|
"\n",
|
|
"\n",
|
|
"Rozkład częstości dla kolumny 'Song URL':\n",
|
|
"https://open.spotify.com/track/6RUKPb4LETWmmr3iAEQktW 560\n",
|
|
"https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw 535\n",
|
|
"https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B 509\n",
|
|
"https://open.spotify.com/track/0RiRZpuVRbi7oqRdSMwhQY 509\n",
|
|
"https://open.spotify.com/track/0tgVpDi06FyKpA1z0VMD4v 472\n",
|
|
" ... \n",
|
|
"https://open.spotify.com/track/52Rfxu5AUNMV1qhhC2ZCkb 1\n",
|
|
"https://open.spotify.com/track/5LHHKZOwV8XW4LJP2C64mw 1\n",
|
|
"https://open.spotify.com/track/1EWkw4Fa6IlnsAihLUlFFM 1\n",
|
|
"https://open.spotify.com/track/3CNbrXrUrEARw8zeKNCdYo 1\n",
|
|
"https://open.spotify.com/track/1rP5gAqMlm8d6UnfseuzSm 1\n",
|
|
"Name: Song URL, Length: 6486, dtype: int64\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# main \n",
|
|
"\n",
|
|
"url = \"https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated?select=universal_top_spotify_songs.csv\"\n",
|
|
"filename = \"dataset.csv\"\n",
|
|
"destination_folder = \"datasets\"\n",
|
|
"\n",
|
|
"# Pobieranie jeśli nie ma już pobranego pliku\n",
|
|
"if len(os.listdir(destination_folder)) == 0:\n",
|
|
" # Pobranie pliku\n",
|
|
" filepath = download_file(url, filename, destination_folder)\n",
|
|
"\n",
|
|
" # Przeniesienie pobranego pliku do wskazanego folderu\n",
|
|
" if filepath:\n",
|
|
" print(\"Przenoszenie pliku do wskazanego folderu...\")\n",
|
|
" shutil.move(filepath, os.path.join(destination_folder, filename))\n",
|
|
" print(\"Plik przeniesiony.\")\n",
|
|
"\n",
|
|
"\n",
|
|
"# Wczytanie danych z pliku CSV\n",
|
|
"data = pd.read_csv(\"datasets/Spotify_Dataset_V3.csv\", sep=\";\")\n",
|
|
"\n",
|
|
"# Podział datasetu na zbiory treningowy, walidacyjny i testowy\n",
|
|
"train_data, val_data, test_data = split_dataset(data)\n",
|
|
"\n",
|
|
"# Zapisanie podzielonych zbiorów danych do osobnych plików CSV\n",
|
|
"train_data.to_csv(\"datasets/train.csv\", index=False)\n",
|
|
"val_data.to_csv(\"datasets/val.csv\", index=False)\n",
|
|
"test_data.to_csv(\"datasets/test.csv\", index=False)\n",
|
|
"\n",
|
|
"# Wydrukowanie statystyk dla zbiorów\n",
|
|
"print_dataset_stats(train_data, \"treningowego\")\n",
|
|
"print(\"\\n\")\n",
|
|
"print_dataset_stats(val_data, \"walidacyjnego\")\n",
|
|
"print(\"\\n\")\n",
|
|
"print_dataset_stats(test_data, \"testowego\")\n",
|
|
"\n",
|
|
"# Normalizacja i czyszczenie zbirów\n",
|
|
"train_data = normalize_data(train_data)\n",
|
|
"train_data = clean_dataset(train_data)\n",
|
|
"val_data = normalize_data(train_data)\n",
|
|
"val_data = clean_dataset(train_data)\n",
|
|
"test_data = normalize_data(train_data)\n",
|
|
"test_data = clean_dataset(train_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e88874c0-afac-488a-9051-ae2537dea531",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|