ium_478855/notebooks/02_Dane.ipynb
2022-04-24 20:51:38 +02:00

36 KiB
Raw Blame History

!pip install kaggle
!pip install pandas
!pip install torch
Requirement already satisfied: kaggle in c:\programy\anaconda3\envs\ium\lib\site-packages (1.5.12)
Requirement already satisfied: tqdm in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (4.63.0)
Requirement already satisfied: certifi in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2021.10.8)
Requirement already satisfied: six>=1.10 in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: requests in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2.27.1)
Requirement already satisfied: python-slugify in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (6.1.1)
Requirement already satisfied: urllib3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (1.26.9)
Requirement already satisfied: python-dateutil in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: text-unidecode>=1.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from requests->kaggle) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from requests->kaggle) (3.3)
Requirement already satisfied: colorama in c:\programy\anaconda3\envs\ium\lib\site-packages (from tqdm->kaggle) (0.4.4)
Requirement already satisfied: pandas in c:\programy\anaconda3\envs\ium\lib\site-packages (1.3.5)
Requirement already satisfied: pytz>=2017.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (1.21.5)
Requirement already satisfied: six>=1.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Requirement already satisfied: seaborn in c:\programy\anaconda3\envs\ium\lib\site-packages (0.11.2)
Requirement already satisfied: scipy>=1.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.7.3)
Requirement already satisfied: numpy>=1.15 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.21.5)
Requirement already satisfied: matplotlib>=2.2 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (3.5.1)
Requirement already satisfied: pandas>=0.23 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.3.5)
Requirement already satisfied: fonttools>=4.22.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (4.31.1)
Requirement already satisfied: pyparsing>=2.2.1 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (3.0.7)
Requirement already satisfied: cycler>=0.10 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (2.8.2)
Requirement already satisfied: pillow>=6.2.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (9.0.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (1.4.0)
Requirement already satisfied: typing-extensions in c:\programy\anaconda3\envs\ium\lib\site-packages (from kiwisolver>=1.0.1->matplotlib>=2.2->seaborn) (4.1.1)
Requirement already satisfied: pytz>=2017.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas>=0.23->seaborn) (2022.1)
Requirement already satisfied: six>=1.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.16.0)
Requirement already satisfied: torch in c:\programy\anaconda3\envs\ium\lib\site-packages (1.11.0)
Requirement already satisfied: typing-extensions in c:\programy\anaconda3\envs\ium\lib\site-packages (from torch) (4.1.1)
# 1 Pobranie zbioru
!kaggle datasets download -d joniarroba/noshowappointments
401 - Unauthorized
!unzip -o noshowappointments.zip
'unzip' is not recognized as an internal or external command,
operable program or batch file.
import pandas as pd
no_shows=pd.read_csv('KaggleV2-May-2016.csv')
no_shows
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110522 2.572134e+12 5651768 F 2016-05-03T09:15:35Z 2016-06-07T00:00:00Z 56 MARIA ORTIZ 0 0 0 0 0 1 No
110523 3.596266e+12 5650093 F 2016-05-03T07:27:33Z 2016-06-07T00:00:00Z 51 MARIA ORTIZ 0 0 0 0 0 1 No
110524 1.557663e+13 5630692 F 2016-04-27T16:03:52Z 2016-06-07T00:00:00Z 21 MARIA ORTIZ 0 0 0 0 0 1 No
110525 9.213493e+13 5630323 F 2016-04-27T15:09:23Z 2016-06-07T00:00:00Z 38 MARIA ORTIZ 0 0 0 0 0 1 No
110526 3.775115e+14 5629448 F 2016-04-27T13:30:56Z 2016-06-07T00:00:00Z 54 MARIA ORTIZ 0 0 0 0 0 1 No

110527 rows × 14 columns

# 2. Podział na train/test
import torch

train_size = int(0.8 * len(no_shows))
test_size = (len(no_shows) - train_size)
no_shows_train, no_shows_test = torch.utils.data.random_split(no_shows, [train_size, test_size])
# 3. Statystyki
# Wielkość zbioru i podzbiorów
print(f"Wielkosc zbioru: {len(no_shows)}, podzbiór train: {train_size}, podzbiór test {test_size}.")
# Opis parametrów
no_shows.describe(include='all')
Wielkosc zbioru: 110527, podzbiór train: 88421, podzbiór test 22106.
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
count 1.105270e+05 1.105270e+05 110527 110527 110527 110527.000000 110527 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527
unique NaN NaN 2 103549 27 NaN 81 NaN NaN NaN NaN NaN NaN 2
top NaN NaN F 2016-05-06T07:09:54Z 2016-06-06T00:00:00Z NaN JARDIM CAMBURI NaN NaN NaN NaN NaN NaN No
freq NaN NaN 71840 24 4692 NaN 7717 NaN NaN NaN NaN NaN NaN 88208
mean 1.474963e+14 5.675305e+06 NaN NaN NaN 37.088874 NaN 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026 NaN
std 2.560949e+14 7.129575e+04 NaN NaN NaN 23.110205 NaN 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873 NaN
min 3.921784e+04 5.030230e+06 NaN NaN NaN -1.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
25% 4.172614e+12 5.640286e+06 NaN NaN NaN 18.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
50% 3.173184e+13 5.680573e+06 NaN NaN NaN 37.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
75% 9.439172e+13 5.725524e+06 NaN NaN NaN 55.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN
max 9.999816e+14 5.790484e+06 NaN NaN NaN 115.000000 NaN 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000 NaN
# Rozkład częstości dla klas
no_shows["No-show"].value_counts().plot(kind="bar", title="No-show")
<AxesSubplot:title={'center':'No-show'}>
# Wyczyszczenie zbioru
# Usunięcie negatywnego wieku
no_shows = no_shows.drop(no_shows[no_shows["Age"] < 0].index)

# Usunięcie niewiadomego wieku (zależy od zastosowania)
# no_shows = no_shows.drop(no_shows[no_shows["Age"] == 0].index)
# Normalizacja danych

# Usunięcie kolumn PatientId oraz AppointmentID
no_shows.drop(["PatientId", "AppointmentID"], inplace=True, axis=1)

# Zmiena wartości kolumny No-show z Yes/No na wartość boolowską
no_shows["No-show"] = no_shows["No-show"].map({'Yes': 1, 'No': 0})

# Normalizacja kolumny Age
no_shows["Age"]=(no_shows["Age"]-no_shows["Age"].min())/(no_shows["Age"].max()-no_shows["Age"].min())