36 KiB
36 KiB
!pip install kaggle
!pip install pandas
!pip install torch
Requirement already satisfied: kaggle in c:\programy\anaconda3\envs\ium\lib\site-packages (1.5.12) Requirement already satisfied: tqdm in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (4.63.0) Requirement already satisfied: certifi in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2021.10.8) Requirement already satisfied: six>=1.10 in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: requests in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2.27.1) Requirement already satisfied: python-slugify in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (6.1.1) Requirement already satisfied: urllib3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (1.26.9) Requirement already satisfied: python-dateutil in c:\programy\anaconda3\envs\ium\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: text-unidecode>=1.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from requests->kaggle) (2.0.12) Requirement already satisfied: idna<4,>=2.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from requests->kaggle) (3.3) Requirement already satisfied: colorama in c:\programy\anaconda3\envs\ium\lib\site-packages (from tqdm->kaggle) (0.4.4) Requirement already satisfied: pandas in c:\programy\anaconda3\envs\ium\lib\site-packages (1.3.5) Requirement already satisfied: pytz>=2017.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (2022.1) Requirement already satisfied: python-dateutil>=2.7.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.17.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas) (1.21.5) Requirement already satisfied: six>=1.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Requirement already satisfied: seaborn in c:\programy\anaconda3\envs\ium\lib\site-packages (0.11.2) Requirement already satisfied: scipy>=1.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.7.3) Requirement already satisfied: numpy>=1.15 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.21.5) Requirement already satisfied: matplotlib>=2.2 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (3.5.1) Requirement already satisfied: pandas>=0.23 in c:\programy\anaconda3\envs\ium\lib\site-packages (from seaborn) (1.3.5) Requirement already satisfied: fonttools>=4.22.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (4.31.1) Requirement already satisfied: pyparsing>=2.2.1 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (3.0.7) Requirement already satisfied: cycler>=0.10 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (0.11.0) Requirement already satisfied: packaging>=20.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (21.3) Requirement already satisfied: python-dateutil>=2.7 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (2.8.2) Requirement already satisfied: pillow>=6.2.0 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (9.0.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programy\anaconda3\envs\ium\lib\site-packages (from matplotlib>=2.2->seaborn) (1.4.0) Requirement already satisfied: typing-extensions in c:\programy\anaconda3\envs\ium\lib\site-packages (from kiwisolver>=1.0.1->matplotlib>=2.2->seaborn) (4.1.1) Requirement already satisfied: pytz>=2017.3 in c:\programy\anaconda3\envs\ium\lib\site-packages (from pandas>=0.23->seaborn) (2022.1) Requirement already satisfied: six>=1.5 in c:\programy\anaconda3\envs\ium\lib\site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.16.0) Requirement already satisfied: torch in c:\programy\anaconda3\envs\ium\lib\site-packages (1.11.0) Requirement already satisfied: typing-extensions in c:\programy\anaconda3\envs\ium\lib\site-packages (from torch) (4.1.1)
# 1 Pobranie zbioru
!kaggle datasets download -d joniarroba/noshowappointments
401 - Unauthorized
!unzip -o noshowappointments.zip
'unzip' is not recognized as an internal or external command, operable program or batch file.
import pandas as pd
no_shows=pd.read_csv('KaggleV2-May-2016.csv')
no_shows
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110522 | 2.572134e+12 | 5651768 | F | 2016-05-03T09:15:35Z | 2016-06-07T00:00:00Z | 56 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110523 | 3.596266e+12 | 5650093 | F | 2016-05-03T07:27:33Z | 2016-06-07T00:00:00Z | 51 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110524 | 1.557663e+13 | 5630692 | F | 2016-04-27T16:03:52Z | 2016-06-07T00:00:00Z | 21 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110525 | 9.213493e+13 | 5630323 | F | 2016-04-27T15:09:23Z | 2016-06-07T00:00:00Z | 38 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110526 | 3.775115e+14 | 5629448 | F | 2016-04-27T13:30:56Z | 2016-06-07T00:00:00Z | 54 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110527 rows × 14 columns
# 2. Podział na train/test
import torch
train_size = int(0.8 * len(no_shows))
test_size = (len(no_shows) - train_size)
no_shows_train, no_shows_test = torch.utils.data.random_split(no_shows, [train_size, test_size])
# 3. Statystyki
# Wielkość zbioru i podzbiorów
print(f"Wielkosc zbioru: {len(no_shows)}, podzbiór train: {train_size}, podzbiór test {test_size}.")
# Opis parametrów
no_shows.describe(include='all')
Wielkosc zbioru: 110527, podzbiór train: 88421, podzbiór test 22106.
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.105270e+05 | 1.105270e+05 | 110527 | 110527 | 110527 | 110527.000000 | 110527 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527 |
unique | NaN | NaN | 2 | 103549 | 27 | NaN | 81 | NaN | NaN | NaN | NaN | NaN | NaN | 2 |
top | NaN | NaN | F | 2016-05-06T07:09:54Z | 2016-06-06T00:00:00Z | NaN | JARDIM CAMBURI | NaN | NaN | NaN | NaN | NaN | NaN | No |
freq | NaN | NaN | 71840 | 24 | 4692 | NaN | 7717 | NaN | NaN | NaN | NaN | NaN | NaN | 88208 |
mean | 1.474963e+14 | 5.675305e+06 | NaN | NaN | NaN | 37.088874 | NaN | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 | NaN |
std | 2.560949e+14 | 7.129575e+04 | NaN | NaN | NaN | 23.110205 | NaN | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 | NaN |
min | 3.921784e+04 | 5.030230e+06 | NaN | NaN | NaN | -1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | 4.172614e+12 | 5.640286e+06 | NaN | NaN | NaN | 18.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
50% | 3.173184e+13 | 5.680573e+06 | NaN | NaN | NaN | 37.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
75% | 9.439172e+13 | 5.725524e+06 | NaN | NaN | NaN | 55.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | NaN |
max | 9.999816e+14 | 5.790484e+06 | NaN | NaN | NaN | 115.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 | NaN |
# Rozkład częstości dla klas
no_shows["No-show"].value_counts().plot(kind="bar", title="No-show")
<AxesSubplot:title={'center':'No-show'}>
# Wyczyszczenie zbioru
# Usunięcie negatywnego wieku
no_shows = no_shows.drop(no_shows[no_shows["Age"] < 0].index)
# Usunięcie niewiadomego wieku (zależy od zastosowania)
# no_shows = no_shows.drop(no_shows[no_shows["Age"] == 0].index)
# Normalizacja danych
# Usunięcie kolumn PatientId oraz AppointmentID
no_shows.drop(["PatientId", "AppointmentID"], inplace=True, axis=1)
# Zmiena wartości kolumny No-show z Yes/No na wartość boolowską
no_shows["No-show"] = no_shows["No-show"].map({'Yes': 1, 'No': 0})
# Normalizacja kolumny Age
no_shows["Age"]=(no_shows["Age"]-no_shows["Age"].min())/(no_shows["Age"].max()-no_shows["Age"].min())