ium_478855/02_Dane.ipynb
Michał Ulaniuk 88c3784118 Jenkinsfile
2022-03-27 17:07:12 +02:00

38 KiB
Raw Blame History

!pip install kaggle
!pip install pandas
!pip install seaborn
!pip install torch
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: kaggle in \\\\files\students\s478855\.appdata\python\python38\site-packages (1.5.12)
Requirement already satisfied: tqdm in c:\software\python3\lib\site-packages (from kaggle) (4.62.1)
Requirement already satisfied: urllib3 in c:\software\python3\lib\site-packages (from kaggle) (1.26.6)
Requirement already satisfied: certifi in c:\software\python3\lib\site-packages (from kaggle) (2021.5.30)
Requirement already satisfied: python-dateutil in c:\software\python3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: six>=1.10 in c:\software\python3\lib\site-packages (from kaggle) (1.15.0)
Requirement already satisfied: python-slugify in \\\\files\students\s478855\.appdata\python\python38\site-packages (from kaggle) (6.1.1)
Requirement already satisfied: requests in c:\software\python3\lib\site-packages (from kaggle) (2.26.0)
Requirement already satisfied: text-unidecode>=1.3 in \\\\files\students\s478855\.appdata\python\python38\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in c:\software\python3\lib\site-packages (from requests->kaggle) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\software\python3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: colorama in c:\software\python3\lib\site-packages (from tqdm->kaggle) (0.4.4)
WARNING: You are using pip version 21.2.4; however, version 22.0.4 is available.
You should consider upgrading via the 'c:\software\python3\python3.exe -m pip install --upgrade pip' command.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in c:\software\python3\lib\site-packages (1.3.2)
Requirement already satisfied: pytz>=2017.3 in c:\software\python3\lib\site-packages (from pandas) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\software\python3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in c:\software\python3\lib\site-packages (from pandas) (1.19.5)
Requirement already satisfied: six>=1.5 in c:\software\python3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
WARNING: You are using pip version 21.2.4; however, version 22.0.4 is available.
You should consider upgrading via the 'c:\software\python3\python3.exe -m pip install --upgrade pip' command.
Defaulting to user installation because normal site-packages is not writeable
WARNING: You are using pip version 21.2.4; however, version 22.0.4 is available.
You should consider upgrading via the 'c:\software\python3\python3.exe -m pip install --upgrade pip' command.
Requirement already satisfied: seaborn in \\\\files\students\s478855\.appdata\python\python38\site-packages (0.11.2)
Requirement already satisfied: numpy>=1.15 in c:\software\python3\lib\site-packages (from seaborn) (1.19.5)
Requirement already satisfied: pandas>=0.23 in c:\software\python3\lib\site-packages (from seaborn) (1.3.2)
Requirement already satisfied: matplotlib>=2.2 in c:\software\python3\lib\site-packages (from seaborn) (3.4.3)
Requirement already satisfied: scipy>=1.0 in c:\software\python3\lib\site-packages (from seaborn) (1.7.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\software\python3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in c:\software\python3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: pillow>=6.2.0 in c:\software\python3\lib\site-packages (from matplotlib>=2.2->seaborn) (8.3.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\software\python3\lib\site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\software\python3\lib\site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: six in c:\software\python3\lib\site-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.15.0)
Requirement already satisfied: pytz>=2017.3 in c:\software\python3\lib\site-packages (from pandas>=0.23->seaborn) (2021.1)
Defaulting to user installation because normal site-packages is not writeable
Collecting torch
  Downloading torch-1.11.0-cp38-cp38-win_amd64.whl (158.0 MB)
Requirement already satisfied: typing-extensions in c:\software\python3\lib\site-packages (from torch) (3.7.4.3)
Installing collected packages: torch
  WARNING: The scripts convert-caffe2-to-onnx.exe, convert-onnx-to-caffe2.exe and torchrun.exe are installed in 'j:\.AppData\Python\Python38\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: You are using pip version 21.2.4; however, version 22.0.4 is available.
You should consider upgrading via the 'c:\software\python3\python3.exe -m pip install --upgrade pip' command.
Successfully installed torch-1.11.0
# 1 Pobranie zbioru
!kaggle datasets download -d joniarroba/noshowappointments
'kaggle' is not recognized as an internal or external command,
operable program or batch file.
!unzip -o noshowappointments.zip
Archive:  noshowappointments.zip
  inflating: KaggleV2-May-2016.csv   
import pandas as pd
no_shows=pd.read_csv('KaggleV2-May-2016.csv')
no_shows
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110522 2.572134e+12 5651768 F 2016-05-03T09:15:35Z 2016-06-07T00:00:00Z 56 MARIA ORTIZ 0 0 0 0 0 1 No
110523 3.596266e+12 5650093 F 2016-05-03T07:27:33Z 2016-06-07T00:00:00Z 51 MARIA ORTIZ 0 0 0 0 0 1 No
110524 1.557663e+13 5630692 F 2016-04-27T16:03:52Z 2016-06-07T00:00:00Z 21 MARIA ORTIZ 0 0 0 0 0 1 No
110525 9.213493e+13 5630323 F 2016-04-27T15:09:23Z 2016-06-07T00:00:00Z 38 MARIA ORTIZ 0 0 0 0 0 1 No
110526 3.775115e+14 5629448 F 2016-04-27T13:30:56Z 2016-06-07T00:00:00Z 54 MARIA ORTIZ 0 0 0 0 0 1 No

110527 rows × 14 columns

# 2. Podział na train/test
import torch

train_size = int(0.8 * len(no_shows))
test_size = (len(no_shows) - train_size)
no_shows_train, no_shows_test = torch.utils.data.random_split(no_shows, [train_size, test_size])
# 3. Statystyki
# Wielkość zbioru i podzbiorów
print(f"Wielkosc zbioru: {len(no_shows)}, podzbiór train: {train_size}, podzbiór test {test_size}.")
# Opis parametrów
no_shows.describe(include='all')
Wielkosc zbioru: 110527, podzbiór train: 88421, podzbiór test 22106.
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
count 1.105270e+05 1.105270e+05 110527 110527 110527 110527.000000 110527 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527
unique NaN NaN 2 103549 27 NaN 81 NaN NaN NaN NaN NaN NaN 2
top NaN NaN F 2016-05-06T07:09:54Z 2016-06-06T00:00:00Z NaN JARDIM CAMBURI NaN NaN NaN NaN NaN NaN No
freq NaN NaN 71840 24 4692 NaN 7717 NaN NaN NaN NaN NaN NaN 88208
mean 1.474963e+14 5.675305e+06 NaN NaN NaN 37.088874 NaN 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026 NaN
std 2.560949e+14 7.129575e+04 NaN NaN NaN 23.110205 NaN 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873 NaN
min 3.921784e+04 5.030230e+06 NaN NaN NaN -1.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
25% 4.172614e+12 5.640286e+06 NaN NaN NaN 18.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
50% 3.173184e+13 5.680573e+06 NaN NaN NaN 37.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
75% 9.439172e+13 5.725524e+06 NaN NaN NaN 55.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN
max 9.999816e+14 5.790484e+06 NaN NaN NaN 115.000000 NaN 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000 NaN
# Rozkład częstości dla klas
no_shows["No-show"].value_counts().plot(kind="bar", title="No-show")
<AxesSubplot:title={'center':'No-show'}>
# Wyczyszczenie zbioru
# Usunięcie negatywnego wieku
no_shows = no_shows.drop(no_shows[no_shows["Age"] < 0].index)

# Usunięcie niewiadomego wieku (zależy od zastosowania)
# no_shows = no_shows.drop(no_shows[no_shows["Age"] == 0].index)
# Normalizacja danych

# Usunięcie kolumn PatientId oraz AppointmentID
no_shows.drop(["PatientId", "AppointmentID"], inplace=True, axis=1)

# Zmiena wartości kolumny No-show z Yes/No na wartość boolowską
no_shows["No-show"] = no_shows["No-show"].map({'Yes': 1, 'No': 0})

# Normalizacja kolumny Age
no_shows["Age"]=(no_shows["Age"]-no_shows["Age"].min())/(no_shows["Age"].max()-no_shows["Age"].min())