ium_452487/dane.ipynb at 88a23b201b3a1b8d116a5b70dd3e0cb7d61d56f7

1. Pobieranie zbioru danych

!pip install --user kaggle

Collecting kaggle
  Using cached kaggle-1.6.6.tar.gz (84 kB)
Requirement already satisfied: six>=1.10 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2022.6.15)
Requirement already satisfied: python-dateutil in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.28.1)
Requirement already satisfied: tqdm in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (4.64.1)
Collecting python-slugify
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: urllib3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.26.11)
Requirement already satisfied: bleach in c:\users\adrian\miniconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: webencodings in c:\users\adrian\miniconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: packaging in c:\users\adrian\appdata\roaming\python\python39\site-packages (from bleach->kaggle) (22.0)
Collecting text-unidecode>=1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Requirement already satisfied: idna<4,>=2.5 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.10)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.1.0)
Requirement already satisfied: colorama in c:\users\adrian\appdata\roaming\python\python39\site-packages (from tqdm->kaggle) (0.4.5)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111961 sha256=3aa19c7655c19d77b65c2542567b6e34a57813227f1f2df6d0fd84accad6824f
  Stored in directory: c:\users\adrian\appdata\local\pip\cache\wheels\46\aa\c3\b3e421522fb5acdd7c366a05c5fc80787615bdeed207e7f79b
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.6.6 python-slugify-8.0.4 text-unidecode-1.3

!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease/

Downloading personal-key-indicators-of-heart-disease.zip to C:\Users\Adrian\Desktop\Semestr 1 (II ST)\ML\zadania

  0%|          | 0.00/21.4M [00:00<?, ?B/s]
  5%|4         | 1.00M/21.4M [00:00<00:11, 1.80MB/s]
  9%|9         | 2.00M/21.4M [00:00<00:05, 3.51MB/s]
 19%|#8        | 4.00M/21.4M [00:00<00:02, 7.25MB/s]
 33%|###2      | 7.00M/21.4M [00:00<00:01, 12.0MB/s]
 51%|#####1    | 11.0M/21.4M [00:01<00:00, 17.9MB/s]
 65%|######5   | 14.0M/21.4M [00:01<00:00, 20.7MB/s]
 79%|#######9  | 17.0M/21.4M [00:01<00:00, 20.6MB/s]
 93%|#########3| 20.0M/21.4M [00:01<00:00, 22.7MB/s]
100%|##########| 21.4M/21.4M [00:01<00:00, 14.9MB/s]

#!unzip -o personal-key-indicators-of-heart-disease.zip #nie działa na Windowsie więc korzystam z modułu zipfile

'unzip' is not recognized as an internal or external command,
operable program or batch file.

import zipfile
with zipfile.ZipFile("personal-key-indicators-of-heart-disease.zip", 'r') as zip_ref:
    zip_ref.extractall("dataset_extracted")

import pandas as pd
# W pobranym zbiorze danych jest kilka podzbiorów więc celowo otwieram ten z NaN, żeby manualnie go oczyścić dla praktyki
df = pd.read_csv("dataset_extracted/2022/heart_2022_with_nans.csv")

Przeglądanie nieoczyszczonego datasetu

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      445132 non-null  object 
 1   Sex                        445132 non-null  object 
 2   GeneralHealth              443934 non-null  object 
 3   PhysicalHealthDays         434205 non-null  float64
 4   MentalHealthDays           436065 non-null  float64
 5   LastCheckupTime            436824 non-null  object 
 6   PhysicalActivities         444039 non-null  object 
 7   SleepHours                 439679 non-null  float64
 8   RemovedTeeth               433772 non-null  object 
 9   HadHeartAttack             442067 non-null  object 
 10  HadAngina                  440727 non-null  object 
 11  HadStroke                  443575 non-null  object 
 12  HadAsthma                  443359 non-null  object 
 13  HadSkinCancer              441989 non-null  object 
 14  HadCOPD                    442913 non-null  object 
 15  HadDepressiveDisorder      442320 non-null  object 
 16  HadKidneyDisease           443206 non-null  object 
 17  HadArthritis               442499 non-null  object 
 18  HadDiabetes                444045 non-null  object 
 19  DeafOrHardOfHearing        424485 non-null  object 
 20  BlindOrVisionDifficulty    423568 non-null  object 
 21  DifficultyConcentrating    420892 non-null  object 
 22  DifficultyWalking          421120 non-null  object 
 23  DifficultyDressingBathing  421217 non-null  object 
 24  DifficultyErrands          419476 non-null  object 
 25  SmokerStatus               409670 non-null  object 
 26  ECigaretteUsage            409472 non-null  object 
 27  ChestScan                  389086 non-null  object 
 28  RaceEthnicityCategory      431075 non-null  object 
 29  AgeCategory                436053 non-null  object 
 30  HeightInMeters             416480 non-null  float64
 31  WeightInKilograms          403054 non-null  float64
 32  BMI                        396326 non-null  float64
 33  AlcoholDrinkers            398558 non-null  object 
 34  HIVTesting                 379005 non-null  object 
 35  FluVaxLast12               398011 non-null  object 
 36  PneumoVaxEver              368092 non-null  object 
 37  TetanusLast10Tdap          362616 non-null  object 
 38  HighRiskLastYear           394509 non-null  object 
 39  CovidPos                   394368 non-null  object 
dtypes: float64(6), object(34)
memory usage: 135.8+ MB

df.head()

	State	Sex	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	HadHeartAttack	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	HighRiskLastYear	CovidPos
0	Alabama	Female	Very good	0.0	0.0	Within past year (anytime less than 12 months ...	No	8.0	NaN	No	...	NaN	NaN	NaN	No	No	Yes	No	Yes, received tetanus shot but not sure what type	No	No
1	Alabama	Female	Excellent	0.0	0.0	NaN	No	6.0	NaN	No	...	1.60	68.04	26.57	No	No	No	No	No, did not receive any tetanus shot in the pa...	No	No
2	Alabama	Female	Very good	2.0	3.0	Within past year (anytime less than 12 months ...	Yes	5.0	NaN	No	...	1.57	63.50	25.61	No	No	No	No	NaN	No	Yes
3	Alabama	Female	Excellent	0.0	0.0	Within past year (anytime less than 12 months ...	Yes	7.0	NaN	No	...	1.65	63.50	23.30	No	No	Yes	Yes	No, did not receive any tetanus shot in the pa...	No	No
4	Alabama	Female	Fair	2.0	0.0	Within past year (anytime less than 12 months ...	Yes	9.0	NaN	No	...	1.57	53.98	21.77	Yes	No	No	Yes	No, did not receive any tetanus shot in the pa...	No	No

5 rows × 40 columns

df.describe()

	PhysicalHealthDays	MentalHealthDays	SleepHours	HeightInMeters	WeightInKilograms	BMI
count	434205.000000	436065.000000	439679.000000	416480.000000	403054.000000	396326.000000
mean	4.347919	4.382649	7.022983	1.702691	83.074470	28.529842
std	8.688912	8.387475	1.502425	0.107177	21.448173	6.554889
min	0.000000	0.000000	1.000000	0.910000	22.680000	12.020000
25%	0.000000	0.000000	6.000000	1.630000	68.040000	24.130000
50%	0.000000	0.000000	7.000000	1.700000	80.740000	27.440000
75%	3.000000	5.000000	8.000000	1.780000	95.250000	31.750000
max	30.000000	30.000000	24.000000	2.410000	292.570000	99.640000

Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu

Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:

df["HadHeartAttack"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='HadHeartAttack'>

df["HadHeartAttack"].value_counts()

No     416959
Yes     25108
Name: HadHeartAttack, dtype: int64

2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling

from sklearn.model_selection import train_test_split
#Funkcji z sklearn musimy użyć dwukrotnie, bo dzieli tylko na dwa podzbiory
train, test_and_valid = train_test_split(df, test_size=0.2) #0.8 train, 0.2 test&valid

test, valid = train_test_split(test_and_valid, test_size=0.5) #0.1 test, 0.1 valid

train["HadHeartAttack"].value_counts()

No     333641
Yes     20042
Name: HadHeartAttack, dtype: int64

Zbiór treningowy jest nadal niezbalansowany więc zrobię prosty oversampling przez kopiowanie mniejszej klasy aż będą prawie równe

def oversample(dataset):
    num_true = len(dataset[dataset["HadHeartAttack"]=="Yes"])
    num_false = len(dataset[dataset["HadHeartAttack"]=="No"])
    num_oversampling_steps = num_false//num_true
    oversampled = dataset.copy()
    for x in range(num_oversampling_steps):
        oversampled = pd.concat([oversampled, dataset[dataset["HadHeartAttack"]=="Yes"]], ignore_index=True)
    return oversampled

train = oversample(train)

train["HadHeartAttack"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='HadHeartAttack'>

test["HadHeartAttack"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='HadHeartAttack'>

valid["HadHeartAttack"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='HadHeartAttack'>

Proporcje osób palących / niepalących w pierwotnym zbiorze danych:

df["SmokerStatus"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='SmokerStatus'>

df["ECigaretteUsage"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='ECigaretteUsage'>

Statystyki covidowe

df["CovidPos"].value_counts().plot(kind="pie")

<AxesSubplot:ylabel='CovidPos'>

Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne

Kolumny zawierające stan zdrowia i podobne cechy opisane w sposób "poor/fair/good/excellent" etc. starałem się zamienić na liczbowe w sposób sensowny, rosnący względem pozytywnego aspektu tego czynnika zdrowotnego. Podobnie z tym jak często dana osoba paliła. Część kolumn zamieniłem na kategoryczne Kolumnę płci zamieniłem na numeryczną w celu późniejszego wykorzystania przez model, choć mialem wątpliwości co do robienia tego pod względem poprawności politycznej

df["Sex"].unique()

array(['Female', 'Male'], dtype=object)

df["GeneralHealth"].unique()

array(['Very good', 'Excellent', 'Fair', 'Poor', 'Good', nan],
      dtype=object)

health_map = {
    "Excellent": 5,
    "Very good": 4,
    "Good": 3,
    "Fair": 2,
    "Poor": 1
}

for col in df:
    print(f"{col}:")
    print(df[col].unique())

State:
['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico'
 'Virgin Islands']
Sex:
['Female' 'Male']
GeneralHealth:
['Very good' 'Excellent' 'Fair' 'Poor' 'Good' nan]
PhysicalHealthDays:
[ 0.  2.  1.  8.  5. 30.  4. 23. 14. nan 15.  3. 10.  7. 25.  6. 21. 20.
 29. 16.  9. 27. 28. 12. 13. 11. 26. 17. 24. 19. 18. 22.]
MentalHealthDays:
[ 0.  3.  9.  5. 15. 20. 14. 10. 18.  1. nan  2. 30.  4.  6.  7. 25.  8.
 22. 29. 27. 21. 12. 28. 16. 13. 26. 17. 11. 23. 19. 24.]
LastCheckupTime:
['Within past year (anytime less than 12 months ago)' nan
 'Within past 2 years (1 year but less than 2 years ago)'
 'Within past 5 years (2 years but less than 5 years ago)'
 '5 or more years ago']
PhysicalActivities:
['No' 'Yes' nan]
SleepHours:
[ 8.  6.  5.  7.  9.  4. 10.  1. 12. nan 18.  3.  2. 11. 16. 15. 13. 14.
 20. 23. 17. 24. 22. 19. 21.]
RemovedTeeth:
[nan 'None of them' '1 to 5' '6 or more, but not all' 'All']
HadHeartAttack:
['No' 'Yes' nan]
HadAngina:
['No' 'Yes' nan]
HadStroke:
['No' 'Yes' nan]
HadAsthma:
['No' 'Yes' nan]
HadSkinCancer:
['No' 'Yes' nan]
HadCOPD:
['No' 'Yes' nan]
HadDepressiveDisorder:
['No' 'Yes' nan]
HadKidneyDisease:
['No' 'Yes' nan]
HadArthritis:
['No' 'Yes' nan]
HadDiabetes:
['Yes' 'No' 'No, pre-diabetes or borderline diabetes' nan
 'Yes, but only during pregnancy (female)']
DeafOrHardOfHearing:
['No' nan 'Yes']
BlindOrVisionDifficulty:
['No' 'Yes' nan]
DifficultyConcentrating:
['No' nan 'Yes']
DifficultyWalking:
['No' 'Yes' nan]
DifficultyDressingBathing:
['No' nan 'Yes']
DifficultyErrands:
['No' 'Yes' nan]
SmokerStatus:
['Never smoked' 'Current smoker - now smokes some days' 'Former smoker'
 nan 'Current smoker - now smokes every day']
ECigaretteUsage:
['Not at all (right now)' 'Never used e-cigarettes in my entire life' nan
 'Use them every day' 'Use them some days']
ChestScan:
['No' 'Yes' nan]
RaceEthnicityCategory:
['White only, Non-Hispanic' 'Black only, Non-Hispanic'
 'Other race only, Non-Hispanic' 'Multiracial, Non-Hispanic' nan
 'Hispanic']
AgeCategory:
['Age 80 or older' 'Age 55 to 59' nan 'Age 40 to 44' 'Age 75 to 79'
 'Age 70 to 74' 'Age 65 to 69' 'Age 60 to 64' 'Age 50 to 54'
 'Age 45 to 49' 'Age 35 to 39' 'Age 25 to 29' 'Age 30 to 34'
 'Age 18 to 24']
HeightInMeters:
[ nan 1.6  1.57 1.65 1.8  1.63 1.7  1.68 1.73 1.55 1.93 1.88 1.78 1.85
 1.75 1.52 1.83 1.91 1.96 1.5  1.45 1.42 1.24 1.47 1.22 1.98 2.03 2.01
 1.3  1.4  1.35 1.82 1.67 1.76 2.11 1.37 1.64 1.71 2.16 2.26 0.91 2.06
 1.14 1.74 1.51 1.53 1.69 1.56 1.84 1.9  1.54 1.72 1.87 1.61 1.49 1.59
 1.58 1.62 1.79 1.46 1.89 2.13 0.99 2.08 2.21 1.32 2.18 1.77 2.36 1.25
 1.66 1.86 1.95 1.19 1.05 1.48 1.03 1.18 1.81 1.38 1.44 1.07 1.27 1.2
 1.17 1.04 2.24 1.1  1.43 1.92 2.05 1.12 2.41 2.34 0.97 1.06 1.15 2.29
 1.16 1.09 0.92 2.07 1.   1.08 1.02 1.33 2.   2.02 1.94 0.95]
WeightInKilograms:
[   nan  68.04  63.5   53.98  84.82  62.6   73.48  81.65  74.84  59.42
  85.28 106.59  71.21  64.41  61.23  90.72  65.77  66.22  80.29  86.18
  47.63 107.05  57.15 105.23  77.11  56.7   79.38 113.4  102.06  59.87
 104.33  53.52  61.69 136.08  34.47  99.79 127.01  78.93  95.25  58.97
  92.08  72.57  83.91  49.9  117.93  71.67 102.97  62.14  83.46  54.43
  94.35  60.78 117.03  65.32  76.66  88.45  89.81  74.39  68.95  79.83
 108.41  90.26  55.79  91.63  47.17  78.02  50.8   91.17  84.37 145.15
  93.89 122.47  48.99  73.94  88.9   80.74  81.19 158.76  97.52  51.71
  82.55  76.2   68.49  75.3   70.31  63.05  60.33 115.67  86.64 108.86
  92.53 124.74  43.09  58.51  63.96  92.99  44.45 128.82  98.88  45.36
 110.68  46.72  58.06  73.03  95.71 131.09  78.47  69.4   85.73  67.59
 103.87 120.2   88.    54.88 111.58  52.16  77.56 126.55  94.8  123.83
  89.36  75.75  69.85 112.49  82.1  106.14  57.61  70.76 148.78  96.16
  67.13  48.08 163.29 109.77 100.7  142.88  64.86 111.13 121.11  55.34
 101.6   93.44 117.48 120.66  66.68  44.91 132.   107.5  107.95  36.29
 103.42  87.09  83.01  56.25  96.62 134.26  97.07  34.93  99.34  72.12
  49.44 122.02  98.43 129.73 181.44  52.62 121.56 110.22  48.53 140.61
 156.49 116.57  87.54  44.   114.31  31.75  97.98 101.15 112.04 100.24
 113.85 154.22 118.39 133.81 149.69  41.73 119.75 138.35 151.95 129.27
 131.54 104.78 132.45 102.51 116.12  40.37 105.69 136.98 195.04  53.07
 132.9  124.28 112.94 114.76  45.81 119.29 167.83  51.26 172.37 162.39
  46.27 127.91 123.38  38.56 130.63 143.34 115.21 166.92 135.17 109.32
 135.62 204.12 127.46 118.84 139.25 126.1  122.92 151.5  133.36  42.64
  50.35  80.   190.51  37.19 147.87  35.38 144.24 149.23  37.65  86.
 147.42 281.   165.56 162.84 155.58  70.   137.89 189.6  206.38 148.32
  42.18 153.77  38.1   90.   176.9  191.87 249.48  67.    95.    82.
 170.1   62.    40.82  53.   139.71 130.18 100.   165.11  64.    43.54
  24.   134.72 141.52 125.19  75.    60.    34.02 164.65  30.84 250.
  58.    76.    73.   112.    74.    55.   200.    54.    66.    72.
 152.41  39.46 220.    41.28 168.28 188.24  59.    46.   265.   238.14
 168.74 145.   190.    93.   159.66  78.    50.   185.07  91.   104.
 165.   183.7   33.57 161.93  68.   125.65 134.   130.    32.21 143.79
  69.   179.17  63.   105.   210.92  65.    32.   292.57 280.    85.
 174.63  56.   128.37  87.    39.92  83.   169.64 156.04 177.   121.
 151.05  89.   146.96 146.06  98.   166.47  36.74 171.46 227.25  29.48
 190.06 161.03  35.83 226.8  175.09 138.8  240.4  158.3  170.55  61.
 137.44 145.6  141.07 155.13  52.   120.    57.    77.    27.22  25.4
 240.    96.    47.   115.    41.    45.   170.   150.59 272.16  26.31
  48.    39.01 236.    92.   197.31 156.    84.    94.    29.03  49.
  79.   157.85 192.78 255.   108.   185.   222.26 229.97 180.    81.
  24.95  71.    26.   107.   101.   208.65 140.   175.   111.   110.
 141.97  22.68 284.86 136.53 210.   103.   185.97 140.16 146.51  24.49
  25.85 150.   102.   229.52  23.59 125.   163.    38.   135.   176.45
 185.52 152.86 232.69 124.   192.32 186.88 118.   160.12 160.   193.68
 201.85 144.7  184.16 142.43 169.   166.01  32.66 180.53 196.41  51.
  40.   171.91 195.95  33.11 153.31 159.21 164.2  219.99 215.46 182.34
  30.   160.57 173.27 158.   213.19 276.24 199.58 175.99 235.87 217.72
 200.03 230.88 146.    24.04 178.72 150.14 157.4  163.75 191.42 174.18
  28.58  97.   256.28 205.48 161.48 178.26 179.62 205.02 254.01 154.68
 209.56 201.4  234.96 177.81 200.49 231.79 227.7  273.52 189.15 173.73
 183.25 167.38 211.83 223.62 228.61  30.39 197.77 184.61 250.38 181.89
  31.3  290.3  285.   113.   242.67 231.33 180.08 202.76 176.   188.69
 206.84 164.   156.94 114.   122.   222.   137.   166.   180.98 272.
 172.82 274.42 234.51 199.13 244.94 203.21  23.13 265.35 198.22 263.08
 216.82 154.   169.19 239.04 177.35 210.47 224.98 117.    37.   126.
 273.06 203.66 252.2  238.59 194.59 187.33 221.35 162.   224.53  23.
 223.17 187.79 212.73 152.   233.6  193.23 205.   229.06 230.   247.21
  99.    28.12 230.42 175.54 205.93 171.    26.76 212.28 217.   280.32
 281.68 248.57 195.    42.   258.55 215.   116.    28.   123.   186.43
 228.16 119.   219.09 214.55 278.96 182.8  138.   217.27 246.3  189.  ]
BMI:
[  nan 26.57 25.61 ... 13.51 28.39 48.63]
AlcoholDrinkers:
['No' 'Yes' nan]
HIVTesting:
['No' 'Yes' nan]
FluVaxLast12:
['Yes' 'No' nan]
PneumoVaxEver:
['No' 'Yes' nan]
TetanusLast10Tdap:
['Yes, received tetanus shot but not sure what type'
 'No, did not receive any tetanus shot in the past 10 years' nan
 'Yes, received Tdap' 'Yes, received tetanus shot, but not Tdap']
HighRiskLastYear:
['No' nan 'Yes']
CovidPos:
['No' 'Yes' nan
 'Tested positive using home test without a health professional']

from collections import defaultdict
def normalize_dataset(dataset):
    dataset["GeneralHealth"] = dataset["GeneralHealth"].map(defaultdict(lambda: float('NaN'), health_map), na_action='ignore')
    dataset["Sex"] = dataset["Sex"].map({"Female":0,"Male":1}).astype(float) #Zamiana z kolumn tekstowych na numeryczne
    dataset.rename(columns ={"Sex":"Male"},inplace=True)
    dataset["State"] = dataset["State"].astype('category')
    dataset["PhysicalHealthDays"].astype(float)
    dataset["MentalHealthDays"].astype(float)
    dataset["LastCheckupTime"] = dataset["LastCheckupTime"].fillna("Unknown").astype('category') # Potem korzystam z fillna-->median ale nie działa to na kolumnach kategorycznych więc wykonuję to przed konwersją
    dataset["PhysicalActivities"]= dataset["PhysicalActivities"].map({"No":0,"Yes":1})
    dataset["SleepHours"].astype(float)
    dataset["RemovedTeeth"] = dataset["RemovedTeeth"].map(defaultdict(lambda: float('NaN'), {"None of them":0,"1 to 5":1, "6 or more, but not all":2, "All":3}), na_action='ignore')
    dataset["HadHeartAttack"]= dataset["HadHeartAttack"].map({"No":0,"Yes":1})
    dataset["HadAngina"]= dataset["HadAngina"].map({"No":0,"Yes":1})
    dataset["HadStroke"]= dataset["HadStroke"].map({"No":0,"Yes":1})
    dataset["HadAsthma"]= dataset["HadAsthma"].map({"No":0,"Yes":1})
    dataset["HadSkinCancer"]= dataset["HadSkinCancer"].map({"No":0,"Yes":1})
    dataset["HadCOPD"]= dataset["HadCOPD"].map({"No":0,"Yes":1})
    dataset["HadDepressiveDisorder"]= dataset["HadDepressiveDisorder"].map({"No":0,"Yes":1})
    dataset["HadKidneyDisease"]= dataset["HadKidneyDisease"].map({"No":0,"Yes":1})
    dataset["HadArthritis"]= dataset["HadArthritis"].map({"No":0,"Yes":1})
    dataset["HadDiabetes"]= dataset["HadDiabetes"].map({"No":0,"Yes, but only during pregnancy (female)":1,"No, pre-diabetes or borderline diabetes":2,"Yes":3})

    dataset["DeafOrHardOfHearing"]= dataset["DeafOrHardOfHearing"].map({"No":0,"Yes":1})
    dataset["BlindOrVisionDifficulty"]= dataset["BlindOrVisionDifficulty"].map({"No":0,"Yes":1})
    dataset["DifficultyConcentrating"]= dataset["DifficultyConcentrating"].map({"No":0,"Yes":1})
    dataset["DifficultyWalking"]= dataset["DifficultyWalking"].map({"No":0,"Yes":1})
    dataset["DifficultyDressingBathing"]= dataset["DifficultyDressingBathing"].map({"No":0,"Yes":1})
    dataset["DifficultyErrands"]= dataset["DifficultyErrands"].map({"No":0,"Yes":1})
    dataset["SmokerStatus"]= dataset["SmokerStatus"].map({"Never smoked":0,"Current smoker - now smokes some days":1,"Former smoker":2,"Current smoker - now smokes every day":3})
    dataset["ECigaretteUsage"]= dataset["ECigaretteUsage"].map({"Never used e-cigarettes in my entire life":0,"Not at all (right now)":1,"Use them some days":2,"Use them every day":3})
    dataset["ChestScan"]= dataset["ChestScan"].map({"No":0,"Yes":1})
    dataset["RaceEthnicityCategory"] = dataset["RaceEthnicityCategory"].fillna("Unknown").astype('category')
    dataset["AgeCategory"] = dataset["AgeCategory"].fillna("Unknown").astype('category')
    dataset["HeightInMeters"] = dataset["HeightInMeters"].astype(float)
    dataset["WeightInKilograms"] = dataset["WeightInKilograms"].astype(float)
    dataset["BMI"] = dataset["BMI"].astype(float)
    dataset["AlcoholDrinkers"]= dataset["AlcoholDrinkers"].map({"No":0,"Yes":1})
    dataset["HIVTesting"]= dataset["HIVTesting"].map({"No":0,"Yes":1})
    dataset["FluVaxLast12"]= dataset["FluVaxLast12"].map({"No":0,"Yes":1})
    dataset["PneumoVaxEver"]= dataset["PneumoVaxEver"].map({"No":0,"Yes":1})
    dataset["TetanusLast10Tdap"]= dataset["TetanusLast10Tdap"].apply(lambda x: float('NaN') if type(x)!=str else 1.0 if 'Yes,' in x else 1.0 if 'No,' in x else float('NaN'))
    dataset["HighRiskLastYear"]= dataset["HighRiskLastYear"].map({"No":0,"Yes":1})
    dataset["CovidPos"]= dataset["CovidPos"].map({"No":0,"Yes":1})

Zbiór test przed zmianą typu danych

test.head()

	State	Sex	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	HadHeartAttack	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	HighRiskLastYear	CovidPos
339824	South Dakota	Female	Good	3.0	21.0	Within past year (anytime less than 12 months ...	Yes	8.0	6 or more, but not all	No	...	1.60	52.16	20.37	No	Yes	Yes	Yes	Yes, received Tdap	No	No
127927	Kansas	Female	Good	30.0	0.0	Within past year (anytime less than 12 months ...	Yes	10.0	1 to 5	No	...	1.68	97.52	34.70	No	No	Yes	Yes	NaN	No	No
362523	Utah	Male	Excellent	0.0	0.0	Within past year (anytime less than 12 months ...	Yes	7.0	1 to 5	No	...	1.83	113.85	34.04	No	No	No	No	Yes, received Tdap	No	Yes
183687	Michigan	Male	Good	0.0	7.0	Within past year (anytime less than 12 months ...	Yes	8.0	None of them	No	...	1.78	83.91	26.54	Yes	No	Yes	Yes	Yes, received Tdap	No	No
191905	Michigan	Female	Very good	0.0	0.0	Within past year (anytime less than 12 months ...	Yes	7.0	None of them	No	...	1.57	68.04	27.44	Yes	Yes	No	Yes	Yes, received tetanus shot but not sure what type	No	No

5 rows × 40 columns

Zbiór test po zmianie typu danych

normalize_dataset(test)
test.head()

	State	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	CovidPos
339824	South Dakota	0.0	3.0	3.0	21.0	Within past year (anytime less than 12 months ...	1.0	8.0	2.0	...	1.60	52.16	20.37	0.0	1.0	1.0	1.0	1.0	0.0
127927	Kansas	0.0	3.0	30.0	0.0	Within past year (anytime less than 12 months ...	1.0	10.0	1.0	...	1.68	97.52	34.70	0.0	0.0	1.0	1.0	NaN	0.0
362523	Utah	1.0	5.0	0.0	0.0	Within past year (anytime less than 12 months ...	1.0	7.0	1.0	...	1.83	113.85	34.04	0.0	0.0	0.0	0.0	1.0	1.0
183687	Michigan	1.0	3.0	0.0	7.0	Within past year (anytime less than 12 months ...	1.0	8.0	0.0	...	1.78	83.91	26.54	1.0	0.0	1.0	1.0	1.0	0.0
191905	Michigan	0.0	4.0	0.0	0.0	Within past year (anytime less than 12 months ...	1.0	7.0	0.0	...	1.57	68.04	27.44	1.0	1.0	0.0	1.0	1.0	0.0

5 rows × 40 columns

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44513 entries, 339824 to 52161
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44513 non-null  category
 1   Male                       44513 non-null  float64 
 2   GeneralHealth              44393 non-null  float64 
 3   PhysicalHealthDays         43469 non-null  float64 
 4   MentalHealthDays           43622 non-null  float64 
 5   LastCheckupTime            44513 non-null  category
 6   PhysicalActivities         44408 non-null  float64 
 7   SleepHours                 44008 non-null  float64 
 8   RemovedTeeth               43413 non-null  float64 
 9   HadHeartAttack             44182 non-null  float64 
 10  HadAngina                  44074 non-null  float64 
 11  HadStroke                  44368 non-null  float64 
 12  HadAsthma                  44339 non-null  float64 
 13  HadSkinCancer              44184 non-null  float64 
 14  HadCOPD                    44299 non-null  float64 
 15  HadDepressiveDisorder      44218 non-null  float64 
 16  HadKidneyDisease           44320 non-null  float64 
 17  HadArthritis               44243 non-null  float64 
 18  HadDiabetes                44411 non-null  float64 
 19  DeafOrHardOfHearing        42485 non-null  float64 
 20  BlindOrVisionDifficulty    42387 non-null  float64 
 21  DifficultyConcentrating    42169 non-null  float64 
 22  DifficultyWalking          42172 non-null  float64 
 23  DifficultyDressingBathing  42182 non-null  float64 
 24  DifficultyErrands          41999 non-null  float64 
 25  SmokerStatus               41005 non-null  float64 
 26  ECigaretteUsage            41003 non-null  float64 
 27  ChestScan                  38958 non-null  float64 
 28  RaceEthnicityCategory      44513 non-null  category
 29  AgeCategory                44513 non-null  category
 30  HeightInMeters             41714 non-null  float64 
 31  WeightInKilograms          40397 non-null  float64 
 32  BMI                        39724 non-null  float64 
 33  AlcoholDrinkers            39956 non-null  float64 
 34  HIVTesting                 38018 non-null  float64 
 35  FluVaxLast12               39886 non-null  float64 
 36  PneumoVaxEver              36860 non-null  float64 
 37  TetanusLast10Tdap          36315 non-null  float64 
 38  HighRiskLastYear           39538 non-null  float64 
 39  CovidPos                   38114 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB

normalize_dataset(train)
normalize_dataset(valid)

Statystyki dla zbiorów po zamianie na kolumny numeryczne

_50. centyl to mediana

train.describe()

	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	PhysicalActivities	SleepHours	RemovedTeeth	HadHeartAttack	HadAngina	HadStroke	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	HighRiskLastYear	CovidPos
count	676777.000000	674433.000000	655630.000000	660417.000000	674814.000000	665609.000000	654359.000000	674355.000000	657726.000000	672927.000000	...	637313.000000	619546.000000	611024.000000	606636.000000	572562.000000	605920.000000	570114.000000	554467.0	600540.000000	585150.000000
mean	0.538139	3.055519	6.737547	4.863972	0.690146	7.032336	0.983081	0.505244	0.264549	0.117193	...	1.707193	84.657015	28.917363	0.456819	0.325787	0.569879	0.527672	1.0	0.035087	0.273055
std	0.498544	1.137862	10.713287	9.115863	0.462434	1.726387	1.019679	0.499973	0.441093	0.321650	...	0.108002	21.753692	6.607455	0.498132	0.468668	0.495093	0.499234	0.0	0.183999	0.445529
min	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	...	0.910000	22.680000	12.050000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
25%	0.000000	2.000000	0.000000	0.000000	0.000000	6.000000	0.000000	0.000000	0.000000	0.000000	...	1.630000	69.400000	24.410000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
50%	1.000000	3.000000	0.000000	0.000000	1.000000	7.000000	1.000000	1.000000	0.000000	0.000000	...	1.700000	81.650000	27.890000	0.000000	0.000000	1.000000	1.000000	1.0	0.000000	0.000000
75%	1.000000	4.000000	10.000000	5.000000	1.000000	8.000000	2.000000	1.000000	1.000000	0.000000	...	1.780000	96.160000	32.260000	1.000000	1.000000	1.000000	1.000000	1.0	0.000000	1.000000
max	1.000000	5.000000	30.000000	30.000000	1.000000	24.000000	3.000000	1.000000	1.000000	1.000000	...	2.410000	292.570000	97.650000	1.000000	1.000000	1.000000	1.000000	1.0	1.000000	1.000000

8 rows × 36 columns

test.describe()

	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	PhysicalActivities	SleepHours	RemovedTeeth	HadHeartAttack	HadAngina	HadStroke	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	HighRiskLastYear	CovidPos
count	44513.000000	44393.000000	43469.000000	43622.000000	44408.000000	44008.000000	43413.000000	44182.000000	44074.000000	44368.000000	...	41714.000000	40397.000000	39724.000000	39956.000000	38018.000000	39886.000000	36860.000000	36315.0	39538.000000	38114.000000
mean	0.471593	3.441511	4.275001	4.298221	0.760021	7.036584	0.685302	0.057241	0.061056	0.042869	...	1.703194	83.021746	28.512326	0.529532	0.342233	0.527002	0.413592	1.0	0.043427	0.289500
std	0.499198	1.050924	8.588663	8.299250	0.427075	1.512667	0.884912	0.232304	0.239436	0.202563	...	0.107438	21.551394	6.596149	0.499133	0.474463	0.499277	0.492484	0.0	0.203818	0.453536
min	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	...	0.910000	22.680000	12.020000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
25%	0.000000	3.000000	0.000000	0.000000	1.000000	6.000000	0.000000	0.000000	0.000000	0.000000	...	1.630000	68.040000	24.030000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
50%	0.000000	3.000000	0.000000	0.000000	1.000000	7.000000	0.000000	0.000000	0.000000	0.000000	...	1.700000	80.740000	27.410000	1.000000	0.000000	1.000000	0.000000	1.0	0.000000	0.000000
75%	1.000000	4.000000	3.000000	4.000000	1.000000	8.000000	1.000000	0.000000	0.000000	0.000000	...	1.780000	95.250000	31.650000	1.000000	1.000000	1.000000	1.000000	1.0	0.000000	1.000000
max	1.000000	5.000000	30.000000	30.000000	1.000000	24.000000	3.000000	1.000000	1.000000	1.000000	...	2.340000	290.300000	97.650000	1.000000	1.000000	1.000000	1.000000	1.0	1.000000	1.000000

8 rows × 36 columns

valid.describe()

	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	PhysicalActivities	SleepHours	RemovedTeeth	HadHeartAttack	HadAngina	HadStroke	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	HighRiskLastYear	CovidPos
count	44514.000000	44388.000000	43458.000000	43578.000000	44401.000000	43966.000000	43360.000000	44202.000000	44031.000000	44344.000000	...	41677.000000	40327.000000	39626.000000	39950.000000	38041.000000	39885.000000	36926.000000	36250.0	39535.000000	38212.000000
mean	0.469043	3.434554	4.355470	4.379022	0.758361	7.013010	0.684732	0.057396	0.058936	0.042463	...	1.702146	82.981070	28.521370	0.527910	0.342525	0.525285	0.412771	1.0	0.044846	0.291872
std	0.499046	1.051996	8.718506	8.383576	0.428081	1.491967	0.882396	0.232600	0.235507	0.201646	...	0.106978	21.512676	6.622255	0.499227	0.474560	0.499367	0.492339	0.0	0.206969	0.454630
min	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	...	0.910000	22.680000	12.160000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
25%	0.000000	3.000000	0.000000	0.000000	1.000000	6.000000	0.000000	0.000000	0.000000	0.000000	...	1.630000	68.040000	24.030000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000
50%	0.000000	3.000000	0.000000	0.000000	1.000000	7.000000	0.000000	0.000000	0.000000	0.000000	...	1.700000	79.830000	27.400000	1.000000	0.000000	1.000000	0.000000	1.0	0.000000	0.000000
75%	1.000000	4.000000	3.000000	5.000000	1.000000	8.000000	1.000000	0.000000	0.000000	0.000000	...	1.780000	95.250000	31.750000	1.000000	1.000000	1.000000	1.000000	1.0	0.000000	1.000000
max	1.000000	5.000000	30.000000	30.000000	1.000000	24.000000	3.000000	1.000000	1.000000	1.000000	...	2.360000	263.080000	99.640000	1.000000	1.000000	1.000000	1.000000	1.0	1.000000	1.000000

8 rows × 36 columns

Wydaje się być korelacja między masą ciała i zawałem:

import seaborn as sns
sns.set_theme()
g = sns.catplot(
    data=train, kind="bar",
    x="GeneralHealth", y="WeightInKilograms", hue="HadHeartAttack",
    errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("General health index", "Body mass (kg)")
g.legend.set_title("Had heart attack")

Osoby palące częsciej miały zawał:

valid.groupby('SmokerStatus', as_index=False)['HadHeartAttack'].mean()

	SmokerStatus	HadHeartAttack
0	0.0	0.037883
1	1.0	0.072598
2	2.0	0.088887
3	3.0	0.090192

Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:

valid.groupby('GeneralHealth', as_index=False)['HadHeartAttack'].mean()

	GeneralHealth	HadHeartAttack
0	1.0	0.228411
1	2.0	0.129270
2	3.0	0.056693
3	4.0	0.027336
4	5.0	0.014743

valid.pivot_table('HadHeartAttack',index='GeneralHealth', columns='SmokerStatus')

SmokerStatus	0.0	1.0	2.0	3.0
GeneralHealth
1.0	0.194640	0.310680	0.257100	0.222222
2.0	0.090772	0.146429	0.184443	0.155059
3.0	0.039989	0.031068	0.091645	0.060469
4.0	0.021611	0.032070	0.035265	0.048292
5.0	0.011078	0.012579	0.026298	0.018315

Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
def scale_float_columns(dataset):
    numerical_columns = list(dataset.select_dtypes(include=['float64']).columns)
    dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])

test.head()

	State	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	CovidPos
339824	South Dakota	0.0	3.0	3.0	21.0	Within past year (anytime less than 12 months ...	1.0	8.0	2.0	...	1.60	52.16	20.37	0.0	1.0	1.0	1.0	1.0	0.0
127927	Kansas	0.0	3.0	30.0	0.0	Within past year (anytime less than 12 months ...	1.0	10.0	1.0	...	1.68	97.52	34.70	0.0	0.0	1.0	1.0	NaN	0.0
362523	Utah	1.0	5.0	0.0	0.0	Within past year (anytime less than 12 months ...	1.0	7.0	1.0	...	1.83	113.85	34.04	0.0	0.0	0.0	0.0	1.0	1.0
183687	Michigan	1.0	3.0	0.0	7.0	Within past year (anytime less than 12 months ...	1.0	8.0	0.0	...	1.78	83.91	26.54	1.0	0.0	1.0	1.0	1.0	0.0
191905	Michigan	0.0	4.0	0.0	0.0	Within past year (anytime less than 12 months ...	1.0	7.0	0.0	...	1.57	68.04	27.44	1.0	1.0	0.0	1.0	1.0	0.0

5 rows × 40 columns

scale_float_columns(test)
scale_float_columns(train)
scale_float_columns(valid)
test.head()

	State	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	CovidPos
339824	South Dakota	0.0	0.50	0.1	0.700000	Within past year (anytime less than 12 months ...	1.0	0.304348	0.666667	...	0.482517	0.110156	0.097513	0.0	1.0	1.0	1.0	0.0	0.0
127927	Kansas	0.0	0.50	1.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.391304	0.333333	...	0.538462	0.279650	0.264860	0.0	0.0	1.0	1.0	NaN	0.0
362523	Utah	1.0	1.00	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.333333	...	0.643357	0.340670	0.257153	0.0	0.0	0.0	0.0	0.0	1.0
183687	Michigan	1.0	0.50	0.0	0.233333	Within past year (anytime less than 12 months ...	1.0	0.304348	0.000000	...	0.608392	0.228795	0.169567	1.0	0.0	1.0	1.0	0.0	0.0
191905	Michigan	0.0	0.75	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.000000	...	0.461538	0.169494	0.180077	1.0	1.0	0.0	1.0	0.0	0.0

5 rows × 40 columns

5. Czyszczenie brakujących pól

Nie możemy użyć .dropna() gdyż większość wierszy ma brakujące wartości:

print(df.shape[0])
print(df.shape[0] - df.dropna().shape[0])

445132
199110

test.head()

	State	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	TetanusLast10Tdap	CovidPos
339824	South Dakota	0.0	0.50	0.1	0.700000	Within past year (anytime less than 12 months ...	1.0	0.304348	0.666667	...	0.482517	0.110156	0.097513	0.0	1.0	1.0	1.0	0.0	0.0
127927	Kansas	0.0	0.50	1.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.391304	0.333333	...	0.538462	0.279650	0.264860	0.0	0.0	1.0	1.0	NaN	0.0
362523	Utah	1.0	1.00	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.333333	...	0.643357	0.340670	0.257153	0.0	0.0	0.0	0.0	0.0	1.0
183687	Michigan	1.0	0.50	0.0	0.233333	Within past year (anytime less than 12 months ...	1.0	0.304348	0.000000	...	0.608392	0.228795	0.169567	1.0	0.0	1.0	1.0	0.0	0.0
191905	Michigan	0.0	0.75	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.000000	...	0.461538	0.169494	0.180077	1.0	1.0	0.0	1.0	0.0	0.0

5 rows × 40 columns

Uzupełniam brakujące wartości medianą:

#test.dropna(inplace=True)
#train.dropna(inplace=True)
#valid.dropna(inplace=True)
test.fillna(test.median(),inplace=True)
train.fillna(train.median(),inplace=True)
valid.fillna(valid.median(),inplace=True)

C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:4: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  test.fillna(test.median(),inplace=True)
C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:5: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  train.fillna(train.median(),inplace=True)
C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:6: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  valid.fillna(valid.median(),inplace=True)

test.head()

	State	Male	GeneralHealth	PhysicalHealthDays	MentalHealthDays	LastCheckupTime	PhysicalActivities	SleepHours	RemovedTeeth	...	HeightInMeters	WeightInKilograms	BMI	AlcoholDrinkers	HIVTesting	FluVaxLast12	PneumoVaxEver	CovidPos
339824	South Dakota	0.0	0.50	0.1	0.700000	Within past year (anytime less than 12 months ...	1.0	0.304348	0.666667	...	0.482517	0.110156	0.097513	0.0	1.0	1.0	1.0	0.0
127927	Kansas	0.0	0.50	1.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.391304	0.333333	...	0.538462	0.279650	0.264860	0.0	0.0	1.0	1.0	0.0
362523	Utah	1.0	1.00	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.333333	...	0.643357	0.340670	0.257153	0.0	0.0	0.0	0.0	1.0
183687	Michigan	1.0	0.50	0.0	0.233333	Within past year (anytime less than 12 months ...	1.0	0.304348	0.000000	...	0.608392	0.228795	0.169567	1.0	0.0	1.0	1.0	0.0
191905	Michigan	0.0	0.75	0.0	0.000000	Within past year (anytime less than 12 months ...	1.0	0.260870	0.000000	...	0.461538	0.169494	0.180077	1.0	1.0	0.0	1.0	0.0

5 rows × 40 columns

Kolumny kategoryczne wypełniłem w czasie normalizacji wartościami "Unknown" ponieważ fillna-->median nie działa dla tego typu danych (https://stackoverflow.com/questions/49127897/python-pandas-fillna-median-not-working)

test["HighRiskLastYear"].value_counts()

0.0    42796
1.0     1717
Name: HighRiskLastYear, dtype: int64

test["HighRiskLastYear"].isna().sum()

Brak wartości non-null:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44513 entries, 339824 to 52161
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44513 non-null  category
 1   Male                       44513 non-null  float64 
 2   GeneralHealth              44513 non-null  float64 
 3   PhysicalHealthDays         44513 non-null  float64 
 4   MentalHealthDays           44513 non-null  float64 
 5   LastCheckupTime            44513 non-null  category
 6   PhysicalActivities         44513 non-null  float64 
 7   SleepHours                 44513 non-null  float64 
 8   RemovedTeeth               44513 non-null  float64 
 9   HadHeartAttack             44513 non-null  float64 
 10  HadAngina                  44513 non-null  float64 
 11  HadStroke                  44513 non-null  float64 
 12  HadAsthma                  44513 non-null  float64 
 13  HadSkinCancer              44513 non-null  float64 
 14  HadCOPD                    44513 non-null  float64 
 15  HadDepressiveDisorder      44513 non-null  float64 
 16  HadKidneyDisease           44513 non-null  float64 
 17  HadArthritis               44513 non-null  float64 
 18  HadDiabetes                44513 non-null  float64 
 19  DeafOrHardOfHearing        44513 non-null  float64 
 20  BlindOrVisionDifficulty    44513 non-null  float64 
 21  DifficultyConcentrating    44513 non-null  float64 
 22  DifficultyWalking          44513 non-null  float64 
 23  DifficultyDressingBathing  44513 non-null  float64 
 24  DifficultyErrands          44513 non-null  float64 
 25  SmokerStatus               44513 non-null  float64 
 26  ECigaretteUsage            44513 non-null  float64 
 27  ChestScan                  44513 non-null  float64 
 28  RaceEthnicityCategory      44513 non-null  category
 29  AgeCategory                44513 non-null  category
 30  HeightInMeters             44513 non-null  float64 
 31  WeightInKilograms          44513 non-null  float64 
 32  BMI                        44513 non-null  float64 
 33  AlcoholDrinkers            44513 non-null  float64 
 34  HIVTesting                 44513 non-null  float64 
 35  FluVaxLast12               44513 non-null  float64 
 36  PneumoVaxEver              44513 non-null  float64 
 37  TetanusLast10Tdap          44513 non-null  float64 
 38  HighRiskLastYear           44513 non-null  float64 
 39  CovidPos                   44513 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676777 entries, 0 to 676776
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   State                      676777 non-null  category
 1   Male                       676777 non-null  float64 
 2   GeneralHealth              676777 non-null  float64 
 3   PhysicalHealthDays         676777 non-null  float64 
 4   MentalHealthDays           676777 non-null  float64 
 5   LastCheckupTime            676777 non-null  category
 6   PhysicalActivities         676777 non-null  float64 
 7   SleepHours                 676777 non-null  float64 
 8   RemovedTeeth               676777 non-null  float64 
 9   HadHeartAttack             676777 non-null  float64 
 10  HadAngina                  676777 non-null  float64 
 11  HadStroke                  676777 non-null  float64 
 12  HadAsthma                  676777 non-null  float64 
 13  HadSkinCancer              676777 non-null  float64 
 14  HadCOPD                    676777 non-null  float64 
 15  HadDepressiveDisorder      676777 non-null  float64 
 16  HadKidneyDisease           676777 non-null  float64 
 17  HadArthritis               676777 non-null  float64 
 18  HadDiabetes                676777 non-null  float64 
 19  DeafOrHardOfHearing        676777 non-null  float64 
 20  BlindOrVisionDifficulty    676777 non-null  float64 
 21  DifficultyConcentrating    676777 non-null  float64 
 22  DifficultyWalking          676777 non-null  float64 
 23  DifficultyDressingBathing  676777 non-null  float64 
 24  DifficultyErrands          676777 non-null  float64 
 25  SmokerStatus               676777 non-null  float64 
 26  ECigaretteUsage            676777 non-null  float64 
 27  ChestScan                  676777 non-null  float64 
 28  RaceEthnicityCategory      676777 non-null  category
 29  AgeCategory                676777 non-null  category
 30  HeightInMeters             676777 non-null  float64 
 31  WeightInKilograms          676777 non-null  float64 
 32  BMI                        676777 non-null  float64 
 33  AlcoholDrinkers            676777 non-null  float64 
 34  HIVTesting                 676777 non-null  float64 
 35  FluVaxLast12               676777 non-null  float64 
 36  PneumoVaxEver              676777 non-null  float64 
 37  TetanusLast10Tdap          676777 non-null  float64 
 38  HighRiskLastYear           676777 non-null  float64 
 39  CovidPos                   676777 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 188.5 MB

valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44514 entries, 66965 to 224311
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44514 non-null  category
 1   Male                       44514 non-null  float64 
 2   GeneralHealth              44514 non-null  float64 
 3   PhysicalHealthDays         44514 non-null  float64 
 4   MentalHealthDays           44514 non-null  float64 
 5   LastCheckupTime            44514 non-null  category
 6   PhysicalActivities         44514 non-null  float64 
 7   SleepHours                 44514 non-null  float64 
 8   RemovedTeeth               44514 non-null  float64 
 9   HadHeartAttack             44514 non-null  float64 
 10  HadAngina                  44514 non-null  float64 
 11  HadStroke                  44514 non-null  float64 
 12  HadAsthma                  44514 non-null  float64 
 13  HadSkinCancer              44514 non-null  float64 
 14  HadCOPD                    44514 non-null  float64 
 15  HadDepressiveDisorder      44514 non-null  float64 
 16  HadKidneyDisease           44514 non-null  float64 
 17  HadArthritis               44514 non-null  float64 
 18  HadDiabetes                44514 non-null  float64 
 19  DeafOrHardOfHearing        44514 non-null  float64 
 20  BlindOrVisionDifficulty    44514 non-null  float64 
 21  DifficultyConcentrating    44514 non-null  float64 
 22  DifficultyWalking          44514 non-null  float64 
 23  DifficultyDressingBathing  44514 non-null  float64 
 24  DifficultyErrands          44514 non-null  float64 
 25  SmokerStatus               44514 non-null  float64 
 26  ECigaretteUsage            44514 non-null  float64 
 27  ChestScan                  44514 non-null  float64 
 28  RaceEthnicityCategory      44514 non-null  category
 29  AgeCategory                44514 non-null  category
 30  HeightInMeters             44514 non-null  float64 
 31  WeightInKilograms          44514 non-null  float64 
 32  BMI                        44514 non-null  float64 
 33  AlcoholDrinkers            44514 non-null  float64 
 34  HIVTesting                 44514 non-null  float64 
 35  FluVaxLast12               44514 non-null  float64 
 36  PneumoVaxEver              44514 non-null  float64 
 37  TetanusLast10Tdap          44514 non-null  float64 
 38  HighRiskLastYear           44514 non-null  float64 
 39  CovidPos                   44514 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB

348 KiB Raw Blame History Unescape Escape

1. Pobieranie zbioru danych

Przeglądanie nieoczyszczonego datasetu

Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu

Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:

2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling

Statystyki covidowe

Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne

Zbiór test przed zmianą typu danych

Zbiór test po zmianie typu danych

Statystyki dla zbiorów po zamianie na kolumny numeryczne

Wydaje się być korelacja między masą ciała i zawałem:

Osoby palące częsciej miały zawał:

Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:

Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1

5. Czyszczenie brakujących pól

Brak wartości non-null:

348 KiB

Raw Blame History