ium_452487/dane.ipynb

327 KiB
Raw Permalink Blame History

1. Pobieranie zbioru danych

!pip install --user kaggle
Requirement already satisfied: kaggle in c:\users\adrian\appdata\roaming\python\python39\site-packages (1.6.6)
Requirement already satisfied: bleach in c:\users\adrian\miniconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: python-slugify in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (8.0.4)
Requirement already satisfied: python-dateutil in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: tqdm in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (4.64.1)
Requirement already satisfied: requests in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.28.1)
Requirement already satisfied: certifi in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2022.6.15)
Requirement already satisfied: six>=1.10 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: urllib3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.26.11)
Requirement already satisfied: webencodings in c:\users\adrian\miniconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: packaging in c:\users\adrian\appdata\roaming\python\python39\site-packages (from bleach->kaggle) (22.0)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.10)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.1.0)
Requirement already satisfied: colorama in c:\users\adrian\appdata\roaming\python\python39\site-packages (from tqdm->kaggle) (0.4.5)
!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease/
personal-key-indicators-of-heart-disease.zip: Skipping, found more recently modified local copy (use --force to force download)
#!unzip -o personal-key-indicators-of-heart-disease.zip #nie działa na Windowsie więc korzystam z modułu zipfile
import zipfile
with zipfile.ZipFile("personal-key-indicators-of-heart-disease.zip", 'r') as zip_ref:
    zip_ref.extractall("dataset_extracted")
import pandas as pd
# W pobranym zbiorze danych jest kilka podzbiorów więc celowo otwieram ten z NaN, żeby manualnie go oczyścić dla praktyki
df = pd.read_csv("dataset_extracted/2022/heart_2022_with_nans.csv")

Przeglądanie nieoczyszczonego datasetu

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      445132 non-null  object 
 1   Sex                        445132 non-null  object 
 2   GeneralHealth              443934 non-null  object 
 3   PhysicalHealthDays         434205 non-null  float64
 4   MentalHealthDays           436065 non-null  float64
 5   LastCheckupTime            436824 non-null  object 
 6   PhysicalActivities         444039 non-null  object 
 7   SleepHours                 439679 non-null  float64
 8   RemovedTeeth               433772 non-null  object 
 9   HadHeartAttack             442067 non-null  object 
 10  HadAngina                  440727 non-null  object 
 11  HadStroke                  443575 non-null  object 
 12  HadAsthma                  443359 non-null  object 
 13  HadSkinCancer              441989 non-null  object 
 14  HadCOPD                    442913 non-null  object 
 15  HadDepressiveDisorder      442320 non-null  object 
 16  HadKidneyDisease           443206 non-null  object 
 17  HadArthritis               442499 non-null  object 
 18  HadDiabetes                444045 non-null  object 
 19  DeafOrHardOfHearing        424485 non-null  object 
 20  BlindOrVisionDifficulty    423568 non-null  object 
 21  DifficultyConcentrating    420892 non-null  object 
 22  DifficultyWalking          421120 non-null  object 
 23  DifficultyDressingBathing  421217 non-null  object 
 24  DifficultyErrands          419476 non-null  object 
 25  SmokerStatus               409670 non-null  object 
 26  ECigaretteUsage            409472 non-null  object 
 27  ChestScan                  389086 non-null  object 
 28  RaceEthnicityCategory      431075 non-null  object 
 29  AgeCategory                436053 non-null  object 
 30  HeightInMeters             416480 non-null  float64
 31  WeightInKilograms          403054 non-null  float64
 32  BMI                        396326 non-null  float64
 33  AlcoholDrinkers            398558 non-null  object 
 34  HIVTesting                 379005 non-null  object 
 35  FluVaxLast12               398011 non-null  object 
 36  PneumoVaxEver              368092 non-null  object 
 37  TetanusLast10Tdap          362616 non-null  object 
 38  HighRiskLastYear           394509 non-null  object 
 39  CovidPos                   394368 non-null  object 
dtypes: float64(6), object(34)
memory usage: 135.8+ MB
df.head()
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
0 Alabama Female Very good 0.0 0.0 Within past year (anytime less than 12 months ... No 8.0 NaN No ... NaN NaN NaN No No Yes No Yes, received tetanus shot but not sure what type No No
1 Alabama Female Excellent 0.0 0.0 NaN No 6.0 NaN No ... 1.60 68.04 26.57 No No No No No, did not receive any tetanus shot in the pa... No No
2 Alabama Female Very good 2.0 3.0 Within past year (anytime less than 12 months ... Yes 5.0 NaN No ... 1.57 63.50 25.61 No No No No NaN No Yes
3 Alabama Female Excellent 0.0 0.0 Within past year (anytime less than 12 months ... Yes 7.0 NaN No ... 1.65 63.50 23.30 No No Yes Yes No, did not receive any tetanus shot in the pa... No No
4 Alabama Female Fair 2.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 NaN No ... 1.57 53.98 21.77 Yes No No Yes No, did not receive any tetanus shot in the pa... No No

5 rows × 40 columns

df.describe()
PhysicalHealthDays MentalHealthDays SleepHours HeightInMeters WeightInKilograms BMI
count 434205.000000 436065.000000 439679.000000 416480.000000 403054.000000 396326.000000
mean 4.347919 4.382649 7.022983 1.702691 83.074470 28.529842
std 8.688912 8.387475 1.502425 0.107177 21.448173 6.554889
min 0.000000 0.000000 1.000000 0.910000 22.680000 12.020000
25% 0.000000 0.000000 6.000000 1.630000 68.040000 24.130000
50% 0.000000 0.000000 7.000000 1.700000 80.740000 27.440000
75% 3.000000 5.000000 8.000000 1.780000 95.250000 31.750000
max 30.000000 30.000000 24.000000 2.410000 292.570000 99.640000

Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu

Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:

df["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
df["HadHeartAttack"].value_counts()
HadHeartAttack
No     416959
Yes     25108
Name: count, dtype: int64

2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling

from sklearn.model_selection import train_test_split
#Funkcji z sklearn musimy użyć dwukrotnie, bo dzieli tylko na dwa podzbiory
train, test_and_valid = train_test_split(df, test_size=0.2) #0.8 train, 0.2 test&valid

test, valid = train_test_split(test_and_valid, test_size=0.5) #0.1 test, 0.1 valid
train["HadHeartAttack"].value_counts()
HadHeartAttack
No     333640
Yes     20032
Name: count, dtype: int64

Zbiór treningowy jest nadal niezbalansowany więc zrobię prosty oversampling przez kopiowanie mniejszej klasy aż będą prawie równe

def oversample(dataset):
    num_true = len(dataset[dataset["HadHeartAttack"]=="Yes"])
    num_false = len(dataset[dataset["HadHeartAttack"]=="No"])
    num_oversampling_steps = num_false//num_true
    oversampled = dataset.copy()
    for x in range(num_oversampling_steps):
        oversampled = pd.concat([oversampled, dataset[dataset["HadHeartAttack"]=="Yes"]], ignore_index=True)
    return oversampled
train = oversample(train)
train["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
test["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
valid["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>

Proporcje osób palących / niepalących w pierwotnym zbiorze danych:

df["SmokerStatus"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
df["ECigaretteUsage"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>

Statystyki covidowe

df["CovidPos"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>

Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne

Kolumny zawierające stan zdrowia i podobne cechy opisane w sposób "poor/fair/good/excellent" etc. starałem się zamienić na liczbowe w sposób sensowny, rosnący względem pozytywnego aspektu tego czynnika zdrowotnego. Podobnie z tym jak często dana osoba paliła. Część kolumn zamieniłem na kategoryczne Kolumnę płci zamieniłem na numeryczną w celu późniejszego wykorzystania przez model, choć mialem wątpliwości co do robienia tego pod względem poprawności politycznej

df["Sex"].unique()
array(['Female', 'Male'], dtype=object)
df["GeneralHealth"].unique()
array(['Very good', 'Excellent', 'Fair', 'Poor', 'Good', nan],
      dtype=object)
health_map = {
    "Excellent": 5,
    "Very good": 4,
    "Good": 3,
    "Fair": 2,
    "Poor": 1
}
for col in df:
    print(f"{col}:")
    print(df[col].unique())
State:
['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico'
 'Virgin Islands']
Sex:
['Female' 'Male']
GeneralHealth:
['Very good' 'Excellent' 'Fair' 'Poor' 'Good' nan]
PhysicalHealthDays:
[ 0.  2.  1.  8.  5. 30.  4. 23. 14. nan 15.  3. 10.  7. 25.  6. 21. 20.
 29. 16.  9. 27. 28. 12. 13. 11. 26. 17. 24. 19. 18. 22.]
MentalHealthDays:
[ 0.  3.  9.  5. 15. 20. 14. 10. 18.  1. nan  2. 30.  4.  6.  7. 25.  8.
 22. 29. 27. 21. 12. 28. 16. 13. 26. 17. 11. 23. 19. 24.]
LastCheckupTime:
['Within past year (anytime less than 12 months ago)' nan
 'Within past 2 years (1 year but less than 2 years ago)'
 'Within past 5 years (2 years but less than 5 years ago)'
 '5 or more years ago']
PhysicalActivities:
['No' 'Yes' nan]
SleepHours:
[ 8.  6.  5.  7.  9.  4. 10.  1. 12. nan 18.  3.  2. 11. 16. 15. 13. 14.
 20. 23. 17. 24. 22. 19. 21.]
RemovedTeeth:
[nan 'None of them' '1 to 5' '6 or more, but not all' 'All']
HadHeartAttack:
['No' 'Yes' nan]
HadAngina:
['No' 'Yes' nan]
HadStroke:
['No' 'Yes' nan]
HadAsthma:
['No' 'Yes' nan]
HadSkinCancer:
['No' 'Yes' nan]
HadCOPD:
['No' 'Yes' nan]
HadDepressiveDisorder:
['No' 'Yes' nan]
HadKidneyDisease:
['No' 'Yes' nan]
HadArthritis:
['No' 'Yes' nan]
HadDiabetes:
['Yes' 'No' 'No, pre-diabetes or borderline diabetes' nan
 'Yes, but only during pregnancy (female)']
DeafOrHardOfHearing:
['No' nan 'Yes']
BlindOrVisionDifficulty:
['No' 'Yes' nan]
DifficultyConcentrating:
['No' nan 'Yes']
DifficultyWalking:
['No' 'Yes' nan]
DifficultyDressingBathing:
['No' nan 'Yes']
DifficultyErrands:
['No' 'Yes' nan]
SmokerStatus:
['Never smoked' 'Current smoker - now smokes some days' 'Former smoker'
 nan 'Current smoker - now smokes every day']
ECigaretteUsage:
['Not at all (right now)' 'Never used e-cigarettes in my entire life' nan
 'Use them every day' 'Use them some days']
ChestScan:
['No' 'Yes' nan]
RaceEthnicityCategory:
['White only, Non-Hispanic' 'Black only, Non-Hispanic'
 'Other race only, Non-Hispanic' 'Multiracial, Non-Hispanic' nan
 'Hispanic']
AgeCategory:
['Age 80 or older' 'Age 55 to 59' nan 'Age 40 to 44' 'Age 75 to 79'
 'Age 70 to 74' 'Age 65 to 69' 'Age 60 to 64' 'Age 50 to 54'
 'Age 45 to 49' 'Age 35 to 39' 'Age 25 to 29' 'Age 30 to 34'
 'Age 18 to 24']
HeightInMeters:
[ nan 1.6  1.57 1.65 1.8  1.63 1.7  1.68 1.73 1.55 1.93 1.88 1.78 1.85
 1.75 1.52 1.83 1.91 1.96 1.5  1.45 1.42 1.24 1.47 1.22 1.98 2.03 2.01
 1.3  1.4  1.35 1.82 1.67 1.76 2.11 1.37 1.64 1.71 2.16 2.26 0.91 2.06
 1.14 1.74 1.51 1.53 1.69 1.56 1.84 1.9  1.54 1.72 1.87 1.61 1.49 1.59
 1.58 1.62 1.79 1.46 1.89 2.13 0.99 2.08 2.21 1.32 2.18 1.77 2.36 1.25
 1.66 1.86 1.95 1.19 1.05 1.48 1.03 1.18 1.81 1.38 1.44 1.07 1.27 1.2
 1.17 1.04 2.24 1.1  1.43 1.92 2.05 1.12 2.41 2.34 0.97 1.06 1.15 2.29
 1.16 1.09 0.92 2.07 1.   1.08 1.02 1.33 2.   2.02 1.94 0.95]
WeightInKilograms:
[   nan  68.04  63.5   53.98  84.82  62.6   73.48  81.65  74.84  59.42
  85.28 106.59  71.21  64.41  61.23  90.72  65.77  66.22  80.29  86.18
  47.63 107.05  57.15 105.23  77.11  56.7   79.38 113.4  102.06  59.87
 104.33  53.52  61.69 136.08  34.47  99.79 127.01  78.93  95.25  58.97
  92.08  72.57  83.91  49.9  117.93  71.67 102.97  62.14  83.46  54.43
  94.35  60.78 117.03  65.32  76.66  88.45  89.81  74.39  68.95  79.83
 108.41  90.26  55.79  91.63  47.17  78.02  50.8   91.17  84.37 145.15
  93.89 122.47  48.99  73.94  88.9   80.74  81.19 158.76  97.52  51.71
  82.55  76.2   68.49  75.3   70.31  63.05  60.33 115.67  86.64 108.86
  92.53 124.74  43.09  58.51  63.96  92.99  44.45 128.82  98.88  45.36
 110.68  46.72  58.06  73.03  95.71 131.09  78.47  69.4   85.73  67.59
 103.87 120.2   88.    54.88 111.58  52.16  77.56 126.55  94.8  123.83
  89.36  75.75  69.85 112.49  82.1  106.14  57.61  70.76 148.78  96.16
  67.13  48.08 163.29 109.77 100.7  142.88  64.86 111.13 121.11  55.34
 101.6   93.44 117.48 120.66  66.68  44.91 132.   107.5  107.95  36.29
 103.42  87.09  83.01  56.25  96.62 134.26  97.07  34.93  99.34  72.12
  49.44 122.02  98.43 129.73 181.44  52.62 121.56 110.22  48.53 140.61
 156.49 116.57  87.54  44.   114.31  31.75  97.98 101.15 112.04 100.24
 113.85 154.22 118.39 133.81 149.69  41.73 119.75 138.35 151.95 129.27
 131.54 104.78 132.45 102.51 116.12  40.37 105.69 136.98 195.04  53.07
 132.9  124.28 112.94 114.76  45.81 119.29 167.83  51.26 172.37 162.39
  46.27 127.91 123.38  38.56 130.63 143.34 115.21 166.92 135.17 109.32
 135.62 204.12 127.46 118.84 139.25 126.1  122.92 151.5  133.36  42.64
  50.35  80.   190.51  37.19 147.87  35.38 144.24 149.23  37.65  86.
 147.42 281.   165.56 162.84 155.58  70.   137.89 189.6  206.38 148.32
  42.18 153.77  38.1   90.   176.9  191.87 249.48  67.    95.    82.
 170.1   62.    40.82  53.   139.71 130.18 100.   165.11  64.    43.54
  24.   134.72 141.52 125.19  75.    60.    34.02 164.65  30.84 250.
  58.    76.    73.   112.    74.    55.   200.    54.    66.    72.
 152.41  39.46 220.    41.28 168.28 188.24  59.    46.   265.   238.14
 168.74 145.   190.    93.   159.66  78.    50.   185.07  91.   104.
 165.   183.7   33.57 161.93  68.   125.65 134.   130.    32.21 143.79
  69.   179.17  63.   105.   210.92  65.    32.   292.57 280.    85.
 174.63  56.   128.37  87.    39.92  83.   169.64 156.04 177.   121.
 151.05  89.   146.96 146.06  98.   166.47  36.74 171.46 227.25  29.48
 190.06 161.03  35.83 226.8  175.09 138.8  240.4  158.3  170.55  61.
 137.44 145.6  141.07 155.13  52.   120.    57.    77.    27.22  25.4
 240.    96.    47.   115.    41.    45.   170.   150.59 272.16  26.31
  48.    39.01 236.    92.   197.31 156.    84.    94.    29.03  49.
  79.   157.85 192.78 255.   108.   185.   222.26 229.97 180.    81.
  24.95  71.    26.   107.   101.   208.65 140.   175.   111.   110.
 141.97  22.68 284.86 136.53 210.   103.   185.97 140.16 146.51  24.49
  25.85 150.   102.   229.52  23.59 125.   163.    38.   135.   176.45
 185.52 152.86 232.69 124.   192.32 186.88 118.   160.12 160.   193.68
 201.85 144.7  184.16 142.43 169.   166.01  32.66 180.53 196.41  51.
  40.   171.91 195.95  33.11 153.31 159.21 164.2  219.99 215.46 182.34
  30.   160.57 173.27 158.   213.19 276.24 199.58 175.99 235.87 217.72
 200.03 230.88 146.    24.04 178.72 150.14 157.4  163.75 191.42 174.18
  28.58  97.   256.28 205.48 161.48 178.26 179.62 205.02 254.01 154.68
 209.56 201.4  234.96 177.81 200.49 231.79 227.7  273.52 189.15 173.73
 183.25 167.38 211.83 223.62 228.61  30.39 197.77 184.61 250.38 181.89
  31.3  290.3  285.   113.   242.67 231.33 180.08 202.76 176.   188.69
 206.84 164.   156.94 114.   122.   222.   137.   166.   180.98 272.
 172.82 274.42 234.51 199.13 244.94 203.21  23.13 265.35 198.22 263.08
 216.82 154.   169.19 239.04 177.35 210.47 224.98 117.    37.   126.
 273.06 203.66 252.2  238.59 194.59 187.33 221.35 162.   224.53  23.
 223.17 187.79 212.73 152.   233.6  193.23 205.   229.06 230.   247.21
  99.    28.12 230.42 175.54 205.93 171.    26.76 212.28 217.   280.32
 281.68 248.57 195.    42.   258.55 215.   116.    28.   123.   186.43
 228.16 119.   219.09 214.55 278.96 182.8  138.   217.27 246.3  189.  ]
BMI:
[  nan 26.57 25.61 ... 13.51 28.39 48.63]
AlcoholDrinkers:
['No' 'Yes' nan]
HIVTesting:
['No' 'Yes' nan]
FluVaxLast12:
['Yes' 'No' nan]
PneumoVaxEver:
['No' 'Yes' nan]
TetanusLast10Tdap:
['Yes, received tetanus shot but not sure what type'
 'No, did not receive any tetanus shot in the past 10 years' nan
 'Yes, received Tdap' 'Yes, received tetanus shot, but not Tdap']
HighRiskLastYear:
['No' nan 'Yes']
CovidPos:
['No' 'Yes' nan
 'Tested positive using home test without a health professional']
from collections import defaultdict
def normalize_dataset(dataset):
    dataset["GeneralHealth"] = dataset["GeneralHealth"].map(defaultdict(lambda: float('NaN'), health_map), na_action='ignore')
    dataset["Sex"] = dataset["Sex"].map({"Female":0,"Male":1}).astype(float) #Zamiana z kolumn tekstowych na numeryczne
    dataset.rename(columns ={"Sex":"Male"},inplace=True)
    dataset["State"] = dataset["State"].astype('category')
    dataset["PhysicalHealthDays"].astype(float)
    dataset["MentalHealthDays"].astype(float)
    dataset["LastCheckupTime"] = dataset["LastCheckupTime"].fillna("Unknown").astype('category') # Potem korzystam z fillna-->median ale nie działa to na kolumnach kategorycznych więc wykonuję to przed konwersją
    dataset["PhysicalActivities"]= dataset["PhysicalActivities"].map({"No":0,"Yes":1})
    dataset["SleepHours"].astype(float)
    dataset["RemovedTeeth"] = dataset["RemovedTeeth"].map(defaultdict(lambda: float('NaN'), {"None of them":0,"1 to 5":1, "6 or more, but not all":2, "All":3}), na_action='ignore')
    dataset["HadHeartAttack"]= dataset["HadHeartAttack"].map({"No":0,"Yes":1})
    dataset["HadAngina"]= dataset["HadAngina"].map({"No":0,"Yes":1})
    dataset["HadStroke"]= dataset["HadStroke"].map({"No":0,"Yes":1})
    dataset["HadAsthma"]= dataset["HadAsthma"].map({"No":0,"Yes":1})
    dataset["HadSkinCancer"]= dataset["HadSkinCancer"].map({"No":0,"Yes":1})
    dataset["HadCOPD"]= dataset["HadCOPD"].map({"No":0,"Yes":1})
    dataset["HadDepressiveDisorder"]= dataset["HadDepressiveDisorder"].map({"No":0,"Yes":1})
    dataset["HadKidneyDisease"]= dataset["HadKidneyDisease"].map({"No":0,"Yes":1})
    dataset["HadArthritis"]= dataset["HadArthritis"].map({"No":0,"Yes":1})
    dataset["HadDiabetes"]= dataset["HadDiabetes"].map({"No":0,"Yes, but only during pregnancy (female)":1,"No, pre-diabetes or borderline diabetes":2,"Yes":3})

    dataset["DeafOrHardOfHearing"]= dataset["DeafOrHardOfHearing"].map({"No":0,"Yes":1})
    dataset["BlindOrVisionDifficulty"]= dataset["BlindOrVisionDifficulty"].map({"No":0,"Yes":1})
    dataset["DifficultyConcentrating"]= dataset["DifficultyConcentrating"].map({"No":0,"Yes":1})
    dataset["DifficultyWalking"]= dataset["DifficultyWalking"].map({"No":0,"Yes":1})
    dataset["DifficultyDressingBathing"]= dataset["DifficultyDressingBathing"].map({"No":0,"Yes":1})
    dataset["DifficultyErrands"]= dataset["DifficultyErrands"].map({"No":0,"Yes":1})
    dataset["SmokerStatus"]= dataset["SmokerStatus"].map({"Never smoked":0,"Current smoker - now smokes some days":1,"Former smoker":2,"Current smoker - now smokes every day":3})
    dataset["ECigaretteUsage"]= dataset["ECigaretteUsage"].map({"Never used e-cigarettes in my entire life":0,"Not at all (right now)":1,"Use them some days":2,"Use them every day":3})
    dataset["ChestScan"]= dataset["ChestScan"].map({"No":0,"Yes":1})
    dataset["RaceEthnicityCategory"] = dataset["RaceEthnicityCategory"].fillna("Unknown").astype('category')
    dataset["AgeCategory"] = dataset["AgeCategory"].fillna("Unknown").astype('category')
    dataset["HeightInMeters"] = dataset["HeightInMeters"].astype(float)
    dataset["WeightInKilograms"] = dataset["WeightInKilograms"].astype(float)
    dataset["BMI"] = dataset["BMI"].astype(float)
    dataset["AlcoholDrinkers"]= dataset["AlcoholDrinkers"].map({"No":0,"Yes":1})
    dataset["HIVTesting"]= dataset["HIVTesting"].map({"No":0,"Yes":1})
    dataset["FluVaxLast12"]= dataset["FluVaxLast12"].map({"No":0,"Yes":1})
    dataset["PneumoVaxEver"]= dataset["PneumoVaxEver"].map({"No":0,"Yes":1})
    dataset["TetanusLast10Tdap"]= dataset["TetanusLast10Tdap"].apply(lambda x: float('NaN') if type(x)!=str else 1.0 if 'Yes,' in x else 1.0 if 'No,' in x else float('NaN'))
    dataset["HighRiskLastYear"]= dataset["HighRiskLastYear"].map({"No":0,"Yes":1})
    dataset["CovidPos"]= dataset["CovidPos"].map({"No":0,"Yes":1})

Zbiór test przed zmianą typu danych

test.head()
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York Male Good 2.0 0.0 Within past 2 years (1 year but less than 2 ye... NaN 7.0 None of them No ... 1.55 NaN NaN No No No NaN No, did not receive any tetanus shot in the pa... No No
189605 Michigan Female Fair 20.0 15.0 Within past year (anytime less than 12 months ... Yes 5.0 All No ... 1.68 70.31 25.02 No NaN Yes Yes NaN No No
59234 Delaware Female Very good 0.0 0.0 Within past year (anytime less than 12 months ... Yes 6.0 None of them No ... 1.50 64.41 28.68 No No Yes NaN No, did not receive any tetanus shot in the pa... No No
255322 New Mexico Male Good 0.0 0.0 5 or more years ago Yes 6.0 None of them No ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226504 Montana Female Very good 6.0 0.0 Within past year (anytime less than 12 months ... Yes 8.0 None of them No ... 1.73 90.72 30.41 Yes No No Yes NaN No Yes

5 rows × 40 columns

Zbiór test po zmianie typu danych

normalize_dataset(test)
test.head()
State Male GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York 1.0 3.0 2.0 0.0 Within past 2 years (1 year but less than 2 ye... NaN 7.0 0.0 0.0 ... 1.55 NaN NaN 0.0 0.0 0.0 NaN 1.0 0.0 0.0
189605 Michigan 0.0 2.0 20.0 15.0 Within past year (anytime less than 12 months ... 1.0 5.0 3.0 0.0 ... 1.68 70.31 25.02 0.0 NaN 1.0 1.0 NaN 0.0 0.0
59234 Delaware 0.0 4.0 0.0 0.0 Within past year (anytime less than 12 months ... 1.0 6.0 0.0 0.0 ... 1.50 64.41 28.68 0.0 0.0 1.0 NaN 1.0 0.0 0.0
255322 New Mexico 1.0 3.0 0.0 0.0 5 or more years ago 1.0 6.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226504 Montana 0.0 4.0 6.0 0.0 Within past year (anytime less than 12 months ... 1.0 8.0 0.0 0.0 ... 1.73 90.72 30.41 1.0 0.0 0.0 1.0 NaN 0.0 1.0

5 rows × 40 columns

test.info()
<class 'pandas.core.frame.DataFrame'>
Index: 44513 entries, 276058 to 196692
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44513 non-null  category
 1   Male                       44513 non-null  float64 
 2   GeneralHealth              44380 non-null  float64 
 3   PhysicalHealthDays         43374 non-null  float64 
 4   MentalHealthDays           43620 non-null  float64 
 5   LastCheckupTime            44513 non-null  category
 6   PhysicalActivities         44383 non-null  float64 
 7   SleepHours                 43982 non-null  float64 
 8   RemovedTeeth               43364 non-null  float64 
 9   HadHeartAttack             44220 non-null  float64 
 10  HadAngina                  44117 non-null  float64 
 11  HadStroke                  44352 non-null  float64 
 12  HadAsthma                  44348 non-null  float64 
 13  HadSkinCancer              44192 non-null  float64 
 14  HadCOPD                    44283 non-null  float64 
 15  HadDepressiveDisorder      44197 non-null  float64 
 16  HadKidneyDisease           44342 non-null  float64 
 17  HadArthritis               44231 non-null  float64 
 18  HadDiabetes                44377 non-null  float64 
 19  DeafOrHardOfHearing        42456 non-null  float64 
 20  BlindOrVisionDifficulty    42338 non-null  float64 
 21  DifficultyConcentrating    42066 non-null  float64 
 22  DifficultyWalking          42090 non-null  float64 
 23  DifficultyDressingBathing  42111 non-null  float64 
 24  DifficultyErrands          41923 non-null  float64 
 25  SmokerStatus               40967 non-null  float64 
 26  ECigaretteUsage            40964 non-null  float64 
 27  ChestScan                  38930 non-null  float64 
 28  RaceEthnicityCategory      44513 non-null  category
 29  AgeCategory                44513 non-null  category
 30  HeightInMeters             41634 non-null  float64 
 31  WeightInKilograms          40303 non-null  float64 
 32  BMI                        39648 non-null  float64 
 33  AlcoholDrinkers            39882 non-null  float64 
 34  HIVTesting                 37870 non-null  float64 
 35  FluVaxLast12               39814 non-null  float64 
 36  PneumoVaxEver              36760 non-null  float64 
 37  TetanusLast10Tdap          36287 non-null  float64 
 38  HighRiskLastYear           39445 non-null  float64 
 39  CovidPos                   38063 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB
normalize_dataset(train)
normalize_dataset(valid)

Statystyki dla zbiorów po zamianie na kolumny numeryczne

_50. centyl to mediana

train.describe()
Male GeneralHealth PhysicalHealthDays MentalHealthDays PhysicalActivities SleepHours RemovedTeeth HadHeartAttack HadAngina HadStroke ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
count 676617.000000 674189.000000 655653.000000 660103.000000 674547.000000 665806.000000 654146.000000 674184.000000 657382.000000 672884.000000 ... 637479.000000 620141.000000 611530.000000 607591.000000 573999.000000 606624.000000 571259.000000 554407.0 601115.000000 585931.000000
mean 0.539397 3.056503 6.720248 4.819231 0.689765 7.039463 0.978094 0.505120 0.264342 0.116472 ... 1.707316 84.660193 28.918429 0.455838 0.326018 0.571211 0.527326 1.0 0.034534 0.273136
std 0.498446 1.138185 10.708463 9.058480 0.462590 1.726591 1.017700 0.499974 0.440983 0.320790 ... 0.108041 21.748490 6.631906 0.498046 0.468754 0.494903 0.499253 0.0 0.182597 0.445571
min 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ... 0.910000 22.680000 12.020000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000
25% 0.000000 2.000000 0.000000 0.000000 0.000000 6.000000 0.000000 0.000000 0.000000 0.000000 ... 1.630000 69.400000 24.410000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000
50% 1.000000 3.000000 0.000000 0.000000 1.000000 7.000000 1.000000 1.000000 0.000000 0.000000 ... 1.700000 81.650000 27.890000 0.000000 0.000000 1.000000 1.000000 1.0 0.000000 0.000000
75% 1.000000 4.000000 10.000000 5.000000 1.000000 8.000000 2.000000 1.000000 1.000000 0.000000 ... 1.780000 96.160000 32.220000 1.000000 1.000000 1.000000 1.000000 1.0 0.000000 1.000000
max 1.000000 5.000000 30.000000 30.000000 1.000000 24.000000 3.000000 1.000000 1.000000 1.000000 ... 2.410000 292.570000 99.640000 1.000000 1.000000 1.000000 1.000000 1.0 1.000000 1.000000

8 rows × 36 columns

test.describe()
Male GeneralHealth PhysicalHealthDays MentalHealthDays PhysicalActivities SleepHours RemovedTeeth HadHeartAttack HadAngina HadStroke ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
count 44513.000000 44380.000000 43374.000000 43620.000000 44383.000000 43982.000000 43364.000000 44220.000000 44117.000000 44352.000000 ... 41634.000000 40303.000000 39648.000000 39882.000000 37870.000000 39814.000000 36760.00000 36287.0 39445.000000 38063.000000
mean 0.467347 3.433551 4.304353 4.470839 0.759119 7.012414 0.687644 0.058684 0.060816 0.043155 ... 1.701734 82.990520 28.545288 0.532621 0.342382 0.526348 0.41420 1.0 0.043174 0.293461
std 0.498938 1.049691 8.629763 8.472884 0.427623 1.493726 0.883372 0.235035 0.238994 0.203208 ... 0.106604 21.462338 6.574508 0.498941 0.474513 0.499312 0.49259 0.0 0.203251 0.455354
min 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ... 0.910000 22.680000 12.690000 0.000000 0.000000 0.000000 0.00000 1.0 0.000000 0.000000
25% 0.000000 3.000000 0.000000 0.000000 1.000000 6.000000 0.000000 0.000000 0.000000 0.000000 ... 1.630000 68.040000 24.130000 0.000000 0.000000 0.000000 0.00000 1.0 0.000000 0.000000
50% 0.000000 3.000000 0.000000 0.000000 1.000000 7.000000 0.000000 0.000000 0.000000 0.000000 ... 1.700000 80.740000 27.440000 1.000000 0.000000 1.000000 0.00000 1.0 0.000000 0.000000
75% 1.000000 4.000000 3.000000 5.000000 1.000000 8.000000 1.000000 0.000000 0.000000 0.000000 ... 1.780000 95.250000 31.750000 1.000000 1.000000 1.000000 1.00000 1.0 0.000000 1.000000
max 1.000000 5.000000 30.000000 30.000000 1.000000 24.000000 3.000000 1.000000 1.000000 1.000000 ... 2.260000 276.240000 97.650000 1.000000 1.000000 1.000000 1.00000 1.0 1.000000 1.000000

8 rows × 36 columns

valid.describe()
Male GeneralHealth PhysicalHealthDays MentalHealthDays PhysicalActivities SleepHours RemovedTeeth HadHeartAttack HadAngina HadStroke ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
count 44514.000000 44405.000000 43450.000000 43622.000000 44421.000000 43955.000000 43350.000000 44175.000000 44060.000000 44339.000000 ... 41591.000000 40226.000000 39516.000000 39789.000000 37856.000000 39749.000000 36681.000000 36210.0 39453.000000 38058.000000
mean 0.466887 3.427835 4.354799 4.398171 0.760271 7.031760 0.684060 0.056163 0.060236 0.043506 ... 1.702198 83.013436 28.522226 0.529945 0.340501 0.522831 0.414983 1.0 0.045903 0.290609
std 0.498908 1.056506 8.691768 8.406697 0.426923 1.513703 0.881616 0.230239 0.237926 0.203995 ... 0.107066 21.464497 6.564679 0.499109 0.473884 0.499485 0.492726 0.0 0.209277 0.454049
min 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ... 0.910000 22.680000 12.190000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000
25% 0.000000 3.000000 0.000000 0.000000 1.000000 6.000000 0.000000 0.000000 0.000000 0.000000 ... 1.630000 68.040000 24.130000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000
50% 0.000000 3.000000 0.000000 0.000000 1.000000 7.000000 0.000000 0.000000 0.000000 0.000000 ... 1.700000 80.740000 27.440000 1.000000 0.000000 1.000000 0.000000 1.0 0.000000 0.000000
75% 1.000000 4.000000 3.000000 5.000000 1.000000 8.000000 1.000000 0.000000 0.000000 0.000000 ... 1.780000 95.250000 31.750000 1.000000 1.000000 1.000000 1.000000 1.0 0.000000 1.000000
max 1.000000 5.000000 30.000000 30.000000 1.000000 24.000000 3.000000 1.000000 1.000000 1.000000 ... 2.360000 284.860000 96.200000 1.000000 1.000000 1.000000 1.000000 1.0 1.000000 1.000000

8 rows × 36 columns

Wydaje się być korelacja między masą ciała i zawałem:

import seaborn as sns
sns.set_theme()
g = sns.catplot(
    data=train, kind="bar",
    x="GeneralHealth", y="WeightInKilograms", hue="HadHeartAttack",
    errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("General health index", "Body mass (kg)")
g.legend.set_title("Had heart attack")

Osoby palące częsciej miały zawał:

valid.groupby('SmokerStatus', as_index=False)['HadHeartAttack'].mean()
SmokerStatus HadHeartAttack
0 0.0 0.037162
1 1.0 0.069817
2 2.0 0.082760
3 3.0 0.093980

Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:

valid.groupby('GeneralHealth', as_index=False)['HadHeartAttack'].mean()
GeneralHealth HadHeartAttack
0 1.0 0.219401
1 2.0 0.118330
2 3.0 0.056664
3 4.0 0.028686
4 5.0 0.014112
valid.pivot_table('HadHeartAttack',index='GeneralHealth', columns='SmokerStatus')
SmokerStatus 0.0 1.0 2.0 3.0
GeneralHealth
1.0 0.163180 0.242991 0.259740 0.250000
2.0 0.085862 0.120438 0.158195 0.146465
3.0 0.038882 0.059574 0.083070 0.076079
4.0 0.023638 0.022901 0.039315 0.032688
5.0 0.011113 0.017544 0.020365 0.025316

Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
def scale_float_columns(dataset):
    numerical_columns = list(dataset.select_dtypes(include=['float64']).columns)
    dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])
test.head()
State Male GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York 1.0 3.0 2.0 0.0 Within past 2 years (1 year but less than 2 ye... NaN 7.0 0.0 0.0 ... 1.55 NaN NaN 0.0 0.0 0.0 NaN 1.0 0.0 0.0
189605 Michigan 0.0 2.0 20.0 15.0 Within past year (anytime less than 12 months ... 1.0 5.0 3.0 0.0 ... 1.68 70.31 25.02 0.0 NaN 1.0 1.0 NaN 0.0 0.0
59234 Delaware 0.0 4.0 0.0 0.0 Within past year (anytime less than 12 months ... 1.0 6.0 0.0 0.0 ... 1.50 64.41 28.68 0.0 0.0 1.0 NaN 1.0 0.0 0.0
255322 New Mexico 1.0 3.0 0.0 0.0 5 or more years ago 1.0 6.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226504 Montana 0.0 4.0 6.0 0.0 Within past year (anytime less than 12 months ... 1.0 8.0 0.0 0.0 ... 1.73 90.72 30.41 1.0 0.0 0.0 1.0 NaN 0.0 1.0

5 rows × 40 columns

scale_float_columns(test)
scale_float_columns(train)
scale_float_columns(valid)
test.head()
State Male GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York 1.0 0.50 0.066667 0.0 Within past 2 years (1 year but less than 2 ye... NaN 0.260870 0.0 0.0 ... 0.474074 NaN NaN 0.0 0.0 0.0 NaN 0.0 0.0 0.0
189605 Michigan 0.0 0.25 0.666667 0.5 Within past year (anytime less than 12 months ... 1.0 0.173913 1.0 0.0 ... 0.570370 0.187845 0.145127 0.0 NaN 1.0 1.0 NaN 0.0 0.0
59234 Delaware 0.0 0.75 0.000000 0.0 Within past year (anytime less than 12 months ... 1.0 0.217391 0.0 0.0 ... 0.437037 0.164576 0.188206 0.0 0.0 1.0 NaN 0.0 0.0 0.0
255322 New Mexico 1.0 0.50 0.000000 0.0 5 or more years ago 1.0 0.217391 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226504 Montana 0.0 0.75 0.200000 0.0 Within past year (anytime less than 12 months ... 1.0 0.304348 0.0 0.0 ... 0.607407 0.268339 0.208569 1.0 0.0 0.0 1.0 NaN 0.0 1.0

5 rows × 40 columns

5. Czyszczenie brakujących pól

Nie możemy użyć .dropna() gdyż większość wierszy ma brakujące wartości:

print(df.shape[0])
print(df.shape[0] - df.dropna().shape[0])
445132
199110
test.head()
State Male GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York 1.0 0.50 0.066667 0.0 Within past 2 years (1 year but less than 2 ye... NaN 0.260870 0.0 0.0 ... 0.474074 NaN NaN 0.0 0.0 0.0 NaN 0.0 0.0 0.0
189605 Michigan 0.0 0.25 0.666667 0.5 Within past year (anytime less than 12 months ... 1.0 0.173913 1.0 0.0 ... 0.570370 0.187845 0.145127 0.0 NaN 1.0 1.0 NaN 0.0 0.0
59234 Delaware 0.0 0.75 0.000000 0.0 Within past year (anytime less than 12 months ... 1.0 0.217391 0.0 0.0 ... 0.437037 0.164576 0.188206 0.0 0.0 1.0 NaN 0.0 0.0 0.0
255322 New Mexico 1.0 0.50 0.000000 0.0 5 or more years ago 1.0 0.217391 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226504 Montana 0.0 0.75 0.200000 0.0 Within past year (anytime less than 12 months ... 1.0 0.304348 0.0 0.0 ... 0.607407 0.268339 0.208569 1.0 0.0 0.0 1.0 NaN 0.0 1.0

5 rows × 40 columns

Uzupełniam brakujące wartości medianą:

numeric_columns = train.select_dtypes(include=['number']).columns
test[numeric_columns] = test[numeric_columns].fillna(test[numeric_columns].median().iloc[0])
train[numeric_columns] = train[numeric_columns].fillna(train[numeric_columns].median().iloc[0])
valid[numeric_columns] = valid[numeric_columns].fillna(valid[numeric_columns].iloc[0])
test.head()
State Male GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
276058 New York 1.0 0.50 0.066667 0.0 Within past 2 years (1 year but less than 2 ye... 0.0 0.260870 0.0 0.0 ... 0.474074 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
189605 Michigan 0.0 0.25 0.666667 0.5 Within past year (anytime less than 12 months ... 1.0 0.173913 1.0 0.0 ... 0.570370 0.187845 0.145127 0.0 0.0 1.0 1.0 0.0 0.0 0.0
59234 Delaware 0.0 0.75 0.000000 0.0 Within past year (anytime less than 12 months ... 1.0 0.217391 0.0 0.0 ... 0.437037 0.164576 0.188206 0.0 0.0 1.0 0.0 0.0 0.0 0.0
255322 New Mexico 1.0 0.50 0.000000 0.0 5 or more years ago 1.0 0.217391 0.0 0.0 ... 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
226504 Montana 0.0 0.75 0.200000 0.0 Within past year (anytime less than 12 months ... 1.0 0.304348 0.0 0.0 ... 0.607407 0.268339 0.208569 1.0 0.0 0.0 1.0 0.0 0.0 1.0

5 rows × 40 columns

Kolumny kategoryczne wypełniłem w czasie normalizacji wartościami "Unknown" ponieważ fillna-->median nie działa dla tego typu danych (https://stackoverflow.com/questions/49127897/python-pandas-fillna-median-not-working)

test["HighRiskLastYear"].value_counts()
HighRiskLastYear
0.0    42810
1.0     1703
Name: count, dtype: int64
test["HighRiskLastYear"].isna().sum()
0

Brak wartości non-null:

test.info()
<class 'pandas.core.frame.DataFrame'>
Index: 44513 entries, 276058 to 196692
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44513 non-null  category
 1   Male                       44513 non-null  float64 
 2   GeneralHealth              44513 non-null  float64 
 3   PhysicalHealthDays         44513 non-null  float64 
 4   MentalHealthDays           44513 non-null  float64 
 5   LastCheckupTime            44513 non-null  category
 6   PhysicalActivities         44513 non-null  float64 
 7   SleepHours                 44513 non-null  float64 
 8   RemovedTeeth               44513 non-null  float64 
 9   HadHeartAttack             44513 non-null  float64 
 10  HadAngina                  44513 non-null  float64 
 11  HadStroke                  44513 non-null  float64 
 12  HadAsthma                  44513 non-null  float64 
 13  HadSkinCancer              44513 non-null  float64 
 14  HadCOPD                    44513 non-null  float64 
 15  HadDepressiveDisorder      44513 non-null  float64 
 16  HadKidneyDisease           44513 non-null  float64 
 17  HadArthritis               44513 non-null  float64 
 18  HadDiabetes                44513 non-null  float64 
 19  DeafOrHardOfHearing        44513 non-null  float64 
 20  BlindOrVisionDifficulty    44513 non-null  float64 
 21  DifficultyConcentrating    44513 non-null  float64 
 22  DifficultyWalking          44513 non-null  float64 
 23  DifficultyDressingBathing  44513 non-null  float64 
 24  DifficultyErrands          44513 non-null  float64 
 25  SmokerStatus               44513 non-null  float64 
 26  ECigaretteUsage            44513 non-null  float64 
 27  ChestScan                  44513 non-null  float64 
 28  RaceEthnicityCategory      44513 non-null  category
 29  AgeCategory                44513 non-null  category
 30  HeightInMeters             44513 non-null  float64 
 31  WeightInKilograms          44513 non-null  float64 
 32  BMI                        44513 non-null  float64 
 33  AlcoholDrinkers            44513 non-null  float64 
 34  HIVTesting                 44513 non-null  float64 
 35  FluVaxLast12               44513 non-null  float64 
 36  PneumoVaxEver              44513 non-null  float64 
 37  TetanusLast10Tdap          44513 non-null  float64 
 38  HighRiskLastYear           44513 non-null  float64 
 39  CovidPos                   44513 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676617 entries, 0 to 676616
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   State                      676617 non-null  category
 1   Male                       676617 non-null  float64 
 2   GeneralHealth              676617 non-null  float64 
 3   PhysicalHealthDays         676617 non-null  float64 
 4   MentalHealthDays           676617 non-null  float64 
 5   LastCheckupTime            676617 non-null  category
 6   PhysicalActivities         676617 non-null  float64 
 7   SleepHours                 676617 non-null  float64 
 8   RemovedTeeth               676617 non-null  float64 
 9   HadHeartAttack             676617 non-null  float64 
 10  HadAngina                  676617 non-null  float64 
 11  HadStroke                  676617 non-null  float64 
 12  HadAsthma                  676617 non-null  float64 
 13  HadSkinCancer              676617 non-null  float64 
 14  HadCOPD                    676617 non-null  float64 
 15  HadDepressiveDisorder      676617 non-null  float64 
 16  HadKidneyDisease           676617 non-null  float64 
 17  HadArthritis               676617 non-null  float64 
 18  HadDiabetes                676617 non-null  float64 
 19  DeafOrHardOfHearing        676617 non-null  float64 
 20  BlindOrVisionDifficulty    676617 non-null  float64 
 21  DifficultyConcentrating    676617 non-null  float64 
 22  DifficultyWalking          676617 non-null  float64 
 23  DifficultyDressingBathing  676617 non-null  float64 
 24  DifficultyErrands          676617 non-null  float64 
 25  SmokerStatus               676617 non-null  float64 
 26  ECigaretteUsage            676617 non-null  float64 
 27  ChestScan                  676617 non-null  float64 
 28  RaceEthnicityCategory      676617 non-null  category
 29  AgeCategory                676617 non-null  category
 30  HeightInMeters             676617 non-null  float64 
 31  WeightInKilograms          676617 non-null  float64 
 32  BMI                        676617 non-null  float64 
 33  AlcoholDrinkers            676617 non-null  float64 
 34  HIVTesting                 676617 non-null  float64 
 35  FluVaxLast12               676617 non-null  float64 
 36  PneumoVaxEver              676617 non-null  float64 
 37  TetanusLast10Tdap          676617 non-null  float64 
 38  HighRiskLastYear           676617 non-null  float64 
 39  CovidPos                   676617 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 188.4 MB
valid.info()
<class 'pandas.core.frame.DataFrame'>
Index: 44514 entries, 127295 to 418173
Data columns (total 40 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   State                      44514 non-null  category
 1   Male                       44514 non-null  float64 
 2   GeneralHealth              44514 non-null  float64 
 3   PhysicalHealthDays         44514 non-null  float64 
 4   MentalHealthDays           44514 non-null  float64 
 5   LastCheckupTime            44514 non-null  category
 6   PhysicalActivities         44514 non-null  float64 
 7   SleepHours                 44514 non-null  float64 
 8   RemovedTeeth               44514 non-null  float64 
 9   HadHeartAttack             44514 non-null  float64 
 10  HadAngina                  44514 non-null  float64 
 11  HadStroke                  44514 non-null  float64 
 12  HadAsthma                  44514 non-null  float64 
 13  HadSkinCancer              44514 non-null  float64 
 14  HadCOPD                    44514 non-null  float64 
 15  HadDepressiveDisorder      44514 non-null  float64 
 16  HadKidneyDisease           44514 non-null  float64 
 17  HadArthritis               44514 non-null  float64 
 18  HadDiabetes                44514 non-null  float64 
 19  DeafOrHardOfHearing        44514 non-null  float64 
 20  BlindOrVisionDifficulty    44514 non-null  float64 
 21  DifficultyConcentrating    44514 non-null  float64 
 22  DifficultyWalking          44514 non-null  float64 
 23  DifficultyDressingBathing  44514 non-null  float64 
 24  DifficultyErrands          44514 non-null  float64 
 25  SmokerStatus               44514 non-null  float64 
 26  ECigaretteUsage            44514 non-null  float64 
 27  ChestScan                  44514 non-null  float64 
 28  RaceEthnicityCategory      44514 non-null  category
 29  AgeCategory                44514 non-null  category
 30  HeightInMeters             44514 non-null  float64 
 31  WeightInKilograms          44514 non-null  float64 
 32  BMI                        44514 non-null  float64 
 33  AlcoholDrinkers            44514 non-null  float64 
 34  HIVTesting                 44514 non-null  float64 
 35  FluVaxLast12               44514 non-null  float64 
 36  PneumoVaxEver              44514 non-null  float64 
 37  TetanusLast10Tdap          44514 non-null  float64 
 38  HighRiskLastYear           44514 non-null  float64 
 39  CovidPos                   44514 non-null  float64 
dtypes: category(4), float64(36)
memory usage: 12.7 MB

Zapisywanie do csv

cat_columns = test.select_dtypes(['category']).columns
print(cat_columns)
Index(['State', 'LastCheckupTime', 'RaceEthnicityCategory', 'AgeCategory'], dtype='object')
#test[cat_columns] = test[cat_columns].apply(lambda x: pd.factorize(x)[0])
#train[cat_columns] = train[cat_columns].apply(lambda x: pd.factorize(x)[0])
#valid[cat_columns] = valid[cat_columns].apply(lambda x: pd.factorize(x)[0])
test.to_csv("test.csv")
train.to_csv("train.csv")
valid.to_csv("valid.csv")