344 KiB
1. Pobieranie zbioru danych
!pip install --user kaggle
Requirement already satisfied: kaggle in c:\users\adrian\appdata\roaming\python\python39\site-packages (1.6.6) Requirement already satisfied: bleach in c:\users\adrian\miniconda3\lib\site-packages (from kaggle) (4.1.0) Requirement already satisfied: python-slugify in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (8.0.4) Requirement already satisfied: python-dateutil in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.8.2) Requirement already satisfied: tqdm in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (4.64.1) Requirement already satisfied: requests in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.28.1) Requirement already satisfied: certifi in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2022.6.15) Requirement already satisfied: six>=1.10 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.16.0) Requirement already satisfied: urllib3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.26.11) Requirement already satisfied: webencodings in c:\users\adrian\miniconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: packaging in c:\users\adrian\appdata\roaming\python\python39\site-packages (from bleach->kaggle) (22.0) Requirement already satisfied: text-unidecode>=1.3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<4,>=2.5 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.10) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.1.0) Requirement already satisfied: colorama in c:\users\adrian\appdata\roaming\python\python39\site-packages (from tqdm->kaggle) (0.4.5)
!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease/
personal-key-indicators-of-heart-disease.zip: Skipping, found more recently modified local copy (use --force to force download)
#!unzip -o personal-key-indicators-of-heart-disease.zip #nie działa na Windowsie więc korzystam z modułu zipfile
import zipfile
with zipfile.ZipFile("personal-key-indicators-of-heart-disease.zip", 'r') as zip_ref:
zip_ref.extractall("dataset_extracted")
import pandas as pd
# W pobranym zbiorze danych jest kilka podzbiorów więc celowo otwieram ten z NaN, żeby manualnie go oczyścić dla praktyki
df = pd.read_csv("dataset_extracted/2022/heart_2022_with_nans.csv")
Przeglądanie nieoczyszczonego datasetu
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 445132 entries, 0 to 445131 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 445132 non-null object 1 Sex 445132 non-null object 2 GeneralHealth 443934 non-null object 3 PhysicalHealthDays 434205 non-null float64 4 MentalHealthDays 436065 non-null float64 5 LastCheckupTime 436824 non-null object 6 PhysicalActivities 444039 non-null object 7 SleepHours 439679 non-null float64 8 RemovedTeeth 433772 non-null object 9 HadHeartAttack 442067 non-null object 10 HadAngina 440727 non-null object 11 HadStroke 443575 non-null object 12 HadAsthma 443359 non-null object 13 HadSkinCancer 441989 non-null object 14 HadCOPD 442913 non-null object 15 HadDepressiveDisorder 442320 non-null object 16 HadKidneyDisease 443206 non-null object 17 HadArthritis 442499 non-null object 18 HadDiabetes 444045 non-null object 19 DeafOrHardOfHearing 424485 non-null object 20 BlindOrVisionDifficulty 423568 non-null object 21 DifficultyConcentrating 420892 non-null object 22 DifficultyWalking 421120 non-null object 23 DifficultyDressingBathing 421217 non-null object 24 DifficultyErrands 419476 non-null object 25 SmokerStatus 409670 non-null object 26 ECigaretteUsage 409472 non-null object 27 ChestScan 389086 non-null object 28 RaceEthnicityCategory 431075 non-null object 29 AgeCategory 436053 non-null object 30 HeightInMeters 416480 non-null float64 31 WeightInKilograms 403054 non-null float64 32 BMI 396326 non-null float64 33 AlcoholDrinkers 398558 non-null object 34 HIVTesting 379005 non-null object 35 FluVaxLast12 398011 non-null object 36 PneumoVaxEver 368092 non-null object 37 TetanusLast10Tdap 362616 non-null object 38 HighRiskLastYear 394509 non-null object 39 CovidPos 394368 non-null object dtypes: float64(6), object(34) memory usage: 135.8+ MB
df.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | Female | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | NaN | No | ... | NaN | NaN | NaN | No | No | Yes | No | Yes, received tetanus shot but not sure what type | No | No |
1 | Alabama | Female | Excellent | 0.0 | 0.0 | NaN | No | 6.0 | NaN | No | ... | 1.60 | 68.04 | 26.57 | No | No | No | No | No, did not receive any tetanus shot in the pa... | No | No |
2 | Alabama | Female | Very good | 2.0 | 3.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | NaN | No | ... | 1.57 | 63.50 | 25.61 | No | No | No | No | NaN | No | Yes |
3 | Alabama | Female | Excellent | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | NaN | No | ... | 1.65 | 63.50 | 23.30 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No |
4 | Alabama | Female | Fair | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | NaN | No | ... | 1.57 | 53.98 | 21.77 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | No |
5 rows × 40 columns
df.describe()
PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | WeightInKilograms | BMI | |
---|---|---|---|---|---|---|
count | 434205.000000 | 436065.000000 | 439679.000000 | 416480.000000 | 403054.000000 | 396326.000000 |
mean | 4.347919 | 4.382649 | 7.022983 | 1.702691 | 83.074470 | 28.529842 |
std | 8.688912 | 8.387475 | 1.502425 | 0.107177 | 21.448173 | 6.554889 |
min | 0.000000 | 0.000000 | 1.000000 | 0.910000 | 22.680000 | 12.020000 |
25% | 0.000000 | 0.000000 | 6.000000 | 1.630000 | 68.040000 | 24.130000 |
50% | 0.000000 | 0.000000 | 7.000000 | 1.700000 | 80.740000 | 27.440000 |
75% | 3.000000 | 5.000000 | 8.000000 | 1.780000 | 95.250000 | 31.750000 |
max | 30.000000 | 30.000000 | 24.000000 | 2.410000 | 292.570000 | 99.640000 |
Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu
Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:
df["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
df["HadHeartAttack"].value_counts()
No 416959 Yes 25108 Name: HadHeartAttack, dtype: int64
2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling
from sklearn.model_selection import train_test_split
#Funkcji z sklearn musimy użyć dwukrotnie, bo dzieli tylko na dwa podzbiory
train, test_and_valid = train_test_split(df, test_size=0.2) #0.8 train, 0.2 test&valid
test, valid = train_test_split(test_and_valid, test_size=0.5) #0.1 test, 0.1 valid
train["HadHeartAttack"].value_counts()
No 333634 Yes 20016 Name: HadHeartAttack, dtype: int64
Zbiór treningowy jest nadal niezbalansowany więc zrobię prosty oversampling przez kopiowanie mniejszej klasy aż będą prawie równe
def oversample(dataset):
num_true = len(dataset[dataset["HadHeartAttack"]=="Yes"])
num_false = len(dataset[dataset["HadHeartAttack"]=="No"])
num_oversampling_steps = num_false//num_true
oversampled = dataset.copy()
for x in range(num_oversampling_steps):
oversampled = pd.concat([oversampled, dataset[dataset["HadHeartAttack"]=="Yes"]], ignore_index=True)
return oversampled
train = oversample(train)
train["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
test["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
valid["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
Proporcje osób palących / niepalących w pierwotnym zbiorze danych:
df["SmokerStatus"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='SmokerStatus'>
df["ECigaretteUsage"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='ECigaretteUsage'>
Statystyki covidowe
df["CovidPos"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='CovidPos'>
Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne
Kolumny zawierające stan zdrowia i podobne cechy opisane w sposób "poor/fair/good/excellent" etc. starałem się zamienić na liczbowe w sposób sensowny, rosnący względem pozytywnego aspektu tego czynnika zdrowotnego. Podobnie z tym jak często dana osoba paliła. Część kolumn zamieniłem na kategoryczne Kolumnę płci zamieniłem na numeryczną w celu późniejszego wykorzystania przez model, choć mialem wątpliwości co do robienia tego pod względem poprawności politycznej
df["Sex"].unique()
array(['Female', 'Male'], dtype=object)
df["GeneralHealth"].unique()
array(['Very good', 'Excellent', 'Fair', 'Poor', 'Good', nan], dtype=object)
health_map = {
"Excellent": 5,
"Very good": 4,
"Good": 3,
"Fair": 2,
"Poor": 1
}
for col in df:
print(f"{col}:")
print(df[col].unique())
State: ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico' 'Virgin Islands'] Sex: ['Female' 'Male'] GeneralHealth: ['Very good' 'Excellent' 'Fair' 'Poor' 'Good' nan] PhysicalHealthDays: [ 0. 2. 1. 8. 5. 30. 4. 23. 14. nan 15. 3. 10. 7. 25. 6. 21. 20. 29. 16. 9. 27. 28. 12. 13. 11. 26. 17. 24. 19. 18. 22.] MentalHealthDays: [ 0. 3. 9. 5. 15. 20. 14. 10. 18. 1. nan 2. 30. 4. 6. 7. 25. 8. 22. 29. 27. 21. 12. 28. 16. 13. 26. 17. 11. 23. 19. 24.] LastCheckupTime: ['Within past year (anytime less than 12 months ago)' nan 'Within past 2 years (1 year but less than 2 years ago)' 'Within past 5 years (2 years but less than 5 years ago)' '5 or more years ago'] PhysicalActivities: ['No' 'Yes' nan] SleepHours: [ 8. 6. 5. 7. 9. 4. 10. 1. 12. nan 18. 3. 2. 11. 16. 15. 13. 14. 20. 23. 17. 24. 22. 19. 21.] RemovedTeeth: [nan 'None of them' '1 to 5' '6 or more, but not all' 'All'] HadHeartAttack: ['No' 'Yes' nan] HadAngina: ['No' 'Yes' nan] HadStroke: ['No' 'Yes' nan] HadAsthma: ['No' 'Yes' nan] HadSkinCancer: ['No' 'Yes' nan] HadCOPD: ['No' 'Yes' nan] HadDepressiveDisorder: ['No' 'Yes' nan] HadKidneyDisease: ['No' 'Yes' nan] HadArthritis: ['No' 'Yes' nan] HadDiabetes: ['Yes' 'No' 'No, pre-diabetes or borderline diabetes' nan 'Yes, but only during pregnancy (female)'] DeafOrHardOfHearing: ['No' nan 'Yes'] BlindOrVisionDifficulty: ['No' 'Yes' nan] DifficultyConcentrating: ['No' nan 'Yes'] DifficultyWalking: ['No' 'Yes' nan] DifficultyDressingBathing: ['No' nan 'Yes'] DifficultyErrands: ['No' 'Yes' nan] SmokerStatus: ['Never smoked' 'Current smoker - now smokes some days' 'Former smoker' nan 'Current smoker - now smokes every day'] ECigaretteUsage: ['Not at all (right now)' 'Never used e-cigarettes in my entire life' nan 'Use them every day' 'Use them some days'] ChestScan: ['No' 'Yes' nan] RaceEthnicityCategory: ['White only, Non-Hispanic' 'Black only, Non-Hispanic' 'Other race only, Non-Hispanic' 'Multiracial, Non-Hispanic' nan 'Hispanic'] AgeCategory: ['Age 80 or older' 'Age 55 to 59' nan 'Age 40 to 44' 'Age 75 to 79' 'Age 70 to 74' 'Age 65 to 69' 'Age 60 to 64' 'Age 50 to 54' 'Age 45 to 49' 'Age 35 to 39' 'Age 25 to 29' 'Age 30 to 34' 'Age 18 to 24'] HeightInMeters: [ nan 1.6 1.57 1.65 1.8 1.63 1.7 1.68 1.73 1.55 1.93 1.88 1.78 1.85 1.75 1.52 1.83 1.91 1.96 1.5 1.45 1.42 1.24 1.47 1.22 1.98 2.03 2.01 1.3 1.4 1.35 1.82 1.67 1.76 2.11 1.37 1.64 1.71 2.16 2.26 0.91 2.06 1.14 1.74 1.51 1.53 1.69 1.56 1.84 1.9 1.54 1.72 1.87 1.61 1.49 1.59 1.58 1.62 1.79 1.46 1.89 2.13 0.99 2.08 2.21 1.32 2.18 1.77 2.36 1.25 1.66 1.86 1.95 1.19 1.05 1.48 1.03 1.18 1.81 1.38 1.44 1.07 1.27 1.2 1.17 1.04 2.24 1.1 1.43 1.92 2.05 1.12 2.41 2.34 0.97 1.06 1.15 2.29 1.16 1.09 0.92 2.07 1. 1.08 1.02 1.33 2. 2.02 1.94 0.95] WeightInKilograms: [ nan 68.04 63.5 53.98 84.82 62.6 73.48 81.65 74.84 59.42 85.28 106.59 71.21 64.41 61.23 90.72 65.77 66.22 80.29 86.18 47.63 107.05 57.15 105.23 77.11 56.7 79.38 113.4 102.06 59.87 104.33 53.52 61.69 136.08 34.47 99.79 127.01 78.93 95.25 58.97 92.08 72.57 83.91 49.9 117.93 71.67 102.97 62.14 83.46 54.43 94.35 60.78 117.03 65.32 76.66 88.45 89.81 74.39 68.95 79.83 108.41 90.26 55.79 91.63 47.17 78.02 50.8 91.17 84.37 145.15 93.89 122.47 48.99 73.94 88.9 80.74 81.19 158.76 97.52 51.71 82.55 76.2 68.49 75.3 70.31 63.05 60.33 115.67 86.64 108.86 92.53 124.74 43.09 58.51 63.96 92.99 44.45 128.82 98.88 45.36 110.68 46.72 58.06 73.03 95.71 131.09 78.47 69.4 85.73 67.59 103.87 120.2 88. 54.88 111.58 52.16 77.56 126.55 94.8 123.83 89.36 75.75 69.85 112.49 82.1 106.14 57.61 70.76 148.78 96.16 67.13 48.08 163.29 109.77 100.7 142.88 64.86 111.13 121.11 55.34 101.6 93.44 117.48 120.66 66.68 44.91 132. 107.5 107.95 36.29 103.42 87.09 83.01 56.25 96.62 134.26 97.07 34.93 99.34 72.12 49.44 122.02 98.43 129.73 181.44 52.62 121.56 110.22 48.53 140.61 156.49 116.57 87.54 44. 114.31 31.75 97.98 101.15 112.04 100.24 113.85 154.22 118.39 133.81 149.69 41.73 119.75 138.35 151.95 129.27 131.54 104.78 132.45 102.51 116.12 40.37 105.69 136.98 195.04 53.07 132.9 124.28 112.94 114.76 45.81 119.29 167.83 51.26 172.37 162.39 46.27 127.91 123.38 38.56 130.63 143.34 115.21 166.92 135.17 109.32 135.62 204.12 127.46 118.84 139.25 126.1 122.92 151.5 133.36 42.64 50.35 80. 190.51 37.19 147.87 35.38 144.24 149.23 37.65 86. 147.42 281. 165.56 162.84 155.58 70. 137.89 189.6 206.38 148.32 42.18 153.77 38.1 90. 176.9 191.87 249.48 67. 95. 82. 170.1 62. 40.82 53. 139.71 130.18 100. 165.11 64. 43.54 24. 134.72 141.52 125.19 75. 60. 34.02 164.65 30.84 250. 58. 76. 73. 112. 74. 55. 200. 54. 66. 72. 152.41 39.46 220. 41.28 168.28 188.24 59. 46. 265. 238.14 168.74 145. 190. 93. 159.66 78. 50. 185.07 91. 104. 165. 183.7 33.57 161.93 68. 125.65 134. 130. 32.21 143.79 69. 179.17 63. 105. 210.92 65. 32. 292.57 280. 85. 174.63 56. 128.37 87. 39.92 83. 169.64 156.04 177. 121. 151.05 89. 146.96 146.06 98. 166.47 36.74 171.46 227.25 29.48 190.06 161.03 35.83 226.8 175.09 138.8 240.4 158.3 170.55 61. 137.44 145.6 141.07 155.13 52. 120. 57. 77. 27.22 25.4 240. 96. 47. 115. 41. 45. 170. 150.59 272.16 26.31 48. 39.01 236. 92. 197.31 156. 84. 94. 29.03 49. 79. 157.85 192.78 255. 108. 185. 222.26 229.97 180. 81. 24.95 71. 26. 107. 101. 208.65 140. 175. 111. 110. 141.97 22.68 284.86 136.53 210. 103. 185.97 140.16 146.51 24.49 25.85 150. 102. 229.52 23.59 125. 163. 38. 135. 176.45 185.52 152.86 232.69 124. 192.32 186.88 118. 160.12 160. 193.68 201.85 144.7 184.16 142.43 169. 166.01 32.66 180.53 196.41 51. 40. 171.91 195.95 33.11 153.31 159.21 164.2 219.99 215.46 182.34 30. 160.57 173.27 158. 213.19 276.24 199.58 175.99 235.87 217.72 200.03 230.88 146. 24.04 178.72 150.14 157.4 163.75 191.42 174.18 28.58 97. 256.28 205.48 161.48 178.26 179.62 205.02 254.01 154.68 209.56 201.4 234.96 177.81 200.49 231.79 227.7 273.52 189.15 173.73 183.25 167.38 211.83 223.62 228.61 30.39 197.77 184.61 250.38 181.89 31.3 290.3 285. 113. 242.67 231.33 180.08 202.76 176. 188.69 206.84 164. 156.94 114. 122. 222. 137. 166. 180.98 272. 172.82 274.42 234.51 199.13 244.94 203.21 23.13 265.35 198.22 263.08 216.82 154. 169.19 239.04 177.35 210.47 224.98 117. 37. 126. 273.06 203.66 252.2 238.59 194.59 187.33 221.35 162. 224.53 23. 223.17 187.79 212.73 152. 233.6 193.23 205. 229.06 230. 247.21 99. 28.12 230.42 175.54 205.93 171. 26.76 212.28 217. 280.32 281.68 248.57 195. 42. 258.55 215. 116. 28. 123. 186.43 228.16 119. 219.09 214.55 278.96 182.8 138. 217.27 246.3 189. ] BMI: [ nan 26.57 25.61 ... 13.51 28.39 48.63] AlcoholDrinkers: ['No' 'Yes' nan] HIVTesting: ['No' 'Yes' nan] FluVaxLast12: ['Yes' 'No' nan] PneumoVaxEver: ['No' 'Yes' nan] TetanusLast10Tdap: ['Yes, received tetanus shot but not sure what type' 'No, did not receive any tetanus shot in the past 10 years' nan 'Yes, received Tdap' 'Yes, received tetanus shot, but not Tdap'] HighRiskLastYear: ['No' nan 'Yes'] CovidPos: ['No' 'Yes' nan 'Tested positive using home test without a health professional']
from collections import defaultdict
def normalize_dataset(dataset):
dataset["GeneralHealth"] = dataset["GeneralHealth"].map(defaultdict(lambda: float('NaN'), health_map), na_action='ignore')
dataset["Sex"] = dataset["Sex"].map({"Female":0,"Male":1}).astype(float) #Zamiana z kolumn tekstowych na numeryczne
dataset.rename(columns ={"Sex":"Male"},inplace=True)
dataset["State"] = dataset["State"].astype('category')
dataset["PhysicalHealthDays"].astype(float)
dataset["MentalHealthDays"].astype(float)
dataset["LastCheckupTime"] = dataset["LastCheckupTime"].fillna("Unknown").astype('category') # Potem korzystam z fillna-->median ale nie działa to na kolumnach kategorycznych więc wykonuję to przed konwersją
dataset["PhysicalActivities"]= dataset["PhysicalActivities"].map({"No":0,"Yes":1})
dataset["SleepHours"].astype(float)
dataset["RemovedTeeth"] = dataset["RemovedTeeth"].map(defaultdict(lambda: float('NaN'), {"None of them":0,"1 to 5":1, "6 or more, but not all":2, "All":3}), na_action='ignore')
dataset["HadHeartAttack"]= dataset["HadHeartAttack"].map({"No":0,"Yes":1})
dataset["HadAngina"]= dataset["HadAngina"].map({"No":0,"Yes":1})
dataset["HadStroke"]= dataset["HadStroke"].map({"No":0,"Yes":1})
dataset["HadAsthma"]= dataset["HadAsthma"].map({"No":0,"Yes":1})
dataset["HadSkinCancer"]= dataset["HadSkinCancer"].map({"No":0,"Yes":1})
dataset["HadCOPD"]= dataset["HadCOPD"].map({"No":0,"Yes":1})
dataset["HadDepressiveDisorder"]= dataset["HadDepressiveDisorder"].map({"No":0,"Yes":1})
dataset["HadKidneyDisease"]= dataset["HadKidneyDisease"].map({"No":0,"Yes":1})
dataset["HadArthritis"]= dataset["HadArthritis"].map({"No":0,"Yes":1})
dataset["HadDiabetes"]= dataset["HadDiabetes"].map({"No":0,"Yes, but only during pregnancy (female)":1,"No, pre-diabetes or borderline diabetes":2,"Yes":3})
dataset["DeafOrHardOfHearing"]= dataset["DeafOrHardOfHearing"].map({"No":0,"Yes":1})
dataset["BlindOrVisionDifficulty"]= dataset["BlindOrVisionDifficulty"].map({"No":0,"Yes":1})
dataset["DifficultyConcentrating"]= dataset["DifficultyConcentrating"].map({"No":0,"Yes":1})
dataset["DifficultyWalking"]= dataset["DifficultyWalking"].map({"No":0,"Yes":1})
dataset["DifficultyDressingBathing"]= dataset["DifficultyDressingBathing"].map({"No":0,"Yes":1})
dataset["DifficultyErrands"]= dataset["DifficultyErrands"].map({"No":0,"Yes":1})
dataset["SmokerStatus"]= dataset["SmokerStatus"].map({"Never smoked":0,"Current smoker - now smokes some days":1,"Former smoker":2,"Current smoker - now smokes every day":3})
dataset["ECigaretteUsage"]= dataset["ECigaretteUsage"].map({"Never used e-cigarettes in my entire life":0,"Not at all (right now)":1,"Use them some days":2,"Use them every day":3})
dataset["ChestScan"]= dataset["ChestScan"].map({"No":0,"Yes":1})
dataset["RaceEthnicityCategory"] = dataset["RaceEthnicityCategory"].fillna("Unknown").astype('category')
dataset["AgeCategory"] = dataset["AgeCategory"].fillna("Unknown").astype('category')
dataset["HeightInMeters"] = dataset["HeightInMeters"].astype(float)
dataset["WeightInKilograms"] = dataset["WeightInKilograms"].astype(float)
dataset["BMI"] = dataset["BMI"].astype(float)
dataset["AlcoholDrinkers"]= dataset["AlcoholDrinkers"].map({"No":0,"Yes":1})
dataset["HIVTesting"]= dataset["HIVTesting"].map({"No":0,"Yes":1})
dataset["FluVaxLast12"]= dataset["FluVaxLast12"].map({"No":0,"Yes":1})
dataset["PneumoVaxEver"]= dataset["PneumoVaxEver"].map({"No":0,"Yes":1})
dataset["TetanusLast10Tdap"]= dataset["TetanusLast10Tdap"].apply(lambda x: float('NaN') if type(x)!=str else 1.0 if 'Yes,' in x else 1.0 if 'No,' in x else float('NaN'))
dataset["HighRiskLastYear"]= dataset["HighRiskLastYear"].map({"No":0,"Yes":1})
dataset["CovidPos"]= dataset["CovidPos"].map({"No":0,"Yes":1})
Zbiór test przed zmianą typu danych
test.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | Male | Good | 0.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | Yes | 7.0 | None of them | No | ... | 1.78 | 90.72 | 28.70 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | No |
359753 | Texas | Male | Good | 0.0 | 0.0 | NaN | No | 8.0 | None of them | No | ... | 1.60 | 77.11 | 30.12 | No | No | No | No | No, did not receive any tetanus shot in the pa... | No | No |
286723 | Ohio | Male | Fair | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | All | No | ... | 1.75 | 80.29 | 26.14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
305100 | Oklahoma | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | None of them | No | ... | 1.80 | 97.52 | 29.99 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
199077 | Minnesota | Female | Excellent | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | 1 to 5 | No | ... | 1.52 | 67.13 | 28.90 | Yes | Yes | No | No | NaN | No | Yes |
5 rows × 40 columns
Zbiór test po zmianie typu danych
normalize_dataset(test)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | 1.0 | 3.0 | 0.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.78 | 90.72 | 28.70 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
359753 | Texas | 1.0 | 3.0 | 0.0 | 0.0 | Unknown | 0.0 | 8.0 | 0.0 | 0.0 | ... | 1.60 | 77.11 | 30.12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
286723 | Ohio | 1.0 | 2.0 | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 8.0 | 3.0 | 0.0 | ... | 1.75 | 80.29 | 26.14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
305100 | Oklahoma | 1.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.80 | 97.52 | 29.99 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
199077 | Minnesota | 0.0 | 5.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 8.0 | 1.0 | 0.0 | ... | 1.52 | 67.13 | 28.90 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44513 entries, 166151 to 306609 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44400 non-null float64 3 PhysicalHealthDays 43445 non-null float64 4 MentalHealthDays 43614 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44402 non-null float64 7 SleepHours 43952 non-null float64 8 RemovedTeeth 43394 non-null float64 9 HadHeartAttack 44227 non-null float64 10 HadAngina 44101 non-null float64 11 HadStroke 44360 non-null float64 12 HadAsthma 44323 non-null float64 13 HadSkinCancer 44186 non-null float64 14 HadCOPD 44276 non-null float64 15 HadDepressiveDisorder 44239 non-null float64 16 HadKidneyDisease 44336 non-null float64 17 HadArthritis 44254 non-null float64 18 HadDiabetes 44386 non-null float64 19 DeafOrHardOfHearing 42470 non-null float64 20 BlindOrVisionDifficulty 42355 non-null float64 21 DifficultyConcentrating 42104 non-null float64 22 DifficultyWalking 42121 non-null float64 23 DifficultyDressingBathing 42141 non-null float64 24 DifficultyErrands 41986 non-null float64 25 SmokerStatus 40996 non-null float64 26 ECigaretteUsage 40955 non-null float64 27 ChestScan 38863 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 41620 non-null float64 31 WeightInKilograms 40212 non-null float64 32 BMI 39520 non-null float64 33 AlcoholDrinkers 39817 non-null float64 34 HIVTesting 37853 non-null float64 35 FluVaxLast12 39779 non-null float64 36 PneumoVaxEver 36821 non-null float64 37 TetanusLast10Tdap 36315 non-null float64 38 HighRiskLastYear 39453 non-null float64 39 CovidPos 38048 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
normalize_dataset(train)
normalize_dataset(valid)
Statystyki dla zbiorów po zamianie na kolumny numeryczne
_50. centyl to mediana
train.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 676361.000000 | 674074.000000 | 655378.000000 | 660076.000000 | 674264.000000 | 665524.000000 | 653726.000000 | 673906.000000 | 657289.000000 | 672572.000000 | ... | 636986.000000 | 619364.000000 | 610744.000000 | 607296.000000 | 573796.000000 | 606661.000000 | 571290.000000 | 554109.0 | 601088.000000 | 585564.000000 |
mean | 0.538461 | 3.055939 | 6.729019 | 4.851376 | 0.689958 | 7.038852 | 0.978907 | 0.504925 | 0.263980 | 0.117729 | ... | 1.707282 | 84.678030 | 28.935880 | 0.456237 | 0.324913 | 0.571250 | 0.528798 | 1.0 | 0.035005 | 0.274667 |
std | 0.498519 | 1.138040 | 10.712884 | 9.106112 | 0.462511 | 1.729860 | 1.018776 | 0.499976 | 0.440789 | 0.322287 | ... | 0.108041 | 21.671753 | 6.616632 | 0.498082 | 0.468343 | 0.494898 | 0.499170 | 0.0 | 0.183792 | 0.446347 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.050000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 69.400000 | 24.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 81.650000 | 27.890000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 10.000000 | 5.000000 | 1.000000 | 8.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 1.780000 | 96.160000 | 32.280000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.410000 | 292.570000 | 99.640000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
test.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44513.000000 | 44400.000000 | 43445.000000 | 43614.000000 | 44402.000000 | 43952.000000 | 43394.000000 | 44227.000000 | 44101.000000 | 44360.000000 | ... | 41620.000000 | 40212.000000 | 39520.000000 | 39817.000000 | 37853.000000 | 39779.000000 | 36821.000000 | 36315.0 | 39453.000000 | 38048.000000 |
mean | 0.468402 | 3.437815 | 4.266222 | 4.363989 | 0.761092 | 7.016746 | 0.683505 | 0.056775 | 0.059817 | 0.042178 | ... | 1.702194 | 83.020179 | 28.509078 | 0.531281 | 0.343539 | 0.529123 | 0.417615 | 1.0 | 0.043774 | 0.290896 |
std | 0.499006 | 1.047836 | 8.570562 | 8.340355 | 0.426421 | 1.490779 | 0.884697 | 0.231415 | 0.237151 | 0.200997 | ... | 0.107023 | 21.645999 | 6.571327 | 0.499027 | 0.474896 | 0.499157 | 0.493173 | 0.0 | 0.204594 | 0.454181 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.340000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 80.740000 | 27.410000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 5.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.660000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.240000 | 290.300000 | 94.660000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
valid.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44514.000000 | 44388.000000 | 43430.000000 | 43655.000000 | 44413.000000 | 44011.000000 | 43420.000000 | 44190.000000 | 44057.000000 | 44339.000000 | ... | 41618.000000 | 40342.000000 | 39646.000000 | 39861.000000 | 37852.000000 | 39827.000000 | 36781.000000 | 36224.0 | 39424.000000 | 38088.000000 |
mean | 0.470571 | 3.429733 | 4.332006 | 4.394823 | 0.759215 | 7.014360 | 0.684938 | 0.058407 | 0.059605 | 0.041747 | ... | 1.702956 | 83.157638 | 28.558016 | 0.530493 | 0.341118 | 0.524795 | 0.410810 | 1.0 | 0.042639 | 0.288542 |
std | 0.499139 | 1.052177 | 8.681714 | 8.423078 | 0.427565 | 1.510916 | 0.880082 | 0.234514 | 0.236755 | 0.200012 | ... | 0.107780 | 21.489722 | 6.623124 | 0.499076 | 0.474091 | 0.499391 | 0.491988 | 0.0 | 0.202044 | 0.453091 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 24.950000 | 12.020000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 80.740000 | 27.415000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 4.500000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.750000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.290000 | 285.000000 | 99.340000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
Wydaje się być korelacja między masą ciała i zawałem:
import seaborn as sns
sns.set_theme()
g = sns.catplot(
data=train, kind="bar",
x="GeneralHealth", y="WeightInKilograms", hue="HadHeartAttack",
errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("General health index", "Body mass (kg)")
g.legend.set_title("Had heart attack")
Osoby palące częsciej miały zawał:
valid.groupby('SmokerStatus', as_index=False)['HadHeartAttack'].mean()
SmokerStatus | HadHeartAttack | |
---|---|---|
0 | 0.0 | 0.036784 |
1 | 1.0 | 0.072586 |
2 | 2.0 | 0.092351 |
3 | 3.0 | 0.090198 |
Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:
valid.groupby('GeneralHealth', as_index=False)['HadHeartAttack'].mean()
GeneralHealth | HadHeartAttack | |
---|---|---|
0 | 1.0 | 0.228482 |
1 | 2.0 | 0.125921 |
2 | 3.0 | 0.058758 |
3 | 4.0 | 0.028679 |
4 | 5.0 | 0.014917 |
valid.pivot_table('HadHeartAttack',index='GeneralHealth', columns='SmokerStatus')
SmokerStatus | 0.0 | 1.0 | 2.0 | 3.0 |
---|---|---|---|---|
GeneralHealth | ||||
1.0 | 0.176024 | 0.258065 | 0.257364 | 0.245614 |
2.0 | 0.085981 | 0.123404 | 0.178412 | 0.146520 |
3.0 | 0.041363 | 0.055769 | 0.090743 | 0.063766 |
4.0 | 0.019886 | 0.021918 | 0.048122 | 0.034739 |
5.0 | 0.012267 | 0.028736 | 0.020260 | 0.021978 |
Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
def scale_float_columns(dataset):
numerical_columns = list(dataset.select_dtypes(include=['float64']).columns)
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | 1.0 | 3.0 | 0.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.78 | 90.72 | 28.70 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
359753 | Texas | 1.0 | 3.0 | 0.0 | 0.0 | Unknown | 0.0 | 8.0 | 0.0 | 0.0 | ... | 1.60 | 77.11 | 30.12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
286723 | Ohio | 1.0 | 2.0 | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 8.0 | 3.0 | 0.0 | ... | 1.75 | 80.29 | 26.14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
305100 | Oklahoma | 1.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.80 | 97.52 | 29.99 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
199077 | Minnesota | 0.0 | 5.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 8.0 | 1.0 | 0.0 | ... | 1.52 | 67.13 | 28.90 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
scale_float_columns(test)
scale_float_columns(train)
scale_float_columns(valid)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | 1.0 | 0.50 | 0.000000 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.654135 | 0.254241 | 0.198737 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
359753 | Texas | 1.0 | 0.50 | 0.000000 | 0.0 | Unknown | 0.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.518797 | 0.203385 | 0.215986 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
286723 | Ohio | 1.0 | 0.25 | 0.066667 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 1.000000 | 0.0 | ... | 0.631579 | 0.215268 | 0.167638 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
305100 | Oklahoma | 1.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.669173 | 0.279650 | 0.214407 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
199077 | Minnesota | 0.0 | 1.00 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 0.333333 | 0.0 | ... | 0.458647 | 0.166094 | 0.201166 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
5. Czyszczenie brakujących pól
Nie możemy użyć .dropna() gdyż większość wierszy ma brakujące wartości:
print(df.shape[0])
print(df.shape[0] - df.dropna().shape[0])
445132 199110
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | 1.0 | 0.50 | 0.000000 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.654135 | 0.254241 | 0.198737 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
359753 | Texas | 1.0 | 0.50 | 0.000000 | 0.0 | Unknown | 0.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.518797 | 0.203385 | 0.215986 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
286723 | Ohio | 1.0 | 0.25 | 0.066667 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 1.000000 | 0.0 | ... | 0.631579 | 0.215268 | 0.167638 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
305100 | Oklahoma | 1.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.669173 | 0.279650 | 0.214407 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
199077 | Minnesota | 0.0 | 1.00 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 0.333333 | 0.0 | ... | 0.458647 | 0.166094 | 0.201166 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
Uzupełniam brakujące wartości medianą:
numeric_columns = train.select_dtypes(include=['number']).columns
test[numeric_columns] = test[numeric_columns].fillna(test[numeric_columns].median().iloc[0])
train[numeric_columns] = train[numeric_columns].fillna(train[numeric_columns].median().iloc[0])
valid[numeric_columns] = valid[numeric_columns].fillna(valid[numeric_columns].iloc[0])
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
166151 | Maryland | 1.0 | 0.50 | 0.000000 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.654135 | 0.254241 | 0.198737 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
359753 | Texas | 1.0 | 0.50 | 0.000000 | 0.0 | Unknown | 0.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.518797 | 0.203385 | 0.215986 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
286723 | Ohio | 1.0 | 0.25 | 0.066667 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 1.000000 | 0.0 | ... | 0.631579 | 0.215268 | 0.167638 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
305100 | Oklahoma | 1.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.669173 | 0.279650 | 0.214407 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
199077 | Minnesota | 0.0 | 1.00 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 0.0 | 0.304348 | 0.333333 | 0.0 | ... | 0.458647 | 0.166094 | 0.201166 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 40 columns
Kolumny kategoryczne wypełniłem w czasie normalizacji wartościami "Unknown" ponieważ fillna-->median nie działa dla tego typu danych (https://stackoverflow.com/questions/49127897/python-pandas-fillna-median-not-working)
test["HighRiskLastYear"].value_counts()
0.0 42786 1.0 1727 Name: HighRiskLastYear, dtype: int64
test["HighRiskLastYear"].isna().sum()
0
Brak wartości non-null:
test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44513 entries, 166151 to 306609 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44513 non-null float64 3 PhysicalHealthDays 44513 non-null float64 4 MentalHealthDays 44513 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44513 non-null float64 7 SleepHours 44513 non-null float64 8 RemovedTeeth 44513 non-null float64 9 HadHeartAttack 44513 non-null float64 10 HadAngina 44513 non-null float64 11 HadStroke 44513 non-null float64 12 HadAsthma 44513 non-null float64 13 HadSkinCancer 44513 non-null float64 14 HadCOPD 44513 non-null float64 15 HadDepressiveDisorder 44513 non-null float64 16 HadKidneyDisease 44513 non-null float64 17 HadArthritis 44513 non-null float64 18 HadDiabetes 44513 non-null float64 19 DeafOrHardOfHearing 44513 non-null float64 20 BlindOrVisionDifficulty 44513 non-null float64 21 DifficultyConcentrating 44513 non-null float64 22 DifficultyWalking 44513 non-null float64 23 DifficultyDressingBathing 44513 non-null float64 24 DifficultyErrands 44513 non-null float64 25 SmokerStatus 44513 non-null float64 26 ECigaretteUsage 44513 non-null float64 27 ChestScan 44513 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 44513 non-null float64 31 WeightInKilograms 44513 non-null float64 32 BMI 44513 non-null float64 33 AlcoholDrinkers 44513 non-null float64 34 HIVTesting 44513 non-null float64 35 FluVaxLast12 44513 non-null float64 36 PneumoVaxEver 44513 non-null float64 37 TetanusLast10Tdap 44513 non-null float64 38 HighRiskLastYear 44513 non-null float64 39 CovidPos 44513 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 676361 entries, 0 to 676360 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 676361 non-null category 1 Male 676361 non-null float64 2 GeneralHealth 676361 non-null float64 3 PhysicalHealthDays 676361 non-null float64 4 MentalHealthDays 676361 non-null float64 5 LastCheckupTime 676361 non-null category 6 PhysicalActivities 676361 non-null float64 7 SleepHours 676361 non-null float64 8 RemovedTeeth 676361 non-null float64 9 HadHeartAttack 676361 non-null float64 10 HadAngina 676361 non-null float64 11 HadStroke 676361 non-null float64 12 HadAsthma 676361 non-null float64 13 HadSkinCancer 676361 non-null float64 14 HadCOPD 676361 non-null float64 15 HadDepressiveDisorder 676361 non-null float64 16 HadKidneyDisease 676361 non-null float64 17 HadArthritis 676361 non-null float64 18 HadDiabetes 676361 non-null float64 19 DeafOrHardOfHearing 676361 non-null float64 20 BlindOrVisionDifficulty 676361 non-null float64 21 DifficultyConcentrating 676361 non-null float64 22 DifficultyWalking 676361 non-null float64 23 DifficultyDressingBathing 676361 non-null float64 24 DifficultyErrands 676361 non-null float64 25 SmokerStatus 676361 non-null float64 26 ECigaretteUsage 676361 non-null float64 27 ChestScan 676361 non-null float64 28 RaceEthnicityCategory 676361 non-null category 29 AgeCategory 676361 non-null category 30 HeightInMeters 676361 non-null float64 31 WeightInKilograms 676361 non-null float64 32 BMI 676361 non-null float64 33 AlcoholDrinkers 676361 non-null float64 34 HIVTesting 676361 non-null float64 35 FluVaxLast12 676361 non-null float64 36 PneumoVaxEver 676361 non-null float64 37 TetanusLast10Tdap 676361 non-null float64 38 HighRiskLastYear 676361 non-null float64 39 CovidPos 676361 non-null float64 dtypes: category(4), float64(36) memory usage: 188.4 MB
valid.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44514 entries, 399706 to 77011 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44514 non-null category 1 Male 44514 non-null float64 2 GeneralHealth 44514 non-null float64 3 PhysicalHealthDays 44514 non-null float64 4 MentalHealthDays 44514 non-null float64 5 LastCheckupTime 44514 non-null category 6 PhysicalActivities 44514 non-null float64 7 SleepHours 44514 non-null float64 8 RemovedTeeth 44514 non-null float64 9 HadHeartAttack 44514 non-null float64 10 HadAngina 44514 non-null float64 11 HadStroke 44514 non-null float64 12 HadAsthma 44514 non-null float64 13 HadSkinCancer 44514 non-null float64 14 HadCOPD 44514 non-null float64 15 HadDepressiveDisorder 44514 non-null float64 16 HadKidneyDisease 44514 non-null float64 17 HadArthritis 44514 non-null float64 18 HadDiabetes 44514 non-null float64 19 DeafOrHardOfHearing 44514 non-null float64 20 BlindOrVisionDifficulty 44514 non-null float64 21 DifficultyConcentrating 44514 non-null float64 22 DifficultyWalking 44514 non-null float64 23 DifficultyDressingBathing 44514 non-null float64 24 DifficultyErrands 44514 non-null float64 25 SmokerStatus 44514 non-null float64 26 ECigaretteUsage 44514 non-null float64 27 ChestScan 44514 non-null float64 28 RaceEthnicityCategory 44514 non-null category 29 AgeCategory 44514 non-null category 30 HeightInMeters 44514 non-null float64 31 WeightInKilograms 44514 non-null float64 32 BMI 44514 non-null float64 33 AlcoholDrinkers 44514 non-null float64 34 HIVTesting 44514 non-null float64 35 FluVaxLast12 44514 non-null float64 36 PneumoVaxEver 36781 non-null float64 37 TetanusLast10Tdap 44514 non-null float64 38 HighRiskLastYear 44514 non-null float64 39 CovidPos 44514 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB