327 KiB
1. Pobieranie zbioru danych
!pip install --user kaggle
Requirement already satisfied: kaggle in c:\users\adrian\appdata\roaming\python\python39\site-packages (1.6.6) Requirement already satisfied: bleach in c:\users\adrian\miniconda3\lib\site-packages (from kaggle) (4.1.0) Requirement already satisfied: python-slugify in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (8.0.4) Requirement already satisfied: python-dateutil in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.8.2) Requirement already satisfied: tqdm in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (4.64.1) Requirement already satisfied: requests in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.28.1) Requirement already satisfied: certifi in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2022.6.15) Requirement already satisfied: six>=1.10 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.16.0) Requirement already satisfied: urllib3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.26.11) Requirement already satisfied: webencodings in c:\users\adrian\miniconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: packaging in c:\users\adrian\appdata\roaming\python\python39\site-packages (from bleach->kaggle) (22.0) Requirement already satisfied: text-unidecode>=1.3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<4,>=2.5 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.10) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.1.0) Requirement already satisfied: colorama in c:\users\adrian\appdata\roaming\python\python39\site-packages (from tqdm->kaggle) (0.4.5)
!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease/
personal-key-indicators-of-heart-disease.zip: Skipping, found more recently modified local copy (use --force to force download)
#!unzip -o personal-key-indicators-of-heart-disease.zip #nie działa na Windowsie więc korzystam z modułu zipfile
import zipfile
with zipfile.ZipFile("personal-key-indicators-of-heart-disease.zip", 'r') as zip_ref:
zip_ref.extractall("dataset_extracted")
import pandas as pd
# W pobranym zbiorze danych jest kilka podzbiorów więc celowo otwieram ten z NaN, żeby manualnie go oczyścić dla praktyki
df = pd.read_csv("dataset_extracted/2022/heart_2022_with_nans.csv")
Przeglądanie nieoczyszczonego datasetu
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 445132 entries, 0 to 445131 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 445132 non-null object 1 Sex 445132 non-null object 2 GeneralHealth 443934 non-null object 3 PhysicalHealthDays 434205 non-null float64 4 MentalHealthDays 436065 non-null float64 5 LastCheckupTime 436824 non-null object 6 PhysicalActivities 444039 non-null object 7 SleepHours 439679 non-null float64 8 RemovedTeeth 433772 non-null object 9 HadHeartAttack 442067 non-null object 10 HadAngina 440727 non-null object 11 HadStroke 443575 non-null object 12 HadAsthma 443359 non-null object 13 HadSkinCancer 441989 non-null object 14 HadCOPD 442913 non-null object 15 HadDepressiveDisorder 442320 non-null object 16 HadKidneyDisease 443206 non-null object 17 HadArthritis 442499 non-null object 18 HadDiabetes 444045 non-null object 19 DeafOrHardOfHearing 424485 non-null object 20 BlindOrVisionDifficulty 423568 non-null object 21 DifficultyConcentrating 420892 non-null object 22 DifficultyWalking 421120 non-null object 23 DifficultyDressingBathing 421217 non-null object 24 DifficultyErrands 419476 non-null object 25 SmokerStatus 409670 non-null object 26 ECigaretteUsage 409472 non-null object 27 ChestScan 389086 non-null object 28 RaceEthnicityCategory 431075 non-null object 29 AgeCategory 436053 non-null object 30 HeightInMeters 416480 non-null float64 31 WeightInKilograms 403054 non-null float64 32 BMI 396326 non-null float64 33 AlcoholDrinkers 398558 non-null object 34 HIVTesting 379005 non-null object 35 FluVaxLast12 398011 non-null object 36 PneumoVaxEver 368092 non-null object 37 TetanusLast10Tdap 362616 non-null object 38 HighRiskLastYear 394509 non-null object 39 CovidPos 394368 non-null object dtypes: float64(6), object(34) memory usage: 135.8+ MB
df.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | Female | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | NaN | No | ... | NaN | NaN | NaN | No | No | Yes | No | Yes, received tetanus shot but not sure what type | No | No |
1 | Alabama | Female | Excellent | 0.0 | 0.0 | NaN | No | 6.0 | NaN | No | ... | 1.60 | 68.04 | 26.57 | No | No | No | No | No, did not receive any tetanus shot in the pa... | No | No |
2 | Alabama | Female | Very good | 2.0 | 3.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | NaN | No | ... | 1.57 | 63.50 | 25.61 | No | No | No | No | NaN | No | Yes |
3 | Alabama | Female | Excellent | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | NaN | No | ... | 1.65 | 63.50 | 23.30 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No |
4 | Alabama | Female | Fair | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | NaN | No | ... | 1.57 | 53.98 | 21.77 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | No |
5 rows × 40 columns
df.describe()
PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | WeightInKilograms | BMI | |
---|---|---|---|---|---|---|
count | 434205.000000 | 436065.000000 | 439679.000000 | 416480.000000 | 403054.000000 | 396326.000000 |
mean | 4.347919 | 4.382649 | 7.022983 | 1.702691 | 83.074470 | 28.529842 |
std | 8.688912 | 8.387475 | 1.502425 | 0.107177 | 21.448173 | 6.554889 |
min | 0.000000 | 0.000000 | 1.000000 | 0.910000 | 22.680000 | 12.020000 |
25% | 0.000000 | 0.000000 | 6.000000 | 1.630000 | 68.040000 | 24.130000 |
50% | 0.000000 | 0.000000 | 7.000000 | 1.700000 | 80.740000 | 27.440000 |
75% | 3.000000 | 5.000000 | 8.000000 | 1.780000 | 95.250000 | 31.750000 |
max | 30.000000 | 30.000000 | 24.000000 | 2.410000 | 292.570000 | 99.640000 |
Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu
Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:
df["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
df["HadHeartAttack"].value_counts()
HadHeartAttack No 416959 Yes 25108 Name: count, dtype: int64
2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling
from sklearn.model_selection import train_test_split
#Funkcji z sklearn musimy użyć dwukrotnie, bo dzieli tylko na dwa podzbiory
train, test_and_valid = train_test_split(df, test_size=0.2) #0.8 train, 0.2 test&valid
test, valid = train_test_split(test_and_valid, test_size=0.5) #0.1 test, 0.1 valid
train["HadHeartAttack"].value_counts()
HadHeartAttack No 333640 Yes 20032 Name: count, dtype: int64
Zbiór treningowy jest nadal niezbalansowany więc zrobię prosty oversampling przez kopiowanie mniejszej klasy aż będą prawie równe
def oversample(dataset):
num_true = len(dataset[dataset["HadHeartAttack"]=="Yes"])
num_false = len(dataset[dataset["HadHeartAttack"]=="No"])
num_oversampling_steps = num_false//num_true
oversampled = dataset.copy()
for x in range(num_oversampling_steps):
oversampled = pd.concat([oversampled, dataset[dataset["HadHeartAttack"]=="Yes"]], ignore_index=True)
return oversampled
train = oversample(train)
train["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
test["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
valid["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
Proporcje osób palących / niepalących w pierwotnym zbiorze danych:
df["SmokerStatus"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
df["ECigaretteUsage"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
Statystyki covidowe
df["CovidPos"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne
Kolumny zawierające stan zdrowia i podobne cechy opisane w sposób "poor/fair/good/excellent" etc. starałem się zamienić na liczbowe w sposób sensowny, rosnący względem pozytywnego aspektu tego czynnika zdrowotnego. Podobnie z tym jak często dana osoba paliła. Część kolumn zamieniłem na kategoryczne Kolumnę płci zamieniłem na numeryczną w celu późniejszego wykorzystania przez model, choć mialem wątpliwości co do robienia tego pod względem poprawności politycznej
df["Sex"].unique()
array(['Female', 'Male'], dtype=object)
df["GeneralHealth"].unique()
array(['Very good', 'Excellent', 'Fair', 'Poor', 'Good', nan], dtype=object)
health_map = {
"Excellent": 5,
"Very good": 4,
"Good": 3,
"Fair": 2,
"Poor": 1
}
for col in df:
print(f"{col}:")
print(df[col].unique())
State: ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico' 'Virgin Islands'] Sex: ['Female' 'Male'] GeneralHealth: ['Very good' 'Excellent' 'Fair' 'Poor' 'Good' nan] PhysicalHealthDays: [ 0. 2. 1. 8. 5. 30. 4. 23. 14. nan 15. 3. 10. 7. 25. 6. 21. 20. 29. 16. 9. 27. 28. 12. 13. 11. 26. 17. 24. 19. 18. 22.] MentalHealthDays: [ 0. 3. 9. 5. 15. 20. 14. 10. 18. 1. nan 2. 30. 4. 6. 7. 25. 8. 22. 29. 27. 21. 12. 28. 16. 13. 26. 17. 11. 23. 19. 24.] LastCheckupTime: ['Within past year (anytime less than 12 months ago)' nan 'Within past 2 years (1 year but less than 2 years ago)' 'Within past 5 years (2 years but less than 5 years ago)' '5 or more years ago'] PhysicalActivities: ['No' 'Yes' nan] SleepHours: [ 8. 6. 5. 7. 9. 4. 10. 1. 12. nan 18. 3. 2. 11. 16. 15. 13. 14. 20. 23. 17. 24. 22. 19. 21.] RemovedTeeth: [nan 'None of them' '1 to 5' '6 or more, but not all' 'All'] HadHeartAttack: ['No' 'Yes' nan] HadAngina: ['No' 'Yes' nan] HadStroke: ['No' 'Yes' nan] HadAsthma: ['No' 'Yes' nan] HadSkinCancer: ['No' 'Yes' nan] HadCOPD: ['No' 'Yes' nan] HadDepressiveDisorder: ['No' 'Yes' nan] HadKidneyDisease: ['No' 'Yes' nan] HadArthritis: ['No' 'Yes' nan] HadDiabetes: ['Yes' 'No' 'No, pre-diabetes or borderline diabetes' nan 'Yes, but only during pregnancy (female)'] DeafOrHardOfHearing: ['No' nan 'Yes'] BlindOrVisionDifficulty: ['No' 'Yes' nan] DifficultyConcentrating: ['No' nan 'Yes'] DifficultyWalking: ['No' 'Yes' nan] DifficultyDressingBathing: ['No' nan 'Yes'] DifficultyErrands: ['No' 'Yes' nan] SmokerStatus: ['Never smoked' 'Current smoker - now smokes some days' 'Former smoker' nan 'Current smoker - now smokes every day'] ECigaretteUsage: ['Not at all (right now)' 'Never used e-cigarettes in my entire life' nan 'Use them every day' 'Use them some days'] ChestScan: ['No' 'Yes' nan] RaceEthnicityCategory: ['White only, Non-Hispanic' 'Black only, Non-Hispanic' 'Other race only, Non-Hispanic' 'Multiracial, Non-Hispanic' nan 'Hispanic'] AgeCategory: ['Age 80 or older' 'Age 55 to 59' nan 'Age 40 to 44' 'Age 75 to 79' 'Age 70 to 74' 'Age 65 to 69' 'Age 60 to 64' 'Age 50 to 54' 'Age 45 to 49' 'Age 35 to 39' 'Age 25 to 29' 'Age 30 to 34' 'Age 18 to 24'] HeightInMeters: [ nan 1.6 1.57 1.65 1.8 1.63 1.7 1.68 1.73 1.55 1.93 1.88 1.78 1.85 1.75 1.52 1.83 1.91 1.96 1.5 1.45 1.42 1.24 1.47 1.22 1.98 2.03 2.01 1.3 1.4 1.35 1.82 1.67 1.76 2.11 1.37 1.64 1.71 2.16 2.26 0.91 2.06 1.14 1.74 1.51 1.53 1.69 1.56 1.84 1.9 1.54 1.72 1.87 1.61 1.49 1.59 1.58 1.62 1.79 1.46 1.89 2.13 0.99 2.08 2.21 1.32 2.18 1.77 2.36 1.25 1.66 1.86 1.95 1.19 1.05 1.48 1.03 1.18 1.81 1.38 1.44 1.07 1.27 1.2 1.17 1.04 2.24 1.1 1.43 1.92 2.05 1.12 2.41 2.34 0.97 1.06 1.15 2.29 1.16 1.09 0.92 2.07 1. 1.08 1.02 1.33 2. 2.02 1.94 0.95] WeightInKilograms: [ nan 68.04 63.5 53.98 84.82 62.6 73.48 81.65 74.84 59.42 85.28 106.59 71.21 64.41 61.23 90.72 65.77 66.22 80.29 86.18 47.63 107.05 57.15 105.23 77.11 56.7 79.38 113.4 102.06 59.87 104.33 53.52 61.69 136.08 34.47 99.79 127.01 78.93 95.25 58.97 92.08 72.57 83.91 49.9 117.93 71.67 102.97 62.14 83.46 54.43 94.35 60.78 117.03 65.32 76.66 88.45 89.81 74.39 68.95 79.83 108.41 90.26 55.79 91.63 47.17 78.02 50.8 91.17 84.37 145.15 93.89 122.47 48.99 73.94 88.9 80.74 81.19 158.76 97.52 51.71 82.55 76.2 68.49 75.3 70.31 63.05 60.33 115.67 86.64 108.86 92.53 124.74 43.09 58.51 63.96 92.99 44.45 128.82 98.88 45.36 110.68 46.72 58.06 73.03 95.71 131.09 78.47 69.4 85.73 67.59 103.87 120.2 88. 54.88 111.58 52.16 77.56 126.55 94.8 123.83 89.36 75.75 69.85 112.49 82.1 106.14 57.61 70.76 148.78 96.16 67.13 48.08 163.29 109.77 100.7 142.88 64.86 111.13 121.11 55.34 101.6 93.44 117.48 120.66 66.68 44.91 132. 107.5 107.95 36.29 103.42 87.09 83.01 56.25 96.62 134.26 97.07 34.93 99.34 72.12 49.44 122.02 98.43 129.73 181.44 52.62 121.56 110.22 48.53 140.61 156.49 116.57 87.54 44. 114.31 31.75 97.98 101.15 112.04 100.24 113.85 154.22 118.39 133.81 149.69 41.73 119.75 138.35 151.95 129.27 131.54 104.78 132.45 102.51 116.12 40.37 105.69 136.98 195.04 53.07 132.9 124.28 112.94 114.76 45.81 119.29 167.83 51.26 172.37 162.39 46.27 127.91 123.38 38.56 130.63 143.34 115.21 166.92 135.17 109.32 135.62 204.12 127.46 118.84 139.25 126.1 122.92 151.5 133.36 42.64 50.35 80. 190.51 37.19 147.87 35.38 144.24 149.23 37.65 86. 147.42 281. 165.56 162.84 155.58 70. 137.89 189.6 206.38 148.32 42.18 153.77 38.1 90. 176.9 191.87 249.48 67. 95. 82. 170.1 62. 40.82 53. 139.71 130.18 100. 165.11 64. 43.54 24. 134.72 141.52 125.19 75. 60. 34.02 164.65 30.84 250. 58. 76. 73. 112. 74. 55. 200. 54. 66. 72. 152.41 39.46 220. 41.28 168.28 188.24 59. 46. 265. 238.14 168.74 145. 190. 93. 159.66 78. 50. 185.07 91. 104. 165. 183.7 33.57 161.93 68. 125.65 134. 130. 32.21 143.79 69. 179.17 63. 105. 210.92 65. 32. 292.57 280. 85. 174.63 56. 128.37 87. 39.92 83. 169.64 156.04 177. 121. 151.05 89. 146.96 146.06 98. 166.47 36.74 171.46 227.25 29.48 190.06 161.03 35.83 226.8 175.09 138.8 240.4 158.3 170.55 61. 137.44 145.6 141.07 155.13 52. 120. 57. 77. 27.22 25.4 240. 96. 47. 115. 41. 45. 170. 150.59 272.16 26.31 48. 39.01 236. 92. 197.31 156. 84. 94. 29.03 49. 79. 157.85 192.78 255. 108. 185. 222.26 229.97 180. 81. 24.95 71. 26. 107. 101. 208.65 140. 175. 111. 110. 141.97 22.68 284.86 136.53 210. 103. 185.97 140.16 146.51 24.49 25.85 150. 102. 229.52 23.59 125. 163. 38. 135. 176.45 185.52 152.86 232.69 124. 192.32 186.88 118. 160.12 160. 193.68 201.85 144.7 184.16 142.43 169. 166.01 32.66 180.53 196.41 51. 40. 171.91 195.95 33.11 153.31 159.21 164.2 219.99 215.46 182.34 30. 160.57 173.27 158. 213.19 276.24 199.58 175.99 235.87 217.72 200.03 230.88 146. 24.04 178.72 150.14 157.4 163.75 191.42 174.18 28.58 97. 256.28 205.48 161.48 178.26 179.62 205.02 254.01 154.68 209.56 201.4 234.96 177.81 200.49 231.79 227.7 273.52 189.15 173.73 183.25 167.38 211.83 223.62 228.61 30.39 197.77 184.61 250.38 181.89 31.3 290.3 285. 113. 242.67 231.33 180.08 202.76 176. 188.69 206.84 164. 156.94 114. 122. 222. 137. 166. 180.98 272. 172.82 274.42 234.51 199.13 244.94 203.21 23.13 265.35 198.22 263.08 216.82 154. 169.19 239.04 177.35 210.47 224.98 117. 37. 126. 273.06 203.66 252.2 238.59 194.59 187.33 221.35 162. 224.53 23. 223.17 187.79 212.73 152. 233.6 193.23 205. 229.06 230. 247.21 99. 28.12 230.42 175.54 205.93 171. 26.76 212.28 217. 280.32 281.68 248.57 195. 42. 258.55 215. 116. 28. 123. 186.43 228.16 119. 219.09 214.55 278.96 182.8 138. 217.27 246.3 189. ] BMI: [ nan 26.57 25.61 ... 13.51 28.39 48.63] AlcoholDrinkers: ['No' 'Yes' nan] HIVTesting: ['No' 'Yes' nan] FluVaxLast12: ['Yes' 'No' nan] PneumoVaxEver: ['No' 'Yes' nan] TetanusLast10Tdap: ['Yes, received tetanus shot but not sure what type' 'No, did not receive any tetanus shot in the past 10 years' nan 'Yes, received Tdap' 'Yes, received tetanus shot, but not Tdap'] HighRiskLastYear: ['No' nan 'Yes'] CovidPos: ['No' 'Yes' nan 'Tested positive using home test without a health professional']
from collections import defaultdict
def normalize_dataset(dataset):
dataset["GeneralHealth"] = dataset["GeneralHealth"].map(defaultdict(lambda: float('NaN'), health_map), na_action='ignore')
dataset["Sex"] = dataset["Sex"].map({"Female":0,"Male":1}).astype(float) #Zamiana z kolumn tekstowych na numeryczne
dataset.rename(columns ={"Sex":"Male"},inplace=True)
dataset["State"] = dataset["State"].astype('category')
dataset["PhysicalHealthDays"].astype(float)
dataset["MentalHealthDays"].astype(float)
dataset["LastCheckupTime"] = dataset["LastCheckupTime"].fillna("Unknown").astype('category') # Potem korzystam z fillna-->median ale nie działa to na kolumnach kategorycznych więc wykonuję to przed konwersją
dataset["PhysicalActivities"]= dataset["PhysicalActivities"].map({"No":0,"Yes":1})
dataset["SleepHours"].astype(float)
dataset["RemovedTeeth"] = dataset["RemovedTeeth"].map(defaultdict(lambda: float('NaN'), {"None of them":0,"1 to 5":1, "6 or more, but not all":2, "All":3}), na_action='ignore')
dataset["HadHeartAttack"]= dataset["HadHeartAttack"].map({"No":0,"Yes":1})
dataset["HadAngina"]= dataset["HadAngina"].map({"No":0,"Yes":1})
dataset["HadStroke"]= dataset["HadStroke"].map({"No":0,"Yes":1})
dataset["HadAsthma"]= dataset["HadAsthma"].map({"No":0,"Yes":1})
dataset["HadSkinCancer"]= dataset["HadSkinCancer"].map({"No":0,"Yes":1})
dataset["HadCOPD"]= dataset["HadCOPD"].map({"No":0,"Yes":1})
dataset["HadDepressiveDisorder"]= dataset["HadDepressiveDisorder"].map({"No":0,"Yes":1})
dataset["HadKidneyDisease"]= dataset["HadKidneyDisease"].map({"No":0,"Yes":1})
dataset["HadArthritis"]= dataset["HadArthritis"].map({"No":0,"Yes":1})
dataset["HadDiabetes"]= dataset["HadDiabetes"].map({"No":0,"Yes, but only during pregnancy (female)":1,"No, pre-diabetes or borderline diabetes":2,"Yes":3})
dataset["DeafOrHardOfHearing"]= dataset["DeafOrHardOfHearing"].map({"No":0,"Yes":1})
dataset["BlindOrVisionDifficulty"]= dataset["BlindOrVisionDifficulty"].map({"No":0,"Yes":1})
dataset["DifficultyConcentrating"]= dataset["DifficultyConcentrating"].map({"No":0,"Yes":1})
dataset["DifficultyWalking"]= dataset["DifficultyWalking"].map({"No":0,"Yes":1})
dataset["DifficultyDressingBathing"]= dataset["DifficultyDressingBathing"].map({"No":0,"Yes":1})
dataset["DifficultyErrands"]= dataset["DifficultyErrands"].map({"No":0,"Yes":1})
dataset["SmokerStatus"]= dataset["SmokerStatus"].map({"Never smoked":0,"Current smoker - now smokes some days":1,"Former smoker":2,"Current smoker - now smokes every day":3})
dataset["ECigaretteUsage"]= dataset["ECigaretteUsage"].map({"Never used e-cigarettes in my entire life":0,"Not at all (right now)":1,"Use them some days":2,"Use them every day":3})
dataset["ChestScan"]= dataset["ChestScan"].map({"No":0,"Yes":1})
dataset["RaceEthnicityCategory"] = dataset["RaceEthnicityCategory"].fillna("Unknown").astype('category')
dataset["AgeCategory"] = dataset["AgeCategory"].fillna("Unknown").astype('category')
dataset["HeightInMeters"] = dataset["HeightInMeters"].astype(float)
dataset["WeightInKilograms"] = dataset["WeightInKilograms"].astype(float)
dataset["BMI"] = dataset["BMI"].astype(float)
dataset["AlcoholDrinkers"]= dataset["AlcoholDrinkers"].map({"No":0,"Yes":1})
dataset["HIVTesting"]= dataset["HIVTesting"].map({"No":0,"Yes":1})
dataset["FluVaxLast12"]= dataset["FluVaxLast12"].map({"No":0,"Yes":1})
dataset["PneumoVaxEver"]= dataset["PneumoVaxEver"].map({"No":0,"Yes":1})
dataset["TetanusLast10Tdap"]= dataset["TetanusLast10Tdap"].apply(lambda x: float('NaN') if type(x)!=str else 1.0 if 'Yes,' in x else 1.0 if 'No,' in x else float('NaN'))
dataset["HighRiskLastYear"]= dataset["HighRiskLastYear"].map({"No":0,"Yes":1})
dataset["CovidPos"]= dataset["CovidPos"].map({"No":0,"Yes":1})
Zbiór test przed zmianą typu danych
test.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | Male | Good | 2.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | NaN | 7.0 | None of them | No | ... | 1.55 | NaN | NaN | No | No | No | NaN | No, did not receive any tetanus shot in the pa... | No | No |
189605 | Michigan | Female | Fair | 20.0 | 15.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | All | No | ... | 1.68 | 70.31 | 25.02 | No | NaN | Yes | Yes | NaN | No | No |
59234 | Delaware | Female | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 6.0 | None of them | No | ... | 1.50 | 64.41 | 28.68 | No | No | Yes | NaN | No, did not receive any tetanus shot in the pa... | No | No |
255322 | New Mexico | Male | Good | 0.0 | 0.0 | 5 or more years ago | Yes | 6.0 | None of them | No | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226504 | Montana | Female | Very good | 6.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 8.0 | None of them | No | ... | 1.73 | 90.72 | 30.41 | Yes | No | No | Yes | NaN | No | Yes |
5 rows × 40 columns
Zbiór test po zmianie typu danych
normalize_dataset(test)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | 1.0 | 3.0 | 2.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | NaN | 7.0 | 0.0 | 0.0 | ... | 1.55 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | 0.0 |
189605 | Michigan | 0.0 | 2.0 | 20.0 | 15.0 | Within past year (anytime less than 12 months ... | 1.0 | 5.0 | 3.0 | 0.0 | ... | 1.68 | 70.31 | 25.02 | 0.0 | NaN | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
59234 | Delaware | 0.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 6.0 | 0.0 | 0.0 | ... | 1.50 | 64.41 | 28.68 | 0.0 | 0.0 | 1.0 | NaN | 1.0 | 0.0 | 0.0 |
255322 | New Mexico | 1.0 | 3.0 | 0.0 | 0.0 | 5 or more years ago | 1.0 | 6.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226504 | Montana | 0.0 | 4.0 | 6.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 0.0 | 0.0 | ... | 1.73 | 90.72 | 30.41 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
test.info()
<class 'pandas.core.frame.DataFrame'> Index: 44513 entries, 276058 to 196692 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44380 non-null float64 3 PhysicalHealthDays 43374 non-null float64 4 MentalHealthDays 43620 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44383 non-null float64 7 SleepHours 43982 non-null float64 8 RemovedTeeth 43364 non-null float64 9 HadHeartAttack 44220 non-null float64 10 HadAngina 44117 non-null float64 11 HadStroke 44352 non-null float64 12 HadAsthma 44348 non-null float64 13 HadSkinCancer 44192 non-null float64 14 HadCOPD 44283 non-null float64 15 HadDepressiveDisorder 44197 non-null float64 16 HadKidneyDisease 44342 non-null float64 17 HadArthritis 44231 non-null float64 18 HadDiabetes 44377 non-null float64 19 DeafOrHardOfHearing 42456 non-null float64 20 BlindOrVisionDifficulty 42338 non-null float64 21 DifficultyConcentrating 42066 non-null float64 22 DifficultyWalking 42090 non-null float64 23 DifficultyDressingBathing 42111 non-null float64 24 DifficultyErrands 41923 non-null float64 25 SmokerStatus 40967 non-null float64 26 ECigaretteUsage 40964 non-null float64 27 ChestScan 38930 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 41634 non-null float64 31 WeightInKilograms 40303 non-null float64 32 BMI 39648 non-null float64 33 AlcoholDrinkers 39882 non-null float64 34 HIVTesting 37870 non-null float64 35 FluVaxLast12 39814 non-null float64 36 PneumoVaxEver 36760 non-null float64 37 TetanusLast10Tdap 36287 non-null float64 38 HighRiskLastYear 39445 non-null float64 39 CovidPos 38063 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
normalize_dataset(train)
normalize_dataset(valid)
Statystyki dla zbiorów po zamianie na kolumny numeryczne
_50. centyl to mediana
train.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 676617.000000 | 674189.000000 | 655653.000000 | 660103.000000 | 674547.000000 | 665806.000000 | 654146.000000 | 674184.000000 | 657382.000000 | 672884.000000 | ... | 637479.000000 | 620141.000000 | 611530.000000 | 607591.000000 | 573999.000000 | 606624.000000 | 571259.000000 | 554407.0 | 601115.000000 | 585931.000000 |
mean | 0.539397 | 3.056503 | 6.720248 | 4.819231 | 0.689765 | 7.039463 | 0.978094 | 0.505120 | 0.264342 | 0.116472 | ... | 1.707316 | 84.660193 | 28.918429 | 0.455838 | 0.326018 | 0.571211 | 0.527326 | 1.0 | 0.034534 | 0.273136 |
std | 0.498446 | 1.138185 | 10.708463 | 9.058480 | 0.462590 | 1.726591 | 1.017700 | 0.499974 | 0.440983 | 0.320790 | ... | 0.108041 | 21.748490 | 6.631906 | 0.498046 | 0.468754 | 0.494903 | 0.499253 | 0.0 | 0.182597 | 0.445571 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.020000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 69.400000 | 24.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 81.650000 | 27.890000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 10.000000 | 5.000000 | 1.000000 | 8.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 1.780000 | 96.160000 | 32.220000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.410000 | 292.570000 | 99.640000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
test.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44513.000000 | 44380.000000 | 43374.000000 | 43620.000000 | 44383.000000 | 43982.000000 | 43364.000000 | 44220.000000 | 44117.000000 | 44352.000000 | ... | 41634.000000 | 40303.000000 | 39648.000000 | 39882.000000 | 37870.000000 | 39814.000000 | 36760.00000 | 36287.0 | 39445.000000 | 38063.000000 |
mean | 0.467347 | 3.433551 | 4.304353 | 4.470839 | 0.759119 | 7.012414 | 0.687644 | 0.058684 | 0.060816 | 0.043155 | ... | 1.701734 | 82.990520 | 28.545288 | 0.532621 | 0.342382 | 0.526348 | 0.41420 | 1.0 | 0.043174 | 0.293461 |
std | 0.498938 | 1.049691 | 8.629763 | 8.472884 | 0.427623 | 1.493726 | 0.883372 | 0.235035 | 0.238994 | 0.203208 | ... | 0.106604 | 21.462338 | 6.574508 | 0.498941 | 0.474513 | 0.499312 | 0.49259 | 0.0 | 0.203251 | 0.455354 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.690000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.130000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 80.740000 | 27.440000 | 1.000000 | 0.000000 | 1.000000 | 0.00000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 5.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.750000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.260000 | 276.240000 | 97.650000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
valid.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44514.000000 | 44405.000000 | 43450.000000 | 43622.000000 | 44421.000000 | 43955.000000 | 43350.000000 | 44175.000000 | 44060.000000 | 44339.000000 | ... | 41591.000000 | 40226.000000 | 39516.000000 | 39789.000000 | 37856.000000 | 39749.000000 | 36681.000000 | 36210.0 | 39453.000000 | 38058.000000 |
mean | 0.466887 | 3.427835 | 4.354799 | 4.398171 | 0.760271 | 7.031760 | 0.684060 | 0.056163 | 0.060236 | 0.043506 | ... | 1.702198 | 83.013436 | 28.522226 | 0.529945 | 0.340501 | 0.522831 | 0.414983 | 1.0 | 0.045903 | 0.290609 |
std | 0.498908 | 1.056506 | 8.691768 | 8.406697 | 0.426923 | 1.513703 | 0.881616 | 0.230239 | 0.237926 | 0.203995 | ... | 0.107066 | 21.464497 | 6.564679 | 0.499109 | 0.473884 | 0.499485 | 0.492726 | 0.0 | 0.209277 | 0.454049 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.190000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 80.740000 | 27.440000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 5.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.750000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.360000 | 284.860000 | 96.200000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
Wydaje się być korelacja między masą ciała i zawałem:
import seaborn as sns
sns.set_theme()
g = sns.catplot(
data=train, kind="bar",
x="GeneralHealth", y="WeightInKilograms", hue="HadHeartAttack",
errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("General health index", "Body mass (kg)")
g.legend.set_title("Had heart attack")
Osoby palące częsciej miały zawał:
valid.groupby('SmokerStatus', as_index=False)['HadHeartAttack'].mean()
SmokerStatus | HadHeartAttack | |
---|---|---|
0 | 0.0 | 0.037162 |
1 | 1.0 | 0.069817 |
2 | 2.0 | 0.082760 |
3 | 3.0 | 0.093980 |
Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:
valid.groupby('GeneralHealth', as_index=False)['HadHeartAttack'].mean()
GeneralHealth | HadHeartAttack | |
---|---|---|
0 | 1.0 | 0.219401 |
1 | 2.0 | 0.118330 |
2 | 3.0 | 0.056664 |
3 | 4.0 | 0.028686 |
4 | 5.0 | 0.014112 |
valid.pivot_table('HadHeartAttack',index='GeneralHealth', columns='SmokerStatus')
SmokerStatus | 0.0 | 1.0 | 2.0 | 3.0 |
---|---|---|---|---|
GeneralHealth | ||||
1.0 | 0.163180 | 0.242991 | 0.259740 | 0.250000 |
2.0 | 0.085862 | 0.120438 | 0.158195 | 0.146465 |
3.0 | 0.038882 | 0.059574 | 0.083070 | 0.076079 |
4.0 | 0.023638 | 0.022901 | 0.039315 | 0.032688 |
5.0 | 0.011113 | 0.017544 | 0.020365 | 0.025316 |
Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
def scale_float_columns(dataset):
numerical_columns = list(dataset.select_dtypes(include=['float64']).columns)
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | 1.0 | 3.0 | 2.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | NaN | 7.0 | 0.0 | 0.0 | ... | 1.55 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | 0.0 |
189605 | Michigan | 0.0 | 2.0 | 20.0 | 15.0 | Within past year (anytime less than 12 months ... | 1.0 | 5.0 | 3.0 | 0.0 | ... | 1.68 | 70.31 | 25.02 | 0.0 | NaN | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
59234 | Delaware | 0.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 6.0 | 0.0 | 0.0 | ... | 1.50 | 64.41 | 28.68 | 0.0 | 0.0 | 1.0 | NaN | 1.0 | 0.0 | 0.0 |
255322 | New Mexico | 1.0 | 3.0 | 0.0 | 0.0 | 5 or more years ago | 1.0 | 6.0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226504 | Montana | 0.0 | 4.0 | 6.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 0.0 | 0.0 | ... | 1.73 | 90.72 | 30.41 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
scale_float_columns(test)
scale_float_columns(train)
scale_float_columns(valid)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | 1.0 | 0.50 | 0.066667 | 0.0 | Within past 2 years (1 year but less than 2 ye... | NaN | 0.260870 | 0.0 | 0.0 | ... | 0.474074 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
189605 | Michigan | 0.0 | 0.25 | 0.666667 | 0.5 | Within past year (anytime less than 12 months ... | 1.0 | 0.173913 | 1.0 | 0.0 | ... | 0.570370 | 0.187845 | 0.145127 | 0.0 | NaN | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
59234 | Delaware | 0.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.217391 | 0.0 | 0.0 | ... | 0.437037 | 0.164576 | 0.188206 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 |
255322 | New Mexico | 1.0 | 0.50 | 0.000000 | 0.0 | 5 or more years ago | 1.0 | 0.217391 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226504 | Montana | 0.0 | 0.75 | 0.200000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.0 | 0.0 | ... | 0.607407 | 0.268339 | 0.208569 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
5. Czyszczenie brakujących pól
Nie możemy użyć .dropna() gdyż większość wierszy ma brakujące wartości:
print(df.shape[0])
print(df.shape[0] - df.dropna().shape[0])
445132 199110
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | 1.0 | 0.50 | 0.066667 | 0.0 | Within past 2 years (1 year but less than 2 ye... | NaN | 0.260870 | 0.0 | 0.0 | ... | 0.474074 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
189605 | Michigan | 0.0 | 0.25 | 0.666667 | 0.5 | Within past year (anytime less than 12 months ... | 1.0 | 0.173913 | 1.0 | 0.0 | ... | 0.570370 | 0.187845 | 0.145127 | 0.0 | NaN | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
59234 | Delaware | 0.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.217391 | 0.0 | 0.0 | ... | 0.437037 | 0.164576 | 0.188206 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 |
255322 | New Mexico | 1.0 | 0.50 | 0.000000 | 0.0 | 5 or more years ago | 1.0 | 0.217391 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226504 | Montana | 0.0 | 0.75 | 0.200000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.0 | 0.0 | ... | 0.607407 | 0.268339 | 0.208569 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 1.0 |
5 rows × 40 columns
Uzupełniam brakujące wartości medianą:
numeric_columns = train.select_dtypes(include=['number']).columns
test[numeric_columns] = test[numeric_columns].fillna(test[numeric_columns].median().iloc[0])
train[numeric_columns] = train[numeric_columns].fillna(train[numeric_columns].median().iloc[0])
valid[numeric_columns] = valid[numeric_columns].fillna(valid[numeric_columns].iloc[0])
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276058 | New York | 1.0 | 0.50 | 0.066667 | 0.0 | Within past 2 years (1 year but less than 2 ye... | 0.0 | 0.260870 | 0.0 | 0.0 | ... | 0.474074 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
189605 | Michigan | 0.0 | 0.25 | 0.666667 | 0.5 | Within past year (anytime less than 12 months ... | 1.0 | 0.173913 | 1.0 | 0.0 | ... | 0.570370 | 0.187845 | 0.145127 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
59234 | Delaware | 0.0 | 0.75 | 0.000000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.217391 | 0.0 | 0.0 | ... | 0.437037 | 0.164576 | 0.188206 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
255322 | New Mexico | 1.0 | 0.50 | 0.000000 | 0.0 | 5 or more years ago | 1.0 | 0.217391 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
226504 | Montana | 0.0 | 0.75 | 0.200000 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.0 | 0.0 | ... | 0.607407 | 0.268339 | 0.208569 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
5 rows × 40 columns
Kolumny kategoryczne wypełniłem w czasie normalizacji wartościami "Unknown" ponieważ fillna-->median nie działa dla tego typu danych (https://stackoverflow.com/questions/49127897/python-pandas-fillna-median-not-working)
test["HighRiskLastYear"].value_counts()
HighRiskLastYear 0.0 42810 1.0 1703 Name: count, dtype: int64
test["HighRiskLastYear"].isna().sum()
0
Brak wartości non-null:
test.info()
<class 'pandas.core.frame.DataFrame'> Index: 44513 entries, 276058 to 196692 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44513 non-null float64 3 PhysicalHealthDays 44513 non-null float64 4 MentalHealthDays 44513 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44513 non-null float64 7 SleepHours 44513 non-null float64 8 RemovedTeeth 44513 non-null float64 9 HadHeartAttack 44513 non-null float64 10 HadAngina 44513 non-null float64 11 HadStroke 44513 non-null float64 12 HadAsthma 44513 non-null float64 13 HadSkinCancer 44513 non-null float64 14 HadCOPD 44513 non-null float64 15 HadDepressiveDisorder 44513 non-null float64 16 HadKidneyDisease 44513 non-null float64 17 HadArthritis 44513 non-null float64 18 HadDiabetes 44513 non-null float64 19 DeafOrHardOfHearing 44513 non-null float64 20 BlindOrVisionDifficulty 44513 non-null float64 21 DifficultyConcentrating 44513 non-null float64 22 DifficultyWalking 44513 non-null float64 23 DifficultyDressingBathing 44513 non-null float64 24 DifficultyErrands 44513 non-null float64 25 SmokerStatus 44513 non-null float64 26 ECigaretteUsage 44513 non-null float64 27 ChestScan 44513 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 44513 non-null float64 31 WeightInKilograms 44513 non-null float64 32 BMI 44513 non-null float64 33 AlcoholDrinkers 44513 non-null float64 34 HIVTesting 44513 non-null float64 35 FluVaxLast12 44513 non-null float64 36 PneumoVaxEver 44513 non-null float64 37 TetanusLast10Tdap 44513 non-null float64 38 HighRiskLastYear 44513 non-null float64 39 CovidPos 44513 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 676617 entries, 0 to 676616 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 676617 non-null category 1 Male 676617 non-null float64 2 GeneralHealth 676617 non-null float64 3 PhysicalHealthDays 676617 non-null float64 4 MentalHealthDays 676617 non-null float64 5 LastCheckupTime 676617 non-null category 6 PhysicalActivities 676617 non-null float64 7 SleepHours 676617 non-null float64 8 RemovedTeeth 676617 non-null float64 9 HadHeartAttack 676617 non-null float64 10 HadAngina 676617 non-null float64 11 HadStroke 676617 non-null float64 12 HadAsthma 676617 non-null float64 13 HadSkinCancer 676617 non-null float64 14 HadCOPD 676617 non-null float64 15 HadDepressiveDisorder 676617 non-null float64 16 HadKidneyDisease 676617 non-null float64 17 HadArthritis 676617 non-null float64 18 HadDiabetes 676617 non-null float64 19 DeafOrHardOfHearing 676617 non-null float64 20 BlindOrVisionDifficulty 676617 non-null float64 21 DifficultyConcentrating 676617 non-null float64 22 DifficultyWalking 676617 non-null float64 23 DifficultyDressingBathing 676617 non-null float64 24 DifficultyErrands 676617 non-null float64 25 SmokerStatus 676617 non-null float64 26 ECigaretteUsage 676617 non-null float64 27 ChestScan 676617 non-null float64 28 RaceEthnicityCategory 676617 non-null category 29 AgeCategory 676617 non-null category 30 HeightInMeters 676617 non-null float64 31 WeightInKilograms 676617 non-null float64 32 BMI 676617 non-null float64 33 AlcoholDrinkers 676617 non-null float64 34 HIVTesting 676617 non-null float64 35 FluVaxLast12 676617 non-null float64 36 PneumoVaxEver 676617 non-null float64 37 TetanusLast10Tdap 676617 non-null float64 38 HighRiskLastYear 676617 non-null float64 39 CovidPos 676617 non-null float64 dtypes: category(4), float64(36) memory usage: 188.4 MB
valid.info()
<class 'pandas.core.frame.DataFrame'> Index: 44514 entries, 127295 to 418173 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44514 non-null category 1 Male 44514 non-null float64 2 GeneralHealth 44514 non-null float64 3 PhysicalHealthDays 44514 non-null float64 4 MentalHealthDays 44514 non-null float64 5 LastCheckupTime 44514 non-null category 6 PhysicalActivities 44514 non-null float64 7 SleepHours 44514 non-null float64 8 RemovedTeeth 44514 non-null float64 9 HadHeartAttack 44514 non-null float64 10 HadAngina 44514 non-null float64 11 HadStroke 44514 non-null float64 12 HadAsthma 44514 non-null float64 13 HadSkinCancer 44514 non-null float64 14 HadCOPD 44514 non-null float64 15 HadDepressiveDisorder 44514 non-null float64 16 HadKidneyDisease 44514 non-null float64 17 HadArthritis 44514 non-null float64 18 HadDiabetes 44514 non-null float64 19 DeafOrHardOfHearing 44514 non-null float64 20 BlindOrVisionDifficulty 44514 non-null float64 21 DifficultyConcentrating 44514 non-null float64 22 DifficultyWalking 44514 non-null float64 23 DifficultyDressingBathing 44514 non-null float64 24 DifficultyErrands 44514 non-null float64 25 SmokerStatus 44514 non-null float64 26 ECigaretteUsage 44514 non-null float64 27 ChestScan 44514 non-null float64 28 RaceEthnicityCategory 44514 non-null category 29 AgeCategory 44514 non-null category 30 HeightInMeters 44514 non-null float64 31 WeightInKilograms 44514 non-null float64 32 BMI 44514 non-null float64 33 AlcoholDrinkers 44514 non-null float64 34 HIVTesting 44514 non-null float64 35 FluVaxLast12 44514 non-null float64 36 PneumoVaxEver 44514 non-null float64 37 TetanusLast10Tdap 44514 non-null float64 38 HighRiskLastYear 44514 non-null float64 39 CovidPos 44514 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
Zapisywanie do csv
cat_columns = test.select_dtypes(['category']).columns
print(cat_columns)
Index(['State', 'LastCheckupTime', 'RaceEthnicityCategory', 'AgeCategory'], dtype='object')
#test[cat_columns] = test[cat_columns].apply(lambda x: pd.factorize(x)[0])
#train[cat_columns] = train[cat_columns].apply(lambda x: pd.factorize(x)[0])
#valid[cat_columns] = valid[cat_columns].apply(lambda x: pd.factorize(x)[0])
test.to_csv("test.csv")
train.to_csv("train.csv")
valid.to_csv("valid.csv")