348 KiB
1. Pobieranie zbioru danych
!pip install --user kaggle
Collecting kaggle Using cached kaggle-1.6.6.tar.gz (84 kB) Requirement already satisfied: six>=1.10 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2022.6.15) Requirement already satisfied: python-dateutil in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.8.2) Requirement already satisfied: requests in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (2.28.1) Requirement already satisfied: tqdm in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (4.64.1) Collecting python-slugify Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB) Requirement already satisfied: urllib3 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from kaggle) (1.26.11) Requirement already satisfied: bleach in c:\users\adrian\miniconda3\lib\site-packages (from kaggle) (4.1.0) Requirement already satisfied: webencodings in c:\users\adrian\miniconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: packaging in c:\users\adrian\appdata\roaming\python\python39\site-packages (from bleach->kaggle) (22.0) Collecting text-unidecode>=1.3 Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB) Requirement already satisfied: idna<4,>=2.5 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.10) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adrian\appdata\roaming\python\python39\site-packages (from requests->kaggle) (2.1.0) Requirement already satisfied: colorama in c:\users\adrian\appdata\roaming\python\python39\site-packages (from tqdm->kaggle) (0.4.5) Building wheels for collected packages: kaggle Building wheel for kaggle (setup.py): started Building wheel for kaggle (setup.py): finished with status 'done' Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111961 sha256=3aa19c7655c19d77b65c2542567b6e34a57813227f1f2df6d0fd84accad6824f Stored in directory: c:\users\adrian\appdata\local\pip\cache\wheels\46\aa\c3\b3e421522fb5acdd7c366a05c5fc80787615bdeed207e7f79b Successfully built kaggle Installing collected packages: text-unidecode, python-slugify, kaggle Successfully installed kaggle-1.6.6 python-slugify-8.0.4 text-unidecode-1.3
!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease/
Downloading personal-key-indicators-of-heart-disease.zip to C:\Users\Adrian\Desktop\Semestr 1 (II ST)\ML\zadania
0%| | 0.00/21.4M [00:00<?, ?B/s] 5%|4 | 1.00M/21.4M [00:00<00:11, 1.80MB/s] 9%|9 | 2.00M/21.4M [00:00<00:05, 3.51MB/s] 19%|#8 | 4.00M/21.4M [00:00<00:02, 7.25MB/s] 33%|###2 | 7.00M/21.4M [00:00<00:01, 12.0MB/s] 51%|#####1 | 11.0M/21.4M [00:01<00:00, 17.9MB/s] 65%|######5 | 14.0M/21.4M [00:01<00:00, 20.7MB/s] 79%|#######9 | 17.0M/21.4M [00:01<00:00, 20.6MB/s] 93%|#########3| 20.0M/21.4M [00:01<00:00, 22.7MB/s] 100%|##########| 21.4M/21.4M [00:01<00:00, 14.9MB/s]
#!unzip -o personal-key-indicators-of-heart-disease.zip #nie działa na Windowsie więc korzystam z modułu zipfile
'unzip' is not recognized as an internal or external command, operable program or batch file.
import zipfile
with zipfile.ZipFile("personal-key-indicators-of-heart-disease.zip", 'r') as zip_ref:
zip_ref.extractall("dataset_extracted")
import pandas as pd
# W pobranym zbiorze danych jest kilka podzbiorów więc celowo otwieram ten z NaN, żeby manualnie go oczyścić dla praktyki
df = pd.read_csv("dataset_extracted/2022/heart_2022_with_nans.csv")
Przeglądanie nieoczyszczonego datasetu
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 445132 entries, 0 to 445131 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 445132 non-null object 1 Sex 445132 non-null object 2 GeneralHealth 443934 non-null object 3 PhysicalHealthDays 434205 non-null float64 4 MentalHealthDays 436065 non-null float64 5 LastCheckupTime 436824 non-null object 6 PhysicalActivities 444039 non-null object 7 SleepHours 439679 non-null float64 8 RemovedTeeth 433772 non-null object 9 HadHeartAttack 442067 non-null object 10 HadAngina 440727 non-null object 11 HadStroke 443575 non-null object 12 HadAsthma 443359 non-null object 13 HadSkinCancer 441989 non-null object 14 HadCOPD 442913 non-null object 15 HadDepressiveDisorder 442320 non-null object 16 HadKidneyDisease 443206 non-null object 17 HadArthritis 442499 non-null object 18 HadDiabetes 444045 non-null object 19 DeafOrHardOfHearing 424485 non-null object 20 BlindOrVisionDifficulty 423568 non-null object 21 DifficultyConcentrating 420892 non-null object 22 DifficultyWalking 421120 non-null object 23 DifficultyDressingBathing 421217 non-null object 24 DifficultyErrands 419476 non-null object 25 SmokerStatus 409670 non-null object 26 ECigaretteUsage 409472 non-null object 27 ChestScan 389086 non-null object 28 RaceEthnicityCategory 431075 non-null object 29 AgeCategory 436053 non-null object 30 HeightInMeters 416480 non-null float64 31 WeightInKilograms 403054 non-null float64 32 BMI 396326 non-null float64 33 AlcoholDrinkers 398558 non-null object 34 HIVTesting 379005 non-null object 35 FluVaxLast12 398011 non-null object 36 PneumoVaxEver 368092 non-null object 37 TetanusLast10Tdap 362616 non-null object 38 HighRiskLastYear 394509 non-null object 39 CovidPos 394368 non-null object dtypes: float64(6), object(34) memory usage: 135.8+ MB
df.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | Female | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | NaN | No | ... | NaN | NaN | NaN | No | No | Yes | No | Yes, received tetanus shot but not sure what type | No | No |
1 | Alabama | Female | Excellent | 0.0 | 0.0 | NaN | No | 6.0 | NaN | No | ... | 1.60 | 68.04 | 26.57 | No | No | No | No | No, did not receive any tetanus shot in the pa... | No | No |
2 | Alabama | Female | Very good | 2.0 | 3.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | NaN | No | ... | 1.57 | 63.50 | 25.61 | No | No | No | No | NaN | No | Yes |
3 | Alabama | Female | Excellent | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | NaN | No | ... | 1.65 | 63.50 | 23.30 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No |
4 | Alabama | Female | Fair | 2.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | NaN | No | ... | 1.57 | 53.98 | 21.77 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | No |
5 rows × 40 columns
df.describe()
PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | WeightInKilograms | BMI | |
---|---|---|---|---|---|---|
count | 434205.000000 | 436065.000000 | 439679.000000 | 416480.000000 | 403054.000000 | 396326.000000 |
mean | 4.347919 | 4.382649 | 7.022983 | 1.702691 | 83.074470 | 28.529842 |
std | 8.688912 | 8.387475 | 1.502425 | 0.107177 | 21.448173 | 6.554889 |
min | 0.000000 | 0.000000 | 1.000000 | 0.910000 | 22.680000 | 12.020000 |
25% | 0.000000 | 0.000000 | 6.000000 | 1.630000 | 68.040000 | 24.130000 |
50% | 0.000000 | 0.000000 | 7.000000 | 1.700000 | 80.740000 | 27.440000 |
75% | 3.000000 | 5.000000 | 8.000000 | 1.780000 | 95.250000 | 31.750000 |
max | 30.000000 | 30.000000 | 24.000000 | 2.410000 | 292.570000 | 99.640000 |
Tylko 6 kolumn jest numeryczne na razie więc wiele statystyk nie zostaje wyświetlonych w tym podsumowaniu
Zbiór danych jest niezbalansowany, zmienna którą chcemy przewidzieć w znacznej większości przypadków wynosi 0:
df["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
df["HadHeartAttack"].value_counts()
No 416959 Yes 25108 Name: HadHeartAttack, dtype: int64
2. Podział na podzbiory (train / dev / test - 8:1:1)) i oversampling
from sklearn.model_selection import train_test_split
#Funkcji z sklearn musimy użyć dwukrotnie, bo dzieli tylko na dwa podzbiory
train, test_and_valid = train_test_split(df, test_size=0.2) #0.8 train, 0.2 test&valid
test, valid = train_test_split(test_and_valid, test_size=0.5) #0.1 test, 0.1 valid
train["HadHeartAttack"].value_counts()
No 333641 Yes 20042 Name: HadHeartAttack, dtype: int64
Zbiór treningowy jest nadal niezbalansowany więc zrobię prosty oversampling przez kopiowanie mniejszej klasy aż będą prawie równe
def oversample(dataset):
num_true = len(dataset[dataset["HadHeartAttack"]=="Yes"])
num_false = len(dataset[dataset["HadHeartAttack"]=="No"])
num_oversampling_steps = num_false//num_true
oversampled = dataset.copy()
for x in range(num_oversampling_steps):
oversampled = pd.concat([oversampled, dataset[dataset["HadHeartAttack"]=="Yes"]], ignore_index=True)
return oversampled
train = oversample(train)
train["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
test["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
valid["HadHeartAttack"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='HadHeartAttack'>
Proporcje osób palących / niepalących w pierwotnym zbiorze danych:
df["SmokerStatus"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='SmokerStatus'>
df["ECigaretteUsage"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='ECigaretteUsage'>
Statystyki covidowe
df["CovidPos"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='CovidPos'>
Normalizacja część 1 - zamiana na kolumny liczbowe i kategoryczne
Kolumny zawierające stan zdrowia i podobne cechy opisane w sposób "poor/fair/good/excellent" etc. starałem się zamienić na liczbowe w sposób sensowny, rosnący względem pozytywnego aspektu tego czynnika zdrowotnego. Podobnie z tym jak często dana osoba paliła. Część kolumn zamieniłem na kategoryczne Kolumnę płci zamieniłem na numeryczną w celu późniejszego wykorzystania przez model, choć mialem wątpliwości co do robienia tego pod względem poprawności politycznej
df["Sex"].unique()
array(['Female', 'Male'], dtype=object)
df["GeneralHealth"].unique()
array(['Very good', 'Excellent', 'Fair', 'Poor', 'Good', nan], dtype=object)
health_map = {
"Excellent": 5,
"Very good": 4,
"Good": 3,
"Fair": 2,
"Poor": 1
}
for col in df:
print(f"{col}:")
print(df[col].unique())
State: ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico' 'Virgin Islands'] Sex: ['Female' 'Male'] GeneralHealth: ['Very good' 'Excellent' 'Fair' 'Poor' 'Good' nan] PhysicalHealthDays: [ 0. 2. 1. 8. 5. 30. 4. 23. 14. nan 15. 3. 10. 7. 25. 6. 21. 20. 29. 16. 9. 27. 28. 12. 13. 11. 26. 17. 24. 19. 18. 22.] MentalHealthDays: [ 0. 3. 9. 5. 15. 20. 14. 10. 18. 1. nan 2. 30. 4. 6. 7. 25. 8. 22. 29. 27. 21. 12. 28. 16. 13. 26. 17. 11. 23. 19. 24.] LastCheckupTime: ['Within past year (anytime less than 12 months ago)' nan 'Within past 2 years (1 year but less than 2 years ago)' 'Within past 5 years (2 years but less than 5 years ago)' '5 or more years ago'] PhysicalActivities: ['No' 'Yes' nan] SleepHours: [ 8. 6. 5. 7. 9. 4. 10. 1. 12. nan 18. 3. 2. 11. 16. 15. 13. 14. 20. 23. 17. 24. 22. 19. 21.] RemovedTeeth: [nan 'None of them' '1 to 5' '6 or more, but not all' 'All'] HadHeartAttack: ['No' 'Yes' nan] HadAngina: ['No' 'Yes' nan] HadStroke: ['No' 'Yes' nan] HadAsthma: ['No' 'Yes' nan] HadSkinCancer: ['No' 'Yes' nan] HadCOPD: ['No' 'Yes' nan] HadDepressiveDisorder: ['No' 'Yes' nan] HadKidneyDisease: ['No' 'Yes' nan] HadArthritis: ['No' 'Yes' nan] HadDiabetes: ['Yes' 'No' 'No, pre-diabetes or borderline diabetes' nan 'Yes, but only during pregnancy (female)'] DeafOrHardOfHearing: ['No' nan 'Yes'] BlindOrVisionDifficulty: ['No' 'Yes' nan] DifficultyConcentrating: ['No' nan 'Yes'] DifficultyWalking: ['No' 'Yes' nan] DifficultyDressingBathing: ['No' nan 'Yes'] DifficultyErrands: ['No' 'Yes' nan] SmokerStatus: ['Never smoked' 'Current smoker - now smokes some days' 'Former smoker' nan 'Current smoker - now smokes every day'] ECigaretteUsage: ['Not at all (right now)' 'Never used e-cigarettes in my entire life' nan 'Use them every day' 'Use them some days'] ChestScan: ['No' 'Yes' nan] RaceEthnicityCategory: ['White only, Non-Hispanic' 'Black only, Non-Hispanic' 'Other race only, Non-Hispanic' 'Multiracial, Non-Hispanic' nan 'Hispanic'] AgeCategory: ['Age 80 or older' 'Age 55 to 59' nan 'Age 40 to 44' 'Age 75 to 79' 'Age 70 to 74' 'Age 65 to 69' 'Age 60 to 64' 'Age 50 to 54' 'Age 45 to 49' 'Age 35 to 39' 'Age 25 to 29' 'Age 30 to 34' 'Age 18 to 24'] HeightInMeters: [ nan 1.6 1.57 1.65 1.8 1.63 1.7 1.68 1.73 1.55 1.93 1.88 1.78 1.85 1.75 1.52 1.83 1.91 1.96 1.5 1.45 1.42 1.24 1.47 1.22 1.98 2.03 2.01 1.3 1.4 1.35 1.82 1.67 1.76 2.11 1.37 1.64 1.71 2.16 2.26 0.91 2.06 1.14 1.74 1.51 1.53 1.69 1.56 1.84 1.9 1.54 1.72 1.87 1.61 1.49 1.59 1.58 1.62 1.79 1.46 1.89 2.13 0.99 2.08 2.21 1.32 2.18 1.77 2.36 1.25 1.66 1.86 1.95 1.19 1.05 1.48 1.03 1.18 1.81 1.38 1.44 1.07 1.27 1.2 1.17 1.04 2.24 1.1 1.43 1.92 2.05 1.12 2.41 2.34 0.97 1.06 1.15 2.29 1.16 1.09 0.92 2.07 1. 1.08 1.02 1.33 2. 2.02 1.94 0.95] WeightInKilograms: [ nan 68.04 63.5 53.98 84.82 62.6 73.48 81.65 74.84 59.42 85.28 106.59 71.21 64.41 61.23 90.72 65.77 66.22 80.29 86.18 47.63 107.05 57.15 105.23 77.11 56.7 79.38 113.4 102.06 59.87 104.33 53.52 61.69 136.08 34.47 99.79 127.01 78.93 95.25 58.97 92.08 72.57 83.91 49.9 117.93 71.67 102.97 62.14 83.46 54.43 94.35 60.78 117.03 65.32 76.66 88.45 89.81 74.39 68.95 79.83 108.41 90.26 55.79 91.63 47.17 78.02 50.8 91.17 84.37 145.15 93.89 122.47 48.99 73.94 88.9 80.74 81.19 158.76 97.52 51.71 82.55 76.2 68.49 75.3 70.31 63.05 60.33 115.67 86.64 108.86 92.53 124.74 43.09 58.51 63.96 92.99 44.45 128.82 98.88 45.36 110.68 46.72 58.06 73.03 95.71 131.09 78.47 69.4 85.73 67.59 103.87 120.2 88. 54.88 111.58 52.16 77.56 126.55 94.8 123.83 89.36 75.75 69.85 112.49 82.1 106.14 57.61 70.76 148.78 96.16 67.13 48.08 163.29 109.77 100.7 142.88 64.86 111.13 121.11 55.34 101.6 93.44 117.48 120.66 66.68 44.91 132. 107.5 107.95 36.29 103.42 87.09 83.01 56.25 96.62 134.26 97.07 34.93 99.34 72.12 49.44 122.02 98.43 129.73 181.44 52.62 121.56 110.22 48.53 140.61 156.49 116.57 87.54 44. 114.31 31.75 97.98 101.15 112.04 100.24 113.85 154.22 118.39 133.81 149.69 41.73 119.75 138.35 151.95 129.27 131.54 104.78 132.45 102.51 116.12 40.37 105.69 136.98 195.04 53.07 132.9 124.28 112.94 114.76 45.81 119.29 167.83 51.26 172.37 162.39 46.27 127.91 123.38 38.56 130.63 143.34 115.21 166.92 135.17 109.32 135.62 204.12 127.46 118.84 139.25 126.1 122.92 151.5 133.36 42.64 50.35 80. 190.51 37.19 147.87 35.38 144.24 149.23 37.65 86. 147.42 281. 165.56 162.84 155.58 70. 137.89 189.6 206.38 148.32 42.18 153.77 38.1 90. 176.9 191.87 249.48 67. 95. 82. 170.1 62. 40.82 53. 139.71 130.18 100. 165.11 64. 43.54 24. 134.72 141.52 125.19 75. 60. 34.02 164.65 30.84 250. 58. 76. 73. 112. 74. 55. 200. 54. 66. 72. 152.41 39.46 220. 41.28 168.28 188.24 59. 46. 265. 238.14 168.74 145. 190. 93. 159.66 78. 50. 185.07 91. 104. 165. 183.7 33.57 161.93 68. 125.65 134. 130. 32.21 143.79 69. 179.17 63. 105. 210.92 65. 32. 292.57 280. 85. 174.63 56. 128.37 87. 39.92 83. 169.64 156.04 177. 121. 151.05 89. 146.96 146.06 98. 166.47 36.74 171.46 227.25 29.48 190.06 161.03 35.83 226.8 175.09 138.8 240.4 158.3 170.55 61. 137.44 145.6 141.07 155.13 52. 120. 57. 77. 27.22 25.4 240. 96. 47. 115. 41. 45. 170. 150.59 272.16 26.31 48. 39.01 236. 92. 197.31 156. 84. 94. 29.03 49. 79. 157.85 192.78 255. 108. 185. 222.26 229.97 180. 81. 24.95 71. 26. 107. 101. 208.65 140. 175. 111. 110. 141.97 22.68 284.86 136.53 210. 103. 185.97 140.16 146.51 24.49 25.85 150. 102. 229.52 23.59 125. 163. 38. 135. 176.45 185.52 152.86 232.69 124. 192.32 186.88 118. 160.12 160. 193.68 201.85 144.7 184.16 142.43 169. 166.01 32.66 180.53 196.41 51. 40. 171.91 195.95 33.11 153.31 159.21 164.2 219.99 215.46 182.34 30. 160.57 173.27 158. 213.19 276.24 199.58 175.99 235.87 217.72 200.03 230.88 146. 24.04 178.72 150.14 157.4 163.75 191.42 174.18 28.58 97. 256.28 205.48 161.48 178.26 179.62 205.02 254.01 154.68 209.56 201.4 234.96 177.81 200.49 231.79 227.7 273.52 189.15 173.73 183.25 167.38 211.83 223.62 228.61 30.39 197.77 184.61 250.38 181.89 31.3 290.3 285. 113. 242.67 231.33 180.08 202.76 176. 188.69 206.84 164. 156.94 114. 122. 222. 137. 166. 180.98 272. 172.82 274.42 234.51 199.13 244.94 203.21 23.13 265.35 198.22 263.08 216.82 154. 169.19 239.04 177.35 210.47 224.98 117. 37. 126. 273.06 203.66 252.2 238.59 194.59 187.33 221.35 162. 224.53 23. 223.17 187.79 212.73 152. 233.6 193.23 205. 229.06 230. 247.21 99. 28.12 230.42 175.54 205.93 171. 26.76 212.28 217. 280.32 281.68 248.57 195. 42. 258.55 215. 116. 28. 123. 186.43 228.16 119. 219.09 214.55 278.96 182.8 138. 217.27 246.3 189. ] BMI: [ nan 26.57 25.61 ... 13.51 28.39 48.63] AlcoholDrinkers: ['No' 'Yes' nan] HIVTesting: ['No' 'Yes' nan] FluVaxLast12: ['Yes' 'No' nan] PneumoVaxEver: ['No' 'Yes' nan] TetanusLast10Tdap: ['Yes, received tetanus shot but not sure what type' 'No, did not receive any tetanus shot in the past 10 years' nan 'Yes, received Tdap' 'Yes, received tetanus shot, but not Tdap'] HighRiskLastYear: ['No' nan 'Yes'] CovidPos: ['No' 'Yes' nan 'Tested positive using home test without a health professional']
from collections import defaultdict
def normalize_dataset(dataset):
dataset["GeneralHealth"] = dataset["GeneralHealth"].map(defaultdict(lambda: float('NaN'), health_map), na_action='ignore')
dataset["Sex"] = dataset["Sex"].map({"Female":0,"Male":1}).astype(float) #Zamiana z kolumn tekstowych na numeryczne
dataset.rename(columns ={"Sex":"Male"},inplace=True)
dataset["State"] = dataset["State"].astype('category')
dataset["PhysicalHealthDays"].astype(float)
dataset["MentalHealthDays"].astype(float)
dataset["LastCheckupTime"] = dataset["LastCheckupTime"].fillna("Unknown").astype('category') # Potem korzystam z fillna-->median ale nie działa to na kolumnach kategorycznych więc wykonuję to przed konwersją
dataset["PhysicalActivities"]= dataset["PhysicalActivities"].map({"No":0,"Yes":1})
dataset["SleepHours"].astype(float)
dataset["RemovedTeeth"] = dataset["RemovedTeeth"].map(defaultdict(lambda: float('NaN'), {"None of them":0,"1 to 5":1, "6 or more, but not all":2, "All":3}), na_action='ignore')
dataset["HadHeartAttack"]= dataset["HadHeartAttack"].map({"No":0,"Yes":1})
dataset["HadAngina"]= dataset["HadAngina"].map({"No":0,"Yes":1})
dataset["HadStroke"]= dataset["HadStroke"].map({"No":0,"Yes":1})
dataset["HadAsthma"]= dataset["HadAsthma"].map({"No":0,"Yes":1})
dataset["HadSkinCancer"]= dataset["HadSkinCancer"].map({"No":0,"Yes":1})
dataset["HadCOPD"]= dataset["HadCOPD"].map({"No":0,"Yes":1})
dataset["HadDepressiveDisorder"]= dataset["HadDepressiveDisorder"].map({"No":0,"Yes":1})
dataset["HadKidneyDisease"]= dataset["HadKidneyDisease"].map({"No":0,"Yes":1})
dataset["HadArthritis"]= dataset["HadArthritis"].map({"No":0,"Yes":1})
dataset["HadDiabetes"]= dataset["HadDiabetes"].map({"No":0,"Yes, but only during pregnancy (female)":1,"No, pre-diabetes or borderline diabetes":2,"Yes":3})
dataset["DeafOrHardOfHearing"]= dataset["DeafOrHardOfHearing"].map({"No":0,"Yes":1})
dataset["BlindOrVisionDifficulty"]= dataset["BlindOrVisionDifficulty"].map({"No":0,"Yes":1})
dataset["DifficultyConcentrating"]= dataset["DifficultyConcentrating"].map({"No":0,"Yes":1})
dataset["DifficultyWalking"]= dataset["DifficultyWalking"].map({"No":0,"Yes":1})
dataset["DifficultyDressingBathing"]= dataset["DifficultyDressingBathing"].map({"No":0,"Yes":1})
dataset["DifficultyErrands"]= dataset["DifficultyErrands"].map({"No":0,"Yes":1})
dataset["SmokerStatus"]= dataset["SmokerStatus"].map({"Never smoked":0,"Current smoker - now smokes some days":1,"Former smoker":2,"Current smoker - now smokes every day":3})
dataset["ECigaretteUsage"]= dataset["ECigaretteUsage"].map({"Never used e-cigarettes in my entire life":0,"Not at all (right now)":1,"Use them some days":2,"Use them every day":3})
dataset["ChestScan"]= dataset["ChestScan"].map({"No":0,"Yes":1})
dataset["RaceEthnicityCategory"] = dataset["RaceEthnicityCategory"].fillna("Unknown").astype('category')
dataset["AgeCategory"] = dataset["AgeCategory"].fillna("Unknown").astype('category')
dataset["HeightInMeters"] = dataset["HeightInMeters"].astype(float)
dataset["WeightInKilograms"] = dataset["WeightInKilograms"].astype(float)
dataset["BMI"] = dataset["BMI"].astype(float)
dataset["AlcoholDrinkers"]= dataset["AlcoholDrinkers"].map({"No":0,"Yes":1})
dataset["HIVTesting"]= dataset["HIVTesting"].map({"No":0,"Yes":1})
dataset["FluVaxLast12"]= dataset["FluVaxLast12"].map({"No":0,"Yes":1})
dataset["PneumoVaxEver"]= dataset["PneumoVaxEver"].map({"No":0,"Yes":1})
dataset["TetanusLast10Tdap"]= dataset["TetanusLast10Tdap"].apply(lambda x: float('NaN') if type(x)!=str else 1.0 if 'Yes,' in x else 1.0 if 'No,' in x else float('NaN'))
dataset["HighRiskLastYear"]= dataset["HighRiskLastYear"].map({"No":0,"Yes":1})
dataset["CovidPos"]= dataset["CovidPos"].map({"No":0,"Yes":1})
Zbiór test przed zmianą typu danych
test.head()
State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | Female | Good | 3.0 | 21.0 | Within past year (anytime less than 12 months ... | Yes | 8.0 | 6 or more, but not all | No | ... | 1.60 | 52.16 | 20.37 | No | Yes | Yes | Yes | Yes, received Tdap | No | No |
127927 | Kansas | Female | Good | 30.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 10.0 | 1 to 5 | No | ... | 1.68 | 97.52 | 34.70 | No | No | Yes | Yes | NaN | No | No |
362523 | Utah | Male | Excellent | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | 1 to 5 | No | ... | 1.83 | 113.85 | 34.04 | No | No | No | No | Yes, received Tdap | No | Yes |
183687 | Michigan | Male | Good | 0.0 | 7.0 | Within past year (anytime less than 12 months ... | Yes | 8.0 | None of them | No | ... | 1.78 | 83.91 | 26.54 | Yes | No | Yes | Yes | Yes, received Tdap | No | No |
191905 | Michigan | Female | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | None of them | No | ... | 1.57 | 68.04 | 27.44 | Yes | Yes | No | Yes | Yes, received tetanus shot but not sure what type | No | No |
5 rows × 40 columns
Zbiór test po zmianie typu danych
normalize_dataset(test)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | 0.0 | 3.0 | 3.0 | 21.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 2.0 | 0.0 | ... | 1.60 | 52.16 | 20.37 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
127927 | Kansas | 0.0 | 3.0 | 30.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 10.0 | 1.0 | 0.0 | ... | 1.68 | 97.52 | 34.70 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
362523 | Utah | 1.0 | 5.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 1.0 | 0.0 | ... | 1.83 | 113.85 | 34.04 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
183687 | Michigan | 1.0 | 3.0 | 0.0 | 7.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 0.0 | 0.0 | ... | 1.78 | 83.91 | 26.54 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
191905 | Michigan | 0.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.57 | 68.04 | 27.44 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
5 rows × 40 columns
test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44513 entries, 339824 to 52161 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44393 non-null float64 3 PhysicalHealthDays 43469 non-null float64 4 MentalHealthDays 43622 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44408 non-null float64 7 SleepHours 44008 non-null float64 8 RemovedTeeth 43413 non-null float64 9 HadHeartAttack 44182 non-null float64 10 HadAngina 44074 non-null float64 11 HadStroke 44368 non-null float64 12 HadAsthma 44339 non-null float64 13 HadSkinCancer 44184 non-null float64 14 HadCOPD 44299 non-null float64 15 HadDepressiveDisorder 44218 non-null float64 16 HadKidneyDisease 44320 non-null float64 17 HadArthritis 44243 non-null float64 18 HadDiabetes 44411 non-null float64 19 DeafOrHardOfHearing 42485 non-null float64 20 BlindOrVisionDifficulty 42387 non-null float64 21 DifficultyConcentrating 42169 non-null float64 22 DifficultyWalking 42172 non-null float64 23 DifficultyDressingBathing 42182 non-null float64 24 DifficultyErrands 41999 non-null float64 25 SmokerStatus 41005 non-null float64 26 ECigaretteUsage 41003 non-null float64 27 ChestScan 38958 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 41714 non-null float64 31 WeightInKilograms 40397 non-null float64 32 BMI 39724 non-null float64 33 AlcoholDrinkers 39956 non-null float64 34 HIVTesting 38018 non-null float64 35 FluVaxLast12 39886 non-null float64 36 PneumoVaxEver 36860 non-null float64 37 TetanusLast10Tdap 36315 non-null float64 38 HighRiskLastYear 39538 non-null float64 39 CovidPos 38114 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
normalize_dataset(train)
normalize_dataset(valid)
Statystyki dla zbiorów po zamianie na kolumny numeryczne
_50. centyl to mediana
train.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 676777.000000 | 674433.000000 | 655630.000000 | 660417.000000 | 674814.000000 | 665609.000000 | 654359.000000 | 674355.000000 | 657726.000000 | 672927.000000 | ... | 637313.000000 | 619546.000000 | 611024.000000 | 606636.000000 | 572562.000000 | 605920.000000 | 570114.000000 | 554467.0 | 600540.000000 | 585150.000000 |
mean | 0.538139 | 3.055519 | 6.737547 | 4.863972 | 0.690146 | 7.032336 | 0.983081 | 0.505244 | 0.264549 | 0.117193 | ... | 1.707193 | 84.657015 | 28.917363 | 0.456819 | 0.325787 | 0.569879 | 0.527672 | 1.0 | 0.035087 | 0.273055 |
std | 0.498544 | 1.137862 | 10.713287 | 9.115863 | 0.462434 | 1.726387 | 1.019679 | 0.499973 | 0.441093 | 0.321650 | ... | 0.108002 | 21.753692 | 6.607455 | 0.498132 | 0.468668 | 0.495093 | 0.499234 | 0.0 | 0.183999 | 0.445529 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.050000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 69.400000 | 24.410000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 81.650000 | 27.890000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 10.000000 | 5.000000 | 1.000000 | 8.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 1.780000 | 96.160000 | 32.260000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.410000 | 292.570000 | 97.650000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
test.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44513.000000 | 44393.000000 | 43469.000000 | 43622.000000 | 44408.000000 | 44008.000000 | 43413.000000 | 44182.000000 | 44074.000000 | 44368.000000 | ... | 41714.000000 | 40397.000000 | 39724.000000 | 39956.000000 | 38018.000000 | 39886.000000 | 36860.000000 | 36315.0 | 39538.000000 | 38114.000000 |
mean | 0.471593 | 3.441511 | 4.275001 | 4.298221 | 0.760021 | 7.036584 | 0.685302 | 0.057241 | 0.061056 | 0.042869 | ... | 1.703194 | 83.021746 | 28.512326 | 0.529532 | 0.342233 | 0.527002 | 0.413592 | 1.0 | 0.043427 | 0.289500 |
std | 0.499198 | 1.050924 | 8.588663 | 8.299250 | 0.427075 | 1.512667 | 0.884912 | 0.232304 | 0.239436 | 0.202563 | ... | 0.107438 | 21.551394 | 6.596149 | 0.499133 | 0.474463 | 0.499277 | 0.492484 | 0.0 | 0.203818 | 0.453536 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.020000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 80.740000 | 27.410000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 4.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.650000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.340000 | 290.300000 | 97.650000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
valid.describe()
Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44514.000000 | 44388.000000 | 43458.000000 | 43578.000000 | 44401.000000 | 43966.000000 | 43360.000000 | 44202.000000 | 44031.000000 | 44344.000000 | ... | 41677.000000 | 40327.000000 | 39626.000000 | 39950.000000 | 38041.000000 | 39885.000000 | 36926.000000 | 36250.0 | 39535.000000 | 38212.000000 |
mean | 0.469043 | 3.434554 | 4.355470 | 4.379022 | 0.758361 | 7.013010 | 0.684732 | 0.057396 | 0.058936 | 0.042463 | ... | 1.702146 | 82.981070 | 28.521370 | 0.527910 | 0.342525 | 0.525285 | 0.412771 | 1.0 | 0.044846 | 0.291872 |
std | 0.499046 | 1.051996 | 8.718506 | 8.383576 | 0.428081 | 1.491967 | 0.882396 | 0.232600 | 0.235507 | 0.201646 | ... | 0.106978 | 21.512676 | 6.622255 | 0.499227 | 0.474560 | 0.499367 | 0.492339 | 0.0 | 0.206969 | 0.454630 |
min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.910000 | 22.680000 | 12.160000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
25% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.630000 | 68.040000 | 24.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
50% | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.700000 | 79.830000 | 27.400000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 |
75% | 1.000000 | 4.000000 | 3.000000 | 5.000000 | 1.000000 | 8.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.780000 | 95.250000 | 31.750000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 0.000000 | 1.000000 |
max | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 24.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 2.360000 | 263.080000 | 99.640000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 |
8 rows × 36 columns
Wydaje się być korelacja między masą ciała i zawałem:
import seaborn as sns
sns.set_theme()
g = sns.catplot(
data=train, kind="bar",
x="GeneralHealth", y="WeightInKilograms", hue="HadHeartAttack",
errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("General health index", "Body mass (kg)")
g.legend.set_title("Had heart attack")
Osoby palące częsciej miały zawał:
valid.groupby('SmokerStatus', as_index=False)['HadHeartAttack'].mean()
SmokerStatus | HadHeartAttack | |
---|---|---|
0 | 0.0 | 0.037883 |
1 | 1.0 | 0.072598 |
2 | 2.0 | 0.088887 |
3 | 3.0 | 0.090192 |
Osoby z gorszym wskaźnikiem "GeneralHealth" w tym zbiorze danych częściej miały zawał:
valid.groupby('GeneralHealth', as_index=False)['HadHeartAttack'].mean()
GeneralHealth | HadHeartAttack | |
---|---|---|
0 | 1.0 | 0.228411 |
1 | 2.0 | 0.129270 |
2 | 3.0 | 0.056693 |
3 | 4.0 | 0.027336 |
4 | 5.0 | 0.014743 |
valid.pivot_table('HadHeartAttack',index='GeneralHealth', columns='SmokerStatus')
SmokerStatus | 0.0 | 1.0 | 2.0 | 3.0 |
---|---|---|---|---|
GeneralHealth | ||||
1.0 | 0.194640 | 0.310680 | 0.257100 | 0.222222 |
2.0 | 0.090772 | 0.146429 | 0.184443 | 0.155059 |
3.0 | 0.039989 | 0.031068 | 0.091645 | 0.060469 |
4.0 | 0.021611 | 0.032070 | 0.035265 | 0.048292 |
5.0 | 0.011078 | 0.012579 | 0.026298 | 0.018315 |
Normalizacja część 2 - Skalowanie kolumn numerycznych do 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
def scale_float_columns(dataset):
numerical_columns = list(dataset.select_dtypes(include=['float64']).columns)
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | 0.0 | 3.0 | 3.0 | 21.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 2.0 | 0.0 | ... | 1.60 | 52.16 | 20.37 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
127927 | Kansas | 0.0 | 3.0 | 30.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 10.0 | 1.0 | 0.0 | ... | 1.68 | 97.52 | 34.70 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
362523 | Utah | 1.0 | 5.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 1.0 | 0.0 | ... | 1.83 | 113.85 | 34.04 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
183687 | Michigan | 1.0 | 3.0 | 0.0 | 7.0 | Within past year (anytime less than 12 months ... | 1.0 | 8.0 | 0.0 | 0.0 | ... | 1.78 | 83.91 | 26.54 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
191905 | Michigan | 0.0 | 4.0 | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | 1.0 | 7.0 | 0.0 | 0.0 | ... | 1.57 | 68.04 | 27.44 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
5 rows × 40 columns
scale_float_columns(test)
scale_float_columns(train)
scale_float_columns(valid)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | 0.0 | 0.50 | 0.1 | 0.700000 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.666667 | 0.0 | ... | 0.482517 | 0.110156 | 0.097513 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
127927 | Kansas | 0.0 | 0.50 | 1.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.391304 | 0.333333 | 0.0 | ... | 0.538462 | 0.279650 | 0.264860 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
362523 | Utah | 1.0 | 1.00 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.333333 | 0.0 | ... | 0.643357 | 0.340670 | 0.257153 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
183687 | Michigan | 1.0 | 0.50 | 0.0 | 0.233333 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.608392 | 0.228795 | 0.169567 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
191905 | Michigan | 0.0 | 0.75 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.461538 | 0.169494 | 0.180077 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5 rows × 40 columns
5. Czyszczenie brakujących pól
Nie możemy użyć .dropna() gdyż większość wierszy ma brakujące wartości:
print(df.shape[0])
print(df.shape[0] - df.dropna().shape[0])
445132 199110
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | 0.0 | 0.50 | 0.1 | 0.700000 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.666667 | 0.0 | ... | 0.482517 | 0.110156 | 0.097513 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
127927 | Kansas | 0.0 | 0.50 | 1.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.391304 | 0.333333 | 0.0 | ... | 0.538462 | 0.279650 | 0.264860 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 |
362523 | Utah | 1.0 | 1.00 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.333333 | 0.0 | ... | 0.643357 | 0.340670 | 0.257153 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
183687 | Michigan | 1.0 | 0.50 | 0.0 | 0.233333 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.608392 | 0.228795 | 0.169567 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
191905 | Michigan | 0.0 | 0.75 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.461538 | 0.169494 | 0.180077 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5 rows × 40 columns
Uzupełniam brakujące wartości medianą:
#test.dropna(inplace=True)
#train.dropna(inplace=True)
#valid.dropna(inplace=True)
test.fillna(test.median(),inplace=True)
train.fillna(train.median(),inplace=True)
valid.fillna(valid.median(),inplace=True)
C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:4: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. test.fillna(test.median(),inplace=True) C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:5: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. train.fillna(train.median(),inplace=True) C:\Users\Adrian\AppData\Local\Temp\ipykernel_18732\896322512.py:6: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. valid.fillna(valid.median(),inplace=True)
test.head()
State | Male | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
339824 | South Dakota | 0.0 | 0.50 | 0.1 | 0.700000 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.666667 | 0.0 | ... | 0.482517 | 0.110156 | 0.097513 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
127927 | Kansas | 0.0 | 0.50 | 1.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.391304 | 0.333333 | 0.0 | ... | 0.538462 | 0.279650 | 0.264860 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
362523 | Utah | 1.0 | 1.00 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.333333 | 0.0 | ... | 0.643357 | 0.340670 | 0.257153 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
183687 | Michigan | 1.0 | 0.50 | 0.0 | 0.233333 | Within past year (anytime less than 12 months ... | 1.0 | 0.304348 | 0.000000 | 0.0 | ... | 0.608392 | 0.228795 | 0.169567 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
191905 | Michigan | 0.0 | 0.75 | 0.0 | 0.000000 | Within past year (anytime less than 12 months ... | 1.0 | 0.260870 | 0.000000 | 0.0 | ... | 0.461538 | 0.169494 | 0.180077 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5 rows × 40 columns
Kolumny kategoryczne wypełniłem w czasie normalizacji wartościami "Unknown" ponieważ fillna-->median nie działa dla tego typu danych (https://stackoverflow.com/questions/49127897/python-pandas-fillna-median-not-working)
test["HighRiskLastYear"].value_counts()
0.0 42796 1.0 1717 Name: HighRiskLastYear, dtype: int64
test["HighRiskLastYear"].isna().sum()
0
Brak wartości non-null:
test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44513 entries, 339824 to 52161 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44513 non-null category 1 Male 44513 non-null float64 2 GeneralHealth 44513 non-null float64 3 PhysicalHealthDays 44513 non-null float64 4 MentalHealthDays 44513 non-null float64 5 LastCheckupTime 44513 non-null category 6 PhysicalActivities 44513 non-null float64 7 SleepHours 44513 non-null float64 8 RemovedTeeth 44513 non-null float64 9 HadHeartAttack 44513 non-null float64 10 HadAngina 44513 non-null float64 11 HadStroke 44513 non-null float64 12 HadAsthma 44513 non-null float64 13 HadSkinCancer 44513 non-null float64 14 HadCOPD 44513 non-null float64 15 HadDepressiveDisorder 44513 non-null float64 16 HadKidneyDisease 44513 non-null float64 17 HadArthritis 44513 non-null float64 18 HadDiabetes 44513 non-null float64 19 DeafOrHardOfHearing 44513 non-null float64 20 BlindOrVisionDifficulty 44513 non-null float64 21 DifficultyConcentrating 44513 non-null float64 22 DifficultyWalking 44513 non-null float64 23 DifficultyDressingBathing 44513 non-null float64 24 DifficultyErrands 44513 non-null float64 25 SmokerStatus 44513 non-null float64 26 ECigaretteUsage 44513 non-null float64 27 ChestScan 44513 non-null float64 28 RaceEthnicityCategory 44513 non-null category 29 AgeCategory 44513 non-null category 30 HeightInMeters 44513 non-null float64 31 WeightInKilograms 44513 non-null float64 32 BMI 44513 non-null float64 33 AlcoholDrinkers 44513 non-null float64 34 HIVTesting 44513 non-null float64 35 FluVaxLast12 44513 non-null float64 36 PneumoVaxEver 44513 non-null float64 37 TetanusLast10Tdap 44513 non-null float64 38 HighRiskLastYear 44513 non-null float64 39 CovidPos 44513 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 676777 entries, 0 to 676776 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 676777 non-null category 1 Male 676777 non-null float64 2 GeneralHealth 676777 non-null float64 3 PhysicalHealthDays 676777 non-null float64 4 MentalHealthDays 676777 non-null float64 5 LastCheckupTime 676777 non-null category 6 PhysicalActivities 676777 non-null float64 7 SleepHours 676777 non-null float64 8 RemovedTeeth 676777 non-null float64 9 HadHeartAttack 676777 non-null float64 10 HadAngina 676777 non-null float64 11 HadStroke 676777 non-null float64 12 HadAsthma 676777 non-null float64 13 HadSkinCancer 676777 non-null float64 14 HadCOPD 676777 non-null float64 15 HadDepressiveDisorder 676777 non-null float64 16 HadKidneyDisease 676777 non-null float64 17 HadArthritis 676777 non-null float64 18 HadDiabetes 676777 non-null float64 19 DeafOrHardOfHearing 676777 non-null float64 20 BlindOrVisionDifficulty 676777 non-null float64 21 DifficultyConcentrating 676777 non-null float64 22 DifficultyWalking 676777 non-null float64 23 DifficultyDressingBathing 676777 non-null float64 24 DifficultyErrands 676777 non-null float64 25 SmokerStatus 676777 non-null float64 26 ECigaretteUsage 676777 non-null float64 27 ChestScan 676777 non-null float64 28 RaceEthnicityCategory 676777 non-null category 29 AgeCategory 676777 non-null category 30 HeightInMeters 676777 non-null float64 31 WeightInKilograms 676777 non-null float64 32 BMI 676777 non-null float64 33 AlcoholDrinkers 676777 non-null float64 34 HIVTesting 676777 non-null float64 35 FluVaxLast12 676777 non-null float64 36 PneumoVaxEver 676777 non-null float64 37 TetanusLast10Tdap 676777 non-null float64 38 HighRiskLastYear 676777 non-null float64 39 CovidPos 676777 non-null float64 dtypes: category(4), float64(36) memory usage: 188.5 MB
valid.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44514 entries, 66965 to 224311 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 44514 non-null category 1 Male 44514 non-null float64 2 GeneralHealth 44514 non-null float64 3 PhysicalHealthDays 44514 non-null float64 4 MentalHealthDays 44514 non-null float64 5 LastCheckupTime 44514 non-null category 6 PhysicalActivities 44514 non-null float64 7 SleepHours 44514 non-null float64 8 RemovedTeeth 44514 non-null float64 9 HadHeartAttack 44514 non-null float64 10 HadAngina 44514 non-null float64 11 HadStroke 44514 non-null float64 12 HadAsthma 44514 non-null float64 13 HadSkinCancer 44514 non-null float64 14 HadCOPD 44514 non-null float64 15 HadDepressiveDisorder 44514 non-null float64 16 HadKidneyDisease 44514 non-null float64 17 HadArthritis 44514 non-null float64 18 HadDiabetes 44514 non-null float64 19 DeafOrHardOfHearing 44514 non-null float64 20 BlindOrVisionDifficulty 44514 non-null float64 21 DifficultyConcentrating 44514 non-null float64 22 DifficultyWalking 44514 non-null float64 23 DifficultyDressingBathing 44514 non-null float64 24 DifficultyErrands 44514 non-null float64 25 SmokerStatus 44514 non-null float64 26 ECigaretteUsage 44514 non-null float64 27 ChestScan 44514 non-null float64 28 RaceEthnicityCategory 44514 non-null category 29 AgeCategory 44514 non-null category 30 HeightInMeters 44514 non-null float64 31 WeightInKilograms 44514 non-null float64 32 BMI 44514 non-null float64 33 AlcoholDrinkers 44514 non-null float64 34 HIVTesting 44514 non-null float64 35 FluVaxLast12 44514 non-null float64 36 PneumoVaxEver 44514 non-null float64 37 TetanusLast10Tdap 44514 non-null float64 38 HighRiskLastYear 44514 non-null float64 39 CovidPos 44514 non-null float64 dtypes: category(4), float64(36) memory usage: 12.7 MB