43 KiB
43 KiB
1. Pobranie zbioru
!pip install kaggle
!pip install pandas
Requirement already satisfied: kaggle in ./venv/lib/python3.11/site-packages (1.6.6) Requirement already satisfied: six>=1.10 in ./venv/lib/python3.11/site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in ./venv/lib/python3.11/site-packages (from kaggle) (2024.2.2) Requirement already satisfied: python-dateutil in ./venv/lib/python3.11/site-packages (from kaggle) (2.9.0.post0) Requirement already satisfied: requests in ./venv/lib/python3.11/site-packages (from kaggle) (2.31.0) Requirement already satisfied: tqdm in ./venv/lib/python3.11/site-packages (from kaggle) (4.66.2) Requirement already satisfied: python-slugify in ./venv/lib/python3.11/site-packages (from kaggle) (8.0.4) Requirement already satisfied: urllib3 in ./venv/lib/python3.11/site-packages (from kaggle) (2.2.1) Requirement already satisfied: bleach in ./venv/lib/python3.11/site-packages (from kaggle) (6.1.0) Requirement already satisfied: webencodings in ./venv/lib/python3.11/site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in ./venv/lib/python3.11/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in ./venv/lib/python3.11/site-packages (from requests->kaggle) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in ./venv/lib/python3.11/site-packages (from requests->kaggle) (3.6) [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m Requirement already satisfied: pandas in ./venv/lib/python3.11/site-packages (2.2.1) Requirement already satisfied: numpy<2,>=1.23.2 in ./venv/lib/python3.11/site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
!kaggle datasets download -d open-powerlifting/powerlifting-database
Downloading powerlifting-database.zip to /Users/szymonbartanowicz/studia/mag_1/inzynieria_uczenia_maszynowego/ium_464937 100%|███████████████████████████████████████▉| 176M/176M [00:06<00:00, 35.7MB/s] 100%|████████████████████████████████████████| 176M/176M [00:06<00:00, 30.1MB/s]
!unzip -o powerlifting-database.zip
Archive: powerlifting-database.zip inflating: openpowerlifting-2024-01-06-4c732975.csv inflating: openpowerlifting.csv
!pip install pandas
!pip install seaborn
Requirement already satisfied: pandas in ./venv/lib/python3.11/site-packages (2.2.1) Requirement already satisfied: numpy<2,>=1.23.2 in ./venv/lib/python3.11/site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m Requirement already satisfied: seaborn in ./venv/lib/python3.11/site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./venv/lib/python3.11/site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in ./venv/lib/python3.11/site-packages (from seaborn) (2.2.1) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./venv/lib/python3.11/site-packages (from seaborn) (3.8.3) Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.50.0) Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5) Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0) Requirement already satisfied: pillow>=8 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
import pandas as pd
data = pd.read_csv('openpowerlifting.csv')
/var/folders/82/g0638vys2hs3rk916hlpkdrr0000gn/T/ipykernel_47077/3909872695.py:2: DtypeWarning: Columns (35) have mixed types. Specify dtype option on import or set low_memory=False. data = pd.read_csv('openpowerlifting.csv')
2. Statystyki
data.head()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1423354 entries, 0 to 1423353 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 1423354 non-null object 1 Sex 1423354 non-null object 2 Event 1423354 non-null object 3 Equipment 1423354 non-null object 4 Age 757527 non-null float64 5 AgeClass 786800 non-null object 6 Division 1415176 non-null object 7 BodyweightKg 1406622 non-null float64 8 WeightClassKg 1410042 non-null object 9 Squat1Kg 337580 non-null float64 10 Squat2Kg 333349 non-null float64 11 Squat3Kg 323842 non-null float64 12 Squat4Kg 3696 non-null float64 13 Best3SquatKg 1031450 non-null float64 14 Bench1Kg 499779 non-null float64 15 Bench2Kg 493486 non-null float64 16 Bench3Kg 478485 non-null float64 17 Bench4Kg 9505 non-null float64 18 Best3BenchKg 1276181 non-null float64 19 Deadlift1Kg 363544 non-null float64 20 Deadlift2Kg 356023 non-null float64 21 Deadlift3Kg 339947 non-null float64 22 Deadlift4Kg 9246 non-null float64 23 Best3DeadliftKg 1081808 non-null float64 24 TotalKg 1313184 non-null float64 25 Place 1423354 non-null object 26 Wilks 1304407 non-null float64 27 McCulloch 1304254 non-null float64 28 Glossbrenner 1304407 non-null float64 29 IPFPoints 1273286 non-null float64 30 Tested 1093892 non-null object 31 Country 388884 non-null object 32 Federation 1423354 non-null object 33 Date 1423354 non-null object 34 MeetCountry 1423354 non-null object 35 MeetState 941545 non-null object 36 MeetName 1423354 non-null object dtypes: float64(22), object(15) memory usage: 401.8+ MB
data['Sex'].value_counts()
Sex M 1060189 F 363165 Name: count, dtype: int64
print(f"Minimum: {data['Best3SquatKg'].min()}")
print(f"Maksimum: {data['Best3SquatKg'].max()}")
print(f"Odchylenie standardowe: {data['Best3SquatKg'].std()}")
print(f"Mediana: {data['Best3SquatKg'].median()}")
data['Best3SquatKg'].value_counts()
Minimum: -477.5 Maksimum: 575.0 Odchylenie standardowe: 69.23931149707244 Mediana: 167.83
Best3SquatKg 200.00 15211 136.08 12626 190.00 12044 160.00 12043 170.00 11993 ... 277.30 1 143.20 1 129.60 1 131.80 1 309.58 1 Name: count, Length: 1907, dtype: int64
3. Czyszczenie zbioru
Kolumna country w 73% przypadków jest pusta, dlatego ją usuwam.
data.drop(columns=['Country'])
Name | Sex | Event | Equipment | Age | AgeClass | Division | BodyweightKg | WeightClassKg | Squat1Kg | ... | Wilks | McCulloch | Glossbrenner | IPFPoints | Tested | Federation | Date | MeetCountry | MeetState | MeetName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63986 | Kylie Beutler | F | SBD | Wraps | 23.0 | 20-23 | Juniors 20-23 | 56.00 | 56 | 83.91 | ... | 338.90 | 338.90 | 299.77 | 523.61 | Yes | WPA | 2011-05-21 | USA | CA | World Championships |
66457 | Kaitlynn Naert | F | SBD | Wraps | 13.0 | 13-15 | Teen 13-15 | 103.69 | 90+ | 43.09 | ... | 205.66 | 263.24 | 175.84 | 359.25 | Yes | APA | 2015-09-19 | USA | MI | Wolverine Open |
67030 | Carol Moorhead | F | SBD | Wraps | 55.0 | 55-59 | Open | 74.39 | 75 | 79.38 | ... | 223.22 | 273.44 | 196.41 | 374.65 | Yes | APA | 2017-04-22 | USA | MO | ShowMe State Raw Championships |
67031 | Nancy Lowther | F | SBD | Wraps | 58.0 | 55-59 | Open | 87.09 | 90 | 90.72 | ... | 260.41 | 336.19 | 227.02 | 449.73 | Yes | APA | 2017-04-22 | USA | MO | ShowMe State Raw Championships |
69557 | Roger Shaw | M | SBD | Wraps | 73.0 | 70-74 | Masters 70-79 | 74.12 | 75 | 147.42 | ... | 311.23 | 546.53 | 301.14 | 443.10 | Yes | APA | 2018-11-17 | USA | MO | Midwest Raw Championships |
646493 | Ryan Lapadat | M | SBD | Raw | 26.0 | 24-34 | Open | 81.70 | 82.5 | 142.50 | ... | 347.04 | 347.04 | 335.10 | 505.60 | Yes | CPO | 2008-05-17 | Canada | ON | Canadian Championships |
646495 | Denis Pronin | M | SBD | Multi-ply | 20.0 | 20-23 | Juniors 20-23 | 76.30 | 82.5 | 125.00 | ... | 313.37 | 322.77 | 303.01 | 411.52 | Yes | CPO | 2008-05-17 | Canada | ON | Canadian Championships |
652230 | Brooke Zak | F | SBD | Raw | 12.5 | 13-15 | Teen 12-13 | 44.54 | 48 | 40.00 | ... | 223.45 | 286.01 | 200.16 | 303.20 | Yes | RAW | 2018-11-09 | USA | NC | OBX Open |
658136 | Brooke Zak | F | SBD | Raw | 12.0 | 5-12 | Teen 12-13 | 43.27 | 44 | 37.50 | ... | 234.91 | 312.43 | 211.11 | 318.59 | Yes | RAW | 2018-08-03 | USA | NC | Southern Open |
658137 | Brooke Zak | F | SBD | Raw | 12.0 | 5-12 | Open | 43.27 | 44 | 37.50 | ... | 234.91 | 312.43 | 211.11 | 318.59 | Yes | RAW | 2018-08-03 | USA | NC | Southern Open |
658150 | Frank Ferchland | M | SBD | Raw | 50.5 | 50-54 | Open | 106.87 | 110 | 142.50 | ... | 286.58 | 323.84 | 274.82 | 390.62 | Yes | RAW | 2018-08-03 | USA | NC | Southern Open |
658151 | Frank Ferchland | M | SBD | Raw | 50.5 | 50-54 | Law/Fire/Military | 106.87 | 110 | 142.50 | ... | 286.58 | 323.84 | 274.82 | 390.62 | Yes | RAW | 2018-08-03 | USA | NC | Southern Open |
658152 | Frank Ferchland | M | SBD | Raw | 50.5 | 50-54 | Masters 50-54 | 106.87 | 110 | 142.50 | ... | 286.58 | 323.84 | 274.82 | 390.62 | Yes | RAW | 2018-08-03 | USA | NC | Southern Open |
919233 | Michael Trentin | M | SBD | Multi-ply | 18.0 | 18-19 | MO-MP | 64.45 | 67.5 | 117.50 | ... | 298.34 | 316.24 | 289.83 | 394.62 | Yes | CAPO | 2001-08-18 | Australia | WA | Nationals |
14 rows × 36 columns
Wartości NaN zamieniam na 0.
data.fillna(0, inplace=True)
4. Podział zbioru na podzbiory
Używam proporcji 8:1:1 (train:dev:test)
!pip install scikit-learn
Collecting scikit-learn Downloading scikit_learn-1.4.1.post1-cp311-cp311-macosx_10_9_x86_64.whl (11.6 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m [?25hRequirement already satisfied: numpy<2.0,>=1.19.5 in ./venv/lib/python3.11/site-packages (from scikit-learn) (1.26.4) Collecting scipy>=1.6.0 Downloading scipy-1.12.0-cp311-cp311-macosx_10_9_x86_64.whl (38.9 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.9/38.9 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m [?25hCollecting joblib>=1.2.0 Using cached joblib-1.3.2-py3-none-any.whl (302 kB) Collecting threadpoolctl>=2.0.0 Downloading threadpoolctl-3.3.0-py3-none-any.whl (17 kB) Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn Successfully installed joblib-1.3.2 scikit-learn-1.4.1.post1 scipy-1.12.0 threadpoolctl-3.3.0 [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m [1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
from sklearn.model_selection import train_test_split
openpowerlifting_train, openpowerlifting_test = train_test_split(data, test_size=0.1, random_state=1)
openpowerlifting_train, openpowerlifting_dev = train_test_split(openpowerlifting_train, test_size=1/9, random_state=1)
print("Wielkość zbioru train: ", len(openpowerlifting_train))
print("Wielkość zbioru dev: ", len(openpowerlifting_dev))
print("Wielkość zbioru test: ", len(openpowerlifting_test))
Wielkość zbioru train: 1138682 Wielkość zbioru dev: 142336 Wielkość zbioru test: 142336
5. Normalizacja
from sklearn.preprocessing import StandardScaler
scaled_features = data.copy()
col_names = ['Age', 'BodyweightKg', 'Squat1Kg', 'Squat2Kg', 'Squat3Kg', 'Squat4Kg', 'Best3SquatKg', 'Bench1Kg', 'Bench2Kg', 'Bench3Kg', 'Bench4Kg', 'Best3BenchKg', 'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg', 'Deadlift4Kg', 'Best3DeadliftKg', 'TotalKg']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features
scaled_features
Name | Sex | Event | Equipment | Age | AgeClass | Division | BodyweightKg | WeightClassKg | Squat1Kg | ... | McCulloch | Glossbrenner | IPFPoints | Tested | Country | Federation | Date | MeetCountry | MeetState | MeetName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abbie Murphy | F | SBD | Wraps | 0.661353 | 24-34 | F-OR | -0.944800 | 60 | 0.611664 | ... | 324.16 | 286.42 | 511.15 | 0 | 0 | GPC-AUS | 2018-10-27 | Australia | VIC | Melbourne Cup |
1 | Abbie Tuong | F | SBD | Wraps | 0.661353 | 24-34 | F-OR | -0.997210 | 60 | 0.842750 | ... | 378.07 | 334.16 | 595.65 | 0 | 0 | GPC-AUS | 2018-10-27 | Australia | VIC | Melbourne Cup |
2 | Ainslee Hooper | F | B | Raw | 1.255975 | 40-44 | F-OR | -1.122189 | 56 | -0.312682 | ... | 38.56 | 34.12 | 313.97 | 0 | 0 | GPC-AUS | 2018-10-27 | Australia | VIC | Melbourne Cup |
3 | Amy Moldenhauer | F | SBD | Wraps | 0.337014 | 20-23 | F-OR | -0.936736 | 60 | -1.525886 | ... | 345.61 | 305.37 | 547.04 | 0 | 0 | GPC-AUS | 2018-10-27 | Australia | VIC | Melbourne Cup |
4 | Andrea Rowan | F | SBD | Wraps | 1.526258 | 45-49 | F-OR | 0.837161 | 110 | 1.073837 | ... | 338.91 | 274.56 | 550.08 | 0 | 0 | GPC-AUS | 2018-10-27 | Australia | VIC | Melbourne Cup |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1423349 | Marian Cafalik | M | SBD | Raw | 2.364134 | 60-64 | Masters 2 | -0.392472 | 74 | 1.536010 | ... | 438.27 | 316.52 | 469.67 | Yes | 0 | PZKFiTS | 2017-04-01 | Poland | 0 | Polish Classic Powerlifting Cup |
1423350 | Marian Piwowarczyk | M | SBD | Raw | 2.093852 | 55-59 | Masters 2 | -0.795631 | 66 | 0.727207 | ... | 372.60 | 295.66 | 423.03 | Yes | Poland | PZKFiTS | 2017-04-01 | Poland | 0 | Polish Classic Powerlifting Cup |
1423351 | Andrzej Bryniarski | M | SBD | Raw | 2.472248 | 60-64 | Masters 2 | 0.450129 | 105 | 1.304923 | ... | 382.36 | 264.22 | 378.84 | Yes | 0 | PZKFiTS | 2017-04-01 | Poland | 0 | Polish Classic Powerlifting Cup |
1423352 | Stanisław Goroczko | M | SBD | Raw | 2.526304 | 60-64 | Masters 2 | -0.098167 | 83 | -2.219146 | ... | 0.00 | 0.00 | 0.00 | Yes | 0 | PZKFiTS | 2017-04-01 | Poland | 0 | Polish Classic Powerlifting Cup |
1423353 | Jan Sowa | M | SBD | Raw | 2.904700 | 70-74 | Masters 2 | -0.049788 | 83 | -1.641430 | ... | 0.00 | 0.00 | 0.00 | Yes | 0 | PZKFiTS | 2017-04-01 | Poland | 0 | Polish Classic Powerlifting Cup |
1423354 rows × 37 columns