ium_464937/02.ipynb
Szymon Bartanowicz f3ca0153eb fix
2024-03-18 00:30:44 +01:00

43 KiB
Raw Blame History

1. Pobranie zbioru

!pip install kaggle
!pip install pandas
Requirement already satisfied: kaggle in ./venv/lib/python3.11/site-packages (1.6.6)
Requirement already satisfied: six>=1.10 in ./venv/lib/python3.11/site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in ./venv/lib/python3.11/site-packages (from kaggle) (2024.2.2)
Requirement already satisfied: python-dateutil in ./venv/lib/python3.11/site-packages (from kaggle) (2.9.0.post0)
Requirement already satisfied: requests in ./venv/lib/python3.11/site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in ./venv/lib/python3.11/site-packages (from kaggle) (4.66.2)
Requirement already satisfied: python-slugify in ./venv/lib/python3.11/site-packages (from kaggle) (8.0.4)
Requirement already satisfied: urllib3 in ./venv/lib/python3.11/site-packages (from kaggle) (2.2.1)
Requirement already satisfied: bleach in ./venv/lib/python3.11/site-packages (from kaggle) (6.1.0)
Requirement already satisfied: webencodings in ./venv/lib/python3.11/site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in ./venv/lib/python3.11/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in ./venv/lib/python3.11/site-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./venv/lib/python3.11/site-packages (from requests->kaggle) (3.6)

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: pandas in ./venv/lib/python3.11/site-packages (2.2.1)
Requirement already satisfied: numpy<2,>=1.23.2 in ./venv/lib/python3.11/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
!kaggle datasets download -d open-powerlifting/powerlifting-database
Downloading powerlifting-database.zip to /Users/szymonbartanowicz/studia/mag_1/inzynieria_uczenia_maszynowego/ium_464937
100%|███████████████████████████████████████▉| 176M/176M [00:06<00:00, 35.7MB/s]
100%|████████████████████████████████████████| 176M/176M [00:06<00:00, 30.1MB/s]
!unzip -o powerlifting-database.zip
Archive:  powerlifting-database.zip
  inflating: openpowerlifting-2024-01-06-4c732975.csv  
  inflating: openpowerlifting.csv    
!pip install pandas
!pip install seaborn
Requirement already satisfied: pandas in ./venv/lib/python3.11/site-packages (2.2.1)
Requirement already satisfied: numpy<2,>=1.23.2 in ./venv/lib/python3.11/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: seaborn in ./venv/lib/python3.11/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./venv/lib/python3.11/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in ./venv/lib/python3.11/site-packages (from seaborn) (2.2.1)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./venv/lib/python3.11/site-packages (from seaborn) (3.8.3)
Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.50.0)
Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0)
Requirement already satisfied: pillow>=8 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in ./venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
import pandas as pd
data = pd.read_csv('openpowerlifting.csv')
/var/folders/82/g0638vys2hs3rk916hlpkdrr0000gn/T/ipykernel_47077/3909872695.py:2: DtypeWarning: Columns (35) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv('openpowerlifting.csv')

2. Statystyki

data.head()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1423354 entries, 0 to 1423353
Data columns (total 37 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   Name             1423354 non-null  object 
 1   Sex              1423354 non-null  object 
 2   Event            1423354 non-null  object 
 3   Equipment        1423354 non-null  object 
 4   Age              757527 non-null   float64
 5   AgeClass         786800 non-null   object 
 6   Division         1415176 non-null  object 
 7   BodyweightKg     1406622 non-null  float64
 8   WeightClassKg    1410042 non-null  object 
 9   Squat1Kg         337580 non-null   float64
 10  Squat2Kg         333349 non-null   float64
 11  Squat3Kg         323842 non-null   float64
 12  Squat4Kg         3696 non-null     float64
 13  Best3SquatKg     1031450 non-null  float64
 14  Bench1Kg         499779 non-null   float64
 15  Bench2Kg         493486 non-null   float64
 16  Bench3Kg         478485 non-null   float64
 17  Bench4Kg         9505 non-null     float64
 18  Best3BenchKg     1276181 non-null  float64
 19  Deadlift1Kg      363544 non-null   float64
 20  Deadlift2Kg      356023 non-null   float64
 21  Deadlift3Kg      339947 non-null   float64
 22  Deadlift4Kg      9246 non-null     float64
 23  Best3DeadliftKg  1081808 non-null  float64
 24  TotalKg          1313184 non-null  float64
 25  Place            1423354 non-null  object 
 26  Wilks            1304407 non-null  float64
 27  McCulloch        1304254 non-null  float64
 28  Glossbrenner     1304407 non-null  float64
 29  IPFPoints        1273286 non-null  float64
 30  Tested           1093892 non-null  object 
 31  Country          388884 non-null   object 
 32  Federation       1423354 non-null  object 
 33  Date             1423354 non-null  object 
 34  MeetCountry      1423354 non-null  object 
 35  MeetState        941545 non-null   object 
 36  MeetName         1423354 non-null  object 
dtypes: float64(22), object(15)
memory usage: 401.8+ MB
data['Sex'].value_counts()
Sex
M    1060189
F     363165
Name: count, dtype: int64
print(f"Minimum: {data['Best3SquatKg'].min()}")
print(f"Maksimum: {data['Best3SquatKg'].max()}")
print(f"Odchylenie standardowe: {data['Best3SquatKg'].std()}")
print(f"Mediana: {data['Best3SquatKg'].median()}")
data['Best3SquatKg'].value_counts()
Minimum: -477.5
Maksimum: 575.0
Odchylenie standardowe: 69.23931149707244
Mediana: 167.83
Best3SquatKg
200.00    15211
136.08    12626
190.00    12044
160.00    12043
170.00    11993
          ...  
277.30        1
143.20        1
129.60        1
131.80        1
309.58        1
Name: count, Length: 1907, dtype: int64

3. Czyszczenie zbioru

Kolumna country w 73% przypadków jest pusta, dlatego ją usuwam.

data.drop(columns=['Country'])
Name Sex Event Equipment Age AgeClass Division BodyweightKg WeightClassKg Squat1Kg ... Wilks McCulloch Glossbrenner IPFPoints Tested Federation Date MeetCountry MeetState MeetName
63986 Kylie Beutler F SBD Wraps 23.0 20-23 Juniors 20-23 56.00 56 83.91 ... 338.90 338.90 299.77 523.61 Yes WPA 2011-05-21 USA CA World Championships
66457 Kaitlynn Naert F SBD Wraps 13.0 13-15 Teen 13-15 103.69 90+ 43.09 ... 205.66 263.24 175.84 359.25 Yes APA 2015-09-19 USA MI Wolverine Open
67030 Carol Moorhead F SBD Wraps 55.0 55-59 Open 74.39 75 79.38 ... 223.22 273.44 196.41 374.65 Yes APA 2017-04-22 USA MO ShowMe State Raw Championships
67031 Nancy Lowther F SBD Wraps 58.0 55-59 Open 87.09 90 90.72 ... 260.41 336.19 227.02 449.73 Yes APA 2017-04-22 USA MO ShowMe State Raw Championships
69557 Roger Shaw M SBD Wraps 73.0 70-74 Masters 70-79 74.12 75 147.42 ... 311.23 546.53 301.14 443.10 Yes APA 2018-11-17 USA MO Midwest Raw Championships
646493 Ryan Lapadat M SBD Raw 26.0 24-34 Open 81.70 82.5 142.50 ... 347.04 347.04 335.10 505.60 Yes CPO 2008-05-17 Canada ON Canadian Championships
646495 Denis Pronin M SBD Multi-ply 20.0 20-23 Juniors 20-23 76.30 82.5 125.00 ... 313.37 322.77 303.01 411.52 Yes CPO 2008-05-17 Canada ON Canadian Championships
652230 Brooke Zak F SBD Raw 12.5 13-15 Teen 12-13 44.54 48 40.00 ... 223.45 286.01 200.16 303.20 Yes RAW 2018-11-09 USA NC OBX Open
658136 Brooke Zak F SBD Raw 12.0 5-12 Teen 12-13 43.27 44 37.50 ... 234.91 312.43 211.11 318.59 Yes RAW 2018-08-03 USA NC Southern Open
658137 Brooke Zak F SBD Raw 12.0 5-12 Open 43.27 44 37.50 ... 234.91 312.43 211.11 318.59 Yes RAW 2018-08-03 USA NC Southern Open
658150 Frank Ferchland M SBD Raw 50.5 50-54 Open 106.87 110 142.50 ... 286.58 323.84 274.82 390.62 Yes RAW 2018-08-03 USA NC Southern Open
658151 Frank Ferchland M SBD Raw 50.5 50-54 Law/Fire/Military 106.87 110 142.50 ... 286.58 323.84 274.82 390.62 Yes RAW 2018-08-03 USA NC Southern Open
658152 Frank Ferchland M SBD Raw 50.5 50-54 Masters 50-54 106.87 110 142.50 ... 286.58 323.84 274.82 390.62 Yes RAW 2018-08-03 USA NC Southern Open
919233 Michael Trentin M SBD Multi-ply 18.0 18-19 MO-MP 64.45 67.5 117.50 ... 298.34 316.24 289.83 394.62 Yes CAPO 2001-08-18 Australia WA Nationals

14 rows × 36 columns

Wartości NaN zamieniam na 0.

data.fillna(0, inplace=True)

4. Podział zbioru na podzbiory

Używam proporcji 8:1:1 (train:dev:test)

!pip install scikit-learn
Collecting scikit-learn
  Downloading scikit_learn-1.4.1.post1-cp311-cp311-macosx_10_9_x86_64.whl (11.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 33.5 MB/s eta 0:00:0000:010:01
[?25hRequirement already satisfied: numpy<2.0,>=1.19.5 in ./venv/lib/python3.11/site-packages (from scikit-learn) (1.26.4)
Collecting scipy>=1.6.0
  Downloading scipy-1.12.0-cp311-cp311-macosx_10_9_x86_64.whl (38.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.9/38.9 MB 14.1 MB/s eta 0:00:0000:0100:01
[?25hCollecting joblib>=1.2.0
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.3.0-py3-none-any.whl (17 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.4.1.post1 scipy-1.12.0 threadpoolctl-3.3.0

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
from sklearn.model_selection import train_test_split
openpowerlifting_train, openpowerlifting_test = train_test_split(data, test_size=0.1, random_state=1)
openpowerlifting_train, openpowerlifting_dev = train_test_split(openpowerlifting_train, test_size=1/9, random_state=1)
print("Wielkość zbioru train: ", len(openpowerlifting_train))
print("Wielkość zbioru dev: ", len(openpowerlifting_dev))
print("Wielkość zbioru test: ", len(openpowerlifting_test))
Wielkość zbioru train:  1138682
Wielkość zbioru dev:  142336
Wielkość zbioru test:  142336

5. Normalizacja

from sklearn.preprocessing import StandardScaler

scaled_features = data.copy()
col_names = ['Age', 'BodyweightKg', 'Squat1Kg', 'Squat2Kg', 'Squat3Kg', 'Squat4Kg', 'Best3SquatKg', 'Bench1Kg', 'Bench2Kg', 'Bench3Kg', 'Bench4Kg', 'Best3BenchKg', 'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg', 'Deadlift4Kg', 'Best3DeadliftKg', 'TotalKg']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features
scaled_features
Name Sex Event Equipment Age AgeClass Division BodyweightKg WeightClassKg Squat1Kg ... McCulloch Glossbrenner IPFPoints Tested Country Federation Date MeetCountry MeetState MeetName
0 Abbie Murphy F SBD Wraps 0.661353 24-34 F-OR -0.944800 60 0.611664 ... 324.16 286.42 511.15 0 0 GPC-AUS 2018-10-27 Australia VIC Melbourne Cup
1 Abbie Tuong F SBD Wraps 0.661353 24-34 F-OR -0.997210 60 0.842750 ... 378.07 334.16 595.65 0 0 GPC-AUS 2018-10-27 Australia VIC Melbourne Cup
2 Ainslee Hooper F B Raw 1.255975 40-44 F-OR -1.122189 56 -0.312682 ... 38.56 34.12 313.97 0 0 GPC-AUS 2018-10-27 Australia VIC Melbourne Cup
3 Amy Moldenhauer F SBD Wraps 0.337014 20-23 F-OR -0.936736 60 -1.525886 ... 345.61 305.37 547.04 0 0 GPC-AUS 2018-10-27 Australia VIC Melbourne Cup
4 Andrea Rowan F SBD Wraps 1.526258 45-49 F-OR 0.837161 110 1.073837 ... 338.91 274.56 550.08 0 0 GPC-AUS 2018-10-27 Australia VIC Melbourne Cup
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1423349 Marian Cafalik M SBD Raw 2.364134 60-64 Masters 2 -0.392472 74 1.536010 ... 438.27 316.52 469.67 Yes 0 PZKFiTS 2017-04-01 Poland 0 Polish Classic Powerlifting Cup
1423350 Marian Piwowarczyk M SBD Raw 2.093852 55-59 Masters 2 -0.795631 66 0.727207 ... 372.60 295.66 423.03 Yes Poland PZKFiTS 2017-04-01 Poland 0 Polish Classic Powerlifting Cup
1423351 Andrzej Bryniarski M SBD Raw 2.472248 60-64 Masters 2 0.450129 105 1.304923 ... 382.36 264.22 378.84 Yes 0 PZKFiTS 2017-04-01 Poland 0 Polish Classic Powerlifting Cup
1423352 Stanisław Goroczko M SBD Raw 2.526304 60-64 Masters 2 -0.098167 83 -2.219146 ... 0.00 0.00 0.00 Yes 0 PZKFiTS 2017-04-01 Poland 0 Polish Classic Powerlifting Cup
1423353 Jan Sowa M SBD Raw 2.904700 70-74 Masters 2 -0.049788 83 -1.641430 ... 0.00 0.00 0.00 Yes 0 PZKFiTS 2017-04-01 Poland 0 Polish Classic Powerlifting Cup

1423354 rows × 37 columns