ium_478831/IUM_main.ipynb
JulianZablonski ec907266c8 ex1
2022-03-20 22:02:36 +01:00

37 KiB
Raw Blame History

!pip install kaggle
!pip install pandas
!pip install seaborn
Requirement already satisfied: kaggle in c:\users\user\anaconda3\lib\site-packages (1.5.12)
Requirement already satisfied: urllib3 in c:\users\user\anaconda3\lib\site-packages (from kaggle) (1.26.7)
Requirement already satisfied: python-dateutil in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: python-slugify in c:\users\user\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: requests in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2.26.0)
Requirement already satisfied: six>=1.10 in c:\users\user\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: tqdm in c:\users\user\anaconda3\lib\site-packages (from kaggle) (4.62.3)
Requirement already satisfied: certifi in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2021.10.8)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\user\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests->kaggle) (3.2)
Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.4)
Requirement already satisfied: pandas in c:\users\user\anaconda3\lib\site-packages (1.3.4)
Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (1.20.3)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Requirement already satisfied: seaborn in c:\users\user\anaconda3\lib\site-packages (0.11.2)
Requirement already satisfied: numpy>=1.15 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.20.3)
Requirement already satisfied: matplotlib>=2.2 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (3.4.3)
Requirement already satisfied: scipy>=1.0 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.7.1)
Requirement already satisfied: pandas>=0.23 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.3.4)
Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (8.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (3.0.4)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.8.2)
Requirement already satisfied: six in c:\users\user\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.16.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas>=0.23->seaborn) (2021.3)
!kaggle datasets download -d wenruliu/adult-income-dataset

    
adult-income-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o adult-income-dataset.zip
'unzip' is not recognized as an internal or external command,
operable program or batch file.
import pandas as pd
df=pd.read_csv('adult-income-dataset.csv')
df
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K
4 18 ? 103497 Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
48838 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
48839 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
48840 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
48841 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

48842 rows × 15 columns

#usunięcie nie pełnych danych 
df = df[df.workclass != '?']
import torch

train_size = int(0.8 * len(df))
test_size = (len(df) - train_size)
df_train, df_test = torch.utils.data.random_split(df, [train_size, test_size])
print(f"Wielkosc zbioru: {len(df)}, podzbiór train: {train_size}, podzbiór test {test_size}.")
df.describe(include='all')
Wielkosc zbioru: 48842, podzbiór train: 39073, podzbiór test 9769.
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
count 48842.000000 48842 4.884200e+04 48842 48842.000000 48842 48842 48842 48842 48842 48842.000000 48842.000000 48842.000000 48842 48842
unique NaN 9 NaN 16 NaN 7 15 6 5 2 NaN NaN NaN 42 2
top NaN Private NaN HS-grad NaN Married-civ-spouse Prof-specialty Husband White Male NaN NaN NaN United-States <=50K
freq NaN 33906 NaN 15784 NaN 22379 6172 19716 41762 32650 NaN NaN NaN 43832 37155
mean 38.643585 NaN 1.896641e+05 NaN 10.078089 NaN NaN NaN NaN NaN 1079.067626 87.502314 40.422382 NaN NaN
std 13.710510 NaN 1.056040e+05 NaN 2.570973 NaN NaN NaN NaN NaN 7452.019058 403.004552 12.391444 NaN NaN
min 17.000000 NaN 1.228500e+04 NaN 1.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 1.000000 NaN NaN
25% 28.000000 NaN 1.175505e+05 NaN 9.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN NaN
50% 37.000000 NaN 1.781445e+05 NaN 10.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN NaN
75% 48.000000 NaN 2.376420e+05 NaN 12.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 45.000000 NaN NaN
max 90.000000 NaN 1.490400e+06 NaN 16.000000 NaN NaN NaN NaN NaN 99999.000000 4356.000000 99.000000 NaN NaN
df["income"].value_counts().plot(kind="bar", title="income")
<AxesSubplot:title={'center':'income'}>