37 KiB
37 KiB
!pip install kaggle
!pip install pandas
!pip install seaborn
Requirement already satisfied: kaggle in c:\users\user\anaconda3\lib\site-packages (1.5.12) Requirement already satisfied: urllib3 in c:\users\user\anaconda3\lib\site-packages (from kaggle) (1.26.7) Requirement already satisfied: python-dateutil in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: python-slugify in c:\users\user\anaconda3\lib\site-packages (from kaggle) (5.0.2) Requirement already satisfied: requests in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2.26.0) Requirement already satisfied: six>=1.10 in c:\users\user\anaconda3\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: tqdm in c:\users\user\anaconda3\lib\site-packages (from kaggle) (4.62.3) Requirement already satisfied: certifi in c:\users\user\anaconda3\lib\site-packages (from kaggle) (2021.10.8) Requirement already satisfied: text-unidecode>=1.3 in c:\users\user\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests->kaggle) (3.2) Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.4) Requirement already satisfied: pandas in c:\users\user\anaconda3\lib\site-packages (1.3.4) Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (2021.3) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.17.3 in c:\users\user\anaconda3\lib\site-packages (from pandas) (1.20.3) Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Requirement already satisfied: seaborn in c:\users\user\anaconda3\lib\site-packages (0.11.2) Requirement already satisfied: numpy>=1.15 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.20.3) Requirement already satisfied: matplotlib>=2.2 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (3.4.3) Requirement already satisfied: scipy>=1.0 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.7.1) Requirement already satisfied: pandas>=0.23 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.3.4) Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (0.10.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (8.4.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (3.0.4) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (1.3.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.8.2) Requirement already satisfied: six in c:\users\user\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.16.0) Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas>=0.23->seaborn) (2021.3)
!kaggle datasets download -d wenruliu/adult-income-dataset
adult-income-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o adult-income-dataset.zip
'unzip' is not recognized as an internal or external command, operable program or batch file.
import pandas as pd
df=pd.read_csv('adult-income-dataset.csv')
df
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States | <=50K |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States | >50K |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
48842 rows × 15 columns
#usunięcie nie pełnych danych
df = df[df.workclass != '?']
import torch
train_size = int(0.8 * len(df))
test_size = (len(df) - train_size)
df_train, df_test = torch.utils.data.random_split(df, [train_size, test_size])
print(f"Wielkosc zbioru: {len(df)}, podzbiór train: {train_size}, podzbiór test {test_size}.")
df.describe(include='all')
Wielkosc zbioru: 48842, podzbiór train: 39073, podzbiór test 9769.
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 48842.000000 | 48842 | 4.884200e+04 | 48842 | 48842.000000 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842.000000 | 48842.000000 | 48842.000000 | 48842 | 48842 |
unique | NaN | 9 | NaN | 16 | NaN | 7 | 15 | 6 | 5 | 2 | NaN | NaN | NaN | 42 | 2 |
top | NaN | Private | NaN | HS-grad | NaN | Married-civ-spouse | Prof-specialty | Husband | White | Male | NaN | NaN | NaN | United-States | <=50K |
freq | NaN | 33906 | NaN | 15784 | NaN | 22379 | 6172 | 19716 | 41762 | 32650 | NaN | NaN | NaN | 43832 | 37155 |
mean | 38.643585 | NaN | 1.896641e+05 | NaN | 10.078089 | NaN | NaN | NaN | NaN | NaN | 1079.067626 | 87.502314 | 40.422382 | NaN | NaN |
std | 13.710510 | NaN | 1.056040e+05 | NaN | 2.570973 | NaN | NaN | NaN | NaN | NaN | 7452.019058 | 403.004552 | 12.391444 | NaN | NaN |
min | 17.000000 | NaN | 1.228500e+04 | NaN | 1.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 1.000000 | NaN | NaN |
25% | 28.000000 | NaN | 1.175505e+05 | NaN | 9.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 40.000000 | NaN | NaN |
50% | 37.000000 | NaN | 1.781445e+05 | NaN | 10.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 40.000000 | NaN | NaN |
75% | 48.000000 | NaN | 2.376420e+05 | NaN | 12.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 45.000000 | NaN | NaN |
max | 90.000000 | NaN | 1.490400e+06 | NaN | 16.000000 | NaN | NaN | NaN | NaN | NaN | 99999.000000 | 4356.000000 | 99.000000 | NaN | NaN |
df["income"].value_counts().plot(kind="bar", title="income")
<AxesSubplot:title={'center':'income'}>