ium_444354/lab2.ipynb

76 KiB
Raw Blame History

!pip install opendatasets

Requirement already satisfied: opendatasets in c:\users\riraa\anaconda3\lib\site-packages (0.1.20)
Requirement already satisfied: click in c:\users\riraa\anaconda3\lib\site-packages (from opendatasets) (7.1.2)
Requirement already satisfied: kaggle in c:\users\riraa\appdata\roaming\python\python38\site-packages (from opendatasets) (1.5.12)
Requirement already satisfied: tqdm in c:\users\riraa\anaconda3\lib\site-packages (from opendatasets) (4.59.0)
Requirement already satisfied: python-slugify in c:\users\riraa\appdata\roaming\python\python38\site-packages (from kaggle->opendatasets) (6.1.1)
Requirement already satisfied: python-dateutil in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2.8.1)
Requirement already satisfied: requests in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2.25.1)
Requirement already satisfied: urllib3 in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (1.26.4)
Requirement already satisfied: certifi in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2020.12.5)
Requirement already satisfied: six>=1.10 in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (1.15.0)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\riraa\appdata\roaming\python\python38\site-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: idna<3,>=2.5 in c:\users\riraa\anaconda3\lib\site-packages (from requests->kaggle->opendatasets) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\riraa\anaconda3\lib\site-packages (from requests->kaggle->opendatasets) (4.0.0)
import opendatasets as od
od.download('https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009')
Skipping, found downloaded files in ".\red-wine-quality-cortez-et-al-2009" (use force=True to force download)
import pandas as pd
wine=pd.read_csv('./red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

from sklearn.model_selection import train_test_split
wine_train, wine_test = train_test_split(wine, test_size=50, random_state=1,stratify=wine["quality"])
wine_train["quality"].value_counts().sort_index(ascending=False) 
8     17
7    193
6    618
5    660
4     51
3     10
Name: quality, dtype: int64

Wielkość zbioru i podzbiorów

Dla całego zbioru

wine.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
wine.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
wine["quality"].value_counts().sort_index(ascending=False)
8     18
7    199
6    638
5    681
4     53
3     10
Name: quality, dtype: int64
wine["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>

Dla podzbioru _train

wine_train.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1453 7.6 0.49 0.33 1.9 0.074 27.0 85.0 0.99706 3.41 0.58 9.0 5
1295 6.6 0.63 0.00 4.3 0.093 51.0 77.5 0.99558 3.20 0.45 9.5 5
778 8.3 0.43 0.30 3.4 0.079 7.0 34.0 0.99788 3.36 0.61 10.5 5
692 8.6 0.49 0.51 2.0 0.422 16.0 62.0 0.99790 3.03 1.17 9.0 5
166 6.8 0.64 0.10 2.1 0.085 18.0 101.0 0.99560 3.34 0.52 10.2 5
wine_train.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000 1549.000000
mean 8.327566 0.528128 0.271252 2.529987 0.086944 15.832150 46.415107 0.996746 3.310484 0.656727 10.419141 5.635249
std 1.744692 0.180152 0.194249 1.380202 0.043732 10.450522 32.884454 0.001877 0.154269 0.166558 1.067245 0.807313
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.860000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 13.000000 38.000000 0.996750 3.310000 0.620000 10.100000 6.000000
75% 9.200000 0.640000 0.430000 2.600000 0.090000 21.000000 62.000000 0.997860 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 0.790000 15.500000 0.467000 72.000000 289.000000 1.003690 4.010000 1.980000 14.900000 8.000000
wine_train["quality"].value_counts().sort_index(ascending=False) #indexy oznaczają jakość wina
8     17
7    193
6    618
5    660
4     51
3     10
Name: quality, dtype: int64

Sortowanie jest po to, by szły od najlepszego do najgorszego, zamiast po największej ilość próbek

wine_train["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>

Dla podzbioru _test

wine_test.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
856 9.3 0.36 0.39 1.5 0.080 41.0 55.0 0.99652 3.47 0.73 10.9 6
1142 6.9 0.45 0.11 2.4 0.043 6.0 12.0 0.99354 3.30 0.65 11.4 6
538 12.9 0.35 0.49 5.8 0.066 5.0 35.0 1.00140 3.20 0.66 12.0 7
1324 6.7 0.46 0.24 1.7 0.077 18.0 34.0 0.99480 3.39 0.60 10.6 6
288 8.7 0.52 0.09 2.5 0.091 20.0 49.0 0.99760 3.34 0.86 10.6 7
wine_test.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 50.000000 50.000000 50.000000 50.000000 50.00000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000
mean 8.074000 0.518300 0.262400 2.812000 0.10364 17.200000 48.100000 0.996779 3.330600 0.702200 10.542000 5.660000
std 1.622899 0.142197 0.213155 2.137769 0.10746 10.777906 33.525653 0.002199 0.158338 0.242035 1.018621 0.823383
min 5.600000 0.310000 0.000000 1.500000 0.03800 3.000000 8.000000 0.992920 2.740000 0.370000 9.000000 4.000000
25% 6.900000 0.402500 0.095000 1.900000 0.07325 10.000000 25.250000 0.995445 3.260000 0.590000 9.725000 5.000000
50% 7.650000 0.500000 0.245000 2.200000 0.08000 15.000000 36.500000 0.996560 3.320000 0.655000 10.350000 6.000000
75% 9.150000 0.625000 0.400000 2.675000 0.08625 23.750000 62.000000 0.997600 3.400000 0.770000 11.175000 6.000000
max 12.900000 0.980000 1.000000 15.400000 0.61100 55.000000 143.000000 1.003690 3.710000 2.000000 12.800000 8.000000
wine_test["quality"].value_counts().sort_index(ascending=False) #indexy oznaczają jakość wina
8     1
7     6
6    20
5    21
4     2
Name: quality, dtype: int64
wine_test["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>

Podział z wyróżnieniem data/remain

X_train,X_rem,y_train,y_rem = train_test_split(wine.iloc[:,:-1],wine.iloc[:,-1], test_size=0.2, random_state=1,stratify=wine["quality"])
y_train.value_counts().sum()
1279
y_rem.value_counts().sum()
320

Mamy teraz podział 8:2, a chcemy mieć 8:1:1, więc pozostały zbiór dzielimy na pół

X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
(1279, 11)
(1279,)
(160, 11)
(160,)
(160, 11)
(160,)
(None, None)

Normalizacja

from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler()
norm_fit = norm.fit(X_train)
norm_X_train = norm_fit.transform(X_train)
norm_X_test = norm_fit.transform(X_test)
norm_X_valid =norm_fit.transform(X_valid)
Wygląd po normalizacji: mieści się w zakresie [0,1]
norm_X_train[1]
array([0.26548673, 0.14049587, 0.62025316, 0.12328767, 0.17582418,
       0.33802817, 0.19081272, 0.51615272, 0.39130435, 0.16969697,
       0.26153846])

Nie ma żadnych null'i do wypełnienia

wine.isnull().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64