76 KiB
76 KiB
!pip install opendatasets
Requirement already satisfied: opendatasets in c:\users\riraa\anaconda3\lib\site-packages (0.1.20) Requirement already satisfied: tqdm in c:\users\riraa\anaconda3\lib\site-packages (from opendatasets) (4.59.0) Requirement already satisfied: click in c:\users\riraa\anaconda3\lib\site-packages (from opendatasets) (7.1.2) Requirement already satisfied: kaggle in c:\users\riraa\appdata\roaming\python\python38\site-packages (from opendatasets) (1.5.12) Requirement already satisfied: requests in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2.25.1) Requirement already satisfied: six>=1.10 in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (1.15.0) Requirement already satisfied: certifi in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2020.12.5) Requirement already satisfied: urllib3 in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (1.26.4) Requirement already satisfied: python-slugify in c:\users\riraa\appdata\roaming\python\python38\site-packages (from kaggle->opendatasets) (6.1.1) Requirement already satisfied: python-dateutil in c:\users\riraa\anaconda3\lib\site-packages (from kaggle->opendatasets) (2.8.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\riraa\appdata\roaming\python\python38\site-packages (from python-slugify->kaggle->opendatasets) (1.3) Requirement already satisfied: idna<3,>=2.5 in c:\users\riraa\anaconda3\lib\site-packages (from requests->kaggle->opendatasets) (2.10) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\riraa\anaconda3\lib\site-packages (from requests->kaggle->opendatasets) (4.0.0)
import opendatasets as od
od.download('https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009')
100%|█████████████████████████████████████████████████████████████████████████████| 25.6k/25.6k [00:00<00:00, 1.68MB/s]
Downloading red-wine-quality-cortez-et-al-2009.zip to .\red-wine-quality-cortez-et-al-2009
import pandas as pd
wine=pd.read_csv('./red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
from sklearn.model_selection import train_test_split
wine_train, wine_test = train_test_split(wine, test_size=50, random_state=1,stratify=wine["quality"])
wine_train["quality"].value_counts().sort_index(ascending=False)
8 17 7 193 6 618 5 660 4 51 3 10 Name: quality, dtype: int64
Wielkość zbioru i podzbiorów
Dla całego zbioru
wine.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
wine.describe()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
wine["quality"].value_counts().sort_index(ascending=False)
8 18 7 199 6 638 5 681 4 53 3 10 Name: quality, dtype: int64
wine["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>
Dla podzbioru _train
wine_train.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1453 | 7.6 | 0.49 | 0.33 | 1.9 | 0.074 | 27.0 | 85.0 | 0.99706 | 3.41 | 0.58 | 9.0 | 5 |
1295 | 6.6 | 0.63 | 0.00 | 4.3 | 0.093 | 51.0 | 77.5 | 0.99558 | 3.20 | 0.45 | 9.5 | 5 |
778 | 8.3 | 0.43 | 0.30 | 3.4 | 0.079 | 7.0 | 34.0 | 0.99788 | 3.36 | 0.61 | 10.5 | 5 |
692 | 8.6 | 0.49 | 0.51 | 2.0 | 0.422 | 16.0 | 62.0 | 0.99790 | 3.03 | 1.17 | 9.0 | 5 |
166 | 6.8 | 0.64 | 0.10 | 2.1 | 0.085 | 18.0 | 101.0 | 0.99560 | 3.34 | 0.52 | 10.2 | 5 |
wine_train.describe()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 | 1549.000000 |
mean | 8.327566 | 0.528128 | 0.271252 | 2.529987 | 0.086944 | 15.832150 | 46.415107 | 0.996746 | 3.310484 | 0.656727 | 10.419141 | 5.635249 |
std | 1.744692 | 0.180152 | 0.194249 | 1.380202 | 0.043732 | 10.450522 | 32.884454 | 0.001877 | 0.154269 | 0.166558 | 1.067245 | 0.807313 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.860000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 13.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.100000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.430000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997860 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 0.790000 | 15.500000 | 0.467000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 1.980000 | 14.900000 | 8.000000 |
wine_train["quality"].value_counts().sort_index(ascending=False) #indexy oznaczają jakość wina
8 17 7 193 6 618 5 660 4 51 3 10 Name: quality, dtype: int64
wine_train["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>
Dla podzbioru _test
wine_test.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
856 | 9.3 | 0.36 | 0.39 | 1.5 | 0.080 | 41.0 | 55.0 | 0.99652 | 3.47 | 0.73 | 10.9 | 6 |
1142 | 6.9 | 0.45 | 0.11 | 2.4 | 0.043 | 6.0 | 12.0 | 0.99354 | 3.30 | 0.65 | 11.4 | 6 |
538 | 12.9 | 0.35 | 0.49 | 5.8 | 0.066 | 5.0 | 35.0 | 1.00140 | 3.20 | 0.66 | 12.0 | 7 |
1324 | 6.7 | 0.46 | 0.24 | 1.7 | 0.077 | 18.0 | 34.0 | 0.99480 | 3.39 | 0.60 | 10.6 | 6 |
288 | 8.7 | 0.52 | 0.09 | 2.5 | 0.091 | 20.0 | 49.0 | 0.99760 | 3.34 | 0.86 | 10.6 | 7 |
wine_test.describe()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.00000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
mean | 8.074000 | 0.518300 | 0.262400 | 2.812000 | 0.10364 | 17.200000 | 48.100000 | 0.996779 | 3.330600 | 0.702200 | 10.542000 | 5.660000 |
std | 1.622899 | 0.142197 | 0.213155 | 2.137769 | 0.10746 | 10.777906 | 33.525653 | 0.002199 | 0.158338 | 0.242035 | 1.018621 | 0.823383 |
min | 5.600000 | 0.310000 | 0.000000 | 1.500000 | 0.03800 | 3.000000 | 8.000000 | 0.992920 | 2.740000 | 0.370000 | 9.000000 | 4.000000 |
25% | 6.900000 | 0.402500 | 0.095000 | 1.900000 | 0.07325 | 10.000000 | 25.250000 | 0.995445 | 3.260000 | 0.590000 | 9.725000 | 5.000000 |
50% | 7.650000 | 0.500000 | 0.245000 | 2.200000 | 0.08000 | 15.000000 | 36.500000 | 0.996560 | 3.320000 | 0.655000 | 10.350000 | 6.000000 |
75% | 9.150000 | 0.625000 | 0.400000 | 2.675000 | 0.08625 | 23.750000 | 62.000000 | 0.997600 | 3.400000 | 0.770000 | 11.175000 | 6.000000 |
max | 12.900000 | 0.980000 | 1.000000 | 15.400000 | 0.61100 | 55.000000 | 143.000000 | 1.003690 | 3.710000 | 2.000000 | 12.800000 | 8.000000 |
wine_test["quality"].value_counts().sort_index(ascending=False) #indexy oznaczają jakość wina
8 1 7 6 6 20 5 21 4 2 Name: quality, dtype: int64
wine_test["quality"].value_counts().sort_index(ascending=False).plot(kind="bar")
<AxesSubplot:>
Normalizacja
Podział z wyróżnieniem data/target
x_train,x_test,y_train,y_test = train_test_split(wine.iloc[:,:-1],wine.iloc[:,-1], test_size=0.2, random_state=1,stratify=wine["quality"])
y_train.value_counts().sum()
1279
y_test.value_counts().sum()
320
Normalizacja
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler()
norm_fit = norm.fit(x_train)
norm_x_train = norm_fit.transform(x_train)
norm_x_test = norm_fit.transform(x_test)
norm_x_train[:5]
array([[0.31858407, 0.15702479, 0.50632911, 0.0890411 , 0.1010989 , 0.07042254, 0.01413428, 0.38839941, 0.39130435, 0.21212121, 0.43076923], [0.26548673, 0.14049587, 0.62025316, 0.12328767, 0.17582418, 0.33802817, 0.19081272, 0.51615272, 0.39130435, 0.16969697, 0.26153846], [0.23893805, 0.17355372, 0.59493671, 0.08219178, 0.14285714, 0.05633803, 0.01766784, 0.42070485, 0.40869565, 0.12121212, 0.29230769], [0.19469027, 0.31404959, 0.13924051, 0.04109589, 0.13846154, 0.21126761, 0.15194346, 0.39500734, 0.43478261, 0.27878788, 0.16923077], [0.27433628, 0.65702479, 0.15189873, 0.0890411 , 0.28791209, 0.08450704, 0.06007067, 0.46475771, 0.42608696, 0.19393939, 0.27692308]])
Nie ma żadnych null'i do wypełnienia
wine.isnull().sum()
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64