119 KiB
1. Pobranie zbioru danych z Repozytorium
!curl -OL https://git.wmi.amu.edu.pl/s434788/ium_434788/raw/branch/master/winequality-red.csv
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 98k 0 98k 0 0 282k 0 --:--:-- --:--:-- --:--:-- 281k
import pandas as pd
wine=pd.read_csv('winequality-red.csv')
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
2. Podział na zbiory test/train przy pomocy SciKit
Próbowałem również podzielić na podzbiory Train:Dev:Test 6:2:2 Przy pomocy basha ale uznałem, że wygodniejsze jest korzystanie z "train_test_split()". Docelowo podział będzie dokonywany na 4 zmienne X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
, jednak chciałem zachować konwencje z przykładu, z ćwiczeń.
from sklearn.model_selection import train_test_split
wine_train, wine_test = train_test_split(wine, test_size=360,train_size=959, random_state=1)
wine_test["quality"].value_counts()
5 155 6 149 7 37 4 16 8 2 3 1 Name: quality, dtype: int64
wine_train["quality"].value_counts()
5 400 6 388 7 125 4 30 8 11 3 5 Name: quality, dtype: int64
3. Statystyki dla zbior
from matplotlib import pyplot as plt
import seaborn as sns
3.1. Zbiór Train
wine_train
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1589 | 6.6 | 0.725 | 0.20 | 7.8 | 0.073 | 29.0 | 79.0 | 0.99770 | 3.29 | 0.54 | 9.2 | 5 |
854 | 9.3 | 0.360 | 0.39 | 1.5 | 0.080 | 41.0 | 55.0 | 0.99652 | 3.47 | 0.73 | 10.9 | 6 |
83 | 7.3 | 0.670 | 0.26 | 1.8 | 0.401 | 16.0 | 51.0 | 0.99690 | 3.16 | 1.14 | 9.4 | 5 |
1106 | 8.2 | 0.230 | 0.42 | 1.9 | 0.069 | 9.0 | 17.0 | 0.99376 | 3.21 | 0.54 | 12.3 | 6 |
650 | 10.7 | 0.430 | 0.39 | 2.2 | 0.106 | 8.0 | 32.0 | 0.99860 | 2.89 | 0.50 | 9.6 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
526 | 7.3 | 0.365 | 0.49 | 2.5 | 0.088 | 39.0 | 106.0 | 0.99660 | 3.36 | 0.78 | 11.0 | 5 |
583 | 12.0 | 0.280 | 0.49 | 1.9 | 0.074 | 10.0 | 21.0 | 0.99760 | 2.98 | 0.66 | 9.9 | 7 |
975 | 7.2 | 0.410 | 0.30 | 2.1 | 0.083 | 35.0 | 72.0 | 0.99700 | 3.44 | 0.52 | 9.4 | 5 |
566 | 8.7 | 0.700 | 0.24 | 2.5 | 0.226 | 5.0 | 15.0 | 0.99910 | 3.32 | 0.60 | 9.0 | 6 |
1232 | 7.6 | 0.430 | 0.29 | 2.1 | 0.075 | 19.0 | 66.0 | 0.99718 | 3.40 | 0.64 | 9.5 | 5 |
959 rows × 12 columns
wine_train["quality"].value_counts()
5 400 6 388 7 125 4 30 8 11 3 5 Name: quality, dtype: int64
wine_train.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 959.000000 | 959.00000 | 959.000000 | 959.000000 | 959.000000 |
mean | 8.329093 | 0.526809 | 0.269864 | 2.493743 | 0.088230 | 15.883733 | 45.738790 | 0.996736 | 3.31048 | 0.661481 | 10.433160 | 5.657977 |
std | 1.808394 | 0.175221 | 0.198377 | 1.262329 | 0.050555 | 10.485739 | 31.897095 | 0.001925 | 0.15462 | 0.171639 | 1.084349 | 0.805654 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.74000 | 0.370000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.400000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995540 | 3.21000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.250000 | 2.200000 | 0.079000 | 14.000000 | 37.000000 | 0.996770 | 3.31000 | 0.620000 | 10.100000 | 6.000000 |
75% | 9.300000 | 0.635000 | 0.430000 | 2.600000 | 0.090000 | 22.000000 | 61.000000 | 0.997870 | 3.40000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.330000 | 1.000000 | 15.400000 | 0.610000 | 72.000000 | 278.000000 | 1.003690 | 4.01000 | 2.000000 | 14.900000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a1e433c10>
3.2. Zbiór Test
wine_test
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
75 | 8.8 | 0.410 | 0.64 | 2.2 | 0.093 | 9.0 | 42.0 | 0.99860 | 3.54 | 0.66 | 10.5 | 5 |
1283 | 8.7 | 0.630 | 0.28 | 2.7 | 0.096 | 17.0 | 69.0 | 0.99734 | 3.26 | 0.63 | 10.2 | 6 |
408 | 10.4 | 0.340 | 0.58 | 3.7 | 0.174 | 6.0 | 16.0 | 0.99700 | 3.19 | 0.70 | 11.3 | 6 |
1281 | 7.1 | 0.460 | 0.20 | 1.9 | 0.077 | 28.0 | 54.0 | 0.99560 | 3.37 | 0.64 | 10.4 | 6 |
1118 | 7.1 | 0.390 | 0.12 | 2.1 | 0.065 | 14.0 | 24.0 | 0.99252 | 3.30 | 0.53 | 13.3 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1461 | 6.2 | 0.785 | 0.00 | 2.1 | 0.060 | 6.0 | 13.0 | 0.99664 | 3.59 | 0.61 | 10.0 | 4 |
1016 | 8.9 | 0.380 | 0.40 | 2.2 | 0.068 | 12.0 | 28.0 | 0.99486 | 3.27 | 0.75 | 12.6 | 7 |
1412 | 8.2 | 0.240 | 0.34 | 5.1 | 0.062 | 8.0 | 22.0 | 0.99740 | 3.22 | 0.94 | 10.9 | 6 |
424 | 7.7 | 0.960 | 0.20 | 2.0 | 0.047 | 15.0 | 60.0 | 0.99550 | 3.36 | 0.44 | 10.9 | 5 |
120 | 7.3 | 1.070 | 0.09 | 1.7 | 0.178 | 10.0 | 89.0 | 0.99620 | 3.30 | 0.57 | 9.0 | 5 |
360 rows × 12 columns
wine_test["quality"].value_counts()
5 155 6 149 7 37 4 16 8 2 3 1 Name: quality, dtype: int64
wine_test.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 |
mean | 8.348611 | 0.518764 | 0.275444 | 2.542222 | 0.086114 | 16.093056 | 48.777778 | 0.996747 | 3.301083 | 0.653833 | 10.368889 | 5.586111 |
std | 1.580574 | 0.182554 | 0.182508 | 1.528465 | 0.043445 | 10.421097 | 35.005778 | 0.001792 | 0.145379 | 0.168306 | 1.041729 | 0.767245 |
min | 5.000000 | 0.120000 | 0.000000 | 0.900000 | 0.042000 | 3.000000 | 6.000000 | 0.990070 | 2.870000 | 0.370000 | 8.700000 | 3.000000 |
25% | 7.200000 | 0.380000 | 0.120000 | 1.900000 | 0.070000 | 8.000000 | 23.000000 | 0.995760 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 8.000000 | 0.500000 | 0.270000 | 2.150000 | 0.079000 | 14.000000 | 40.000000 | 0.996645 | 3.300000 | 0.620000 | 10.100000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 65.750000 | 0.997683 | 3.390000 | 0.720000 | 11.000000 | 6.000000 |
max | 15.600000 | 1.115000 | 0.790000 | 15.500000 | 0.611000 | 68.000000 | 289.000000 | 1.003690 | 3.750000 | 1.950000 | 14.000000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a0bec96d0>
3.3. Cały zbiór
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
wine["quality"].value_counts()
5 681 6 638 7 199 4 53 8 18 3 10 Name: quality, dtype: int64
wine.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a0be6e9d0>
4. Normalizacja
Normalizacja kolumny 'quality' na wartości od 0 do 20. Nie jest ona konieczna ale została stworzona w celach demonstracyjnych
wine["quality"]=((wine["quality"]-wine["quality"].min())/(wine["quality"].max()-wine["quality"].min()))*20
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 8.0 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 8.0 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 12.0 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 8.0 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 12.0 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 12.0 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 8.0 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 12.0 |
1599 rows × 12 columns
wine["quality"].value_counts()
8.0 681 12.0 638 16.0 199 4.0 53 20.0 18 0.0 10 Name: quality, dtype: int64
5. Usuwanie artefaktów
Całe szczęscie nie ma w moim zbiorze ani pustych linijek, ani przykładów z niepoprawnymi wartościami
# Znajdźmy pustą linijkę:
! grep -P "^$" -n winequality-red.csv
Szukanie wartości "NA": https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
wine.isnull().sum()
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
wine.dropna(inplace=True)
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 8.0 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 8.0 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 12.0 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 8.0 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 12.0 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 12.0 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 8.0 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 12.0 |
1599 rows × 12 columns