ium_434788/IUM_1_434788.ipynb

119 KiB
Raw Blame History

1. Pobranie zbioru danych z Repozytorium

!curl -OL https://git.wmi.amu.edu.pl/s434788/ium_434788/raw/branch/master/winequality-red.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   98k    0   98k    0     0   282k      0 --:--:-- --:--:-- --:--:--  281k
import pandas as pd
wine=pd.read_csv('winequality-red.csv')
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

2. Podział na zbiory test/train przy pomocy SciKit

Próbowałem również podzielić na podzbiory Train:Dev:Test 6:2:2 Przy pomocy basha ale uznałem, że wygodniejsze jest korzystanie z "train_test_split()". Docelowo podział będzie dokonywany na 4 zmienne X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42), jednak chciałem zachować konwencje z przykładu, z ćwiczeń.

from sklearn.model_selection import train_test_split

wine_train, wine_test = train_test_split(wine, test_size=360,train_size=959, random_state=1)
wine_test["quality"].value_counts()
5    155
6    149
7     37
4     16
8      2
3      1
Name: quality, dtype: int64
wine_train["quality"].value_counts()
5    400
6    388
7    125
4     30
8     11
3      5
Name: quality, dtype: int64

3. Statystyki dla zbior

from matplotlib import pyplot as plt
import seaborn as sns

3.1. Zbiór Train

wine_train
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1589 6.6 0.725 0.20 7.8 0.073 29.0 79.0 0.99770 3.29 0.54 9.2 5
854 9.3 0.360 0.39 1.5 0.080 41.0 55.0 0.99652 3.47 0.73 10.9 6
83 7.3 0.670 0.26 1.8 0.401 16.0 51.0 0.99690 3.16 1.14 9.4 5
1106 8.2 0.230 0.42 1.9 0.069 9.0 17.0 0.99376 3.21 0.54 12.3 6
650 10.7 0.430 0.39 2.2 0.106 8.0 32.0 0.99860 2.89 0.50 9.6 5
... ... ... ... ... ... ... ... ... ... ... ... ...
526 7.3 0.365 0.49 2.5 0.088 39.0 106.0 0.99660 3.36 0.78 11.0 5
583 12.0 0.280 0.49 1.9 0.074 10.0 21.0 0.99760 2.98 0.66 9.9 7
975 7.2 0.410 0.30 2.1 0.083 35.0 72.0 0.99700 3.44 0.52 9.4 5
566 8.7 0.700 0.24 2.5 0.226 5.0 15.0 0.99910 3.32 0.60 9.0 6
1232 7.6 0.430 0.29 2.1 0.075 19.0 66.0 0.99718 3.40 0.64 9.5 5

959 rows × 12 columns

wine_train["quality"].value_counts()
5    400
6    388
7    125
4     30
8     11
3      5
Name: quality, dtype: int64
wine_train.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 959.000000 959.000000 959.000000 959.000000 959.000000 959.000000 959.000000 959.000000 959.00000 959.000000 959.000000 959.000000
mean 8.329093 0.526809 0.269864 2.493743 0.088230 15.883733 45.738790 0.996736 3.31048 0.661481 10.433160 5.657977
std 1.808394 0.175221 0.198377 1.262329 0.050555 10.485739 31.897095 0.001925 0.15462 0.171639 1.084349 0.805654
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.74000 0.370000 8.400000 3.000000
25% 7.100000 0.400000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995540 3.21000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.250000 2.200000 0.079000 14.000000 37.000000 0.996770 3.31000 0.620000 10.100000 6.000000
75% 9.300000 0.635000 0.430000 2.600000 0.090000 22.000000 61.000000 0.997870 3.40000 0.730000 11.100000 6.000000
max 15.900000 1.330000 1.000000 15.400000 0.610000 72.000000 278.000000 1.003690 4.01000 2.000000 14.900000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a1e433c10>

3.2. Zbiór Test

wine_test
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
75 8.8 0.410 0.64 2.2 0.093 9.0 42.0 0.99860 3.54 0.66 10.5 5
1283 8.7 0.630 0.28 2.7 0.096 17.0 69.0 0.99734 3.26 0.63 10.2 6
408 10.4 0.340 0.58 3.7 0.174 6.0 16.0 0.99700 3.19 0.70 11.3 6
1281 7.1 0.460 0.20 1.9 0.077 28.0 54.0 0.99560 3.37 0.64 10.4 6
1118 7.1 0.390 0.12 2.1 0.065 14.0 24.0 0.99252 3.30 0.53 13.3 6
... ... ... ... ... ... ... ... ... ... ... ... ...
1461 6.2 0.785 0.00 2.1 0.060 6.0 13.0 0.99664 3.59 0.61 10.0 4
1016 8.9 0.380 0.40 2.2 0.068 12.0 28.0 0.99486 3.27 0.75 12.6 7
1412 8.2 0.240 0.34 5.1 0.062 8.0 22.0 0.99740 3.22 0.94 10.9 6
424 7.7 0.960 0.20 2.0 0.047 15.0 60.0 0.99550 3.36 0.44 10.9 5
120 7.3 1.070 0.09 1.7 0.178 10.0 89.0 0.99620 3.30 0.57 9.0 5

360 rows × 12 columns

wine_test["quality"].value_counts()
5    155
6    149
7     37
4     16
8      2
3      1
Name: quality, dtype: int64
wine_test.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000 360.000000
mean 8.348611 0.518764 0.275444 2.542222 0.086114 16.093056 48.777778 0.996747 3.301083 0.653833 10.368889 5.586111
std 1.580574 0.182554 0.182508 1.528465 0.043445 10.421097 35.005778 0.001792 0.145379 0.168306 1.041729 0.767245
min 5.000000 0.120000 0.000000 0.900000 0.042000 3.000000 6.000000 0.990070 2.870000 0.370000 8.700000 3.000000
25% 7.200000 0.380000 0.120000 1.900000 0.070000 8.000000 23.000000 0.995760 3.210000 0.550000 9.500000 5.000000
50% 8.000000 0.500000 0.270000 2.150000 0.079000 14.000000 40.000000 0.996645 3.300000 0.620000 10.100000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 65.750000 0.997683 3.390000 0.720000 11.000000 6.000000
max 15.600000 1.115000 0.790000 15.500000 0.611000 68.000000 289.000000 1.003690 3.750000 1.950000 14.000000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a0bec96d0>

3.3. Cały zbiór

wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

wine["quality"].value_counts()
5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64
wine.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a0be6e9d0>

4. Normalizacja

Normalizacja kolumny 'quality' na wartości od 0 do 20. Nie jest ona konieczna ale została stworzona w celach demonstracyjnych

wine["quality"]=((wine["quality"]-wine["quality"].min())/(wine["quality"].max()-wine["quality"].min()))*20
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 8.0
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 8.0
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 12.0
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 8.0
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 12.0
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 12.0
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 8.0
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 12.0

1599 rows × 12 columns

wine["quality"].value_counts()
8.0     681
12.0    638
16.0    199
4.0      53
20.0     18
0.0      10
Name: quality, dtype: int64

5. Usuwanie artefaktów

Całe szczęscie nie ma w moim zbiorze ani pustych linijek, ani przykładów z niepoprawnymi wartościami

# Znajdźmy pustą linijkę:
! grep -P "^$" -n winequality-red.csv
wine.isnull().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
wine.dropna(inplace=True) 
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 8.0
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 8.0
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 12.0
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 8.0
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 12.0
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 12.0
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 8.0
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 12.0

1599 rows × 12 columns