ium_434788/Jupyter_Notebooks/Zadanie_02_434788.ipynb
2021-05-13 12:23:15 +02:00

150 KiB
Raw Permalink Blame History

1. Pobranie zbioru danych z Repozytorium

!curl -OL https://git.wmi.amu.edu.pl/s434788/ium_434788/raw/branch/master/winequality-red.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   98k    0   98k    0     0  74502      0 --:--:--  0:00:01 --:--:-- 74502
import pandas as pd
wine=pd.read_csv('winequality-red.csv')
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

2. Podział na zbiory test/train przy pomocy SciKit + (poprawka z 26.03.2021 przy pomocy basha)

2.1 SciKit

Próbowałem również podzielić na podzbiory Train:Dev:Test 6:2:2 Przy pomocy basha ale uznałem, że wygodniejsze jest korzystanie z "train_test_split()". Docelowo podział będzie dokonywany na 4 zmienne X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42), jednak chciałem zachować konwencje z przykładu, z ćwiczeń.

from sklearn.model_selection import train_test_split

wine_train, wine_test = train_test_split(wine, test_size=360,train_size=959, random_state=1)
wine_test["quality"].value_counts()
5    155
6    149
7     37
4     16
8      2
3      1
Name: quality, dtype: int64
wine_train["quality"].value_counts()
5    400
6    388
7    125
4     30
8     11
3      5
Name: quality, dtype: int64

2.2 Bash

!head -n 1 winequality-red.csv > header.csv
!tail -n +2 winequality-red.csv | shuf > data.shuffled

!head -n 266 data.shuffled > wine.data.test
!head -n 532 data.shuffled | tail -n 266 > wine.data.dev
!tail -n +333 data.shuffled > wine.data.train

!cat header.csv wine.data.test > test.csv
!cat header.csv wine.data.dev > dev.csv
!cat header.csv wine.data.train > train.csv
!wc -l test.csv
!wc -l dev.csv
!wc -l train.csv
267 test.csv
267 dev.csv
1268 train.csv
wine_test_bash=pd.read_csv('test.csv')
wine_dev_bash=pd.read_csv('dev.csv')
wine_train_bash=pd.read_csv('train.csv')

3. Statystyki dla zbiorów

from matplotlib import pyplot as plt
import seaborn as sns

3.1. Zbiór Train (bash)

wine_train_bash
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 10.0 0.380 0.38 1.6 0.169 27.0 90.0 0.99914 3.15 0.65 8.5 5
1 6.7 0.460 0.24 1.7 0.077 18.0 34.0 0.99480 3.39 0.60 10.6 6
2 7.2 0.695 0.13 2.0 0.076 12.0 20.0 0.99546 3.29 0.54 10.1 5
3 12.5 0.600 0.49 4.3 0.100 5.0 14.0 1.00100 3.25 0.74 11.9 6
4 8.3 0.560 0.22 2.4 0.082 10.0 86.0 0.99830 3.37 0.62 9.5 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1262 7.8 0.560 0.12 2.0 0.082 7.0 28.0 0.99700 3.37 0.50 9.4 6
1263 5.8 0.680 0.02 1.8 0.087 21.0 94.0 0.99440 3.54 0.52 10.0 5
1264 7.7 0.630 0.08 1.9 0.076 15.0 27.0 0.99670 3.32 0.54 9.5 6
1265 7.1 0.600 0.00 1.8 0.074 16.0 34.0 0.99720 3.47 0.70 9.9 6
1266 10.4 0.610 0.49 2.1 0.200 5.0 16.0 0.99940 3.16 0.63 8.4 3

1267 rows × 12 columns

wine_train_bash["quality"].value_counts()
5    550
6    498
7    157
4     39
8     15
3      8
Name: quality, dtype: int64
wine_train_bash.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000 1267.000000
mean 8.344199 0.525888 0.273891 2.574033 0.087419 15.889897 46.146014 0.996799 3.310016 0.655730 10.396725 5.632991
std 1.789253 0.177804 0.196141 1.453463 0.046754 10.603674 32.734818 0.001893 0.154047 0.166206 1.042353 0.806931
min 4.700000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.370000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.071000 7.000000 22.000000 0.995660 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.080000 13.000000 37.000000 0.996800 3.310000 0.620000 10.200000 6.000000
75% 9.300000 0.640000 0.430000 2.600000 0.090000 22.000000 62.000000 0.997870 3.400000 0.730000 11.000000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 278.000000 1.003690 4.010000 2.000000 14.900000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_train_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504f98950>

3.2. Zbiór Test (bash)

wine_test_bash
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.1 0.60 0.01 2.3 0.079 24.0 37.0 0.99514 3.40 0.61 10.9 6
1 7.8 0.61 0.29 1.6 0.114 9.0 29.0 0.99740 3.26 1.56 9.1 5
2 7.1 0.63 0.06 2.0 0.083 8.0 29.0 0.99855 3.67 0.73 9.6 5
3 9.1 0.30 0.41 2.0 0.068 10.0 24.0 0.99523 3.27 0.85 11.7 7
4 9.0 0.46 0.31 2.8 0.093 19.0 98.0 0.99815 3.32 0.63 9.5 6
... ... ... ... ... ... ... ... ... ... ... ... ...
261 7.2 0.60 0.04 2.5 0.076 18.0 88.0 0.99745 3.53 0.55 9.5 5
262 8.4 0.67 0.19 2.2 0.093 11.0 75.0 0.99736 3.20 0.59 9.2 4
263 8.8 0.61 0.19 4.0 0.094 30.0 69.0 0.99787 3.22 0.50 10.0 6
264 9.6 0.68 0.24 2.2 0.087 5.0 28.0 0.99880 3.14 0.60 10.2 5
265 10.5 0.43 0.35 3.3 0.092 24.0 70.0 0.99798 3.21 0.69 10.5 6

266 rows × 12 columns

wine_test_bash["quality"].value_counts()
6    109
5    108
7     37
4      8
8      2
3      2
Name: quality, dtype: int64
wine_test_bash.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000
mean 8.245865 0.529455 0.266203 2.373308 0.086823 15.840226 47.447368 0.996499 3.313195 0.676241 10.569925 5.665414
std 1.526175 0.181583 0.191968 1.005345 0.046159 10.163096 34.610379 0.001772 0.158871 0.187786 1.149728 0.808497
min 4.600000 0.180000 0.000000 1.200000 0.039000 1.000000 7.000000 0.990840 2.880000 0.390000 9.000000 3.000000
25% 7.200000 0.392500 0.100000 1.900000 0.068000 7.000000 22.250000 0.995318 3.200000 0.560000 9.500000 5.000000
50% 8.000000 0.520000 0.260000 2.100000 0.078000 14.000000 40.000000 0.996520 3.310000 0.640000 10.250000 6.000000
75% 9.100000 0.630000 0.400000 2.500000 0.092000 21.000000 62.750000 0.997600 3.400000 0.750000 11.400000 6.000000
max 13.300000 1.330000 0.740000 8.800000 0.467000 51.000000 289.000000 1.002600 3.900000 1.980000 14.000000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_test_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504747ad0>

3.3. Cały zbiór

wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

wine["quality"].value_counts()
5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64
wine.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

Testowy Wykres (quality, volatile acidity)

fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504262bd0>

3.4. zbiór Dev (bash)

wine_dev_bash
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 8.0 0.705 0.05 1.9 0.074 8.0 19.0 0.99620 3.34 0.95 10.5 6
1 7.6 0.665 0.10 1.5 0.066 27.0 55.0 0.99655 3.39 0.51 9.3 5
2 7.8 0.550 0.35 2.2 0.074 21.0 66.0 0.99740 3.25 0.56 9.2 5
3 13.0 0.320 0.65 2.6 0.093 15.0 47.0 0.99960 3.05 0.61 10.6 5
4 8.8 0.610 0.30 2.8 0.088 17.0 46.0 0.99760 3.26 0.51 9.3 4
... ... ... ... ... ... ... ... ... ... ... ... ...
261 13.8 0.490 0.67 3.0 0.093 6.0 15.0 0.99860 3.02 0.93 12.0 6
262 7.1 0.750 0.01 2.2 0.059 11.0 18.0 0.99242 3.39 0.40 12.8 6
263 9.9 0.350 0.41 2.3 0.083 11.0 61.0 0.99820 3.21 0.50 9.5 5
264 6.5 0.520 0.11 1.8 0.073 13.0 38.0 0.99550 3.34 0.52 9.3 5
265 6.8 0.670 0.00 1.9 0.080 22.0 39.0 0.99701 3.40 0.74 9.7 5

266 rows × 12 columns

wine_dev_bash["quality"].value_counts()
5    115
6    113
7     24
4      9
8      3
3      2
Name: quality, dtype: int64
wine_dev_bash.describe(include='all')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000 266.000000
mean 8.273684 0.540075 0.253008 2.523308 0.088620 15.398496 43.973684 0.996749 3.317895 0.649774 10.453321 5.590226
std 1.720592 0.193856 0.190330 1.380498 0.055825 10.002219 30.518712 0.001930 0.152003 0.176930 1.058010 0.777841
min 4.900000 0.120000 0.000000 1.300000 0.012000 1.000000 8.000000 0.990640 2.870000 0.330000 8.500000 3.000000
25% 7.100000 0.396250 0.080000 1.900000 0.068250 8.000000 20.000000 0.995525 3.210000 0.542500 9.500000 5.000000
50% 7.900000 0.520000 0.240000 2.200000 0.079000 13.000000 37.000000 0.996720 3.320000 0.620000 10.200000 6.000000
75% 9.200000 0.648750 0.390000 2.600000 0.090000 20.000000 60.000000 0.997877 3.430000 0.720000 11.200000 6.000000
max 15.600000 1.580000 0.760000 13.800000 0.611000 66.000000 141.000000 1.003150 3.720000 1.950000 14.000000 8.000000
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_dev_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504166f50>

4. Normalizacja

Normalizacja kolumny 'quality' na wartości od 0 do 20. Nie jest ona konieczna ale została stworzona w celach demonstracyjnych

wine["quality"]=((wine["quality"]-wine["quality"].min())/(wine["quality"].max()-wine["quality"].min()))*20
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 8.0
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 8.0
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 12.0
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 8.0
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 12.0
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 12.0
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 8.0
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 12.0

1599 rows × 12 columns

wine["quality"].value_counts()
8.0     681
12.0    638
16.0    199
4.0      53
20.0     18
0.0      10
Name: quality, dtype: int64

5. Usuwanie artefaktów

Całe szczęscie nie ma w moim zbiorze ani pustych linijek, ani przykładów z niepoprawnymi wartościami

# Znajdźmy pustą linijkę:
! grep -P "^$" -n winequality-red.csv
wine.isnull().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
wine.dropna(inplace=True) 
wine
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 8.0
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 8.0
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 12.0
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 8.0
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 8.0
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 12.0
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 12.0
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 8.0
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 12.0

1599 rows × 12 columns