150 KiB
1. Pobranie zbioru danych z Repozytorium
!curl -OL https://git.wmi.amu.edu.pl/s434788/ium_434788/raw/branch/master/winequality-red.csv
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 98k 0 98k 0 0 74502 0 --:--:-- 0:00:01 --:--:-- 74502
import pandas as pd
wine=pd.read_csv('winequality-red.csv')
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
2. Podział na zbiory test/train przy pomocy SciKit + (poprawka z 26.03.2021 przy pomocy basha)
2.1 SciKit
Próbowałem również podzielić na podzbiory Train:Dev:Test 6:2:2 Przy pomocy basha ale uznałem, że wygodniejsze jest korzystanie z "train_test_split()". Docelowo podział będzie dokonywany na 4 zmienne X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
, jednak chciałem zachować konwencje z przykładu, z ćwiczeń.
from sklearn.model_selection import train_test_split
wine_train, wine_test = train_test_split(wine, test_size=360,train_size=959, random_state=1)
wine_test["quality"].value_counts()
5 155 6 149 7 37 4 16 8 2 3 1 Name: quality, dtype: int64
wine_train["quality"].value_counts()
5 400 6 388 7 125 4 30 8 11 3 5 Name: quality, dtype: int64
2.2 Bash
!head -n 1 winequality-red.csv > header.csv
!tail -n +2 winequality-red.csv | shuf > data.shuffled
!head -n 266 data.shuffled > wine.data.test
!head -n 532 data.shuffled | tail -n 266 > wine.data.dev
!tail -n +333 data.shuffled > wine.data.train
!cat header.csv wine.data.test > test.csv
!cat header.csv wine.data.dev > dev.csv
!cat header.csv wine.data.train > train.csv
!wc -l test.csv
!wc -l dev.csv
!wc -l train.csv
267 test.csv 267 dev.csv 1268 train.csv
wine_test_bash=pd.read_csv('test.csv')
wine_dev_bash=pd.read_csv('dev.csv')
wine_train_bash=pd.read_csv('train.csv')
3. Statystyki dla zbiorów
from matplotlib import pyplot as plt
import seaborn as sns
3.1. Zbiór Train (bash)
wine_train_bash
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10.0 | 0.380 | 0.38 | 1.6 | 0.169 | 27.0 | 90.0 | 0.99914 | 3.15 | 0.65 | 8.5 | 5 |
1 | 6.7 | 0.460 | 0.24 | 1.7 | 0.077 | 18.0 | 34.0 | 0.99480 | 3.39 | 0.60 | 10.6 | 6 |
2 | 7.2 | 0.695 | 0.13 | 2.0 | 0.076 | 12.0 | 20.0 | 0.99546 | 3.29 | 0.54 | 10.1 | 5 |
3 | 12.5 | 0.600 | 0.49 | 4.3 | 0.100 | 5.0 | 14.0 | 1.00100 | 3.25 | 0.74 | 11.9 | 6 |
4 | 8.3 | 0.560 | 0.22 | 2.4 | 0.082 | 10.0 | 86.0 | 0.99830 | 3.37 | 0.62 | 9.5 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1262 | 7.8 | 0.560 | 0.12 | 2.0 | 0.082 | 7.0 | 28.0 | 0.99700 | 3.37 | 0.50 | 9.4 | 6 |
1263 | 5.8 | 0.680 | 0.02 | 1.8 | 0.087 | 21.0 | 94.0 | 0.99440 | 3.54 | 0.52 | 10.0 | 5 |
1264 | 7.7 | 0.630 | 0.08 | 1.9 | 0.076 | 15.0 | 27.0 | 0.99670 | 3.32 | 0.54 | 9.5 | 6 |
1265 | 7.1 | 0.600 | 0.00 | 1.8 | 0.074 | 16.0 | 34.0 | 0.99720 | 3.47 | 0.70 | 9.9 | 6 |
1266 | 10.4 | 0.610 | 0.49 | 2.1 | 0.200 | 5.0 | 16.0 | 0.99940 | 3.16 | 0.63 | 8.4 | 3 |
1267 rows × 12 columns
wine_train_bash["quality"].value_counts()
5 550 6 498 7 157 4 39 8 15 3 8 Name: quality, dtype: int64
wine_train_bash.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 | 1267.000000 |
mean | 8.344199 | 0.525888 | 0.273891 | 2.574033 | 0.087419 | 15.889897 | 46.146014 | 0.996799 | 3.310016 | 0.655730 | 10.396725 | 5.632991 |
std | 1.789253 | 0.177804 | 0.196141 | 1.453463 | 0.046754 | 10.603674 | 32.734818 | 0.001893 | 0.154047 | 0.166206 | 1.042353 | 0.806931 |
min | 4.700000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.370000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.071000 | 7.000000 | 22.000000 | 0.995660 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.080000 | 13.000000 | 37.000000 | 0.996800 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.300000 | 0.640000 | 0.430000 | 2.600000 | 0.090000 | 22.000000 | 62.000000 | 0.997870 | 3.400000 | 0.730000 | 11.000000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 278.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_train_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504f98950>
3.2. Zbiór Test (bash)
wine_test_bash
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.1 | 0.60 | 0.01 | 2.3 | 0.079 | 24.0 | 37.0 | 0.99514 | 3.40 | 0.61 | 10.9 | 6 |
1 | 7.8 | 0.61 | 0.29 | 1.6 | 0.114 | 9.0 | 29.0 | 0.99740 | 3.26 | 1.56 | 9.1 | 5 |
2 | 7.1 | 0.63 | 0.06 | 2.0 | 0.083 | 8.0 | 29.0 | 0.99855 | 3.67 | 0.73 | 9.6 | 5 |
3 | 9.1 | 0.30 | 0.41 | 2.0 | 0.068 | 10.0 | 24.0 | 0.99523 | 3.27 | 0.85 | 11.7 | 7 |
4 | 9.0 | 0.46 | 0.31 | 2.8 | 0.093 | 19.0 | 98.0 | 0.99815 | 3.32 | 0.63 | 9.5 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
261 | 7.2 | 0.60 | 0.04 | 2.5 | 0.076 | 18.0 | 88.0 | 0.99745 | 3.53 | 0.55 | 9.5 | 5 |
262 | 8.4 | 0.67 | 0.19 | 2.2 | 0.093 | 11.0 | 75.0 | 0.99736 | 3.20 | 0.59 | 9.2 | 4 |
263 | 8.8 | 0.61 | 0.19 | 4.0 | 0.094 | 30.0 | 69.0 | 0.99787 | 3.22 | 0.50 | 10.0 | 6 |
264 | 9.6 | 0.68 | 0.24 | 2.2 | 0.087 | 5.0 | 28.0 | 0.99880 | 3.14 | 0.60 | 10.2 | 5 |
265 | 10.5 | 0.43 | 0.35 | 3.3 | 0.092 | 24.0 | 70.0 | 0.99798 | 3.21 | 0.69 | 10.5 | 6 |
266 rows × 12 columns
wine_test_bash["quality"].value_counts()
6 109 5 108 7 37 4 8 8 2 3 2 Name: quality, dtype: int64
wine_test_bash.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 |
mean | 8.245865 | 0.529455 | 0.266203 | 2.373308 | 0.086823 | 15.840226 | 47.447368 | 0.996499 | 3.313195 | 0.676241 | 10.569925 | 5.665414 |
std | 1.526175 | 0.181583 | 0.191968 | 1.005345 | 0.046159 | 10.163096 | 34.610379 | 0.001772 | 0.158871 | 0.187786 | 1.149728 | 0.808497 |
min | 4.600000 | 0.180000 | 0.000000 | 1.200000 | 0.039000 | 1.000000 | 7.000000 | 0.990840 | 2.880000 | 0.390000 | 9.000000 | 3.000000 |
25% | 7.200000 | 0.392500 | 0.100000 | 1.900000 | 0.068000 | 7.000000 | 22.250000 | 0.995318 | 3.200000 | 0.560000 | 9.500000 | 5.000000 |
50% | 8.000000 | 0.520000 | 0.260000 | 2.100000 | 0.078000 | 14.000000 | 40.000000 | 0.996520 | 3.310000 | 0.640000 | 10.250000 | 6.000000 |
75% | 9.100000 | 0.630000 | 0.400000 | 2.500000 | 0.092000 | 21.000000 | 62.750000 | 0.997600 | 3.400000 | 0.750000 | 11.400000 | 6.000000 |
max | 13.300000 | 1.330000 | 0.740000 | 8.800000 | 0.467000 | 51.000000 | 289.000000 | 1.002600 | 3.900000 | 1.980000 | 14.000000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_test_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504747ad0>
3.3. Cały zbiór
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
wine["quality"].value_counts()
5 681 6 638 7 199 4 53 8 18 3 10 Name: quality, dtype: int64
wine.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
Testowy Wykres (quality, volatile acidity)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504262bd0>
3.4. zbiór Dev (bash)
wine_dev_bash
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.0 | 0.705 | 0.05 | 1.9 | 0.074 | 8.0 | 19.0 | 0.99620 | 3.34 | 0.95 | 10.5 | 6 |
1 | 7.6 | 0.665 | 0.10 | 1.5 | 0.066 | 27.0 | 55.0 | 0.99655 | 3.39 | 0.51 | 9.3 | 5 |
2 | 7.8 | 0.550 | 0.35 | 2.2 | 0.074 | 21.0 | 66.0 | 0.99740 | 3.25 | 0.56 | 9.2 | 5 |
3 | 13.0 | 0.320 | 0.65 | 2.6 | 0.093 | 15.0 | 47.0 | 0.99960 | 3.05 | 0.61 | 10.6 | 5 |
4 | 8.8 | 0.610 | 0.30 | 2.8 | 0.088 | 17.0 | 46.0 | 0.99760 | 3.26 | 0.51 | 9.3 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
261 | 13.8 | 0.490 | 0.67 | 3.0 | 0.093 | 6.0 | 15.0 | 0.99860 | 3.02 | 0.93 | 12.0 | 6 |
262 | 7.1 | 0.750 | 0.01 | 2.2 | 0.059 | 11.0 | 18.0 | 0.99242 | 3.39 | 0.40 | 12.8 | 6 |
263 | 9.9 | 0.350 | 0.41 | 2.3 | 0.083 | 11.0 | 61.0 | 0.99820 | 3.21 | 0.50 | 9.5 | 5 |
264 | 6.5 | 0.520 | 0.11 | 1.8 | 0.073 | 13.0 | 38.0 | 0.99550 | 3.34 | 0.52 | 9.3 | 5 |
265 | 6.8 | 0.670 | 0.00 | 1.9 | 0.080 | 22.0 | 39.0 | 0.99701 | 3.40 | 0.74 | 9.7 | 5 |
266 rows × 12 columns
wine_dev_bash["quality"].value_counts()
5 115 6 113 7 24 4 9 8 3 3 2 Name: quality, dtype: int64
wine_dev_bash.describe(include='all')
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 | 266.000000 |
mean | 8.273684 | 0.540075 | 0.253008 | 2.523308 | 0.088620 | 15.398496 | 43.973684 | 0.996749 | 3.317895 | 0.649774 | 10.453321 | 5.590226 |
std | 1.720592 | 0.193856 | 0.190330 | 1.380498 | 0.055825 | 10.002219 | 30.518712 | 0.001930 | 0.152003 | 0.176930 | 1.058010 | 0.777841 |
min | 4.900000 | 0.120000 | 0.000000 | 1.300000 | 0.012000 | 1.000000 | 8.000000 | 0.990640 | 2.870000 | 0.330000 | 8.500000 | 3.000000 |
25% | 7.100000 | 0.396250 | 0.080000 | 1.900000 | 0.068250 | 8.000000 | 20.000000 | 0.995525 | 3.210000 | 0.542500 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.240000 | 2.200000 | 0.079000 | 13.000000 | 37.000000 | 0.996720 | 3.320000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.648750 | 0.390000 | 2.600000 | 0.090000 | 20.000000 | 60.000000 | 0.997877 | 3.430000 | 0.720000 | 11.200000 | 6.000000 |
max | 15.600000 | 1.580000 | 0.760000 | 13.800000 | 0.611000 | 66.000000 | 141.000000 | 1.003150 | 3.720000 | 1.950000 | 14.000000 | 8.000000 |
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine_dev_bash)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2504166f50>
4. Normalizacja
Normalizacja kolumny 'quality' na wartości od 0 do 20. Nie jest ona konieczna ale została stworzona w celach demonstracyjnych
wine["quality"]=((wine["quality"]-wine["quality"].min())/(wine["quality"].max()-wine["quality"].min()))*20
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 8.0 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 8.0 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 12.0 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 8.0 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 12.0 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 12.0 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 8.0 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 12.0 |
1599 rows × 12 columns
wine["quality"].value_counts()
8.0 681 12.0 638 16.0 199 4.0 53 20.0 18 0.0 10 Name: quality, dtype: int64
5. Usuwanie artefaktów
Całe szczęscie nie ma w moim zbiorze ani pustych linijek, ani przykładów z niepoprawnymi wartościami
# Znajdźmy pustą linijkę:
! grep -P "^$" -n winequality-red.csv
Szukanie wartości "NA": https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
wine.isnull().sum()
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
wine.dropna(inplace=True)
wine
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 8.0 |
2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 8.0 |
3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 12.0 |
4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 8.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 8.0 |
1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 12.0 |
1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 12.0 |
1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 8.0 |
1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 12.0 |
1599 rows × 12 columns