ium_464979/IUM_02.ipynb

111 KiB

Pobieranie zbioru i pakietów

%pip install kaggle
%pip install pandas
%pip install numpy
%pip install scikit-learn
%pip install seaborn
Collecting kaggle
  Downloading kaggle-1.6.6.tar.gz (84 kB)
     ---------------------------------------- 84.6/84.6 kB 2.4 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: six>=1.10 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.28.1)
Requirement already satisfied: tqdm in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.64.1)
Requirement already satisfied: python-slugify in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.26.14)
Requirement already satisfied: bleach in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: webencodings in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: packaging in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (22.0)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\adamw\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: colorama in c:\users\adamw\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111955 sha256=23592736409344e3027e92f5ac103680cd5efb348835a123a68118e729e02b66
  Stored in directory: c:\users\adamw\appdata\local\pip\cache\wheels\54\6e\ff\d5ab6af2287a2d0c5b8cea9328fb14940ca253fe60214a99c8
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.6.6
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pandas in c:\users\adamw\anaconda3\lib\site-packages (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: numpy>=1.21.0 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (1.23.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy in c:\users\adamw\anaconda3\lib\site-packages (1.23.5)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: scikit-learn in c:\users\adamw\anaconda3\lib\site-packages (1.2.1)
Requirement already satisfied: numpy>=1.17.3 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.1.1)
Requirement already satisfied: scipy>=1.3.2 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.10.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: seaborn in c:\users\adamw\anaconda3\lib\site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.23.5)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (3.7.0)
Requirement already satisfied: pandas>=0.25 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0)
Requirement already satisfied: cycler>=0.10 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate
Downloading 1-5-million-beer-reviews-from-beer-advocate.zip to C:\Users\adamw\REPOS\ium_464979

  0%|          | 0.00/32.5M [00:00<?, ?B/s]
  3%|3         | 1.00M/32.5M [00:00<00:21, 1.53MB/s]
  6%|6         | 2.00M/32.5M [00:00<00:11, 2.78MB/s]
  9%|9         | 3.00M/32.5M [00:00<00:07, 3.87MB/s]
 12%|#2        | 4.00M/32.5M [00:01<00:06, 4.72MB/s]
 15%|#5        | 5.00M/32.5M [00:01<00:05, 5.20MB/s]
 18%|#8        | 6.00M/32.5M [00:01<00:05, 5.08MB/s]
 22%|##1       | 7.00M/32.5M [00:01<00:05, 5.19MB/s]
 25%|##4       | 8.00M/32.5M [00:01<00:04, 5.21MB/s]
 28%|##7       | 9.00M/32.5M [00:02<00:04, 5.12MB/s]
 31%|###       | 10.0M/32.5M [00:02<00:04, 5.25MB/s]
 34%|###3      | 11.0M/32.5M [00:02<00:04, 5.50MB/s]
 37%|###6      | 12.0M/32.5M [00:02<00:03, 6.10MB/s]
 40%|####      | 13.0M/32.5M [00:02<00:03, 6.57MB/s]
 43%|####3     | 14.0M/32.5M [00:02<00:03, 6.39MB/s]
 46%|####6     | 15.0M/32.5M [00:03<00:03, 6.10MB/s]
 49%|####9     | 16.0M/32.5M [00:03<00:02, 5.83MB/s]
 52%|#####2    | 17.0M/32.5M [00:03<00:02, 5.85MB/s]
 55%|#####5    | 18.0M/32.5M [00:03<00:02, 5.87MB/s]
 59%|#####8    | 19.0M/32.5M [00:03<00:02, 6.00MB/s]
 62%|######1   | 20.0M/32.5M [00:03<00:01, 6.79MB/s]
 65%|######4   | 21.0M/32.5M [00:04<00:01, 7.17MB/s]
 71%|#######   | 23.0M/32.5M [00:04<00:01, 8.01MB/s]
 74%|#######3  | 24.0M/32.5M [00:04<00:01, 7.80MB/s]
 77%|#######7  | 25.0M/32.5M [00:04<00:01, 7.72MB/s]
 80%|########  | 26.0M/32.5M [00:04<00:00, 7.58MB/s]
 83%|########3 | 27.0M/32.5M [00:05<00:01, 5.54MB/s]
 86%|########6 | 28.0M/32.5M [00:05<00:00, 5.95MB/s]
 89%|########9 | 29.0M/32.5M [00:05<00:00, 6.66MB/s]
 95%|#########5| 31.0M/32.5M [00:05<00:00, 7.50MB/s]
100%|##########| 32.5M/32.5M [00:05<00:00, 8.35MB/s]
100%|##########| 32.5M/32.5M [00:05<00:00, 6.00MB/s]
!kaggle datasets download -d
!unzip -o 1-5-million-beer-reviews-from-beer-advocate.zip
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

pd.set_option('float_format', '{:f}'.format)

Wczytywanie danych

beers=pd.read_csv('beer_reviews.csv')

beers.head()
index brewery_id brewery_name review_time review_overall review_aroma review_appearance review_profilename beer_style review_palate review_taste beer_name beer_abv beer_beerid
0 0 10325 Vecchio Birraio 1234817823 1.500000 2.000000 2.500000 stcules Hefeweizen 1.500000 1.500000 Sausa Weizen 5.000000 47986
1 1 10325 Vecchio Birraio 1235915097 3.000000 2.500000 3.000000 stcules English Strong Ale 3.000000 3.000000 Red Moon 6.200000 48213
2 2 10325 Vecchio Birraio 1235916604 3.000000 2.500000 3.000000 stcules Foreign / Export Stout 3.000000 3.000000 Black Horse Black Beer 6.500000 48215
3 3 10325 Vecchio Birraio 1234725145 3.000000 3.000000 3.500000 stcules German Pilsener 2.500000 3.000000 Sausa Pils 5.000000 47969
4 4 1075 Caldera Brewing Company 1293735206 4.000000 4.500000 4.000000 johnmichaelsen American Double / Imperial IPA 4.000000 4.500000 Cauldron DIPA 7.700000 64883
beers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 14 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   index               1586614 non-null  int64  
 1   brewery_id          1586614 non-null  int64  
 2   brewery_name        1586599 non-null  object 
 3   review_time         1586614 non-null  int64  
 4   review_overall      1586614 non-null  float64
 5   review_aroma        1586614 non-null  float64
 6   review_appearance   1586614 non-null  float64
 7   review_profilename  1586266 non-null  object 
 8   beer_style          1586614 non-null  object 
 9   review_palate       1586614 non-null  float64
 10  review_taste        1586614 non-null  float64
 11  beer_name           1586614 non-null  object 
 12  beer_abv            1518829 non-null  float64
 13  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(4), object(4)
memory usage: 169.5+ MB

Czyszczenie

beers.dropna(subset=['brewery_name'], inplace=True)
beers.dropna(subset=['review_profilename'], inplace=True)
beers.dropna(subset=['beer_abv'], inplace=True)

beers.isnull().sum()
index                 0
brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

Normalizacja

scaler = MinMaxScaler()

beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']] = scaler.fit_transform(beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']])

Podział na podzbiory

beers_train, beers_dev_test = train_test_split(beers, test_size=0.2, random_state=1234)
beers_dev, beers_test = train_test_split(beers_dev_test, test_size=0.5, random_state=1234)
print(f"Liczba kolumn w każdym zbiorze: {beers.shape[1]} kolumn")
print(f"Całość: {beers.shape[0]} rekordów ")
print(f"Train: {beers_train.shape[0]} rekordów")
print(f"Dev: {beers_dev.shape[0]} rekordów")
print(f"Test: {beers_test.shape[0]} rekordów")
Liczba kolumn w każdym zbiorze: 14 kolumn
Całość: 1518478 rekordów 
Train: 1214782 rekordów
Dev: 151848 rekordów
Test: 151848 rekordów

Przegląd danych

print(f"Suma różnych piw: {beers['beer_name'].nunique()}")
print(f"Suma różnych styli: {beers['beer_style'].nunique()}")
print(f"Suma różnych browarów: {beers['brewery_name'].nunique()}")
Suma różnych piw: 44075
Suma różnych styli: 104
Suma różnych browarów: 5155
style_counts = beers['beer_style'].value_counts()

top_15_styles = style_counts.head(15) 

plt.bar(top_15_styles.index, top_15_styles.values)
plt.xlabel('Styl')
plt.ylabel('Liczba piw')
plt.title('Ilość piw dla naliczniejszych styli')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
reviews = pd.DataFrame(beers.groupby('beer_name')['review_overall'].mean())
reviews['Liczba opini'] = pd.DataFrame(beers.groupby('beer_name')['review_overall'].count())
reviews = reviews.sort_values(by=['Liczba opini'], ascending=False)
reviews.head()
review_overall Liczba opini
beer_name
90 Minute IPA 0.829097 3289
Old Rasputin Russian Imperial Stout 0.834823 3110
Sierra Nevada Celebration Ale 0.833711 2999
India Pale Ale 0.770777 2960
Two Hearted Ale 0.866043 2727
beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.3f}")
review_overall review_aroma review_appearance review_palate review_taste beer_abv beer_beerid
count 1518478.000 1518478.000 1518478.000 1518478.000 1518478.000 1518478.000 1518478.000
mean 0.765 0.687 0.770 0.688 0.701 0.122 0.277
std 0.143 0.174 0.123 0.170 0.182 0.040 0.282
min 0.000 0.000 0.000 0.000 0.000 0.000 0.000
25% 0.700 0.625 0.700 0.625 0.625 0.090 0.021
50% 0.800 0.750 0.800 0.750 0.750 0.112 0.166
75% 0.900 0.750 0.800 0.750 0.875 0.147 0.507
max 1.000 1.000 1.000 1.000 1.000 1.000 1.000
beers_train[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall review_aroma review_appearance review_palate review_taste beer_abv beer_beerid
count 1214782.0 1214782.0 1214782.0 1214782.0 1214782.0 1214782.0 1214782.0
mean 0.8 0.7 0.8 0.7 0.7 0.1 0.3
std 0.1 0.2 0.1 0.2 0.2 0.0 0.3
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25% 0.7 0.6 0.7 0.6 0.6 0.1 0.0
50% 0.8 0.8 0.8 0.8 0.8 0.1 0.2
75% 0.9 0.8 0.8 0.8 0.9 0.1 0.5
max 1.0 1.0 1.0 1.0 1.0 1.0 1.0
beers_dev[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall review_aroma review_appearance review_palate review_taste beer_abv beer_beerid
count 151848.0 151848.0 151848.0 151848.0 151848.0 151848.0 151848.0
mean 0.8 0.7 0.8 0.7 0.7 0.1 0.3
std 0.1 0.2 0.1 0.2 0.2 0.0 0.3
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25% 0.7 0.6 0.7 0.6 0.6 0.1 0.0
50% 0.8 0.8 0.8 0.8 0.8 0.1 0.2
75% 0.9 0.8 0.8 0.8 0.9 0.1 0.5
max 1.0 1.0 1.0 1.0 1.0 0.7 1.0
beers_test[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall review_aroma review_appearance review_palate review_taste beer_abv beer_beerid
count 151848.0 151848.0 151848.0 151848.0 151848.0 151848.0 151848.0
mean 0.8 0.7 0.8 0.7 0.7 0.1 0.3
std 0.1 0.2 0.1 0.2 0.2 0.0 0.3
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25% 0.7 0.6 0.7 0.6 0.6 0.1 0.0
50% 0.8 0.8 0.8 0.8 0.8 0.1 0.2
75% 0.9 0.8 0.8 0.8 0.9 0.1 0.5
max 1.0 1.0 1.0 1.0 1.0 0.7 1.0