111 KiB
111 KiB
Pobieranie zbioru i pakietów
%pip install kaggle
%pip install pandas
%pip install numpy
%pip install scikit-learn
%pip install seaborn
Collecting kaggle Downloading kaggle-1.6.6.tar.gz (84 kB) ---------------------------------------- 84.6/84.6 kB 2.4 MB/s eta 0:00:00 Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Requirement already satisfied: six>=1.10 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2022.12.7) Requirement already satisfied: python-dateutil in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: requests in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.28.1) Requirement already satisfied: tqdm in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.64.1) Requirement already satisfied: python-slugify in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (5.0.2) Requirement already satisfied: urllib3 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.26.14) Requirement already satisfied: bleach in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.1.0) Requirement already satisfied: webencodings in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: packaging in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (22.0) Requirement already satisfied: text-unidecode>=1.3 in c:\users\adamw\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<4,>=2.5 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (3.4) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4) Requirement already satisfied: colorama in c:\users\adamw\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6) Building wheels for collected packages: kaggle Building wheel for kaggle (setup.py): started Building wheel for kaggle (setup.py): finished with status 'done' Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111955 sha256=23592736409344e3027e92f5ac103680cd5efb348835a123a68118e729e02b66 Stored in directory: c:\users\adamw\appdata\local\pip\cache\wheels\54\6e\ff\d5ab6af2287a2d0c5b8cea9328fb14940ca253fe60214a99c8 Successfully built kaggle Installing collected packages: kaggle Successfully installed kaggle-1.6.6 Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: pandas in c:\users\adamw\anaconda3\lib\site-packages (1.5.3) Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2022.7) Requirement already satisfied: numpy>=1.21.0 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (1.23.5) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: numpy in c:\users\adamw\anaconda3\lib\site-packages (1.23.5) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: scikit-learn in c:\users\adamw\anaconda3\lib\site-packages (1.2.1) Requirement already satisfied: numpy>=1.17.3 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.23.5) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.1.1) Requirement already satisfied: scipy>=1.3.2 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.10.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: seaborn in c:\users\adamw\anaconda3\lib\site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.23.5) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (3.7.0) Requirement already satisfied: pandas>=0.25 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.5.3) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: contourpy>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5) Requirement already satisfied: fonttools>=4.22.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0) Requirement already satisfied: cycler>=0.10 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate
Downloading 1-5-million-beer-reviews-from-beer-advocate.zip to C:\Users\adamw\REPOS\ium_464979
0%| | 0.00/32.5M [00:00<?, ?B/s] 3%|3 | 1.00M/32.5M [00:00<00:21, 1.53MB/s] 6%|6 | 2.00M/32.5M [00:00<00:11, 2.78MB/s] 9%|9 | 3.00M/32.5M [00:00<00:07, 3.87MB/s] 12%|#2 | 4.00M/32.5M [00:01<00:06, 4.72MB/s] 15%|#5 | 5.00M/32.5M [00:01<00:05, 5.20MB/s] 18%|#8 | 6.00M/32.5M [00:01<00:05, 5.08MB/s] 22%|##1 | 7.00M/32.5M [00:01<00:05, 5.19MB/s] 25%|##4 | 8.00M/32.5M [00:01<00:04, 5.21MB/s] 28%|##7 | 9.00M/32.5M [00:02<00:04, 5.12MB/s] 31%|### | 10.0M/32.5M [00:02<00:04, 5.25MB/s] 34%|###3 | 11.0M/32.5M [00:02<00:04, 5.50MB/s] 37%|###6 | 12.0M/32.5M [00:02<00:03, 6.10MB/s] 40%|#### | 13.0M/32.5M [00:02<00:03, 6.57MB/s] 43%|####3 | 14.0M/32.5M [00:02<00:03, 6.39MB/s] 46%|####6 | 15.0M/32.5M [00:03<00:03, 6.10MB/s] 49%|####9 | 16.0M/32.5M [00:03<00:02, 5.83MB/s] 52%|#####2 | 17.0M/32.5M [00:03<00:02, 5.85MB/s] 55%|#####5 | 18.0M/32.5M [00:03<00:02, 5.87MB/s] 59%|#####8 | 19.0M/32.5M [00:03<00:02, 6.00MB/s] 62%|######1 | 20.0M/32.5M [00:03<00:01, 6.79MB/s] 65%|######4 | 21.0M/32.5M [00:04<00:01, 7.17MB/s] 71%|####### | 23.0M/32.5M [00:04<00:01, 8.01MB/s] 74%|#######3 | 24.0M/32.5M [00:04<00:01, 7.80MB/s] 77%|#######7 | 25.0M/32.5M [00:04<00:01, 7.72MB/s] 80%|######## | 26.0M/32.5M [00:04<00:00, 7.58MB/s] 83%|########3 | 27.0M/32.5M [00:05<00:01, 5.54MB/s] 86%|########6 | 28.0M/32.5M [00:05<00:00, 5.95MB/s] 89%|########9 | 29.0M/32.5M [00:05<00:00, 6.66MB/s] 95%|#########5| 31.0M/32.5M [00:05<00:00, 7.50MB/s] 100%|##########| 32.5M/32.5M [00:05<00:00, 8.35MB/s] 100%|##########| 32.5M/32.5M [00:05<00:00, 6.00MB/s]
!kaggle datasets download -d
!unzip -o 1-5-million-beer-reviews-from-beer-advocate.zip
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
pd.set_option('float_format', '{:f}'.format)
Wczytywanie danych
beers=pd.read_csv('beer_reviews.csv')
beers.head()
index | brewery_id | brewery_name | review_time | review_overall | review_aroma | review_appearance | review_profilename | beer_style | review_palate | review_taste | beer_name | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 10325 | Vecchio Birraio | 1234817823 | 1.500000 | 2.000000 | 2.500000 | stcules | Hefeweizen | 1.500000 | 1.500000 | Sausa Weizen | 5.000000 | 47986 |
1 | 1 | 10325 | Vecchio Birraio | 1235915097 | 3.000000 | 2.500000 | 3.000000 | stcules | English Strong Ale | 3.000000 | 3.000000 | Red Moon | 6.200000 | 48213 |
2 | 2 | 10325 | Vecchio Birraio | 1235916604 | 3.000000 | 2.500000 | 3.000000 | stcules | Foreign / Export Stout | 3.000000 | 3.000000 | Black Horse Black Beer | 6.500000 | 48215 |
3 | 3 | 10325 | Vecchio Birraio | 1234725145 | 3.000000 | 3.000000 | 3.500000 | stcules | German Pilsener | 2.500000 | 3.000000 | Sausa Pils | 5.000000 | 47969 |
4 | 4 | 1075 | Caldera Brewing Company | 1293735206 | 4.000000 | 4.500000 | 4.000000 | johnmichaelsen | American Double / Imperial IPA | 4.000000 | 4.500000 | Cauldron DIPA | 7.700000 | 64883 |
beers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1586614 entries, 0 to 1586613 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 1586614 non-null int64 1 brewery_id 1586614 non-null int64 2 brewery_name 1586599 non-null object 3 review_time 1586614 non-null int64 4 review_overall 1586614 non-null float64 5 review_aroma 1586614 non-null float64 6 review_appearance 1586614 non-null float64 7 review_profilename 1586266 non-null object 8 beer_style 1586614 non-null object 9 review_palate 1586614 non-null float64 10 review_taste 1586614 non-null float64 11 beer_name 1586614 non-null object 12 beer_abv 1518829 non-null float64 13 beer_beerid 1586614 non-null int64 dtypes: float64(6), int64(4), object(4) memory usage: 169.5+ MB
Czyszczenie
beers.dropna(subset=['brewery_name'], inplace=True)
beers.dropna(subset=['review_profilename'], inplace=True)
beers.dropna(subset=['beer_abv'], inplace=True)
beers.isnull().sum()
index 0 brewery_id 0 brewery_name 0 review_time 0 review_overall 0 review_aroma 0 review_appearance 0 review_profilename 0 beer_style 0 review_palate 0 review_taste 0 beer_name 0 beer_abv 0 beer_beerid 0 dtype: int64
Normalizacja
scaler = MinMaxScaler()
beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']] = scaler.fit_transform(beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']])
Podział na podzbiory
beers_train, beers_dev_test = train_test_split(beers, test_size=0.2, random_state=1234)
beers_dev, beers_test = train_test_split(beers_dev_test, test_size=0.5, random_state=1234)
print(f"Liczba kolumn w każdym zbiorze: {beers.shape[1]} kolumn")
print(f"Całość: {beers.shape[0]} rekordów ")
print(f"Train: {beers_train.shape[0]} rekordów")
print(f"Dev: {beers_dev.shape[0]} rekordów")
print(f"Test: {beers_test.shape[0]} rekordów")
Liczba kolumn w każdym zbiorze: 14 kolumn Całość: 1518478 rekordów Train: 1214782 rekordów Dev: 151848 rekordów Test: 151848 rekordów
Przegląd danych
print(f"Suma różnych piw: {beers['beer_name'].nunique()}")
print(f"Suma różnych styli: {beers['beer_style'].nunique()}")
print(f"Suma różnych browarów: {beers['brewery_name'].nunique()}")
Suma różnych piw: 44075 Suma różnych styli: 104 Suma różnych browarów: 5155
style_counts = beers['beer_style'].value_counts()
top_15_styles = style_counts.head(15)
plt.bar(top_15_styles.index, top_15_styles.values)
plt.xlabel('Styl')
plt.ylabel('Liczba piw')
plt.title('Ilość piw dla naliczniejszych styli')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
reviews = pd.DataFrame(beers.groupby('beer_name')['review_overall'].mean())
reviews['Liczba opini'] = pd.DataFrame(beers.groupby('beer_name')['review_overall'].count())
reviews = reviews.sort_values(by=['Liczba opini'], ascending=False)
reviews.head()
review_overall | Liczba opini | |
---|---|---|
beer_name | ||
90 Minute IPA | 0.829097 | 3289 |
Old Rasputin Russian Imperial Stout | 0.834823 | 3110 |
Sierra Nevada Celebration Ale | 0.833711 | 2999 |
India Pale Ale | 0.770777 | 2960 |
Two Hearted Ale | 0.866043 | 2727 |
beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.3f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 |
mean | 0.765 | 0.687 | 0.770 | 0.688 | 0.701 | 0.122 | 0.277 |
std | 0.143 | 0.174 | 0.123 | 0.170 | 0.182 | 0.040 | 0.282 |
min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
25% | 0.700 | 0.625 | 0.700 | 0.625 | 0.625 | 0.090 | 0.021 |
50% | 0.800 | 0.750 | 0.800 | 0.750 | 0.750 | 0.112 | 0.166 |
75% | 0.900 | 0.750 | 0.800 | 0.750 | 0.875 | 0.147 | 0.507 |
max | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
beers_train[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
beers_dev[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.7 | 1.0 |
beers_test[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.7 | 1.0 |