107 KiB
107 KiB
Pobieranie zbioru i pakietów
%pip install kaggle
%pip install pandas
%pip install numpy
%pip install scikit-learn
%pip install seaborn
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: kaggle in /home/students/s464979/.local/lib/python3.9/site-packages (1.6.6) Requirement already satisfied: bleach in /usr/local/lib/python3.9/dist-packages (from kaggle) (5.0.1) Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.9.14) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2) Requirement already satisfied: python-slugify in /home/students/s464979/.local/lib/python3.9/site-packages (from kaggle) (8.0.4) Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.28.1) Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.16.0) Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.64.1) Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.12) Requirement already satisfied: webencodings in /usr/local/lib/python3.9/dist-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in /home/students/s464979/.local/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4) Note: you may need to restart the kernel to use updated packages. Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.5) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas) (2021.1) Requirement already satisfied: numpy>=1.17.3 in /home/students/s464979/.local/lib/python3.9/site-packages (from pandas) (1.26.4) Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages. Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: numpy in /home/students/s464979/.local/lib/python3.9/site-packages (1.26.4) Note: you may need to restart the kernel to use updated packages. Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: scikit-learn in /usr/lib/python3/dist-packages (0.23.2) Note: you may need to restart the kernel to use updated packages. Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: seaborn in /usr/local/lib/python3.9/dist-packages (0.12.0) Requirement already satisfied: numpy>=1.17 in /home/students/s464979/.local/lib/python3.9/site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.3.5) Requirement already satisfied: matplotlib>=3.1 in /usr/local/lib/python3.9/dist-packages (from seaborn) (3.6.2) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (1.0.6) Requirement already satisfied: cycler>=0.10 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (0.10.0) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (4.38.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (1.3.1) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (21.3) Requirement already satisfied: pillow>=6.2.0 in /home/students/s464979/.local/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (10.2.0) Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (2.4.7) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas>=0.25->seaborn) (2021.1) Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7->matplotlib>=3.1->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate
!kaggle datasets download -d
!unzip -o 1-5-million-beer-reviews-from-beer-advocate.zip
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
pd.set_option('float_format', '{:f}'.format)
Wczytywanie danych
beers=pd.read_csv('beer_reviews.csv')
beers.head()
index | brewery_id | brewery_name | review_time | review_overall | review_aroma | review_appearance | review_profilename | beer_style | review_palate | review_taste | beer_name | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 10325 | Vecchio Birraio | 1234817823 | 1.500000 | 2.000000 | 2.500000 | stcules | Hefeweizen | 1.500000 | 1.500000 | Sausa Weizen | 5.000000 | 47986 |
1 | 1 | 10325 | Vecchio Birraio | 1235915097 | 3.000000 | 2.500000 | 3.000000 | stcules | English Strong Ale | 3.000000 | 3.000000 | Red Moon | 6.200000 | 48213 |
2 | 2 | 10325 | Vecchio Birraio | 1235916604 | 3.000000 | 2.500000 | 3.000000 | stcules | Foreign / Export Stout | 3.000000 | 3.000000 | Black Horse Black Beer | 6.500000 | 48215 |
3 | 3 | 10325 | Vecchio Birraio | 1234725145 | 3.000000 | 3.000000 | 3.500000 | stcules | German Pilsener | 2.500000 | 3.000000 | Sausa Pils | 5.000000 | 47969 |
4 | 4 | 1075 | Caldera Brewing Company | 1293735206 | 4.000000 | 4.500000 | 4.000000 | johnmichaelsen | American Double / Imperial IPA | 4.000000 | 4.500000 | Cauldron DIPA | 7.700000 | 64883 |
beers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1586614 entries, 0 to 1586613 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 1586614 non-null int64 1 brewery_id 1586614 non-null int64 2 brewery_name 1586599 non-null object 3 review_time 1586614 non-null int64 4 review_overall 1586614 non-null float64 5 review_aroma 1586614 non-null float64 6 review_appearance 1586614 non-null float64 7 review_profilename 1586266 non-null object 8 beer_style 1586614 non-null object 9 review_palate 1586614 non-null float64 10 review_taste 1586614 non-null float64 11 beer_name 1586614 non-null object 12 beer_abv 1518829 non-null float64 13 beer_beerid 1586614 non-null int64 dtypes: float64(6), int64(4), object(4) memory usage: 169.5+ MB
Czyszczenie
beers.dropna(subset=['brewery_name'], inplace=True)
beers.dropna(subset=['review_profilename'], inplace=True)
beers.dropna(subset=['beer_abv'], inplace=True)
beers.isnull().sum()
index 0 brewery_id 0 brewery_name 0 review_time 0 review_overall 0 review_aroma 0 review_appearance 0 review_profilename 0 beer_style 0 review_palate 0 review_taste 0 beer_name 0 beer_abv 0 beer_beerid 0 dtype: int64
Normalizacja
scaler = MinMaxScaler()
beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']] = scaler.fit_transform(beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']])
Podział na podzbiory
beers_train, beers_dev_test = train_test_split(beers, test_size=0.2, random_state=1234)
beers_dev, beers_test = train_test_split(beers_dev_test, test_size=0.5, random_state=1234)
print(f"Liczba kolumn w każdym zbiorze: {beers.shape[1]} kolumn")
print(f"Całość: {beers.shape[0]} rekordów ")
print(f"Train: {beers_train.shape[0]} rekordów")
print(f"Dev: {beers_dev.shape[0]} rekordów")
print(f"Test: {beers_test.shape[0]} rekordów")
Liczba kolumn w każdym zbiorze: 14 kolumn Całość: 1518478 rekordów Train: 1214782 rekordów Dev: 151848 rekordów Test: 151848 rekordów
Przegląd danych
print(f"Suma różnych piw: {beers['beer_name'].nunique()}")
print(f"Suma różnych styli: {beers['beer_style'].nunique()}")
print(f"Suma różnych browarów: {beers['brewery_name'].nunique()}")
Suma różnych piw: 44075 Suma różnych styli: 104 Suma różnych browarów: 5155
style_counts = beers['beer_style'].value_counts()
top_15_styles = style_counts.head(15)
plt.bar(top_15_styles.index, top_15_styles.values)
plt.xlabel('Styl')
plt.ylabel('Liczba piw')
plt.title('Ilość piw dla naliczniejszych styli')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
reviews = pd.DataFrame(beers.groupby('beer_name')['review_overall'].mean())
reviews['Liczba opini'] = pd.DataFrame(beers.groupby('beer_name')['review_overall'].count())
reviews = reviews.sort_values(by=['Liczba opini'], ascending=False)
reviews.head()
review_overall | Liczba opini | |
---|---|---|
beer_name | ||
90 Minute IPA | 0.829097 | 3289 |
Old Rasputin Russian Imperial Stout | 0.834823 | 3110 |
Sierra Nevada Celebration Ale | 0.833711 | 2999 |
India Pale Ale | 0.770777 | 2960 |
Two Hearted Ale | 0.866043 | 2727 |
beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.3f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 | 1518478.000 |
mean | 0.765 | 0.687 | 0.770 | 0.688 | 0.701 | 0.122 | 0.277 |
std | 0.143 | 0.174 | 0.123 | 0.170 | 0.182 | 0.040 | 0.282 |
min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
25% | 0.700 | 0.625 | 0.700 | 0.625 | 0.625 | 0.090 | 0.021 |
50% | 0.800 | 0.750 | 0.800 | 0.750 | 0.750 | 0.112 | 0.166 |
75% | 0.900 | 0.750 | 0.800 | 0.750 | 0.875 | 0.147 | 0.507 |
max | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
beers_train[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 | 1214782.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
beers_dev[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.7 | 1.0 |
beers_test[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")
review_overall | review_aroma | review_appearance | review_palate | review_taste | beer_abv | beer_beerid | |
---|---|---|---|---|---|---|---|
count | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 | 151848.0 |
mean | 0.8 | 0.7 | 0.8 | 0.7 | 0.7 | 0.1 | 0.3 |
std | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.0 | 0.3 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.7 | 0.6 | 0.7 | 0.6 | 0.6 | 0.1 | 0.0 |
50% | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.1 | 0.2 |
75% | 0.9 | 0.8 | 0.8 | 0.8 | 0.9 | 0.1 | 0.5 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.7 | 1.0 |