ium_464979/IUM_02.ipynb at e559387df3016d487796f87d72647f31bcc86933

Pobieranie zbioru i pakietów

%pip install kaggle
%pip install pandas
%pip install numpy
%pip install scikit-learn
%pip install seaborn

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: kaggle in /home/students/s464979/.local/lib/python3.9/site-packages (1.6.6)
Requirement already satisfied: bleach in /usr/local/lib/python3.9/dist-packages (from kaggle) (5.0.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.9.14)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: python-slugify in /home/students/s464979/.local/lib/python3.9/site-packages (from kaggle) (8.0.4)
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.28.1)
Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.64.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.12)
Requirement already satisfied: webencodings in /usr/local/lib/python3.9/dist-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /home/students/s464979/.local/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.17.3 in /home/students/s464979/.local/lib/python3.9/site-packages (from pandas) (1.26.4)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy in /home/students/s464979/.local/lib/python3.9/site-packages (1.26.4)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in /usr/lib/python3/dist-packages (0.23.2)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in /usr/local/lib/python3.9/dist-packages (0.12.0)
Requirement already satisfied: numpy>=1.17 in /home/students/s464979/.local/lib/python3.9/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.3.5)
Requirement already satisfied: matplotlib>=3.1 in /usr/local/lib/python3.9/dist-packages (from seaborn) (3.6.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (1.0.6)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (0.10.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (4.38.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (1.3.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /home/students/s464979/.local/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (10.2.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3/dist-packages (from matplotlib>=3.1->seaborn) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/lib/python3/dist-packages (from pandas>=0.25->seaborn) (2021.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7->matplotlib>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate

!kaggle datasets download -d

!unzip -o 1-5-million-beer-reviews-from-beer-advocate.zip

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

pd.set_option('float_format', '{:f}'.format)

Wczytywanie danych

beers=pd.read_csv('beer_reviews.csv')

beers.head()

	index	brewery_id	brewery_name	review_time	review_overall	review_aroma	review_appearance	review_profilename	beer_style	review_palate	review_taste	beer_name	beer_abv	beer_beerid
0	0	10325	Vecchio Birraio	1234817823	1.500000	2.000000	2.500000	stcules	Hefeweizen	1.500000	1.500000	Sausa Weizen	5.000000	47986
1	1	10325	Vecchio Birraio	1235915097	3.000000	2.500000	3.000000	stcules	English Strong Ale	3.000000	3.000000	Red Moon	6.200000	48213
2	2	10325	Vecchio Birraio	1235916604	3.000000	2.500000	3.000000	stcules	Foreign / Export Stout	3.000000	3.000000	Black Horse Black Beer	6.500000	48215
3	3	10325	Vecchio Birraio	1234725145	3.000000	3.000000	3.500000	stcules	German Pilsener	2.500000	3.000000	Sausa Pils	5.000000	47969
4	4	1075	Caldera Brewing Company	1293735206	4.000000	4.500000	4.000000	johnmichaelsen	American Double / Imperial IPA	4.000000	4.500000	Cauldron DIPA	7.700000	64883

beers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 14 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   index               1586614 non-null  int64  
 1   brewery_id          1586614 non-null  int64  
 2   brewery_name        1586599 non-null  object 
 3   review_time         1586614 non-null  int64  
 4   review_overall      1586614 non-null  float64
 5   review_aroma        1586614 non-null  float64
 6   review_appearance   1586614 non-null  float64
 7   review_profilename  1586266 non-null  object 
 8   beer_style          1586614 non-null  object 
 9   review_palate       1586614 non-null  float64
 10  review_taste        1586614 non-null  float64
 11  beer_name           1586614 non-null  object 
 12  beer_abv            1518829 non-null  float64
 13  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(4), object(4)
memory usage: 169.5+ MB

Czyszczenie

beers.dropna(subset=['brewery_name'], inplace=True)
beers.dropna(subset=['review_profilename'], inplace=True)
beers.dropna(subset=['beer_abv'], inplace=True)

beers.isnull().sum()

index                 0
brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

Normalizacja

scaler = MinMaxScaler()

beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']] = scaler.fit_transform(beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']])

Podział na podzbiory

beers_train, beers_dev_test = train_test_split(beers, test_size=0.2, random_state=1234)
beers_dev, beers_test = train_test_split(beers_dev_test, test_size=0.5, random_state=1234)

print(f"Liczba kolumn w każdym zbiorze: {beers.shape[1]} kolumn")
print(f"Całość: {beers.shape[0]} rekordów ")
print(f"Train: {beers_train.shape[0]} rekordów")
print(f"Dev: {beers_dev.shape[0]} rekordów")
print(f"Test: {beers_test.shape[0]} rekordów")

Liczba kolumn w każdym zbiorze: 14 kolumn
Całość: 1518478 rekordów 
Train: 1214782 rekordów
Dev: 151848 rekordów
Test: 151848 rekordów

Przegląd danych

print(f"Suma różnych piw: {beers['beer_name'].nunique()}")
print(f"Suma różnych styli: {beers['beer_style'].nunique()}")
print(f"Suma różnych browarów: {beers['brewery_name'].nunique()}")

Suma różnych piw: 44075
Suma różnych styli: 104
Suma różnych browarów: 5155

style_counts = beers['beer_style'].value_counts()

top_15_styles = style_counts.head(15) 

plt.bar(top_15_styles.index, top_15_styles.values)
plt.xlabel('Styl')
plt.ylabel('Liczba piw')
plt.title('Ilość piw dla naliczniejszych styli')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

reviews = pd.DataFrame(beers.groupby('beer_name')['review_overall'].mean())
reviews['Liczba opini'] = pd.DataFrame(beers.groupby('beer_name')['review_overall'].count())
reviews = reviews.sort_values(by=['Liczba opini'], ascending=False)
reviews.head()

	review_overall	Liczba opini
beer_name
90 Minute IPA	0.829097	3289
Old Rasputin Russian Imperial Stout	0.834823	3110
Sierra Nevada Celebration Ale	0.833711	2999
India Pale Ale	0.770777	2960
Two Hearted Ale	0.866043	2727

beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.3f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000
mean	0.765	0.687	0.770	0.688	0.701	0.122	0.277
std	0.143	0.174	0.123	0.170	0.182	0.040	0.282
min	0.000	0.000	0.000	0.000	0.000	0.000	0.000
25%	0.700	0.625	0.700	0.625	0.625	0.090	0.021
50%	0.800	0.750	0.800	0.750	0.750	0.112	0.166
75%	0.900	0.750	0.800	0.750	0.875	0.147	0.507
max	1.000	1.000	1.000	1.000	1.000	1.000	1.000

beers_train[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	1.0	1.0

beers_dev[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	0.7	1.0

beers_test[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	0.7	1.0

107 KiB Raw Blame History

Pobieranie zbioru i pakietów

Wczytywanie danych

Czyszczenie

Normalizacja

Podział na podzbiory

Przegląd danych

107 KiB

Raw Blame History