ium_464979/IUM_02.ipynb at 1abd793e3cf5178e2997449c3a7ff4b95d421c3b

Pobieranie zbioru i pakietów

%pip install kaggle
%pip install pandas
%pip install numpy
%pip install scikit-learn
%pip install seaborn

Collecting kaggle
  Downloading kaggle-1.6.6.tar.gz (84 kB)
     ---------------------------------------- 84.6/84.6 kB 2.4 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: six>=1.10 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (2.28.1)
Requirement already satisfied: tqdm in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.64.1)
Requirement already satisfied: python-slugify in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (1.26.14)
Requirement already satisfied: bleach in c:\users\adamw\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: webencodings in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: packaging in c:\users\adamw\anaconda3\lib\site-packages (from bleach->kaggle) (22.0)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\adamw\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\adamw\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: colorama in c:\users\adamw\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111955 sha256=23592736409344e3027e92f5ac103680cd5efb348835a123a68118e729e02b66
  Stored in directory: c:\users\adamw\appdata\local\pip\cache\wheels\54\6e\ff\d5ab6af2287a2d0c5b8cea9328fb14940ca253fe60214a99c8
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.6.6
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pandas in c:\users\adamw\anaconda3\lib\site-packages (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: numpy>=1.21.0 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (1.23.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy in c:\users\adamw\anaconda3\lib\site-packages (1.23.5)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: scikit-learn in c:\users\adamw\anaconda3\lib\site-packages (1.2.1)
Requirement already satisfied: numpy>=1.17.3 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.1.1)
Requirement already satisfied: scipy>=1.3.2 in c:\users\adamw\anaconda3\lib\site-packages (from scikit-learn) (1.10.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: seaborn in c:\users\adamw\anaconda3\lib\site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.23.5)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (3.7.0)
Requirement already satisfied: pandas>=0.25 in c:\users\adamw\anaconda3\lib\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0)
Requirement already satisfied: cycler>=0.10 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\adamw\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: pytz>=2020.1 in c:\users\adamw\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in c:\users\adamw\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate

Downloading 1-5-million-beer-reviews-from-beer-advocate.zip to C:\Users\adamw\REPOS\ium_464979

  0%|          | 0.00/32.5M [00:00<?, ?B/s]
  3%|3         | 1.00M/32.5M [00:00<00:21, 1.53MB/s]
  6%|6         | 2.00M/32.5M [00:00<00:11, 2.78MB/s]
  9%|9         | 3.00M/32.5M [00:00<00:07, 3.87MB/s]
 12%|#2        | 4.00M/32.5M [00:01<00:06, 4.72MB/s]
 15%|#5        | 5.00M/32.5M [00:01<00:05, 5.20MB/s]
 18%|#8        | 6.00M/32.5M [00:01<00:05, 5.08MB/s]
 22%|##1       | 7.00M/32.5M [00:01<00:05, 5.19MB/s]
 25%|##4       | 8.00M/32.5M [00:01<00:04, 5.21MB/s]
 28%|##7       | 9.00M/32.5M [00:02<00:04, 5.12MB/s]
 31%|###       | 10.0M/32.5M [00:02<00:04, 5.25MB/s]
 34%|###3      | 11.0M/32.5M [00:02<00:04, 5.50MB/s]
 37%|###6      | 12.0M/32.5M [00:02<00:03, 6.10MB/s]
 40%|####      | 13.0M/32.5M [00:02<00:03, 6.57MB/s]
 43%|####3     | 14.0M/32.5M [00:02<00:03, 6.39MB/s]
 46%|####6     | 15.0M/32.5M [00:03<00:03, 6.10MB/s]
 49%|####9     | 16.0M/32.5M [00:03<00:02, 5.83MB/s]
 52%|#####2    | 17.0M/32.5M [00:03<00:02, 5.85MB/s]
 55%|#####5    | 18.0M/32.5M [00:03<00:02, 5.87MB/s]
 59%|#####8    | 19.0M/32.5M [00:03<00:02, 6.00MB/s]
 62%|######1   | 20.0M/32.5M [00:03<00:01, 6.79MB/s]
 65%|######4   | 21.0M/32.5M [00:04<00:01, 7.17MB/s]
 71%|#######   | 23.0M/32.5M [00:04<00:01, 8.01MB/s]
 74%|#######3  | 24.0M/32.5M [00:04<00:01, 7.80MB/s]
 77%|#######7  | 25.0M/32.5M [00:04<00:01, 7.72MB/s]
 80%|########  | 26.0M/32.5M [00:04<00:00, 7.58MB/s]
 83%|########3 | 27.0M/32.5M [00:05<00:01, 5.54MB/s]
 86%|########6 | 28.0M/32.5M [00:05<00:00, 5.95MB/s]
 89%|########9 | 29.0M/32.5M [00:05<00:00, 6.66MB/s]
 95%|#########5| 31.0M/32.5M [00:05<00:00, 7.50MB/s]
100%|##########| 32.5M/32.5M [00:05<00:00, 8.35MB/s]
100%|##########| 32.5M/32.5M [00:05<00:00, 6.00MB/s]

!kaggle datasets download -d

!unzip -o 1-5-million-beer-reviews-from-beer-advocate.zip

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

pd.set_option('float_format', '{:f}'.format)

Wczytywanie danych

beers=pd.read_csv('beer_reviews.csv')

beers.head()

	index	brewery_id	brewery_name	review_time	review_overall	review_aroma	review_appearance	review_profilename	beer_style	review_palate	review_taste	beer_name	beer_abv	beer_beerid
0	0	10325	Vecchio Birraio	1234817823	1.500000	2.000000	2.500000	stcules	Hefeweizen	1.500000	1.500000	Sausa Weizen	5.000000	47986
1	1	10325	Vecchio Birraio	1235915097	3.000000	2.500000	3.000000	stcules	English Strong Ale	3.000000	3.000000	Red Moon	6.200000	48213
2	2	10325	Vecchio Birraio	1235916604	3.000000	2.500000	3.000000	stcules	Foreign / Export Stout	3.000000	3.000000	Black Horse Black Beer	6.500000	48215
3	3	10325	Vecchio Birraio	1234725145	3.000000	3.000000	3.500000	stcules	German Pilsener	2.500000	3.000000	Sausa Pils	5.000000	47969
4	4	1075	Caldera Brewing Company	1293735206	4.000000	4.500000	4.000000	johnmichaelsen	American Double / Imperial IPA	4.000000	4.500000	Cauldron DIPA	7.700000	64883

beers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 14 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   index               1586614 non-null  int64  
 1   brewery_id          1586614 non-null  int64  
 2   brewery_name        1586599 non-null  object 
 3   review_time         1586614 non-null  int64  
 4   review_overall      1586614 non-null  float64
 5   review_aroma        1586614 non-null  float64
 6   review_appearance   1586614 non-null  float64
 7   review_profilename  1586266 non-null  object 
 8   beer_style          1586614 non-null  object 
 9   review_palate       1586614 non-null  float64
 10  review_taste        1586614 non-null  float64
 11  beer_name           1586614 non-null  object 
 12  beer_abv            1518829 non-null  float64
 13  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(4), object(4)
memory usage: 169.5+ MB

Czyszczenie

beers.dropna(subset=['brewery_name'], inplace=True)
beers.dropna(subset=['review_profilename'], inplace=True)
beers.dropna(subset=['beer_abv'], inplace=True)

beers.isnull().sum()

index                 0
brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

Normalizacja

scaler = MinMaxScaler()

beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']] = scaler.fit_transform(beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']])

Podział na podzbiory

beers_train, beers_dev_test = train_test_split(beers, test_size=0.2, random_state=1234)
beers_dev, beers_test = train_test_split(beers_dev_test, test_size=0.5, random_state=1234)

print(f"Liczba kolumn w każdym zbiorze: {beers.shape[1]} kolumn")
print(f"Całość: {beers.shape[0]} rekordów ")
print(f"Train: {beers_train.shape[0]} rekordów")
print(f"Dev: {beers_dev.shape[0]} rekordów")
print(f"Test: {beers_test.shape[0]} rekordów")

Liczba kolumn w każdym zbiorze: 14 kolumn
Całość: 1518478 rekordów 
Train: 1214782 rekordów
Dev: 151848 rekordów
Test: 151848 rekordów

Przegląd danych

print(f"Suma różnych piw: {beers['beer_name'].nunique()}")
print(f"Suma różnych styli: {beers['beer_style'].nunique()}")
print(f"Suma różnych browarów: {beers['brewery_name'].nunique()}")

Suma różnych piw: 44075
Suma różnych styli: 104
Suma różnych browarów: 5155

style_counts = beers['beer_style'].value_counts()

top_15_styles = style_counts.head(15) 

plt.bar(top_15_styles.index, top_15_styles.values)
plt.xlabel('Styl')
plt.ylabel('Liczba piw')
plt.title('Ilość piw dla naliczniejszych styli')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

reviews = pd.DataFrame(beers.groupby('beer_name')['review_overall'].mean())
reviews['Liczba opini'] = pd.DataFrame(beers.groupby('beer_name')['review_overall'].count())
reviews = reviews.sort_values(by=['Liczba opini'], ascending=False)
reviews.head()

	review_overall	Liczba opini
beer_name
90 Minute IPA	0.829097	3289
Old Rasputin Russian Imperial Stout	0.834823	3110
Sierra Nevada Celebration Ale	0.833711	2999
India Pale Ale	0.770777	2960
Two Hearted Ale	0.866043	2727

beers[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.3f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000	1518478.000
mean	0.765	0.687	0.770	0.688	0.701	0.122	0.277
std	0.143	0.174	0.123	0.170	0.182	0.040	0.282
min	0.000	0.000	0.000	0.000	0.000	0.000	0.000
25%	0.700	0.625	0.700	0.625	0.625	0.090	0.021
50%	0.800	0.750	0.800	0.750	0.750	0.112	0.166
75%	0.900	0.750	0.800	0.750	0.875	0.147	0.507
max	1.000	1.000	1.000	1.000	1.000	1.000	1.000

beers_train[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0	1214782.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	1.0	1.0

beers_dev[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	0.7	1.0

beers_test[['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'beer_abv', 'beer_beerid']].describe().applymap(lambda x: f"{x:0.1f}")

	review_overall	review_aroma	review_appearance	review_palate	review_taste	beer_abv	beer_beerid
count	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0	151848.0
mean	0.8	0.7	0.8	0.7	0.7	0.1	0.3
std	0.1	0.2	0.1	0.2	0.2	0.0	0.3
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25%	0.7	0.6	0.7	0.6	0.6	0.1	0.0
50%	0.8	0.8	0.8	0.8	0.8	0.1	0.2
75%	0.9	0.8	0.8	0.8	0.9	0.1	0.5
max	1.0	1.0	1.0	1.0	1.0	0.7	1.0

111 KiB Raw Blame History

Pobieranie zbioru i pakietów

Wczytywanie danych

Czyszczenie

Normalizacja

Podział na podzbiory

Przegląd danych

111 KiB

Raw Blame History