dl_projekt/dataset analyze and clean.ipynb
2024-06-03 16:32:11 +02:00

82 KiB
Raw Permalink Blame History

Analiza zbioru danych, czyszczenie i podział na train/valid/test

import pandas as pd
import numpy as np

dataset = pd.read_csv("dataset.csv")
dataset.tail(20)
app_id app_name review_text review_score review_votes
6417086 99910 Puzzle Pirates Reminds me of the games I played in elementary... -1 0
6417087 99910 Puzzle Pirates I dont like this game -1 0
6417088 99910 Puzzle Pirates The actual game play of Puzzle Pirates is grea... -1 0
6417089 99910 Puzzle Pirates Rating based on current state of play, as per ... -1 0
6417090 99910 Puzzle Pirates This is just appalling. -1 0
6417091 99910 Puzzle Pirates Set my age as less than 5 by mistake. Apparent... -1 0
6417092 99910 Puzzle Pirates It is terrible because I cant ge on because of... -1 0
6417093 99910 Puzzle Pirates Was fun for the first 30 minutes or so, got bo... -1 0
6417094 99910 Puzzle Pirates The game is very awefull and strange. I think ... -1 0
6417095 99910 Puzzle Pirates A very good game, got sick of it after a while... -1 1
6417096 99910 Puzzle Pirates Imagine Bejeweled with a heavy grind based eco... -1 0
6417097 99910 Puzzle Pirates This game has some serious problems. First of ... -1 0
6417098 99910 Puzzle Pirates This game is good but also horrible. Its fun t... -1 0
6417099 99910 Puzzle Pirates A very good game, got sick of it after a while... -1 1
6417100 99910 Puzzle Pirates This game is good but also horrible. Its fun t... -1 0
6417101 99910 Puzzle Pirates I really ove this game but it needs somethings... -1 0
6417102 99910 Puzzle Pirates Used to play Puzzel Pirates 'way back when', b... -1 0
6417103 99910 Puzzle Pirates This game was aright, though a bit annoying. W... -1 0
6417104 99910 Puzzle Pirates I had a nice review to recommend this game, bu... -1 0
6417105 99910 Puzzle Pirates The puzzles in this game are fun, but you have... -1 0
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6417106 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 244.8+ MB
dataset.describe()
app_id review_score review_votes
count 6.417106e+06 6.417106e+06 6.417106e+06
mean 2.274695e+05 6.394992e-01 1.472446e-01
std 1.260451e+05 7.687918e-01 3.543496e-01
min 1.000000e+01 -1.000000e+00 0.000000e+00
25% 2.018100e+05 1.000000e+00 0.000000e+00
50% 2.391600e+05 1.000000e+00 0.000000e+00
75% 3.056200e+05 1.000000e+00 0.000000e+00
max 5.653400e+05 1.000000e+00 1.000000e+00

~5x więcej pozytywnych recenzji niż negatywnych

Usuwanie pustych wartości

dataset["review_score"].value_counts()
review_score
 1    5260420
-1    1156686
Name: count, dtype: int64
dataset["review_score"].value_counts().plot(kind="bar")
<AxesSubplot:xlabel='review_score'>
dataset_without_na = dataset.dropna()
dataset_without_na
app_id app_name review_text review_score review_votes
0 10 Counter-Strike Ruined my life. 1 0
1 10 Counter-Strike This will be more of a ''my experience with th... 1 1
2 10 Counter-Strike This game saved my virginity. 1 0
3 10 Counter-Strike • Do you like original games? • Do you like ga... 1 0
4 10 Counter-Strike Easy to learn, hard to master. 1 1
... ... ... ... ... ...
6417101 99910 Puzzle Pirates I really ove this game but it needs somethings... -1 0
6417102 99910 Puzzle Pirates Used to play Puzzel Pirates 'way back when', b... -1 0
6417103 99910 Puzzle Pirates This game was aright, though a bit annoying. W... -1 0
6417104 99910 Puzzle Pirates I had a nice review to recommend this game, bu... -1 0
6417105 99910 Puzzle Pirates The puzzles in this game are fun, but you have... -1 0

6226728 rows × 5 columns

dataset = dataset_without_na
dataset["review_score"].value_counts()
review_score
 1    5126132
-1    1100596
Name: count, dtype: int64

Gry z największą liczbą recenzji w zbiorze danych

dataset['app_name'].value_counts().nlargest(20)
app_name
DayZ                        88850
PAYDAY 2                    88783
Terraria                    84702
Rust                        77037
Dota 2                      73433
Rocket League               54188
Undertale                   51878
Left 4 Dead 2               50863
Warframe                    48164
Grand Theft Auto V          42323
Robocraft                   41596
Starbound                   41141
Portal 2                    38796
Space Engineers             37453
Fallout: New Vegas          32918
Arma 3                      32262
The Witcher 3: Wild Hunt    31830
Heroes & Generals           31303
BioShock Infinite           31076
The Forest                  29998
Name: count, dtype: int64

Recenzje gier dostępnych we wczesnym dostępie wyświetlają się jako "Early Access Review", bez tekstu

dataset['review_text'].value_counts().nlargest(20)
review_text
 Early Access Review      977399
   Early Access Review     10571
10/10                       6050
.                           4769
Great game                  3662
great game                  3554
Great game!                 2440
:)                          2093
Nice game                   1793
Great Game                  1659
♥♥♥♥                        1645
Great game.                 1633
cool                        1502
...                         1247
its good                     974
Great Game!                  924
9/10                         889
8/10                         747
Great                        746
i love this game             720
Name: count, dtype: int64
dataset = dataset[dataset['review_text'].str.contains("Early Access Review")==False]
dataset['review_text'].value_counts().nlargest(5)
review_text
10/10          6050
.              4769
Great game     3662
great game     3554
Great game!    2440
Name: count, dtype: int64
dataset["review_score"].value_counts()
review_score
 1    4341259
-1     897431
Name: count, dtype: int64

Zbiór danych nadal jest dosyć duży więc obetnę jego większość w celu szybszego treningu

dataset = dataset.sample(frac=0.03)
dataset["review_score"].value_counts()
review_score
 1    130102
-1     27059
Name: count, dtype: int64

Usunięcie niepotrzebnych kolumn

dataset = dataset.drop(columns=["app_id", "review_votes", "app_name"])
dataset.info()
<class 'pandas.core.frame.DataFrame'>
Index: 157161 entries, 1260671 to 6268511
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   review_text   157161 non-null  object
 1   review_score  157161 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB

Podział na zbiory train/test/validate

from sklearn.model_selection import train_test_split

train, test_and_valid = train_test_split(dataset, test_size=0.2)
test, valid = train_test_split(test_and_valid, test_size=0.5)

Downsampling klasy pozytywnej dla zbioru treningowego

train["review_score"].value_counts()
review_score
 1    104113
-1     21615
Name: count, dtype: int64
dataset_positive_reviews = train[train["review_score"]==1]
dataset_negative_reviews = train[train["review_score"]==-1]

dataset_positive_reviews = dataset_positive_reviews.sample(21615)
train = pd.concat([dataset_positive_reviews,dataset_negative_reviews])
train = train.sample(frac=1.0) # Losowanie kolejności przykładów
train = train.reset_index(drop=True)
train["review_score"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
test["review_score"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
valid["review_score"].value_counts().plot(kind="pie")
<AxesSubplot:ylabel='count'>
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43230 entries, 0 to 43229
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_text   43230 non-null  object
 1   review_score  43230 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 675.6+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
Index: 15716 entries, 1265039 to 5454569
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_text   15716 non-null  object
 1   review_score  15716 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 368.3+ KB
valid.info()
<class 'pandas.core.frame.DataFrame'>
Index: 15717 entries, 5012153 to 4962820
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_text   15717 non-null  object
 1   review_score  15717 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 368.4+ KB

Przykłady z każdego podzbioru

train.head()
review_text review_score
0 I'm the biggest fan you will ever meet of tile... -1
1 Really an improvement on the old game (Which w... 1
2 celebrating the four year birthday of payday w... -1
3 Only fun when playing with friends. Can't join... -1
4 While smashing planets together can be fun, th... -1
test.head()
review_text review_score
1265039 I love the Fact you can do what EVER you want ... 1
3132003 Tony Hawk's without the Pro Skater. Finding ou... 1
880195 It's pretty good. 1
717128 This the best dungeon game I have played since... 1
5221356 Totally awesome game alone or with a friend. I... 1
valid.head()
review_text review_score
5012153 ..it's like nights into dreams and treasures o... 1
5818758 As someone who mostly just likes making cool s... 1
4582102 What can I say about this game the story is sh... 1
5242842 A very unique and enjoyable puzzle solving str... 1
5400923 A very adorable, charming game. 1

Zapis do csv

train.to_csv("train.csv")
test.to_csv("test.csv")
valid.to_csv("valid.csv")