Analiza zbioru danych, czyszczenie i podział na train/valid/test
|
app_id |
app_name |
review_text |
review_score |
review_votes |
6417086 |
99910 |
Puzzle Pirates |
Reminds me of the games I played in elementary... |
-1 |
0 |
6417087 |
99910 |
Puzzle Pirates |
I dont like this game |
-1 |
0 |
6417088 |
99910 |
Puzzle Pirates |
The actual game play of Puzzle Pirates is grea... |
-1 |
0 |
6417089 |
99910 |
Puzzle Pirates |
Rating based on current state of play, as per ... |
-1 |
0 |
6417090 |
99910 |
Puzzle Pirates |
This is just appalling. |
-1 |
0 |
6417091 |
99910 |
Puzzle Pirates |
Set my age as less than 5 by mistake. Apparent... |
-1 |
0 |
6417092 |
99910 |
Puzzle Pirates |
It is terrible because I cant ge on because of... |
-1 |
0 |
6417093 |
99910 |
Puzzle Pirates |
Was fun for the first 30 minutes or so, got bo... |
-1 |
0 |
6417094 |
99910 |
Puzzle Pirates |
The game is very awefull and strange. I think ... |
-1 |
0 |
6417095 |
99910 |
Puzzle Pirates |
A very good game, got sick of it after a while... |
-1 |
1 |
6417096 |
99910 |
Puzzle Pirates |
Imagine Bejeweled with a heavy grind based eco... |
-1 |
0 |
6417097 |
99910 |
Puzzle Pirates |
This game has some serious problems. First of ... |
-1 |
0 |
6417098 |
99910 |
Puzzle Pirates |
This game is good but also horrible. Its fun t... |
-1 |
0 |
6417099 |
99910 |
Puzzle Pirates |
A very good game, got sick of it after a while... |
-1 |
1 |
6417100 |
99910 |
Puzzle Pirates |
This game is good but also horrible. Its fun t... |
-1 |
0 |
6417101 |
99910 |
Puzzle Pirates |
I really ove this game but it needs somethings... |
-1 |
0 |
6417102 |
99910 |
Puzzle Pirates |
Used to play Puzzel Pirates 'way back when', b... |
-1 |
0 |
6417103 |
99910 |
Puzzle Pirates |
This game was aright, though a bit annoying. W... |
-1 |
0 |
6417104 |
99910 |
Puzzle Pirates |
I had a nice review to recommend this game, bu... |
-1 |
0 |
6417105 |
99910 |
Puzzle Pirates |
The puzzles in this game are fun, but you have... |
-1 |
0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6417106 entries, 0 to 6417105
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 app_id int64
1 app_name object
2 review_text object
3 review_score int64
4 review_votes int64
dtypes: int64(3), object(2)
memory usage: 244.8+ MB
|
app_id |
review_score |
review_votes |
count |
6.417106e+06 |
6.417106e+06 |
6.417106e+06 |
mean |
2.274695e+05 |
6.394992e-01 |
1.472446e-01 |
std |
1.260451e+05 |
7.687918e-01 |
3.543496e-01 |
min |
1.000000e+01 |
-1.000000e+00 |
0.000000e+00 |
25% |
2.018100e+05 |
1.000000e+00 |
0.000000e+00 |
50% |
2.391600e+05 |
1.000000e+00 |
0.000000e+00 |
75% |
3.056200e+05 |
1.000000e+00 |
0.000000e+00 |
max |
5.653400e+05 |
1.000000e+00 |
1.000000e+00 |
~5x więcej pozytywnych recenzji niż negatywnych
Usuwanie pustych wartości
review_score
1 5260420
-1 1156686
Name: count, dtype: int64
<AxesSubplot:xlabel='review_score'>
|
app_id |
app_name |
review_text |
review_score |
review_votes |
0 |
10 |
Counter-Strike |
Ruined my life. |
1 |
0 |
1 |
10 |
Counter-Strike |
This will be more of a ''my experience with th... |
1 |
1 |
2 |
10 |
Counter-Strike |
This game saved my virginity. |
1 |
0 |
3 |
10 |
Counter-Strike |
• Do you like original games? • Do you like ga... |
1 |
0 |
4 |
10 |
Counter-Strike |
Easy to learn, hard to master. |
1 |
1 |
... |
... |
... |
... |
... |
... |
6417101 |
99910 |
Puzzle Pirates |
I really ove this game but it needs somethings... |
-1 |
0 |
6417102 |
99910 |
Puzzle Pirates |
Used to play Puzzel Pirates 'way back when', b... |
-1 |
0 |
6417103 |
99910 |
Puzzle Pirates |
This game was aright, though a bit annoying. W... |
-1 |
0 |
6417104 |
99910 |
Puzzle Pirates |
I had a nice review to recommend this game, bu... |
-1 |
0 |
6417105 |
99910 |
Puzzle Pirates |
The puzzles in this game are fun, but you have... |
-1 |
0 |
6226728 rows × 5 columns
review_score
1 5126132
-1 1100596
Name: count, dtype: int64
Gry z największą liczbą recenzji w zbiorze danych
app_name
DayZ 88850
PAYDAY 2 88783
Terraria 84702
Rust 77037
Dota 2 73433
Rocket League 54188
Undertale 51878
Left 4 Dead 2 50863
Warframe 48164
Grand Theft Auto V 42323
Robocraft 41596
Starbound 41141
Portal 2 38796
Space Engineers 37453
Fallout: New Vegas 32918
Arma 3 32262
The Witcher 3: Wild Hunt 31830
Heroes & Generals 31303
BioShock Infinite 31076
The Forest 29998
Name: count, dtype: int64
Recenzje gier dostępnych we wczesnym dostępie wyświetlają się jako "Early Access Review", bez tekstu
review_text
Early Access Review 977399
Early Access Review 10571
10/10 6050
. 4769
Great game 3662
great game 3554
Great game! 2440
:) 2093
Nice game 1793
Great Game 1659
♥♥♥♥ 1645
Great game. 1633
cool 1502
... 1247
its good 974
Great Game! 924
9/10 889
8/10 747
Great 746
i love this game 720
Name: count, dtype: int64
review_text
10/10 6050
. 4769
Great game 3662
great game 3554
Great game! 2440
Name: count, dtype: int64
review_score
1 4341259
-1 897431
Name: count, dtype: int64
Zbiór danych nadal jest dosyć duży więc obetnę jego większość w celu szybszego treningu
review_score
1 130102
-1 27059
Name: count, dtype: int64
Usunięcie niepotrzebnych kolumn
<class 'pandas.core.frame.DataFrame'>
Index: 157161 entries, 1260671 to 6268511
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_text 157161 non-null object
1 review_score 157161 non-null int64
dtypes: int64(1), object(1)
memory usage: 3.6+ MB
Podział na zbiory train/test/validate
Downsampling klasy pozytywnej dla zbioru treningowego
review_score
1 104113
-1 21615
Name: count, dtype: int64
<AxesSubplot:ylabel='count'>
<AxesSubplot:ylabel='count'>
<AxesSubplot:ylabel='count'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43230 entries, 0 to 43229
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_text 43230 non-null object
1 review_score 43230 non-null int64
dtypes: int64(1), object(1)
memory usage: 675.6+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 15716 entries, 1265039 to 5454569
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_text 15716 non-null object
1 review_score 15716 non-null int64
dtypes: int64(1), object(1)
memory usage: 368.3+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 15717 entries, 5012153 to 4962820
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_text 15717 non-null object
1 review_score 15717 non-null int64
dtypes: int64(1), object(1)
memory usage: 368.4+ KB
Przykłady z każdego podzbioru
|
review_text |
review_score |
0 |
I'm the biggest fan you will ever meet of tile... |
-1 |
1 |
Really an improvement on the old game (Which w... |
1 |
2 |
celebrating the four year birthday of payday w... |
-1 |
3 |
Only fun when playing with friends. Can't join... |
-1 |
4 |
While smashing planets together can be fun, th... |
-1 |
|
review_text |
review_score |
1265039 |
I love the Fact you can do what EVER you want ... |
1 |
3132003 |
Tony Hawk's without the Pro Skater. Finding ou... |
1 |
880195 |
It's pretty good. |
1 |
717128 |
This the best dungeon game I have played since... |
1 |
5221356 |
Totally awesome game alone or with a friend. I... |
1 |
|
review_text |
review_score |
5012153 |
..it's like nights into dreams and treasures o... |
1 |
5818758 |
As someone who mostly just likes making cool s... |
1 |
4582102 |
What can I say about this game the story is sh... |
1 |
5242842 |
A very unique and enjoyable puzzle solving str... |
1 |
5400923 |
A very adorable, charming game. |
1 |
Zapis do csv