110 KiB
Google Play Store data exploration
Kamila Bobkowska s444517
Link do danych: https://www.kaggle.com/datasets/lava18/google-play-store-apps
Aby ściągnąć dataset z Kaggle należy założyć konto i pobrać token który umożliwi poprawne korzystanie API. Po pobraniu tokenu trzeba go umieścić w odpowiednim miejscu w zależności czy korzystamy z Winodwsa czy Linuxa jest to inna lokalizacja.
_Robiąc to zadanie pobrałam dane korzystając z kaggle z Windowsem, ponieważ nie mam dostępu do Linuxa oprócz komputera wydziałowego, a tam nie działają mi komendy z biblioteki kaggle.
!kaggle datasets download -d lava18/google-play-store-apps
google-play-store-apps.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o google-play-store-apps.zip
Archive: google-play-store-apps.zip inflating: googleplaystore.csv inflating: googleplaystore_user_reviews.csv inflating: license.txt
import pandas as pd
data = pd.read_csv('googleplaystore.csv')
data
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up |
10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up |
10838 | Parkinson Exercices FR | MEDICAL | NaN | 3 | 9.5M | 1,000+ | Free | 0 | Everyone | Medical | January 20, 2017 | 1.0 | 2.2 and up |
10839 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0 | Mature 17+ | Books & Reference | January 19, 2015 | Varies with device | Varies with device |
10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device |
10841 rows × 13 columns
Data exploration
data.columns
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], dtype='object')
data.dtypes
App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object
data.describe(include='all')
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10841 | 10841 | 9367.000000 | 10841 | 10841 | 10841 | 10840 | 10841 | 10840 | 10841 | 10841 | 10833 | 10838 |
unique | 9660 | 34 | NaN | 6002 | 462 | 22 | 3 | 93 | 6 | 120 | 1378 | 2832 | 33 |
top | ROBLOX | FAMILY | NaN | 0 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Tools | August 3, 2018 | Varies with device | 4.1 and up |
freq | 9 | 1972 | NaN | 596 | 1695 | 1579 | 10039 | 10040 | 8714 | 842 | 326 | 1459 | 2451 |
mean | NaN | NaN | 4.193338 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | NaN | NaN | 0.537431 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | NaN | NaN | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | NaN | NaN | 4.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | NaN | NaN | 4.300000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | NaN | NaN | 4.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | NaN | NaN | 19.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
data['Category'].value_counts()
FAMILY 1972 GAME 1144 TOOLS 843 MEDICAL 463 BUSINESS 460 PRODUCTIVITY 424 PERSONALIZATION 392 COMMUNICATION 387 SPORTS 384 LIFESTYLE 382 FINANCE 366 HEALTH_AND_FITNESS 341 PHOTOGRAPHY 335 SOCIAL 295 NEWS_AND_MAGAZINES 283 SHOPPING 260 TRAVEL_AND_LOCAL 258 DATING 234 BOOKS_AND_REFERENCE 231 VIDEO_PLAYERS 175 EDUCATION 156 ENTERTAINMENT 149 MAPS_AND_NAVIGATION 137 FOOD_AND_DRINK 127 HOUSE_AND_HOME 88 LIBRARIES_AND_DEMO 85 AUTO_AND_VEHICLES 85 WEATHER 82 ART_AND_DESIGN 65 EVENTS 64 PARENTING 60 COMICS 60 BEAUTY 53 1.9 1 Name: Category, dtype: int64
data["Content Rating"].value_counts()
Everyone 8714 Teen 1208 Mature 17+ 499 Everyone 10+ 414 Adults only 18+ 3 Unrated 2 Name: Content Rating, dtype: int64
data['Genres'].value_counts()
Tools 842 Entertainment 623 Education 549 Medical 463 Business 460 ... Parenting;Brain Games 1 Health & Fitness;Education 1 Role Playing;Education 1 Puzzle;Education 1 Travel & Local;Action & Adventure 1 Name: Genres, Length: 120, dtype: int64
data['Price'].value_counts()
0 10040 $0.99 148 $2.99 129 $1.99 73 $4.99 72 ... $3.02 1 $2.95 1 $1.61 1 $14.00 1 $1.29 1 Name: Price, Length: 93, dtype: int64
data.isnull().sum()
App 0 Category 0 Rating 1474 Reviews 0 Size 0 Installs 0 Type 1 Price 0 Content Rating 1 Genres 0 Last Updated 0 Current Ver 8 Android Ver 3 dtype: int64
data.dropna(subset=['Rating', 'Type','Content Rating','Current Ver','Android Ver'], inplace=True)
data.reset_index(drop=True, inplace=True)
data
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9355 | FR Calculator | FAMILY | 4.0 | 7 | 2.6M | 500+ | Free | 0 | Everyone | Education | June 18, 2017 | 1.0.0 | 4.1 and up |
9356 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up |
9357 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up |
9358 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0 | Mature 17+ | Books & Reference | January 19, 2015 | Varies with device | Varies with device |
9359 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device |
9360 rows × 13 columns
data.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 0 Android Ver 0 dtype: int64
Proste wizualizacje
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20,5))
sns.distplot(data['Rating']).set(title='Ratings')
plt.show()
data["Price"] = data["Price"].replace({'\$': ''}, regex=True)
plt.figure(figsize=(20,5))
sns.distplot(data['Price']).set(title='Ratings')
plt.show()
Kolumna "Size"
Mimo, że ta kolumna może mieć znaczenie przy opracowwaniu danych ta kolumna zostanie pominięta ze względu na występującą w niej wartość "Varies with device", którą byłoby ciążko opracować. Ponadto nie można po prostu usunąć wszystkich jej wystąpień, ponieważ występują ona w ponad 1500 rzędach.
print(data["Size"].unique())
['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M' '20M' '21M' '37M' '5.5M' '17M' '39M' '31M' '4.2M' '23M' '6.0M' '6.1M' '4.6M' '9.2M' '5.2M' '11M' '24M' 'Varies with device' '9.4M' '15M' '10M' '1.2M' '26M' '8.0M' '7.9M' '56M' '57M' '35M' '54M' '201k' '3.6M' '5.7M' '8.6M' '2.4M' '27M' '2.7M' '2.5M' '7.0M' '16M' '3.4M' '8.9M' '3.9M' '2.9M' '38M' '32M' '5.4M' '18M' '1.1M' '2.2M' '4.5M' '9.8M' '52M' '9.0M' '6.7M' '30M' '2.6M' '7.1M' '22M' '6.4M' '3.2M' '8.2M' '4.9M' '9.5M' '5.0M' '5.9M' '13M' '73M' '6.8M' '3.5M' '4.0M' '2.3M' '2.1M' '42M' '9.1M' '55M' '23k' '7.3M' '6.5M' '1.5M' '7.5M' '51M' '41M' '48M' '8.5M' '46M' '8.3M' '4.3M' '4.7M' '3.3M' '40M' '7.8M' '8.8M' '6.6M' '5.1M' '61M' '66M' '79k' '8.4M' '3.7M' '118k' '44M' '695k' '1.6M' '6.2M' '53M' '1.4M' '3.0M' '7.2M' '5.8M' '3.8M' '9.6M' '45M' '63M' '49M' '77M' '4.4M' '70M' '9.3M' '8.1M' '36M' '6.9M' '7.4M' '84M' '97M' '2.0M' '1.9M' '1.8M' '5.3M' '47M' '556k' '526k' '76M' '7.6M' '59M' '9.7M' '78M' '72M' '43M' '7.7M' '6.3M' '334k' '93M' '65M' '79M' '100M' '58M' '50M' '68M' '64M' '34M' '67M' '60M' '94M' '9.9M' '232k' '99M' '624k' '95M' '8.5k' '41k' '292k' '80M' '1.7M' '10.0M' '74M' '62M' '69M' '75M' '98M' '85M' '82M' '96M' '87M' '71M' '86M' '91M' '81M' '92M' '83M' '88M' '704k' '862k' '899k' '378k' '4.8M' '266k' '375k' '1.3M' '975k' '980k' '4.1M' '89M' '696k' '544k' '525k' '920k' '779k' '853k' '720k' '713k' '772k' '318k' '58k' '241k' '196k' '857k' '51k' '953k' '865k' '251k' '930k' '540k' '313k' '746k' '203k' '26k' '314k' '239k' '371k' '220k' '730k' '756k' '91k' '293k' '17k' '74k' '14k' '317k' '78k' '924k' '818k' '81k' '939k' '169k' '45k' '965k' '90M' '545k' '61k' '283k' '655k' '714k' '93k' '872k' '121k' '322k' '976k' '206k' '954k' '444k' '717k' '210k' '609k' '308k' '306k' '175k' '350k' '383k' '454k' '1.0M' '70k' '812k' '442k' '842k' '417k' '412k' '459k' '478k' '335k' '782k' '721k' '430k' '429k' '192k' '460k' '728k' '496k' '816k' '414k' '506k' '887k' '613k' '778k' '683k' '592k' '186k' '840k' '647k' '373k' '437k' '598k' '716k' '585k' '982k' '219k' '55k' '323k' '691k' '511k' '951k' '963k' '25k' '554k' '351k' '27k' '82k' '208k' '551k' '29k' '103k' '116k' '153k' '209k' '499k' '173k' '597k' '809k' '122k' '411k' '400k' '801k' '787k' '50k' '643k' '986k' '516k' '837k' '780k' '20k' '498k' '600k' '656k' '221k' '228k' '176k' '34k' '259k' '164k' '458k' '629k' '28k' '288k' '775k' '785k' '636k' '916k' '994k' '309k' '485k' '914k' '903k' '608k' '500k' '54k' '562k' '847k' '948k' '811k' '270k' '48k' '523k' '784k' '280k' '24k' '892k' '154k' '18k' '33k' '860k' '364k' '387k' '626k' '161k' '879k' '39k' '170k' '141k' '160k' '144k' '143k' '190k' '376k' '193k' '473k' '246k' '73k' '253k' '957k' '420k' '72k' '404k' '470k' '226k' '240k' '89k' '234k' '257k' '861k' '467k' '676k' '552k' '582k' '619k']
data[data.Size == 'Varies with device'].shape[0]
1637
data = data.drop(columns=["Size", "Android Ver", "Current Ver", "Last Updated"])
to_lowercase = ['App', 'Category', 'Type', 'Content Rating', 'Genres']
for column in to_lowercase:
data[column] = data[column].apply(str.lower)
data["Installs"] = data["Installs"].replace({'\+': ''}, regex=True)
data["Installs"] = data["Installs"].replace({',': ''}, regex=True)
data
App | Category | Rating | Reviews | Installs | Type | Price | Content Rating | Genres | |
---|---|---|---|---|---|---|---|---|---|
0 | photo editor & candy camera & grid & scrapbook | art_and_design | 4.1 | 2.021538e-06 | 10000 | free | 0 | everyone | art & design |
1 | coloring book moana | art_and_design | 3.9 | 1.235953e-05 | 500000 | free | 0 | everyone | art & design;pretend play |
2 | u launcher lite – free live cool themes, hide ... | art_and_design | 4.7 | 1.119638e-03 | 5000000 | free | 0 | everyone | art & design |
3 | sketch - draw & paint | art_and_design | 4.5 | 2.759054e-03 | 50000000 | free | 0 | teen | art & design |
4 | pixel draw - number art coloring book | art_and_design | 4.3 | 1.235953e-05 | 100000 | free | 0 | everyone | art & design;creativity |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9355 | fr calculator | family | 4.0 | 7.676727e-08 | 500 | free | 0 | everyone | education |
9356 | sya9a maroc - fr | family | 4.5 | 4.733982e-07 | 5000 | free | 0 | everyone | education |
9357 | fr. mike schmitz audio teachings | family | 5.0 | 3.838364e-08 | 100 | free | 0 | everyone | education |
9358 | the scp foundation db fr nn5n | books_and_reference | 4.5 | 1.445784e-06 | 1000 | free | 0 | mature 17+ | books & reference |
9359 | ihoroscope - 2018 daily horoscope & astrology | lifestyle | 4.5 | 5.096144e-03 | 10000000 | free | 0 | everyone | lifestyle |
9360 rows × 9 columns
data["Reviews"] = pd.to_numeric(data["Reviews"], errors='coerce')
max_value = data["Reviews"].max()
min_value = data["Reviews"].min()
data["Reviews"] = (data["Reviews"] - min_value) / (max_value - min_value)
data["Installs"] = pd.to_numeric(data["Installs"], errors='coerce')
max_value = data["Installs"].max()
min_value = data["Installs"].min()
data["Installs"] = (data["Installs"] - min_value) / (max_value - min_value)
data
App | Category | Rating | Reviews | Installs | Type | Price | Content Rating | Genres | |
---|---|---|---|---|---|---|---|---|---|
0 | photo editor & candy camera & grid & scrapbook | art_and_design | 4.1 | 2.021538e-06 | 9.999000e-06 | free | 0 | everyone | art & design |
1 | coloring book moana | art_and_design | 3.9 | 1.235953e-05 | 4.999990e-04 | free | 0 | everyone | art & design;pretend play |
2 | u launcher lite – free live cool themes, hide ... | art_and_design | 4.7 | 1.119638e-03 | 4.999999e-03 | free | 0 | everyone | art & design |
3 | sketch - draw & paint | art_and_design | 4.5 | 2.759054e-03 | 5.000000e-02 | free | 0 | teen | art & design |
4 | pixel draw - number art coloring book | art_and_design | 4.3 | 1.235953e-05 | 9.999900e-05 | free | 0 | everyone | art & design;creativity |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9355 | fr calculator | family | 4.0 | 7.676727e-08 | 4.990000e-07 | free | 0 | everyone | education |
9356 | sya9a maroc - fr | family | 4.5 | 4.733982e-07 | 4.999000e-06 | free | 0 | everyone | education |
9357 | fr. mike schmitz audio teachings | family | 5.0 | 3.838364e-08 | 9.900000e-08 | free | 0 | everyone | education |
9358 | the scp foundation db fr nn5n | books_and_reference | 4.5 | 1.445784e-06 | 9.990000e-07 | free | 0 | mature 17+ | books & reference |
9359 | ihoroscope - 2018 daily horoscope & astrology | lifestyle | 4.5 | 5.096144e-03 | 9.999999e-03 | free | 0 | everyone | lifestyle |
9360 rows × 9 columns
data.describe(include='all')
App | Category | Rating | Reviews | Installs | Type | Price | Content Rating | Genres | |
---|---|---|---|---|---|---|---|---|---|
count | 9360 | 9360 | 9360.000000 | 9360.000000 | 9360.000000 | 9360 | 9360 | 9360 | 9360 |
unique | 8174 | 33 | NaN | NaN | NaN | 2 | 73 | 6 | 115 |
top | roblox | family | NaN | NaN | NaN | free | 0 | everyone | tools |
freq | 9 | 1746 | NaN | NaN | NaN | 8715 | 8715 | 7414 | 732 |
mean | NaN | NaN | 4.191838 | 0.006581 | 0.017909 | NaN | NaN | NaN | NaN |
std | NaN | NaN | 0.515263 | 0.040239 | 0.091266 | NaN | NaN | NaN | NaN |
min | NaN | NaN | 1.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | NaN |
25% | NaN | NaN | 4.000000 | 0.000002 | 0.000010 | NaN | NaN | NaN | NaN |
50% | NaN | NaN | 4.300000 | 0.000076 | 0.000500 | NaN | NaN | NaN | NaN |
75% | NaN | NaN | 4.500000 | 0.001044 | 0.005000 | NaN | NaN | NaN | NaN |
max | NaN | NaN | 5.000000 | 1.000000 | 1.000000 | NaN | NaN | NaN | NaN |
Splitting into test, train, validation sets
import numpy as np
np.random.seed(123)
train, validate, test = np.split(data.sample(frac=1, random_state=42), [int(.6*len(data)), int(.8*len(data))])
print(f"Data shape: {data.shape}\nTrain shape: {train.shape}\nTest shape: {test.shape}\nValidation shape:{validate.shape}")
Data shape: (9360, 9) Train shape: (5616, 9) Test shape: (1872, 9) Validation shape:(1872, 9)