ium_444517/data_exploration.ipynb

2083 lines
110 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Google Play Store data exploration\n",
"### Kamila Bobkowska s444517\n",
"\n",
"Link do danych: https://www.kaggle.com/datasets/lava18/google-play-store-apps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aby ściągnąć dataset z Kaggle należy założyć konto i pobrać token który umożliwi poprawne korzystanie API. Po pobraniu tokenu trzeba go umieścić w odpowiednim miejscu w zależności czy korzystamy z Winodwsa czy Linuxa jest to inna lokalizacja.\n",
"\n",
"*Robiąc to zadanie pobrałam dane korzystając z kaggle z Windowsem, ponieważ nie mam dostępu do Linuxa oprócz komputera wydziałowego, a tam nie działają mi komendy z biblioteki kaggle.*"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"google-play-store-apps.zip: Skipping, found more recently modified local copy (use --force to force download)\n"
]
}
],
"source": [
"!kaggle datasets download -d lava18/google-play-store-apps"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: google-play-store-apps.zip\n",
" inflating: googleplaystore.csv \n",
" inflating: googleplaystore_user_reviews.csv \n",
" inflating: license.txt \n"
]
}
],
"source": [
"!unzip -o google-play-store-apps.zip"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Size</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" <th>Last Updated</th>\n",
" <th>Current Ver</th>\n",
" <th>Android Ver</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.1</td>\n",
" <td>159</td>\n",
" <td>19M</td>\n",
" <td>10,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>January 7, 2018</td>\n",
" <td>1.0.0</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Coloring book moana</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>3.9</td>\n",
" <td>967</td>\n",
" <td>14M</td>\n",
" <td>500,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design;Pretend Play</td>\n",
" <td>January 15, 2018</td>\n",
" <td>2.0.0</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>U Launcher Lite FREE Live Cool Themes, Hide ...</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.7</td>\n",
" <td>87510</td>\n",
" <td>8.7M</td>\n",
" <td>5,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>August 1, 2018</td>\n",
" <td>1.2.4</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sketch - Draw &amp; Paint</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.5</td>\n",
" <td>215644</td>\n",
" <td>25M</td>\n",
" <td>50,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Teen</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>June 8, 2018</td>\n",
" <td>Varies with device</td>\n",
" <td>4.2 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Pixel Draw - Number Art Coloring Book</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.3</td>\n",
" <td>967</td>\n",
" <td>2.8M</td>\n",
" <td>100,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design;Creativity</td>\n",
" <td>June 20, 2018</td>\n",
" <td>1.1</td>\n",
" <td>4.4 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10836</th>\n",
" <td>Sya9a Maroc - FR</td>\n",
" <td>FAMILY</td>\n",
" <td>4.5</td>\n",
" <td>38</td>\n",
" <td>53M</td>\n",
" <td>5,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Education</td>\n",
" <td>July 25, 2017</td>\n",
" <td>1.48</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10837</th>\n",
" <td>Fr. Mike Schmitz Audio Teachings</td>\n",
" <td>FAMILY</td>\n",
" <td>5.0</td>\n",
" <td>4</td>\n",
" <td>3.6M</td>\n",
" <td>100+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Education</td>\n",
" <td>July 6, 2018</td>\n",
" <td>1.0</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10838</th>\n",
" <td>Parkinson Exercices FR</td>\n",
" <td>MEDICAL</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>9.5M</td>\n",
" <td>1,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Medical</td>\n",
" <td>January 20, 2017</td>\n",
" <td>1.0</td>\n",
" <td>2.2 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10839</th>\n",
" <td>The SCP Foundation DB fr nn5n</td>\n",
" <td>BOOKS_AND_REFERENCE</td>\n",
" <td>4.5</td>\n",
" <td>114</td>\n",
" <td>Varies with device</td>\n",
" <td>1,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Mature 17+</td>\n",
" <td>Books &amp; Reference</td>\n",
" <td>January 19, 2015</td>\n",
" <td>Varies with device</td>\n",
" <td>Varies with device</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10840</th>\n",
" <td>iHoroscope - 2018 Daily Horoscope &amp; Astrology</td>\n",
" <td>LIFESTYLE</td>\n",
" <td>4.5</td>\n",
" <td>398307</td>\n",
" <td>19M</td>\n",
" <td>10,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Lifestyle</td>\n",
" <td>July 25, 2018</td>\n",
" <td>Varies with device</td>\n",
" <td>Varies with device</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10841 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" App Category \\\n",
"0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN \n",
"1 Coloring book moana ART_AND_DESIGN \n",
"2 U Launcher Lite FREE Live Cool Themes, Hide ... ART_AND_DESIGN \n",
"3 Sketch - Draw & Paint ART_AND_DESIGN \n",
"4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN \n",
"... ... ... \n",
"10836 Sya9a Maroc - FR FAMILY \n",
"10837 Fr. Mike Schmitz Audio Teachings FAMILY \n",
"10838 Parkinson Exercices FR MEDICAL \n",
"10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE \n",
"10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE \n",
"\n",
" Rating Reviews Size Installs Type Price \\\n",
"0 4.1 159 19M 10,000+ Free 0 \n",
"1 3.9 967 14M 500,000+ Free 0 \n",
"2 4.7 87510 8.7M 5,000,000+ Free 0 \n",
"3 4.5 215644 25M 50,000,000+ Free 0 \n",
"4 4.3 967 2.8M 100,000+ Free 0 \n",
"... ... ... ... ... ... ... \n",
"10836 4.5 38 53M 5,000+ Free 0 \n",
"10837 5.0 4 3.6M 100+ Free 0 \n",
"10838 NaN 3 9.5M 1,000+ Free 0 \n",
"10839 4.5 114 Varies with device 1,000+ Free 0 \n",
"10840 4.5 398307 19M 10,000,000+ Free 0 \n",
"\n",
" Content Rating Genres Last Updated \\\n",
"0 Everyone Art & Design January 7, 2018 \n",
"1 Everyone Art & Design;Pretend Play January 15, 2018 \n",
"2 Everyone Art & Design August 1, 2018 \n",
"3 Teen Art & Design June 8, 2018 \n",
"4 Everyone Art & Design;Creativity June 20, 2018 \n",
"... ... ... ... \n",
"10836 Everyone Education July 25, 2017 \n",
"10837 Everyone Education July 6, 2018 \n",
"10838 Everyone Medical January 20, 2017 \n",
"10839 Mature 17+ Books & Reference January 19, 2015 \n",
"10840 Everyone Lifestyle July 25, 2018 \n",
"\n",
" Current Ver Android Ver \n",
"0 1.0.0 4.0.3 and up \n",
"1 2.0.0 4.0.3 and up \n",
"2 1.2.4 4.0.3 and up \n",
"3 Varies with device 4.2 and up \n",
"4 1.1 4.4 and up \n",
"... ... ... \n",
"10836 1.48 4.1 and up \n",
"10837 1.0 4.1 and up \n",
"10838 1.0 2.2 and up \n",
"10839 Varies with device Varies with device \n",
"10840 Varies with device Varies with device \n",
"\n",
"[10841 rows x 13 columns]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"data = pd.read_csv('googleplaystore.csv')\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data exploration"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',\n",
" 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',\n",
" 'Android Ver'],\n",
" dtype='object')"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"App object\n",
"Category object\n",
"Rating float64\n",
"Reviews object\n",
"Size object\n",
"Installs object\n",
"Type object\n",
"Price object\n",
"Content Rating object\n",
"Genres object\n",
"Last Updated object\n",
"Current Ver object\n",
"Android Ver object\n",
"dtype: object"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Size</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" <th>Last Updated</th>\n",
" <th>Current Ver</th>\n",
" <th>Android Ver</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>10841</td>\n",
" <td>10841</td>\n",
" <td>9367.000000</td>\n",
" <td>10841</td>\n",
" <td>10841</td>\n",
" <td>10841</td>\n",
" <td>10840</td>\n",
" <td>10841</td>\n",
" <td>10840</td>\n",
" <td>10841</td>\n",
" <td>10841</td>\n",
" <td>10833</td>\n",
" <td>10838</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>9660</td>\n",
" <td>34</td>\n",
" <td>NaN</td>\n",
" <td>6002</td>\n",
" <td>462</td>\n",
" <td>22</td>\n",
" <td>3</td>\n",
" <td>93</td>\n",
" <td>6</td>\n",
" <td>120</td>\n",
" <td>1378</td>\n",
" <td>2832</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>ROBLOX</td>\n",
" <td>FAMILY</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>Varies with device</td>\n",
" <td>1,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Tools</td>\n",
" <td>August 3, 2018</td>\n",
" <td>Varies with device</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>9</td>\n",
" <td>1972</td>\n",
" <td>NaN</td>\n",
" <td>596</td>\n",
" <td>1695</td>\n",
" <td>1579</td>\n",
" <td>10039</td>\n",
" <td>10040</td>\n",
" <td>8714</td>\n",
" <td>842</td>\n",
" <td>326</td>\n",
" <td>1459</td>\n",
" <td>2451</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.193338</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.537431</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.300000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.500000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" App Category Rating Reviews Size Installs \\\n",
"count 10841 10841 9367.000000 10841 10841 10841 \n",
"unique 9660 34 NaN 6002 462 22 \n",
"top ROBLOX FAMILY NaN 0 Varies with device 1,000,000+ \n",
"freq 9 1972 NaN 596 1695 1579 \n",
"mean NaN NaN 4.193338 NaN NaN NaN \n",
"std NaN NaN 0.537431 NaN NaN NaN \n",
"min NaN NaN 1.000000 NaN NaN NaN \n",
"25% NaN NaN 4.000000 NaN NaN NaN \n",
"50% NaN NaN 4.300000 NaN NaN NaN \n",
"75% NaN NaN 4.500000 NaN NaN NaN \n",
"max NaN NaN 19.000000 NaN NaN NaN \n",
"\n",
" Type Price Content Rating Genres Last Updated \\\n",
"count 10840 10841 10840 10841 10841 \n",
"unique 3 93 6 120 1378 \n",
"top Free 0 Everyone Tools August 3, 2018 \n",
"freq 10039 10040 8714 842 326 \n",
"mean NaN NaN NaN NaN NaN \n",
"std NaN NaN NaN NaN NaN \n",
"min NaN NaN NaN NaN NaN \n",
"25% NaN NaN NaN NaN NaN \n",
"50% NaN NaN NaN NaN NaN \n",
"75% NaN NaN NaN NaN NaN \n",
"max NaN NaN NaN NaN NaN \n",
"\n",
" Current Ver Android Ver \n",
"count 10833 10838 \n",
"unique 2832 33 \n",
"top Varies with device 4.1 and up \n",
"freq 1459 2451 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe(include='all')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FAMILY 1972\n",
"GAME 1144\n",
"TOOLS 843\n",
"MEDICAL 463\n",
"BUSINESS 460\n",
"PRODUCTIVITY 424\n",
"PERSONALIZATION 392\n",
"COMMUNICATION 387\n",
"SPORTS 384\n",
"LIFESTYLE 382\n",
"FINANCE 366\n",
"HEALTH_AND_FITNESS 341\n",
"PHOTOGRAPHY 335\n",
"SOCIAL 295\n",
"NEWS_AND_MAGAZINES 283\n",
"SHOPPING 260\n",
"TRAVEL_AND_LOCAL 258\n",
"DATING 234\n",
"BOOKS_AND_REFERENCE 231\n",
"VIDEO_PLAYERS 175\n",
"EDUCATION 156\n",
"ENTERTAINMENT 149\n",
"MAPS_AND_NAVIGATION 137\n",
"FOOD_AND_DRINK 127\n",
"HOUSE_AND_HOME 88\n",
"LIBRARIES_AND_DEMO 85\n",
"AUTO_AND_VEHICLES 85\n",
"WEATHER 82\n",
"ART_AND_DESIGN 65\n",
"EVENTS 64\n",
"PARENTING 60\n",
"COMICS 60\n",
"BEAUTY 53\n",
"1.9 1\n",
"Name: Category, dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Category'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Everyone 8714\n",
"Teen 1208\n",
"Mature 17+ 499\n",
"Everyone 10+ 414\n",
"Adults only 18+ 3\n",
"Unrated 2\n",
"Name: Content Rating, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"Content Rating\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Tools 842\n",
"Entertainment 623\n",
"Education 549\n",
"Medical 463\n",
"Business 460\n",
" ... \n",
"Parenting;Brain Games 1\n",
"Health & Fitness;Education 1\n",
"Role Playing;Education 1\n",
"Puzzle;Education 1\n",
"Travel & Local;Action & Adventure 1\n",
"Name: Genres, Length: 120, dtype: int64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Genres'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 10040\n",
"$0.99 148\n",
"$2.99 129\n",
"$1.99 73\n",
"$4.99 72\n",
" ... \n",
"$3.02 1\n",
"$2.95 1\n",
"$1.61 1\n",
"$14.00 1\n",
"$1.29 1\n",
"Name: Price, Length: 93, dtype: int64"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Price'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"App 0\n",
"Category 0\n",
"Rating 1474\n",
"Reviews 0\n",
"Size 0\n",
"Installs 0\n",
"Type 1\n",
"Price 0\n",
"Content Rating 1\n",
"Genres 0\n",
"Last Updated 0\n",
"Current Ver 8\n",
"Android Ver 3\n",
"dtype: int64"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Size</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" <th>Last Updated</th>\n",
" <th>Current Ver</th>\n",
" <th>Android Ver</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.1</td>\n",
" <td>159</td>\n",
" <td>19M</td>\n",
" <td>10,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>January 7, 2018</td>\n",
" <td>1.0.0</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Coloring book moana</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>3.9</td>\n",
" <td>967</td>\n",
" <td>14M</td>\n",
" <td>500,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design;Pretend Play</td>\n",
" <td>January 15, 2018</td>\n",
" <td>2.0.0</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>U Launcher Lite FREE Live Cool Themes, Hide ...</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.7</td>\n",
" <td>87510</td>\n",
" <td>8.7M</td>\n",
" <td>5,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>August 1, 2018</td>\n",
" <td>1.2.4</td>\n",
" <td>4.0.3 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sketch - Draw &amp; Paint</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.5</td>\n",
" <td>215644</td>\n",
" <td>25M</td>\n",
" <td>50,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Teen</td>\n",
" <td>Art &amp; Design</td>\n",
" <td>June 8, 2018</td>\n",
" <td>Varies with device</td>\n",
" <td>4.2 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Pixel Draw - Number Art Coloring Book</td>\n",
" <td>ART_AND_DESIGN</td>\n",
" <td>4.3</td>\n",
" <td>967</td>\n",
" <td>2.8M</td>\n",
" <td>100,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Art &amp; Design;Creativity</td>\n",
" <td>June 20, 2018</td>\n",
" <td>1.1</td>\n",
" <td>4.4 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9355</th>\n",
" <td>FR Calculator</td>\n",
" <td>FAMILY</td>\n",
" <td>4.0</td>\n",
" <td>7</td>\n",
" <td>2.6M</td>\n",
" <td>500+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Education</td>\n",
" <td>June 18, 2017</td>\n",
" <td>1.0.0</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9356</th>\n",
" <td>Sya9a Maroc - FR</td>\n",
" <td>FAMILY</td>\n",
" <td>4.5</td>\n",
" <td>38</td>\n",
" <td>53M</td>\n",
" <td>5,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Education</td>\n",
" <td>July 25, 2017</td>\n",
" <td>1.48</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9357</th>\n",
" <td>Fr. Mike Schmitz Audio Teachings</td>\n",
" <td>FAMILY</td>\n",
" <td>5.0</td>\n",
" <td>4</td>\n",
" <td>3.6M</td>\n",
" <td>100+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Education</td>\n",
" <td>July 6, 2018</td>\n",
" <td>1.0</td>\n",
" <td>4.1 and up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9358</th>\n",
" <td>The SCP Foundation DB fr nn5n</td>\n",
" <td>BOOKS_AND_REFERENCE</td>\n",
" <td>4.5</td>\n",
" <td>114</td>\n",
" <td>Varies with device</td>\n",
" <td>1,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Mature 17+</td>\n",
" <td>Books &amp; Reference</td>\n",
" <td>January 19, 2015</td>\n",
" <td>Varies with device</td>\n",
" <td>Varies with device</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9359</th>\n",
" <td>iHoroscope - 2018 Daily Horoscope &amp; Astrology</td>\n",
" <td>LIFESTYLE</td>\n",
" <td>4.5</td>\n",
" <td>398307</td>\n",
" <td>19M</td>\n",
" <td>10,000,000+</td>\n",
" <td>Free</td>\n",
" <td>0</td>\n",
" <td>Everyone</td>\n",
" <td>Lifestyle</td>\n",
" <td>July 25, 2018</td>\n",
" <td>Varies with device</td>\n",
" <td>Varies with device</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9360 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" App Category \\\n",
"0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN \n",
"1 Coloring book moana ART_AND_DESIGN \n",
"2 U Launcher Lite FREE Live Cool Themes, Hide ... ART_AND_DESIGN \n",
"3 Sketch - Draw & Paint ART_AND_DESIGN \n",
"4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN \n",
"... ... ... \n",
"9355 FR Calculator FAMILY \n",
"9356 Sya9a Maroc - FR FAMILY \n",
"9357 Fr. Mike Schmitz Audio Teachings FAMILY \n",
"9358 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE \n",
"9359 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE \n",
"\n",
" Rating Reviews Size Installs Type Price \\\n",
"0 4.1 159 19M 10,000+ Free 0 \n",
"1 3.9 967 14M 500,000+ Free 0 \n",
"2 4.7 87510 8.7M 5,000,000+ Free 0 \n",
"3 4.5 215644 25M 50,000,000+ Free 0 \n",
"4 4.3 967 2.8M 100,000+ Free 0 \n",
"... ... ... ... ... ... ... \n",
"9355 4.0 7 2.6M 500+ Free 0 \n",
"9356 4.5 38 53M 5,000+ Free 0 \n",
"9357 5.0 4 3.6M 100+ Free 0 \n",
"9358 4.5 114 Varies with device 1,000+ Free 0 \n",
"9359 4.5 398307 19M 10,000,000+ Free 0 \n",
"\n",
" Content Rating Genres Last Updated \\\n",
"0 Everyone Art & Design January 7, 2018 \n",
"1 Everyone Art & Design;Pretend Play January 15, 2018 \n",
"2 Everyone Art & Design August 1, 2018 \n",
"3 Teen Art & Design June 8, 2018 \n",
"4 Everyone Art & Design;Creativity June 20, 2018 \n",
"... ... ... ... \n",
"9355 Everyone Education June 18, 2017 \n",
"9356 Everyone Education July 25, 2017 \n",
"9357 Everyone Education July 6, 2018 \n",
"9358 Mature 17+ Books & Reference January 19, 2015 \n",
"9359 Everyone Lifestyle July 25, 2018 \n",
"\n",
" Current Ver Android Ver \n",
"0 1.0.0 4.0.3 and up \n",
"1 2.0.0 4.0.3 and up \n",
"2 1.2.4 4.0.3 and up \n",
"3 Varies with device 4.2 and up \n",
"4 1.1 4.4 and up \n",
"... ... ... \n",
"9355 1.0.0 4.1 and up \n",
"9356 1.48 4.1 and up \n",
"9357 1.0 4.1 and up \n",
"9358 Varies with device Varies with device \n",
"9359 Varies with device Varies with device \n",
"\n",
"[9360 rows x 13 columns]"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(subset=['Rating', 'Type','Content Rating','Current Ver','Android Ver'], inplace=True)\n",
"data.reset_index(drop=True, inplace=True)\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"App 0\n",
"Category 0\n",
"Rating 0\n",
"Reviews 0\n",
"Size 0\n",
"Installs 0\n",
"Type 0\n",
"Price 0\n",
"Content Rating 0\n",
"Genres 0\n",
"Last Updated 0\n",
"Current Ver 0\n",
"Android Ver 0\n",
"dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Proste wizualizacje"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"plt.figure(figsize=(20,5))\n",
"sns.distplot(data['Rating']).set(title='Ratings')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data[\"Price\"] = data[\"Price\"].replace({'\\$': ''}, regex=True)\n",
"plt.figure(figsize=(20,5))\n",
"sns.distplot(data['Price']).set(title='Ratings')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kolumna \"Size\"\n",
"Mimo, że ta kolumna może mieć znaczenie przy opracowwaniu danych ta kolumna zostanie pominięta ze względu na występującą w niej wartość \"Varies with device\", którą byłoby ciążko opracować. Ponadto nie można po prostu usunąć wszystkich jej wystąpień, ponieważ występują ona w ponad 1500 rzędach."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M'\n",
" '20M' '21M' '37M' '5.5M' '17M' '39M' '31M' '4.2M' '23M' '6.0M' '6.1M'\n",
" '4.6M' '9.2M' '5.2M' '11M' '24M' 'Varies with device' '9.4M' '15M' '10M'\n",
" '1.2M' '26M' '8.0M' '7.9M' '56M' '57M' '35M' '54M' '201k' '3.6M' '5.7M'\n",
" '8.6M' '2.4M' '27M' '2.7M' '2.5M' '7.0M' '16M' '3.4M' '8.9M' '3.9M'\n",
" '2.9M' '38M' '32M' '5.4M' '18M' '1.1M' '2.2M' '4.5M' '9.8M' '52M' '9.0M'\n",
" '6.7M' '30M' '2.6M' '7.1M' '22M' '6.4M' '3.2M' '8.2M' '4.9M' '9.5M'\n",
" '5.0M' '5.9M' '13M' '73M' '6.8M' '3.5M' '4.0M' '2.3M' '2.1M' '42M' '9.1M'\n",
" '55M' '23k' '7.3M' '6.5M' '1.5M' '7.5M' '51M' '41M' '48M' '8.5M' '46M'\n",
" '8.3M' '4.3M' '4.7M' '3.3M' '40M' '7.8M' '8.8M' '6.6M' '5.1M' '61M' '66M'\n",
" '79k' '8.4M' '3.7M' '118k' '44M' '695k' '1.6M' '6.2M' '53M' '1.4M' '3.0M'\n",
" '7.2M' '5.8M' '3.8M' '9.6M' '45M' '63M' '49M' '77M' '4.4M' '70M' '9.3M'\n",
" '8.1M' '36M' '6.9M' '7.4M' '84M' '97M' '2.0M' '1.9M' '1.8M' '5.3M' '47M'\n",
" '556k' '526k' '76M' '7.6M' '59M' '9.7M' '78M' '72M' '43M' '7.7M' '6.3M'\n",
" '334k' '93M' '65M' '79M' '100M' '58M' '50M' '68M' '64M' '34M' '67M' '60M'\n",
" '94M' '9.9M' '232k' '99M' '624k' '95M' '8.5k' '41k' '292k' '80M' '1.7M'\n",
" '10.0M' '74M' '62M' '69M' '75M' '98M' '85M' '82M' '96M' '87M' '71M' '86M'\n",
" '91M' '81M' '92M' '83M' '88M' '704k' '862k' '899k' '378k' '4.8M' '266k'\n",
" '375k' '1.3M' '975k' '980k' '4.1M' '89M' '696k' '544k' '525k' '920k'\n",
" '779k' '853k' '720k' '713k' '772k' '318k' '58k' '241k' '196k' '857k'\n",
" '51k' '953k' '865k' '251k' '930k' '540k' '313k' '746k' '203k' '26k'\n",
" '314k' '239k' '371k' '220k' '730k' '756k' '91k' '293k' '17k' '74k' '14k'\n",
" '317k' '78k' '924k' '818k' '81k' '939k' '169k' '45k' '965k' '90M' '545k'\n",
" '61k' '283k' '655k' '714k' '93k' '872k' '121k' '322k' '976k' '206k'\n",
" '954k' '444k' '717k' '210k' '609k' '308k' '306k' '175k' '350k' '383k'\n",
" '454k' '1.0M' '70k' '812k' '442k' '842k' '417k' '412k' '459k' '478k'\n",
" '335k' '782k' '721k' '430k' '429k' '192k' '460k' '728k' '496k' '816k'\n",
" '414k' '506k' '887k' '613k' '778k' '683k' '592k' '186k' '840k' '647k'\n",
" '373k' '437k' '598k' '716k' '585k' '982k' '219k' '55k' '323k' '691k'\n",
" '511k' '951k' '963k' '25k' '554k' '351k' '27k' '82k' '208k' '551k' '29k'\n",
" '103k' '116k' '153k' '209k' '499k' '173k' '597k' '809k' '122k' '411k'\n",
" '400k' '801k' '787k' '50k' '643k' '986k' '516k' '837k' '780k' '20k'\n",
" '498k' '600k' '656k' '221k' '228k' '176k' '34k' '259k' '164k' '458k'\n",
" '629k' '28k' '288k' '775k' '785k' '636k' '916k' '994k' '309k' '485k'\n",
" '914k' '903k' '608k' '500k' '54k' '562k' '847k' '948k' '811k' '270k'\n",
" '48k' '523k' '784k' '280k' '24k' '892k' '154k' '18k' '33k' '860k' '364k'\n",
" '387k' '626k' '161k' '879k' '39k' '170k' '141k' '160k' '144k' '143k'\n",
" '190k' '376k' '193k' '473k' '246k' '73k' '253k' '957k' '420k' '72k'\n",
" '404k' '470k' '226k' '240k' '89k' '234k' '257k' '861k' '467k' '676k'\n",
" '552k' '582k' '619k']\n"
]
}
],
"source": [
"print(data[\"Size\"].unique())"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1637"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data.Size == 'Varies with device'].shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"data = data.drop(columns=[\"Size\", \"Android Ver\", \"Current Ver\", \"Last Updated\"])"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"to_lowercase = ['App', 'Category', 'Type', 'Content Rating', 'Genres']\n",
"for column in to_lowercase:\n",
" data[column] = data[column].apply(str.lower)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>photo editor &amp; candy camera &amp; grid &amp; scrapbook</td>\n",
" <td>art_and_design</td>\n",
" <td>4.1</td>\n",
" <td>2.021538e-06</td>\n",
" <td>10000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>coloring book moana</td>\n",
" <td>art_and_design</td>\n",
" <td>3.9</td>\n",
" <td>1.235953e-05</td>\n",
" <td>500000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design;pretend play</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>u launcher lite free live cool themes, hide ...</td>\n",
" <td>art_and_design</td>\n",
" <td>4.7</td>\n",
" <td>1.119638e-03</td>\n",
" <td>5000000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>sketch - draw &amp; paint</td>\n",
" <td>art_and_design</td>\n",
" <td>4.5</td>\n",
" <td>2.759054e-03</td>\n",
" <td>50000000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>teen</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>pixel draw - number art coloring book</td>\n",
" <td>art_and_design</td>\n",
" <td>4.3</td>\n",
" <td>1.235953e-05</td>\n",
" <td>100000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design;creativity</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9355</th>\n",
" <td>fr calculator</td>\n",
" <td>family</td>\n",
" <td>4.0</td>\n",
" <td>7.676727e-08</td>\n",
" <td>500</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9356</th>\n",
" <td>sya9a maroc - fr</td>\n",
" <td>family</td>\n",
" <td>4.5</td>\n",
" <td>4.733982e-07</td>\n",
" <td>5000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9357</th>\n",
" <td>fr. mike schmitz audio teachings</td>\n",
" <td>family</td>\n",
" <td>5.0</td>\n",
" <td>3.838364e-08</td>\n",
" <td>100</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9358</th>\n",
" <td>the scp foundation db fr nn5n</td>\n",
" <td>books_and_reference</td>\n",
" <td>4.5</td>\n",
" <td>1.445784e-06</td>\n",
" <td>1000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>mature 17+</td>\n",
" <td>books &amp; reference</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9359</th>\n",
" <td>ihoroscope - 2018 daily horoscope &amp; astrology</td>\n",
" <td>lifestyle</td>\n",
" <td>4.5</td>\n",
" <td>5.096144e-03</td>\n",
" <td>10000000</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>lifestyle</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9360 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" App Category \\\n",
"0 photo editor & candy camera & grid & scrapbook art_and_design \n",
"1 coloring book moana art_and_design \n",
"2 u launcher lite free live cool themes, hide ... art_and_design \n",
"3 sketch - draw & paint art_and_design \n",
"4 pixel draw - number art coloring book art_and_design \n",
"... ... ... \n",
"9355 fr calculator family \n",
"9356 sya9a maroc - fr family \n",
"9357 fr. mike schmitz audio teachings family \n",
"9358 the scp foundation db fr nn5n books_and_reference \n",
"9359 ihoroscope - 2018 daily horoscope & astrology lifestyle \n",
"\n",
" Rating Reviews Installs Type Price Content Rating \\\n",
"0 4.1 2.021538e-06 10000 free 0 everyone \n",
"1 3.9 1.235953e-05 500000 free 0 everyone \n",
"2 4.7 1.119638e-03 5000000 free 0 everyone \n",
"3 4.5 2.759054e-03 50000000 free 0 teen \n",
"4 4.3 1.235953e-05 100000 free 0 everyone \n",
"... ... ... ... ... ... ... \n",
"9355 4.0 7.676727e-08 500 free 0 everyone \n",
"9356 4.5 4.733982e-07 5000 free 0 everyone \n",
"9357 5.0 3.838364e-08 100 free 0 everyone \n",
"9358 4.5 1.445784e-06 1000 free 0 mature 17+ \n",
"9359 4.5 5.096144e-03 10000000 free 0 everyone \n",
"\n",
" Genres \n",
"0 art & design \n",
"1 art & design;pretend play \n",
"2 art & design \n",
"3 art & design \n",
"4 art & design;creativity \n",
"... ... \n",
"9355 education \n",
"9356 education \n",
"9357 education \n",
"9358 books & reference \n",
"9359 lifestyle \n",
"\n",
"[9360 rows x 9 columns]"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"Installs\"] = data[\"Installs\"].replace({'\\+': ''}, regex=True)\n",
"data[\"Installs\"] = data[\"Installs\"].replace({',': ''}, regex=True)\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>photo editor &amp; candy camera &amp; grid &amp; scrapbook</td>\n",
" <td>art_and_design</td>\n",
" <td>4.1</td>\n",
" <td>2.021538e-06</td>\n",
" <td>9.999000e-06</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>coloring book moana</td>\n",
" <td>art_and_design</td>\n",
" <td>3.9</td>\n",
" <td>1.235953e-05</td>\n",
" <td>4.999990e-04</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design;pretend play</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>u launcher lite free live cool themes, hide ...</td>\n",
" <td>art_and_design</td>\n",
" <td>4.7</td>\n",
" <td>1.119638e-03</td>\n",
" <td>4.999999e-03</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>sketch - draw &amp; paint</td>\n",
" <td>art_and_design</td>\n",
" <td>4.5</td>\n",
" <td>2.759054e-03</td>\n",
" <td>5.000000e-02</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>teen</td>\n",
" <td>art &amp; design</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>pixel draw - number art coloring book</td>\n",
" <td>art_and_design</td>\n",
" <td>4.3</td>\n",
" <td>1.235953e-05</td>\n",
" <td>9.999900e-05</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>art &amp; design;creativity</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9355</th>\n",
" <td>fr calculator</td>\n",
" <td>family</td>\n",
" <td>4.0</td>\n",
" <td>7.676727e-08</td>\n",
" <td>4.990000e-07</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9356</th>\n",
" <td>sya9a maroc - fr</td>\n",
" <td>family</td>\n",
" <td>4.5</td>\n",
" <td>4.733982e-07</td>\n",
" <td>4.999000e-06</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9357</th>\n",
" <td>fr. mike schmitz audio teachings</td>\n",
" <td>family</td>\n",
" <td>5.0</td>\n",
" <td>3.838364e-08</td>\n",
" <td>9.900000e-08</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>education</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9358</th>\n",
" <td>the scp foundation db fr nn5n</td>\n",
" <td>books_and_reference</td>\n",
" <td>4.5</td>\n",
" <td>1.445784e-06</td>\n",
" <td>9.990000e-07</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>mature 17+</td>\n",
" <td>books &amp; reference</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9359</th>\n",
" <td>ihoroscope - 2018 daily horoscope &amp; astrology</td>\n",
" <td>lifestyle</td>\n",
" <td>4.5</td>\n",
" <td>5.096144e-03</td>\n",
" <td>9.999999e-03</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>lifestyle</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9360 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" App Category \\\n",
"0 photo editor & candy camera & grid & scrapbook art_and_design \n",
"1 coloring book moana art_and_design \n",
"2 u launcher lite free live cool themes, hide ... art_and_design \n",
"3 sketch - draw & paint art_and_design \n",
"4 pixel draw - number art coloring book art_and_design \n",
"... ... ... \n",
"9355 fr calculator family \n",
"9356 sya9a maroc - fr family \n",
"9357 fr. mike schmitz audio teachings family \n",
"9358 the scp foundation db fr nn5n books_and_reference \n",
"9359 ihoroscope - 2018 daily horoscope & astrology lifestyle \n",
"\n",
" Rating Reviews Installs Type Price Content Rating \\\n",
"0 4.1 2.021538e-06 9.999000e-06 free 0 everyone \n",
"1 3.9 1.235953e-05 4.999990e-04 free 0 everyone \n",
"2 4.7 1.119638e-03 4.999999e-03 free 0 everyone \n",
"3 4.5 2.759054e-03 5.000000e-02 free 0 teen \n",
"4 4.3 1.235953e-05 9.999900e-05 free 0 everyone \n",
"... ... ... ... ... ... ... \n",
"9355 4.0 7.676727e-08 4.990000e-07 free 0 everyone \n",
"9356 4.5 4.733982e-07 4.999000e-06 free 0 everyone \n",
"9357 5.0 3.838364e-08 9.900000e-08 free 0 everyone \n",
"9358 4.5 1.445784e-06 9.990000e-07 free 0 mature 17+ \n",
"9359 4.5 5.096144e-03 9.999999e-03 free 0 everyone \n",
"\n",
" Genres \n",
"0 art & design \n",
"1 art & design;pretend play \n",
"2 art & design \n",
"3 art & design \n",
"4 art & design;creativity \n",
"... ... \n",
"9355 education \n",
"9356 education \n",
"9357 education \n",
"9358 books & reference \n",
"9359 lifestyle \n",
"\n",
"[9360 rows x 9 columns]"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"Reviews\"] = pd.to_numeric(data[\"Reviews\"], errors='coerce')\n",
"max_value = data[\"Reviews\"].max()\n",
"min_value = data[\"Reviews\"].min()\n",
"data[\"Reviews\"] = (data[\"Reviews\"] - min_value) / (max_value - min_value)\n",
"\n",
"data[\"Installs\"] = pd.to_numeric(data[\"Installs\"], errors='coerce')\n",
"max_value = data[\"Installs\"].max()\n",
"min_value = data[\"Installs\"].min()\n",
"data[\"Installs\"] = (data[\"Installs\"] - min_value) / (max_value - min_value)\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>App</th>\n",
" <th>Category</th>\n",
" <th>Rating</th>\n",
" <th>Reviews</th>\n",
" <th>Installs</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Content Rating</th>\n",
" <th>Genres</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>9360</td>\n",
" <td>9360</td>\n",
" <td>9360.000000</td>\n",
" <td>9360.000000</td>\n",
" <td>9360.000000</td>\n",
" <td>9360</td>\n",
" <td>9360</td>\n",
" <td>9360</td>\n",
" <td>9360</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>8174</td>\n",
" <td>33</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>73</td>\n",
" <td>6</td>\n",
" <td>115</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>roblox</td>\n",
" <td>family</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>free</td>\n",
" <td>0</td>\n",
" <td>everyone</td>\n",
" <td>tools</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>9</td>\n",
" <td>1746</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>8715</td>\n",
" <td>8715</td>\n",
" <td>7414</td>\n",
" <td>732</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.191838</td>\n",
" <td>0.006581</td>\n",
" <td>0.017909</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.515263</td>\n",
" <td>0.040239</td>\n",
" <td>0.091266</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.000000</td>\n",
" <td>0.000002</td>\n",
" <td>0.000010</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.300000</td>\n",
" <td>0.000076</td>\n",
" <td>0.000500</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.500000</td>\n",
" <td>0.001044</td>\n",
" <td>0.005000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>5.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" App Category Rating Reviews Installs Type Price \\\n",
"count 9360 9360 9360.000000 9360.000000 9360.000000 9360 9360 \n",
"unique 8174 33 NaN NaN NaN 2 73 \n",
"top roblox family NaN NaN NaN free 0 \n",
"freq 9 1746 NaN NaN NaN 8715 8715 \n",
"mean NaN NaN 4.191838 0.006581 0.017909 NaN NaN \n",
"std NaN NaN 0.515263 0.040239 0.091266 NaN NaN \n",
"min NaN NaN 1.000000 0.000000 0.000000 NaN NaN \n",
"25% NaN NaN 4.000000 0.000002 0.000010 NaN NaN \n",
"50% NaN NaN 4.300000 0.000076 0.000500 NaN NaN \n",
"75% NaN NaN 4.500000 0.001044 0.005000 NaN NaN \n",
"max NaN NaN 5.000000 1.000000 1.000000 NaN NaN \n",
"\n",
" Content Rating Genres \n",
"count 9360 9360 \n",
"unique 6 115 \n",
"top everyone tools \n",
"freq 7414 732 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN "
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Splitting into test, train, validation sets"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data shape: (9360, 9)\n",
"Train shape: (5616, 9)\n",
"Test shape: (1872, 9)\n",
"Validation shape:(1872, 9)\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"np.random.seed(123)\n",
"train, validate, test = np.split(data.sample(frac=1, random_state=42), [int(.6*len(data)), int(.8*len(data))])\n",
"print(f\"Data shape: {data.shape}\\nTrain shape: {train.shape}\\nTest shape: {test.shape}\\nValidation shape:{validate.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}