ium_440058/dataset.ipynb
2021-03-21 23:49:49 +01:00

31 KiB
Raw Blame History

Finance & Accounting Courses in udemy.com

Includes:

  • id
  • title
  • is_paid
  • num_subscribers
  • rating
  • num_reviews
  • created
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_files('jilkothari/finance-accounting-courses-udemy-13k-course', path='.', unzip=True)

courses = pd.read_csv('courses.csv')

Dataset

courses
id title url is_paid num_subscribers rating num_reviews created
0 762616 the_complete_sql_bootcamp_2020:_go_from_zero_t... /course/the-complete-sql-bootcamp/ True 295509 4.7 78006 2016-02-14T22:57:48Z
1 937678 tableau_2020_a-z:_hands-on_tableau_training_fo... /course/tableau10/ True 209070 4.6 54581 2016-08-22T12:10:18Z
2 1361790 pmp_exam_prep_seminar_-__pmbok_guide_6 /course/pmp-pmbok6-35-pdus/ True 155282 4.6 52653 2017-09-26T16:32:48Z
3 648826 the_complete_financial_analyst_course_2020 /course/the-complete-financial-analyst-course/ True 245860 4.5 46447 2015-10-23T13:34:35Z
4 637930 an_entire_mba_in_1_course:award_winning_busine... /course/an-entire-mba-in-1-courseaward-winning... True 374836 4.5 41630 2015-10-12T06:39:46Z
... ... ... ... ... ... ... ... ...
13531 3171702 máster_en_inversión_bursátil,_completo_análisi... /course/master-en-inversion-bursatil-completo-... False 485 4.4 11 2020-05-26T17:34:49Z
13532 2925096 curso_do_zero_a_investidor_em_ações_na_bolsa /course/curso-do-zero-a-investidor-em-acoes-na... False 260 4.2 11 2020-03-28T18:39:36Z
13533 3146788 day_trading_kumo-méthode_de_trading_range-_for... /course/day-trading-kumo-methode-de-trading-ra... False 121 4.1 10 2020-05-19T17:08:48Z
13534 2400574 investindo_do_zero_com_tesouro_direto /course/investindo-do-zero-com-tesouro-direto-... False 233 3.6 10 2019-06-05T23:08:57Z
13535 2888390 acabou_a_previdência_e_agora?_-_volume_01 /course/acabou-a-previdencia-e-agora-volume-01/ False 175 4.5 10 2020-03-20T01:41:25Z

9501 rows × 8 columns

Delete redundant columns

imp_col = ['id', 'title', 'url', 'is_paid', 'num_subscribers', 'rating', 'num_reviews', 'created']
courses = courses[imp_col]
courses.to_csv("courses.csv", index=False)
courses = pd.read_csv('courses.csv')

Delete empty rows of rating column and number of reviews less than 10

rating_col = 'rating'
num_reviews_col = 'num_reviews'
courses = courses.drop(courses[courses.rating == 0].index)
courses = courses.drop(courses[courses.num_reviews < 10].index)

Simplify numbers to one decimal place and format 'title' column to specifc schema

courses = courses.round(1)
courses['title'] = courses['title'].str.lower()
courses['title'] = courses['title'].str.replace(" ", "_")

Delete artifacts

courses = courses.dropna()

Split dataset into 60% 20% 20% - train, valid, test

courses_train, courses_validate, courses_test = np.split(courses.sample(frac=1), [int(.6*len(courses)), int(.8*len(courses))])

Summary of train, valid, test

print("Courses: ".ljust(20), np.size(courses))
print("Courses (train) : ".ljust(20), np.size(courses_train))
print("Courses (validate): ".ljust(20), np.size(courses_validate))
print("Courses (test) ".ljust(20), np.size(courses_test))
Courses:             76008
Courses (train) :    45600
Courses (validate):  15200
Courses (test)       15208

Describe numeric columns

courses.describe().round(1)
id num_subscribers rating num_reviews
count 9501.0 9501.0 9501.0 9501.0
mean 1484700.3 3953.9 4.1 346.6
std 887299.7 11103.9 0.4 1882.7
min 2762.0 13.0 1.5 10.0
25% 718252.0 261.0 3.9 21.0
50% 1413712.0 1170.0 4.2 49.0
75% 2193058.0 3644.0 4.4 157.0
max 3477486.0 374836.0 5.0 78006.0

Distribution of 'is_paid' column

pd.value_counts(courses['is_paid']).plot(kind="bar")
<AxesSubplot:>

Current dataset

courses
id title url is_paid num_subscribers rating num_reviews created
0 762616 the_complete_sql_bootcamp_2020:_go_from_zero_t... /course/the-complete-sql-bootcamp/ True 295509 4.7 78006 2016-02-14T22:57:48Z
1 937678 tableau_2020_a-z:_hands-on_tableau_training_fo... /course/tableau10/ True 209070 4.6 54581 2016-08-22T12:10:18Z
2 1361790 pmp_exam_prep_seminar_-__pmbok_guide_6 /course/pmp-pmbok6-35-pdus/ True 155282 4.6 52653 2017-09-26T16:32:48Z
3 648826 the_complete_financial_analyst_course_2020 /course/the-complete-financial-analyst-course/ True 245860 4.5 46447 2015-10-23T13:34:35Z
4 637930 an_entire_mba_in_1_course:award_winning_busine... /course/an-entire-mba-in-1-courseaward-winning... True 374836 4.5 41630 2015-10-12T06:39:46Z
... ... ... ... ... ... ... ... ...
13531 3171702 máster_en_inversión_bursátil,_completo_análisi... /course/master-en-inversion-bursatil-completo-... False 485 4.4 11 2020-05-26T17:34:49Z
13532 2925096 curso_do_zero_a_investidor_em_ações_na_bolsa /course/curso-do-zero-a-investidor-em-acoes-na... False 260 4.2 11 2020-03-28T18:39:36Z
13533 3146788 day_trading_kumo-méthode_de_trading_range-_for... /course/day-trading-kumo-methode-de-trading-ra... False 121 4.1 10 2020-05-19T17:08:48Z
13534 2400574 investindo_do_zero_com_tesouro_direto /course/investindo-do-zero-com-tesouro-direto-... False 233 3.6 10 2019-06-05T23:08:57Z
13535 2888390 acabou_a_previdência_e_agora?_-_volume_01 /course/acabou-a-previdencia-e-agora-volume-01/ False 175 4.5 10 2020-03-20T01:41:25Z

9501 rows × 8 columns