ium_425850/IUM_zadanie.ipynb

95 KiB
Raw Permalink Blame History

!pip install --user kaggle
!pip install --user pandas
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: kaggle in /usr/local/lib/python3.9/dist-packages (1.5.13)
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.27.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.9/dist-packages (from kaggle) (8.0.1)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.65.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.15)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.0.12)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
!kaggle datasets download -d dylanjcastillo/7k-books-with-metadata
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.9/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.9/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.
!unzip -o 7k-books-with-metadata.zip
Archive:  7k-books-with-metadata.zip
  inflating: books.csv               
import pandas as pd
books=pd.read_csv('books.csv')
books
isbn13 isbn10 title subtitle authors categories thumbnail description published_year average_rating num_pages ratings_count
0 9780002005883 0002005883 Gilead NaN Marilynne Robinson Fiction http://books.google.com/books/content?id=KQZCP... A NOVEL THAT READERS and critics have been eag... 2004.0 3.85 247.0 361.0
1 9780002261982 0002261987 Spider's Web A Novel Charles Osborne;Agatha Christie Detective and mystery stories http://books.google.com/books/content?id=gA5GP... A new 'Christie for Christmas' -- a full-lengt... 2000.0 3.83 241.0 5164.0
2 9780006163831 0006163831 The One Tree NaN Stephen R. Donaldson American fiction http://books.google.com/books/content?id=OmQaw... Volume Two of Stephen Donaldson's acclaimed se... 1982.0 3.97 479.0 172.0
3 9780006178736 0006178731 Rage of angels NaN Sidney Sheldon Fiction http://books.google.com/books/content?id=FKo2T... A memorable, mesmerizing heroine Jennifer -- b... 1993.0 3.93 512.0 29532.0
4 9780006280897 0006280897 The Four Loves NaN Clive Staples Lewis Christian life http://books.google.com/books/content?id=XhQ5X... Lewis' work on the nature of love divides love... 2002.0 4.15 170.0 33684.0
... ... ... ... ... ... ... ... ... ... ... ... ...
6805 9788185300535 8185300534 I Am that Talks with Sri Nisargadatta Maharaj Sri Nisargadatta Maharaj;Sudhakar S. Dikshit Philosophy http://books.google.com/books/content?id=Fv_JP... This collection of the timeless teachings of o... 1999.0 4.51 531.0 104.0
6806 9788185944609 8185944601 Secrets Of The Heart NaN Khalil Gibran Mysticism http://books.google.com/books/content?id=XcrVp... NaN 1993.0 4.08 74.0 324.0
6807 9788445074879 8445074873 Fahrenheit 451 NaN Ray Bradbury Book burning NaN NaN 2004.0 3.98 186.0 5733.0
6808 9789027712059 9027712050 The Berlin Phenomenology NaN Georg Wilhelm Friedrich Hegel History http://books.google.com/books/content?id=Vy7Sk... Since the three volume edition ofHegel's Philo... 1981.0 0.00 210.0 0.0
6809 9789042003408 9042003405 'I'm Telling You Stories' Jeanette Winterson and the Politics of Reading Helena Grice;Tim Woods Literary Criticism http://books.google.com/books/content?id=2lVyR... This is a jubilant and rewarding collection of... 1998.0 3.70 136.0 10.0

6810 rows × 12 columns

books.describe(include='all')
isbn13 isbn10 title subtitle authors categories thumbnail description published_year average_rating num_pages ratings_count
count 6.810000e+03 6810 6810 2381 6738 6711 6481 6548 6804.000000 6767.000000 6767.000000 6.767000e+03
unique NaN 6810 6398 2009 3780 567 6481 6474 NaN NaN NaN NaN
top NaN 0786282258 The Lord of the Rings A Novel Agatha Christie Fiction http://books.google.com/books/content?id=6dVAW... This is a reproduction of the original artefac... NaN NaN NaN NaN
freq NaN 1 11 226 37 2588 1 6 NaN NaN NaN NaN
mean 9.780677e+12 NaN NaN NaN NaN NaN NaN NaN 1998.630364 3.933284 348.181026 2.106910e+04
std 6.068911e+08 NaN NaN NaN NaN NaN NaN NaN 10.484257 0.331352 242.376783 1.376207e+05
min 9.780002e+12 NaN NaN NaN NaN NaN NaN NaN 1853.000000 0.000000 0.000000 0.000000e+00
25% 9.780330e+12 NaN NaN NaN NaN NaN NaN NaN 1996.000000 3.770000 208.000000 1.590000e+02
50% 9.780553e+12 NaN NaN NaN NaN NaN NaN NaN 2002.000000 3.960000 304.000000 1.018000e+03
75% 9.780810e+12 NaN NaN NaN NaN NaN NaN NaN 2005.000000 4.130000 420.000000 5.992500e+03
max 9.789042e+12 NaN NaN NaN NaN NaN NaN NaN 2019.000000 5.000000 3342.000000 5.629932e+06
books.isnull().sum()
isbn13               0
isbn10               0
title                0
subtitle          4429
authors             72
categories          99
thumbnail          329
description        262
published_year       6
average_rating      43
num_pages           43
ratings_count       43
dtype: int64
books.drop('thumbnail', inplace=True, axis=1)
books.drop('subtitle', inplace=True, axis=1)
books.drop('description', inplace=True, axis=1)
books.drop('isbn13', inplace=True, axis=1)
books.drop('isbn10', inplace=True, axis=1)
books.isnull().sum()
title              0
authors           72
categories        99
published_year     6
average_rating    43
num_pages         43
ratings_count     43
dtype: int64
books.dropna(inplace=True)
books
title authors categories published_year average_rating num_pages ratings_count
0 Gilead Marilynne Robinson Fiction 2004.0 3.85 247.0 361.0
1 Spider's Web Charles Osborne;Agatha Christie Detective and mystery stories 2000.0 3.83 241.0 5164.0
2 The One Tree Stephen R. Donaldson American fiction 1982.0 3.97 479.0 172.0
3 Rage of angels Sidney Sheldon Fiction 1993.0 3.93 512.0 29532.0
4 The Four Loves Clive Staples Lewis Christian life 2002.0 4.15 170.0 33684.0
... ... ... ... ... ... ... ...
6805 I Am that Sri Nisargadatta Maharaj;Sudhakar S. Dikshit Philosophy 1999.0 4.51 531.0 104.0
6806 Secrets Of The Heart Khalil Gibran Mysticism 1993.0 4.08 74.0 324.0
6807 Fahrenheit 451 Ray Bradbury Book burning 2004.0 3.98 186.0 5733.0
6808 The Berlin Phenomenology Georg Wilhelm Friedrich Hegel History 1981.0 0.00 210.0 0.0
6809 'I'm Telling You Stories' Helena Grice;Tim Woods Literary Criticism 1998.0 3.70 136.0 10.0

6599 rows × 7 columns

books.describe(include='all')
title authors categories published_year average_rating num_pages ratings_count
count 6599 6599 6599 6599.000000 6599.000000 6599.000000 6.599000e+03
unique 6216 3728 563 NaN NaN NaN NaN
top The Lord of the Rings Agatha Christie Fiction NaN NaN NaN NaN
freq 9 37 2561 NaN NaN NaN NaN
mean NaN NaN NaN 1998.750417 3.931367 348.296863 2.143083e+04
std NaN NaN NaN 10.168465 0.331173 239.199411 1.392929e+05
min NaN NaN NaN 1876.000000 0.000000 0.000000 0.000000e+00
25% NaN NaN NaN 1997.000000 3.770000 208.000000 1.630000e+02
50% NaN NaN NaN 2002.000000 3.950000 304.000000 1.032000e+03
75% NaN NaN NaN 2005.000000 4.130000 420.000000 6.105500e+03
max NaN NaN NaN 2019.000000 5.000000 3342.000000 5.629932e+06
books["categories"].value_counts()
Fiction                            2561
Juvenile Fiction                    524
Biography & Autobiography           398
History                             261
Literary Criticism                  165
                                   ... 
Child analysis                        1
Illinois                              1
Erinyes (Greek mythology)             1
Exorcism                              1
People with social disabilities       1
Name: categories, Length: 563, dtype: int64
books["published_year"].value_counts()
2006.0    877
2005.0    681
2004.0    605
2003.0    569
2002.0    470
         ... 
1928.0      1
1904.0      1
1938.0      1
1936.0      1
1947.0      1
Name: published_year, Length: 91, dtype: int64
books["authors"].value_counts()
Agatha Christie               37
Stephen King                  36
William Shakespeare           29
John Ronald Reuel Tolkien     25
Sandra Brown                  23
                              ..
Aeg                            1
Pauline Reage                  1
Tim Flannery                   1
Saint Augustine (of Hippo)     1
Michael S. Reynolds            1
Name: authors, Length: 3728, dtype: int64
books["average_rating"].value_counts()
4.00    125
3.93    110
3.95    109
3.99    108
3.96    104
       ... 
4.64      1
4.68      1
4.72      1
2.44      1
4.78      1
Name: average_rating, Length: 200, dtype: int64
import sklearn
from sklearn.model_selection import train_test_split

books_train, books_test = sklearn.model_selection.train_test_split(books, test_size=0.2, random_state=1)
books_train, books_val = sklearn.model_selection.train_test_split(books_train, test_size=0.5, random_state=1)
books_train
title authors categories published_year average_rating num_pages ratings_count
915 The Autobiography of Alice B. Toklas Gertrude Stein Biography & Autobiography 2001.0 3.59 272.0 233.0
4493 Never Far from Nowhere Andrea Levy Blacks 1996.0 3.68 282.0 601.0
1983 Year's Happy Ending Betty Neels Fiction 2001.0 3.95 216.0 128.0
2196 Wrinkles in Time George Smoot;Keay Davidson Science 1994.0 3.99 360.0 985.0
4011 Dispatches Michael Herr History 1991.0 4.23 260.0 12590.0
... ... ... ... ... ... ... ...
2841 Magic Bites Ilona Andrews Fiction 2007.0 4.07 260.0 82231.0
1713 High Five Janet Evanovich Bail bond agents 2000.0 4.18 336.0 99172.0
3469 A Brief History of Time Stephen Hawking Science 1998.0 4.16 212.0 214520.0
1657 The Magus John Fowles Fiction 2001.0 4.05 656.0 36909.0
3986 The Complete Monty Python's Flying Circus Graham Chapman;Monty Python (Comedy troupe);Te... Humor 1989.0 4.44 384.0 1191.0

2639 rows × 7 columns

books_train.describe(include='all')
title authors categories published_year average_rating num_pages ratings_count
count 2639 2639 2639 2639.000000 2639.000000 2639.000000 2.639000e+03
unique 2547 1827 286 NaN NaN NaN NaN
top One Hundred Years of Solitude Stephen King Fiction NaN NaN NaN NaN
freq 4 18 1027 NaN NaN NaN NaN
mean NaN NaN NaN 1999.032967 3.929807 349.534672 2.363199e+04
std NaN NaN NaN 9.865320 0.358919 244.871090 1.452470e+05
min NaN NaN NaN 1876.000000 0.000000 0.000000 0.000000e+00
25% NaN NaN NaN 1997.000000 3.770000 208.000000 1.745000e+02
50% NaN NaN NaN 2002.000000 3.950000 304.000000 1.066000e+03
75% NaN NaN NaN 2005.000000 4.130000 429.000000 6.084500e+03
max NaN NaN NaN 2019.000000 5.000000 3020.000000 4.367341e+06
books_test.describe(include='all')
title authors categories published_year average_rating num_pages ratings_count
count 1320 1320 1320 1320.000000 1320.000000 1320.000000 1.320000e+03
unique 1303 1064 185 NaN NaN NaN NaN
top 20,000 Leagues Under the Sea Stephen King Fiction NaN NaN NaN NaN
freq 3 7 540 NaN NaN NaN NaN
mean NaN NaN NaN 1998.590909 3.925470 339.346970 1.588767e+04
std NaN NaN NaN 10.119569 0.299805 219.560964 7.877064e+04
min NaN NaN NaN 1942.000000 2.330000 0.000000 0.000000e+00
25% NaN NaN NaN 1996.000000 3.750000 208.000000 1.510000e+02
50% NaN NaN NaN 2002.000000 3.950000 304.000000 1.068000e+03
75% NaN NaN NaN 2005.000000 4.130000 401.000000 6.360000e+03
max NaN NaN NaN 2017.000000 5.000000 3342.000000 2.115562e+06
books_val.describe(include='all')
title authors categories published_year average_rating num_pages ratings_count
count 2640 2640 2640 2640.000000 2640.000000 2640.000000 2.640000e+03
unique 2562 1850 313 NaN NaN NaN NaN
top Three Complete Novels Agatha Christie Fiction NaN NaN NaN NaN
freq 6 14 994 NaN NaN NaN NaN
mean NaN NaN NaN 1998.547727 3.935875 351.534470 2.200209e+04
std NaN NaN NaN 10.483752 0.316971 242.829463 1.558830e+05
min NaN NaN NaN 1901.000000 0.000000 4.000000 0.000000e+00
25% NaN NaN NaN 1996.000000 3.770000 208.000000 1.557500e+02
50% NaN NaN NaN 2002.000000 3.950000 309.500000 9.555000e+02
75% NaN NaN NaN 2005.000000 4.130000 430.250000 5.980750e+03
max NaN NaN NaN 2019.000000 5.000000 2965.000000 5.629932e+06