meh/recommender-systems-class-master/class_2_numpy_pandas.ipynb
2021-07-07 20:03:54 +02:00

126 KiB

%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display, HTML

# Fix the dying kernel problem (only a problem in some installations - you can remove it, if it works without it)
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

Numpy

For a detailed reference check out: https://numpy.org/doc/stable/reference/arrays.indexing.html.

Creating numpy arrays

Directly

a = np.array(
    [[1.0, 2.0, 3.0], 
     [4.0, 5.0, 6.0], 
     [7.0, 8.0, 9.0]]
)

print(a)
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

From a list

a = [[1.0, 2.0, 3.0], 
     [4.0, 5.0, 6.0], 
     [7.0, 8.0, 9.0]]

print(a)
print()

a = np.array(a)

print(a)
[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

From a list comprehension

a = [i**2 for i in range(10)]

print(a)
print()
print(np.array(a))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

[ 0  1  4  9 16 25 36 49 64 81]

Ready-made functions in numpy

# All zeros
a = np.zeros((3, 4))
print("All zeros")
print(a)
print()

# All a chosen value
a = np.full((3, 4), 7.0)
print("All chosen value (variant 1)")
print(a)
print()

# or

a = np.zeros((3, 4))
a[:] = 7.0
print("All chosen value (variant 2)")
print(a)
print()

# Random integers

a = np.random.randint(low=0, high=10, size=(3, 2))
print("Random integers")
print(a)
print()

# Random values from the normal distribution (Gaussian)

print("Random values from the normal distribution")
a = np.random.normal(loc=0, scale=10, size=(3, 2))
print(a)
All zeros
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

All chosen value (variant 1)
[[7. 7. 7. 7.]
 [7. 7. 7. 7.]
 [7. 7. 7. 7.]]

All chosen value (variant 2)
[[7. 7. 7. 7.]
 [7. 7. 7. 7.]
 [7. 7. 7. 7.]]

Random integers
[[7 5]
 [9 8]
 [6 3]]

Random values from the normal distribution
[[  3.88109518 -15.30896612]
 [  7.88779281   7.67458172]
 [ -9.81026963  -6.02098263]]

Slicing numpy arrays

Slicing in 1D

To obtain only specific values from a numpy array one can use so called slicing. It has the form

arr[low:high:step]

where low is the lowest index to be retrieved, high is the lowest index not to be retrieved and step indicates that every step element will be taken.

a = [i**2 for i in range(10)]

print("Original: ", a)
print("First 5 elements:", a[:5])
print("Elements from index 3 to index 5:", a[3:6])
print("Last 3 elements (negative indexing):", a[-3:])
print("Printed in reverse order:", a[::-1])
print("Every second element:", a[::2])
Original:  [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
First 5 elements: [0, 1, 4, 9, 16]
Elements from index 3 to index 5: [9, 16, 25]
Last 3 elements (negative indexing): [49, 64, 81]
Printed in reverse order: [81, 64, 49, 36, 25, 16, 9, 4, 1, 0]
Every second element: [0, 4, 16, 36, 64]

Slicing in 2D

In two dimensions it works similarly, just the slicing is separate for every dimension.

a = np.array([i for i in range(25)]).reshape(5, 5)

print("Original: ")
print(a)
print()
print("First 2 elements of the first 3 row:")
print(a[:3, :2])
print()
print("Middle 3 elements from the middle 3 rows:")
print(a[1:4, 1:4])
print()
print("Bottom-right 3 by 3 submatrix (negative indexing):")
print(a[-3:, -3:])
print()
print("Reversed columns:")
print(a[:, ::-1])
print()
Original: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

First 2 elements of the first 3 row:
[[ 0  1]
 [ 5  6]
 [10 11]]

Middle 3 elements from the middle 3 rows:
[[ 6  7  8]
 [11 12 13]
 [16 17 18]]

Bottom-right 3 by 3 submatrix (negative indexing):
[[12 13 14]
 [17 18 19]
 [22 23 24]]

Reversed columns:
[[ 4  3  2  1  0]
 [ 9  8  7  6  5]
 [14 13 12 11 10]
 [19 18 17 16 15]
 [24 23 22 21 20]]

Setting numpy array field values

a = np.array([i for i in range(25)]).reshape(5, 5)

print("Original: ")
print(a)
print()

a[1:4, 1:4] = 5.0

print("Middle values changed to 5")
print(a)
print()

b = np.array([i**2 - i for i in range(9)]).reshape(3, 3)

print("Second matrix")
print(b)
print()

a[1:4, 1:4] = b

print("Second matrix substituted into the middle of the first matrix")
print(a)
Original: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

Middle values changed to 5
[[ 0  1  2  3  4]
 [ 5  5  5  5  9]
 [10  5  5  5 14]
 [15  5  5  5 19]
 [20 21 22 23 24]]

Second matrix
[[ 0  0  2]
 [ 6 12 20]
 [30 42 56]]

Second matrix substituted into the middle of the first matrix
[[ 0  1  2  3  4]
 [ 5  0  0  2  9]
 [10  6 12 20 14]
 [15 30 42 56 19]
 [20 21 22 23 24]]

Operations on numpy arrays

It is important to remember that arithmetic operations on numpy arrays are always element-wise.

a = np.array([i**2 for i in range(9)]).reshape((3, 3))
print(a)
print()

b = np.array([i**0.5 for i in range(9)]).reshape((3, 3))
print(b)
print()
[[ 0  1  4]
 [ 9 16 25]
 [36 49 64]]

[[0.         1.         1.41421356]
 [1.73205081 2.         2.23606798]
 [2.44948974 2.64575131 2.82842712]]

Element-wise sum

print(a + b)
[[ 0.          2.          5.41421356]
 [10.73205081 18.         27.23606798]
 [38.44948974 51.64575131 66.82842712]]

Element-wise multiplication

print(a * b)
[[  0.           1.           5.65685425]
 [ 15.58845727  32.          55.90169944]
 [ 88.18163074 129.64181424 181.01933598]]

Matrix multiplication

print(np.matmul(a, b))
print()

# Multiplication by the identity matrix (to check it works as expected)
id_matrix = np.array([[1.0, 0.0, 0.0], 
                      [0.0, 1.0, 0.0], 
                      [0.0, 0.0, 1.0]])

print(np.matmul(id_matrix, a))
[[ 11.53000978  12.58300524  13.54977648]
 [ 88.95005649 107.14378278 119.21568782]
 [241.63783311 303.32808391 341.49835513]]

[[ 0.  1.  4.]
 [ 9. 16. 25.]
 [36. 49. 64.]]

Calculating the mean

a = np.random.randint(low=0, high=10, size=(5))

print(a)
print()

print("Mean (by sum): ", np.sum(a) / len(a))
print("Mean (by mean):", np.mean(a))
[1 4 0 6 4]

Mean (by sum):  3.0
Mean (by mean): 3.0

Calculating the mean of every row

a = np.random.randint(low=0, high=10, size=(5, 3))

print(a)
print()
print(a.shape)
print()

print("Mean:", np.sum(a, axis=1) / a.shape[1])

print("Mean in the original matrix form:")
print((np.sum(a, axis=1) / a.shape[1]).reshape(-1, 1))  # -1 calculates the right size to use all elements
[[4 9 5]
 [8 9 1]
 [5 6 4]
 [3 7 8]
 [2 1 5]]

(5, 3)

Mean: [6.         6.         5.         6.         2.66666667]
Mean in the original matrix form:
[[6.        ]
 [6.        ]
 [5.        ]
 [6.        ]
 [2.66666667]]

More complex operations

a = [1.0, 2.0, 3.0]

print("Vector to power 2 (element-wise)")
print(np.power(a, 2))
print()
print("Euler number to the power a (element-wise)")
print(np.exp(a))
print()
print("An even more complex expression")
print((np.power(a, 2) + np.exp(a)) / np.sum(a))
Vector to power 2 (element-wise)
[1. 4. 9.]

Euler number to the power a (element-wise)
[ 2.71828183  7.3890561  20.08553692]

An even more complex expression
[0.61971364 1.89817602 4.84758949]

Numpy tasks

Task 1. Calculate the sigmoid (logistic) function on every element of the following numpy array [0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25] and print the last 5 elements. Use only vector operations.

# Write your code here

Task 2. Calculate the dot product of the following two vectors:
$x = [3, 1, 4, 2, 6, 1, 4, 8]$
$y = [5, 2, 3, 12, 2, 4, 17, 11]$
a) by using element-wise mutliplication and np.sum,
b) by using np.dot,
b) by using np.matmul and transposition (x.T).

# Write your code here

Task 3. Calculate the following expression
$$\frac{1}{1 + e^{-x_0 \theta_0 - \ldots - x_9 \theta_9 - \theta_{10}}}$$ for
$x = [1.2, 2.3, 3.4, -0.7, 4.2, 2.7, -0.5, -2.1, -3.3, 0.2]$
$\theta = [7.7, 0.33, -2.12, -1.73, 2.9, -5.8, -0.9, 12.11, 3.43, -0.5, 1.65]$
and print the result. Use only vector operations.

# Write your code here

Pandas

steam_df = pd.read_csv(os.path.join("data", "steam", "steam-200k.csv"), 
                       names=['user-id', 'game-title', 'behavior-name', 'value', 'zero'])

ml_ratings_df = pd.read_csv(os.path.join("data", "movielens_small", "ratings.csv"))
ml_movies_df = pd.read_csv(os.path.join("data", "movielens_small", "movies.csv"))

Peek into the datasets

steam_df.head(10)
user-id game-title behavior-name value zero
0 151603712 The Elder Scrolls V Skyrim purchase 1.0 0
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
2 151603712 Fallout 4 purchase 1.0 0
3 151603712 Fallout 4 play 87.0 0
4 151603712 Spore purchase 1.0 0
5 151603712 Spore play 14.9 0
6 151603712 Fallout New Vegas purchase 1.0 0
7 151603712 Fallout New Vegas play 12.1 0
8 151603712 Left 4 Dead 2 purchase 1.0 0
9 151603712 Left 4 Dead 2 play 8.9 0
ml_ratings_df.head(10)
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
5 1 70 3.0 964982400
6 1 101 5.0 964980868
7 1 110 4.0 964982176
8 1 151 5.0 964984041
9 1 157 5.0 964984100
ml_movies_df.head(10)
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller

Merge both MovieLens DataFrames into one

ml_df = pd.merge(ml_ratings_df, ml_movies_df, on='movieId')
ml_df.head(10)
userId movieId rating timestamp title genres
0 1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 5 1 4.0 847434962 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 7 1 4.5 1106635946 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
3 15 1 2.5 1510577970 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
4 17 1 4.5 1305696483 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
5 18 1 3.5 1455209816 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
6 19 1 4.0 965705637 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
7 21 1 3.5 1407618878 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
8 27 1 3.0 962685262 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
9 31 1 5.0 850466616 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

Choosing a row, a column or several columns

display(HTML(steam_df.head(10).to_html()))

# Choosing rows by index
chosen_df = steam_df[3:6]

print("Choosing rows by index")
display(HTML(chosen_df.head(10).to_html()))

# Choosing rows by position
chosen_df = steam_df.iloc[3:6]

print("Choosing rows by position")
display(HTML(chosen_df.head(10).to_html()))
user-id game-title behavior-name value zero
0 151603712 The Elder Scrolls V Skyrim purchase 1.0 0
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
2 151603712 Fallout 4 purchase 1.0 0
3 151603712 Fallout 4 play 87.0 0
4 151603712 Spore purchase 1.0 0
5 151603712 Spore play 14.9 0
6 151603712 Fallout New Vegas purchase 1.0 0
7 151603712 Fallout New Vegas play 12.1 0
8 151603712 Left 4 Dead 2 purchase 1.0 0
9 151603712 Left 4 Dead 2 play 8.9 0
Choosing rows by index
user-id game-title behavior-name value zero
3 151603712 Fallout 4 play 87.0 0
4 151603712 Spore purchase 1.0 0
5 151603712 Spore play 14.9 0
Choosing rows by position
user-id game-title behavior-name value zero
3 151603712 Fallout 4 play 87.0 0
4 151603712 Spore purchase 1.0 0
5 151603712 Spore play 14.9 0
# Choosing a column
chosen_df = steam_df['game-title']

print(chosen_df.head(10))
0    The Elder Scrolls V Skyrim
1    The Elder Scrolls V Skyrim
2                     Fallout 4
3                     Fallout 4
4                         Spore
5                         Spore
6             Fallout New Vegas
7             Fallout New Vegas
8                 Left 4 Dead 2
9                 Left 4 Dead 2
Name: game-title, dtype: object
# Choosing several columns
chosen_df = steam_df[['user-id', 'game-title']]

display(HTML(chosen_df.head(10).to_html()))
user-id game-title
0 151603712 The Elder Scrolls V Skyrim
1 151603712 The Elder Scrolls V Skyrim
2 151603712 Fallout 4
3 151603712 Fallout 4
4 151603712 Spore
5 151603712 Spore
6 151603712 Fallout New Vegas
7 151603712 Fallout New Vegas
8 151603712 Left 4 Dead 2
9 151603712 Left 4 Dead 2

Splitting the dataset into training and test set

shuffle = np.array(list(range(len(steam_df))))

# alternatively

shuffle = np.arange(len(steam_df))

np.random.shuffle(shuffle)
# shuffle = list(shuffle)
print("Shuffled range of indices")
print(shuffle[:20])
print()

train_test_split = 0.8
split_index = int(len(steam_df) * train_test_split)

training_set = steam_df.iloc[shuffle[:split_index]]
test_set = steam_df.iloc[shuffle[split_index:]]

display(HTML(training_set.head(10).to_html()))

display(HTML(test_set.head(10).to_html()))

print(len(training_set))
print(len(test_set))
Shuffled range of indices
[ 88886  27084  35588  56116 183664  34019 190384 138109  48325  94171
 163304  35071  45875 187591 107927  62332  97588   3784    669  75931]

user-id game-title behavior-name value zero
88886 173434036 Mortal Kombat X purchase 1.0 0
27084 80779496 Sins of a Solar Empire Trinity play 0.6 0
35588 109669093 Killing Floor play 225.0 0
56116 94269421 Fallout 4 play 10.1 0
183664 279406744 BLOCKADE 3D purchase 1.0 0
34019 126269125 Grand Theft Auto San Andreas purchase 1.0 0
190384 71335402 7 Days to Die play 8.2 0
138109 156818121 Half-Life 2 play 22.0 0
48325 114617787 Garry's Mod play 1.2 0
94171 156615447 LEGO MARVEL Super Heroes play 1.7 0
user-id game-title behavior-name value zero
170080 81591317 Warframe purchase 1.0 0
85279 44472980 Serious Sam Double D XXL purchase 1.0 0
132916 45592640 Penumbra Black Plague purchase 1.0 0
12193 64787956 Always Sometimes Monsters purchase 1.0 0
46374 192538478 Heroes & Generals play 0.4 0
89823 1936551 Castle Crashers purchase 1.0 0
179113 132196353 Knights and Merchants purchase 1.0 0
144002 13190476 Blood Bowl 2 play 6.3 0
35416 60296891 Mirror's Edge purchase 1.0 0
120786 62990992 Rome Total War purchase 1.0 0
160000
40000

Filtering

Filtering columns

chosen_df = steam_df.loc[:, ['user-id', 'game-title']]

display(HTML(chosen_df.head(10).to_html()))
user-id game-title
0 151603712 The Elder Scrolls V Skyrim
1 151603712 The Elder Scrolls V Skyrim
2 151603712 Fallout 4
3 151603712 Fallout 4
4 151603712 Spore
5 151603712 Spore
6 151603712 Fallout New Vegas
7 151603712 Fallout New Vegas
8 151603712 Left 4 Dead 2
9 151603712 Left 4 Dead 2

Filtering rows

condition = steam_df['game-title'] == 'Fallout 4'

print(condition.head(10))

chosen_df = steam_df.loc[condition]

display(HTML(chosen_df.head(10).to_html()))
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: game-title, dtype: bool
user-id game-title behavior-name value zero
2 151603712 Fallout 4 purchase 1.0 0
3 151603712 Fallout 4 play 87.0 0
3187 87445402 Fallout 4 purchase 1.0 0
3188 87445402 Fallout 4 play 83.0 0
5683 25096601 Fallout 4 purchase 1.0 0
5684 25096601 Fallout 4 play 1.6 0
6219 211925330 Fallout 4 purchase 1.0 0
6220 211925330 Fallout 4 play 133.0 0
7300 115396529 Fallout 4 purchase 1.0 0
7301 115396529 Fallout 4 play 17.9 0

Filtering rows and columns at once

condition = (steam_df['game-title'] == 'Fallout 4') & (steam_df['behavior-name'] == 'play')

chosen_df = steam_df.loc[condition, ['user-id', 'game-title', 'value']]

display(HTML(chosen_df.head(10).to_html()))
user-id game-title value
3 151603712 Fallout 4 87.0
3188 87445402 Fallout 4 83.0
5684 25096601 Fallout 4 1.6
6220 211925330 Fallout 4 133.0
7301 115396529 Fallout 4 17.9
7527 4834220 Fallout 4 19.8
7617 65229865 Fallout 4 0.5
7712 65958466 Fallout 4 123.0
9963 91800733 Fallout 4 63.0
10700 43913966 Fallout 4 65.0

Simple operations on columns

Multiply a column by 2

steam_df_copy = steam_df.copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'] * 2

display(HTML(steam_df_copy.head(10).to_html()))
user-id game-title behavior-name value zero
0 151603712 The Elder Scrolls V Skyrim purchase 1.0 0
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
2 151603712 Fallout 4 purchase 1.0 0
3 151603712 Fallout 4 play 87.0 0
4 151603712 Spore purchase 1.0 0
5 151603712 Spore play 14.9 0
6 151603712 Fallout New Vegas purchase 1.0 0
7 151603712 Fallout New Vegas play 12.1 0
8 151603712 Left 4 Dead 2 purchase 1.0 0
9 151603712 Left 4 Dead 2 play 8.9 0
user-id game-title behavior-name value zero
0 151603712 The Elder Scrolls V Skyrim purchase 2.0 0
1 151603712 The Elder Scrolls V Skyrim play 546.0 0
2 151603712 Fallout 4 purchase 2.0 0
3 151603712 Fallout 4 play 174.0 0
4 151603712 Spore purchase 2.0 0
5 151603712 Spore play 29.8 0
6 151603712 Fallout New Vegas purchase 2.0 0
7 151603712 Fallout New Vegas play 24.2 0
8 151603712 Left 4 Dead 2 purchase 2.0 0
9 151603712 Left 4 Dead 2 play 17.8 0

Choose the first n letters of a string

ml_movies_df_copy = ml_movies_df.copy()

display(HTML(ml_movies_df_copy.head(10).to_html()))

ml_movies_df_copy.loc[:, 'title'] = ml_movies_df_copy['title'].str[:6]

display(HTML(ml_movies_df_copy.head(10).to_html()))
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
movieId title genres
0 1 Toy St Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanj Adventure|Children|Fantasy
2 3 Grumpi Comedy|Romance
3 4 Waitin Comedy|Drama|Romance
4 5 Father Comedy
5 6 Heat ( Action|Crime|Thriller
6 7 Sabrin Comedy|Romance
7 8 Tom an Adventure|Children
8 9 Sudden Action
9 10 Golden Action|Adventure|Thriller

Take the mean of a column

# Option 1
print(steam_df['value'].mean())

# Option 2
print(np.mean(steam_df['value']))
17.874384000000475
17.874384000000475

Simple operation on filtered data

steam_df_copy = steam_df.loc[((steam_df['game-title'] == 'Fallout 4') | (steam_df['game-title'] == 'The Elder Scrolls V Skyrim')) 
                             & (steam_df['behavior-name'] == 'play')].copy()

display(HTML(steam_df_copy.head(10).to_html()))

condition = (steam_df_copy['game-title'] == 'Fallout 4') & (steam_df_copy['behavior-name'] == 'play')

steam_df_copy.loc[condition, 'value'] = steam_df_copy.loc[condition, 'value'] * 2

display(HTML(steam_df_copy.head(10).to_html()))
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
3 151603712 Fallout 4 play 87.0 0
73 59945701 The Elder Scrolls V Skyrim play 58.0 0
1066 92107940 The Elder Scrolls V Skyrim play 110.0 0
1168 250006052 The Elder Scrolls V Skyrim play 465.0 0
1388 11373749 The Elder Scrolls V Skyrim play 220.0 0
2065 54103616 The Elder Scrolls V Skyrim play 35.0 0
2569 56038151 The Elder Scrolls V Skyrim play 14.6 0
3188 87445402 Fallout 4 play 83.0 0
3233 94088853 The Elder Scrolls V Skyrim play 320.0 0
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
3 151603712 Fallout 4 play 174.0 0
73 59945701 The Elder Scrolls V Skyrim play 58.0 0
1066 92107940 The Elder Scrolls V Skyrim play 110.0 0
1168 250006052 The Elder Scrolls V Skyrim play 465.0 0
1388 11373749 The Elder Scrolls V Skyrim play 220.0 0
2065 54103616 The Elder Scrolls V Skyrim play 35.0 0
2569 56038151 The Elder Scrolls V Skyrim play 14.6 0
3188 87445402 Fallout 4 play 166.0 0
3233 94088853 The Elder Scrolls V Skyrim play 320.0 0

Advanced operations on columns

def reduce_outliers(x):
    return min(np.log(1 + x), 4)

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'].apply(reduce_outliers)

display(HTML(steam_df_copy.head(10).to_html()))
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
3 151603712 Fallout 4 play 87.0 0
5 151603712 Spore play 14.9 0
7 151603712 Fallout New Vegas play 12.1 0
9 151603712 Left 4 Dead 2 play 8.9 0
11 151603712 HuniePop play 8.5 0
13 151603712 Path of Exile play 8.1 0
15 151603712 Poly Bridge play 7.5 0
17 151603712 Left 4 Dead play 3.3 0
19 151603712 Team Fortress 2 play 2.8 0
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 4.000000 0
3 151603712 Fallout 4 play 4.000000 0
5 151603712 Spore play 2.766319 0
7 151603712 Fallout New Vegas play 2.572612 0
9 151603712 Left 4 Dead 2 play 2.292535 0
11 151603712 HuniePop play 2.251292 0
13 151603712 Path of Exile play 2.208274 0
15 151603712 Poly Bridge play 2.140066 0
17 151603712 Left 4 Dead play 1.458615 0
19 151603712 Team Fortress 2 play 1.335001 0

The same apply operation can be achieved with the use of a lambda function

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'].apply(lambda x: min(np.log(1 + x), 4))

display(HTML(steam_df_copy.head(10).to_html()))
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
3 151603712 Fallout 4 play 87.0 0
5 151603712 Spore play 14.9 0
7 151603712 Fallout New Vegas play 12.1 0
9 151603712 Left 4 Dead 2 play 8.9 0
11 151603712 HuniePop play 8.5 0
13 151603712 Path of Exile play 8.1 0
15 151603712 Poly Bridge play 7.5 0
17 151603712 Left 4 Dead play 3.3 0
19 151603712 Team Fortress 2 play 2.8 0
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 4.000000 0
3 151603712 Fallout 4 play 4.000000 0
5 151603712 Spore play 2.766319 0
7 151603712 Fallout New Vegas play 2.572612 0
9 151603712 Left 4 Dead 2 play 2.292535 0
11 151603712 HuniePop play 2.251292 0
13 151603712 Path of Exile play 2.208274 0
15 151603712 Poly Bridge play 2.140066 0
17 151603712 Left 4 Dead play 1.458615 0
19 151603712 Team Fortress 2 play 1.335001 0

Apply on two columns at once

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value_2'] = steam_df_copy['value'].apply(lambda x: min(np.log(1 + x), 4))

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy[['value', 'value_2']].apply(lambda x: x[0] * x[1], axis=1)

display(HTML(steam_df_copy.head(10).to_html()))
user-id game-title behavior-name value zero
1 151603712 The Elder Scrolls V Skyrim play 273.0 0
3 151603712 Fallout 4 play 87.0 0
5 151603712 Spore play 14.9 0
7 151603712 Fallout New Vegas play 12.1 0
9 151603712 Left 4 Dead 2 play 8.9 0
11 151603712 HuniePop play 8.5 0
13 151603712 Path of Exile play 8.1 0
15 151603712 Poly Bridge play 7.5 0
17 151603712 Left 4 Dead play 3.3 0
19 151603712 Team Fortress 2 play 2.8 0
user-id game-title behavior-name value zero value_2
1 151603712 The Elder Scrolls V Skyrim play 273.0 0 4.000000
3 151603712 Fallout 4 play 87.0 0 4.000000
5 151603712 Spore play 14.9 0 2.766319
7 151603712 Fallout New Vegas play 12.1 0 2.572612
9 151603712 Left 4 Dead 2 play 8.9 0 2.292535
11 151603712 HuniePop play 8.5 0 2.251292
13 151603712 Path of Exile play 8.1 0 2.208274
15 151603712 Poly Bridge play 7.5 0 2.140066
17 151603712 Left 4 Dead play 3.3 0 1.458615
19 151603712 Team Fortress 2 play 2.8 0 1.335001
user-id game-title behavior-name value zero value_2
1 151603712 The Elder Scrolls V Skyrim play 1092.000000 0 4.000000
3 151603712 Fallout 4 play 348.000000 0 4.000000
5 151603712 Spore play 41.218155 0 2.766319
7 151603712 Fallout New Vegas play 31.128608 0 2.572612
9 151603712 Left 4 Dead 2 play 20.403559 0 2.292535
11 151603712 HuniePop play 19.135980 0 2.251292
13 151603712 Path of Exile play 17.887023 0 2.208274
15 151603712 Poly Bridge play 16.050496 0 2.140066
17 151603712 Left 4 Dead play 4.813430 0 1.458615
19 151603712 Team Fortress 2 play 3.738003 0 1.335001
ml_movies_df_copy = ml_movies_df.copy()

display(HTML(ml_movies_df_copy.head(10).to_html()))

ml_movies_df_copy.loc[:, 'title|genres'] = ml_movies_df_copy[['title', 'genres']].apply(lambda x: x[0] + "|" + x[1], axis=1)

display(HTML(ml_movies_df_copy.head(10).to_html()))
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
movieId title genres title|genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy Toy Story (1995)|Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy Jumanji (1995)|Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance Grumpier Old Men (1995)|Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance Waiting to Exhale (1995)|Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy Father of the Bride Part II (1995)|Comedy
5 6 Heat (1995) Action|Crime|Thriller Heat (1995)|Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance Sabrina (1995)|Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children Tom and Huck (1995)|Adventure|Children
8 9 Sudden Death (1995) Action Sudden Death (1995)|Action
9 10 GoldenEye (1995) Action|Adventure|Thriller GoldenEye (1995)|Action|Adventure|Thriller

Grouping and aggregating

steam_grouped = steam_df.loc[steam_df['behavior-name'] == 'purchase', ['game-title', 'value']]
steam_grouped = steam_grouped.groupby('game-title').sum()
display(HTML(steam_grouped.head(10).to_html()))

steam_grouped = steam_grouped.sort_values(by='value', ascending=False).reset_index()

display(HTML(steam_grouped.head(10).to_html()))
value
game-title
007 Legends 1.0
0RBITALIS 3.0
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 7.0
10 Second Ninja 6.0
10,000,000 1.0
100% Orange Juice 10.0
1000 Amps 2.0
12 Labours of Hercules 10.0
12 Labours of Hercules II The Cretan Bull 12.0
12 Labours of Hercules III Girl Power 6.0
game-title value
0 Dota 2 4841.0
1 Team Fortress 2 2323.0
2 Unturned 1563.0
3 Counter-Strike Global Offensive 1412.0
4 Half-Life 2 Lost Coast 981.0
5 Counter-Strike Source 978.0
6 Left 4 Dead 2 951.0
7 Counter-Strike 856.0
8 Warframe 847.0
9 Half-Life 2 Deathmatch 823.0

Iterating over a DataFrame (if possible, use column operations instead)

i = 0
for idx, row in steam_df.iterrows():
    print("[{}, {}, {}, {}]".format(idx, row['user-id'], row['game-title'], row['behavior-name']))
    i += 1
    if i == 10:
        break
[0, 151603712, The Elder Scrolls V Skyrim, purchase]
[1, 151603712, The Elder Scrolls V Skyrim, play]
[2, 151603712, Fallout 4, purchase]
[3, 151603712, Fallout 4, play]
[4, 151603712, Spore, purchase]
[5, 151603712, Spore, play]
[6, 151603712, Fallout New Vegas, purchase]
[7, 151603712, Fallout New Vegas, play]
[8, 151603712, Left 4 Dead 2, purchase]
[9, 151603712, Left 4 Dead 2, play]

Pandas tasks - Steam dataset

Task 4. How many people made a purchase in the Steam dataset? Remember that a person could by many games, but you need to count every person once.

# Write your code here

Task 5. How many people made a purchase of "The Elder Scrolls V Skyrim"?

# Write your code here

Task 6. How many purchases people made on average?

# Write your code here

Task 7. Who bought the most games?

# Write your code here

Task 8. How many hours on average people played in "The Elder Scrolls V Skyrim"?

# Write your code here

Task 9. Which games were played the most (in terms of the number of hours played)? Print the first 10 titles and respective numbers of hours.

# Write your code here

Task 10. Which games are the most consistently played (in terms of the average number of hours played)? Print the first 10 titles and respective numbers of hours.

# Write your code here

Task 11**. Fix the above for the fact that 0 hours played is not listed, but only a purchase is recorded in such a case.

# Write your code here

Task 12. Apply the sigmoid function $$f(x) = \frac{1}{1 + e^{-\frac{1}{100}x}}$$ to hours played and print the first 10 rows from the entire Steam dataset after this change.

# Write your code here

Pandas tasks - MovieLens dataset

Task 13*. Calculate popularity (by the number of users who watched a movie) of all genres.

# Write your code here

Task 14*. Calculate average rating for all genres.

# Write your code here

Task 15. Calculate each movie rating bias (deviation from the mean of all movies average ratings). Print first 10 in the form: title, average rating, bias.

# Write your code here

Task 16. Calculate each user rating bias (deviation from the mean of all users average ratings). Print first 10 in the form: user_id, average rating, bias.

# Write your code here

Task 17. Randomly choose 10 movies and 10 users and print their interaction matrix in the form of a DataFrame with user_id as index and movie titles as columns (use HTML Display for that). You can iterate over the DataFrame in this task.

# Write your code here

Pandas + numpy tasks

Task 18. Create the entire interaction matrix for the MovieLens dataset.

# Write your code here

Task 19. Calculate the matrix of size (n_users, n_users) where at position (i, j) is the number of movies watched both by user i and user j. Print the submatrix of first 10 rows and 10 columns.

# Write your code here

Task 20. Calculate the matrix of size (n_items, n_items) where at position (i, j) is the number of users who watched both movie i and movie j. Print the submatrix of first 10 rows and 10 columns.

# Write your code here