%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display, HTML

# Fix the dying kernel problem (only a problem in some installations - you can remove it, if it works without it)
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

Numpy

For a detailed reference check out: https://numpy.org/doc/stable/reference/arrays.indexing.html.

Creating numpy arrays

Directly

a = np.array(
    [[1.0, 2.0, 3.0], 
     [4.0, 5.0, 6.0], 
     [7.0, 8.0, 9.0]]
)

print(a)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

From a list

a = [[1.0, 2.0, 3.0], 
     [4.0, 5.0, 6.0], 
     [7.0, 8.0, 9.0]]

print(a)
print()

a = np.array(a)

print(a)

[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

From a list comprehension

a = [i**2 for i in range(10)]

print(a)
print()
print(np.array(a))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

[ 0  1  4  9 16 25 36 49 64 81]

Ready-made functions in numpy

# All zeros
a = np.zeros((3, 4))
print("All zeros")
print(a)
print()

# All a chosen value
a = np.full((3, 4), 7.0)
print("All chosen value (variant 1)")
print(a)
print()

# or

a = np.zeros((3, 4))
a[:] = 7.0
print("All chosen value (variant 2)")
print(a)
print()

# Random integers

a = np.random.randint(low=0, high=10, size=(3, 2))
print("Random integers")
print(a)
print()

# Random values from the normal distribution (Gaussian)

print("Random values from the normal distribution")
a = np.random.normal(loc=0, scale=10, size=(3, 2))
print(a)

All zeros
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

All chosen value (variant 1)
[[7. 7. 7. 7.]
 [7. 7. 7. 7.]
 [7. 7. 7. 7.]]

All chosen value (variant 2)
[[7. 7. 7. 7.]
 [7. 7. 7. 7.]
 [7. 7. 7. 7.]]

Random integers
[[7 5]
 [9 8]
 [6 3]]

Random values from the normal distribution
[[  3.88109518 -15.30896612]
 [  7.88779281   7.67458172]
 [ -9.81026963  -6.02098263]]

Slicing numpy arrays

Slicing in 1D

To obtain only specific values from a numpy array one can use so called slicing. It has the form

arr[low:high:step]

where low is the lowest index to be retrieved, high is the lowest index not to be retrieved and step indicates that every step element will be taken.

a = [i**2 for i in range(10)]

print("Original: ", a)
print("First 5 elements:", a[:5])
print("Elements from index 3 to index 5:", a[3:6])
print("Last 3 elements (negative indexing):", a[-3:])
print("Printed in reverse order:", a[::-1])
print("Every second element:", a[::2])

Original:  [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
First 5 elements: [0, 1, 4, 9, 16]
Elements from index 3 to index 5: [9, 16, 25]
Last 3 elements (negative indexing): [49, 64, 81]
Printed in reverse order: [81, 64, 49, 36, 25, 16, 9, 4, 1, 0]
Every second element: [0, 4, 16, 36, 64]

Slicing in 2D

In two dimensions it works similarly, just the slicing is separate for every dimension.

a = np.array([i for i in range(25)]).reshape(5, 5)

print("Original: ")
print(a)
print()
print("First 2 elements of the first 3 row:")
print(a[:3, :2])
print()
print("Middle 3 elements from the middle 3 rows:")
print(a[1:4, 1:4])
print()
print("Bottom-right 3 by 3 submatrix (negative indexing):")
print(a[-3:, -3:])
print()
print("Reversed columns:")
print(a[:, ::-1])
print()

Original: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

First 2 elements of the first 3 row:
[[ 0  1]
 [ 5  6]
 [10 11]]

Middle 3 elements from the middle 3 rows:
[[ 6  7  8]
 [11 12 13]
 [16 17 18]]

Bottom-right 3 by 3 submatrix (negative indexing):
[[12 13 14]
 [17 18 19]
 [22 23 24]]

Reversed columns:
[[ 4  3  2  1  0]
 [ 9  8  7  6  5]
 [14 13 12 11 10]
 [19 18 17 16 15]
 [24 23 22 21 20]]

Setting numpy array field values

a = np.array([i for i in range(25)]).reshape(5, 5)

print("Original: ")
print(a)
print()

a[1:4, 1:4] = 5.0

print("Middle values changed to 5")
print(a)
print()

b = np.array([i**2 - i for i in range(9)]).reshape(3, 3)

print("Second matrix")
print(b)
print()

a[1:4, 1:4] = b

print("Second matrix substituted into the middle of the first matrix")
print(a)

Original: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

Middle values changed to 5
[[ 0  1  2  3  4]
 [ 5  5  5  5  9]
 [10  5  5  5 14]
 [15  5  5  5 19]
 [20 21 22 23 24]]

Second matrix
[[ 0  0  2]
 [ 6 12 20]
 [30 42 56]]

Second matrix substituted into the middle of the first matrix
[[ 0  1  2  3  4]
 [ 5  0  0  2  9]
 [10  6 12 20 14]
 [15 30 42 56 19]
 [20 21 22 23 24]]

Operations on numpy arrays

It is important to remember that arithmetic operations on numpy arrays are always element-wise.

a = np.array([i**2 for i in range(9)]).reshape((3, 3))
print(a)
print()

b = np.array([i**0.5 for i in range(9)]).reshape((3, 3))
print(b)
print()

[[ 0  1  4]
 [ 9 16 25]
 [36 49 64]]

[[0.         1.         1.41421356]
 [1.73205081 2.         2.23606798]
 [2.44948974 2.64575131 2.82842712]]

Element-wise sum

print(a + b)

[[ 0.          2.          5.41421356]
 [10.73205081 18.         27.23606798]
 [38.44948974 51.64575131 66.82842712]]

Element-wise multiplication

print(a * b)

[[  0.           1.           5.65685425]
 [ 15.58845727  32.          55.90169944]
 [ 88.18163074 129.64181424 181.01933598]]

Matrix multiplication

print(np.matmul(a, b))
print()

# Multiplication by the identity matrix (to check it works as expected)
id_matrix = np.array([[1.0, 0.0, 0.0], 
                      [0.0, 1.0, 0.0], 
                      [0.0, 0.0, 1.0]])

print(np.matmul(id_matrix, a))

[[ 11.53000978  12.58300524  13.54977648]
 [ 88.95005649 107.14378278 119.21568782]
 [241.63783311 303.32808391 341.49835513]]

[[ 0.  1.  4.]
 [ 9. 16. 25.]
 [36. 49. 64.]]

Calculating the mean

a = np.random.randint(low=0, high=10, size=(5))

print(a)
print()

print("Mean (by sum): ", np.sum(a) / len(a))
print("Mean (by mean):", np.mean(a))

[1 4 0 6 4]

Mean (by sum):  3.0
Mean (by mean): 3.0

Calculating the mean of every row

a = np.random.randint(low=0, high=10, size=(5, 3))

print(a)
print()
print(a.shape)
print()

print("Mean:", np.sum(a, axis=1) / a.shape[1])

print("Mean in the original matrix form:")
print((np.sum(a, axis=1) / a.shape[1]).reshape(-1, 1))  # -1 calculates the right size to use all elements

[[4 9 5]
 [8 9 1]
 [5 6 4]
 [3 7 8]
 [2 1 5]]

(5, 3)

Mean: [6.         6.         5.         6.         2.66666667]
Mean in the original matrix form:
[[6.        ]
 [6.        ]
 [5.        ]
 [6.        ]
 [2.66666667]]

More complex operations

a = [1.0, 2.0, 3.0]

print("Vector to power 2 (element-wise)")
print(np.power(a, 2))
print()
print("Euler number to the power a (element-wise)")
print(np.exp(a))
print()
print("An even more complex expression")
print((np.power(a, 2) + np.exp(a)) / np.sum(a))

Vector to power 2 (element-wise)
[1. 4. 9.]

Euler number to the power a (element-wise)
[ 2.71828183  7.3890561  20.08553692]

An even more complex expression
[0.61971364 1.89817602 4.84758949]

Numpy tasks

Task 1. Calculate the sigmoid (logistic) function on every element of the following numpy array [0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25] and print the last 5 elements. Use only vector operations.

# Write your code here

Task 2. Calculate the dot product of the following two vectors:
$x = [3, 1, 4, 2, 6, 1, 4, 8]$
$y = [5, 2, 3, 12, 2, 4, 17, 11]$
a) by using element-wise mutliplication and np.sum,
b) by using np.dot,
b) by using np.matmul and transposition (x.T).

# Write your code here

Task 3. Calculate the following expression
$$\frac{1}{1 + e^{-x_0 \theta_0 - \ldots - x_9 \theta_9 - \theta_{10}}}$$ for
$x = [1.2, 2.3, 3.4, -0.7, 4.2, 2.7, -0.5, -2.1, -3.3, 0.2]$
$\theta = [7.7, 0.33, -2.12, -1.73, 2.9, -5.8, -0.9, 12.11, 3.43, -0.5, 1.65]$
and print the result. Use only vector operations.

# Write your code here

Pandas

Load datasets

Steam (https://www.kaggle.com/tamber/steam-video-games)
MovieLens (https://grouplens.org/datasets/movielens/)

steam_df = pd.read_csv(os.path.join("data", "steam", "steam-200k.csv"), 
                       names=['user-id', 'game-title', 'behavior-name', 'value', 'zero'])

ml_ratings_df = pd.read_csv(os.path.join("data", "movielens_small", "ratings.csv"))
ml_movies_df = pd.read_csv(os.path.join("data", "movielens_small", "movies.csv"))

Peek into the datasets

steam_df.head(10)

	user-id	game-title	behavior-name	value
0	151603712	The Elder Scrolls V Skyrim	purchase	1.0
1	151603712	The Elder Scrolls V Skyrim	play	273.0
2	151603712	Fallout 4	purchase	1.0
3	151603712	Fallout 4	play	87.0
4	151603712	Spore	purchase	1.0
5	151603712	Spore	play	14.9
6	151603712	Fallout New Vegas	purchase	1.0
7	151603712	Fallout New Vegas	play	12.1
8	151603712	Left 4 Dead 2	purchase	1.0
9	151603712	Left 4 Dead 2	play	8.9

ml_ratings_df.head(10)

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931
5	1	70	3.0	964982400
6	1	101	5.0	964980868
7	1	110	4.0	964982176
8	1	151	5.0	964984041
9	1	157	5.0	964984100

ml_movies_df.head(10)

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

Merge both MovieLens DataFrames into one

ml_df = pd.merge(ml_ratings_df, ml_movies_df, on='movieId')
ml_df.head(10)

	userId	movieId	rating	timestamp	title	genres
0	1	1	4.0	964982703	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	5	1	4.0	847434962	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	7	1	4.5	1106635946	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
3	15	1	2.5	1510577970	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
4	17	1	4.5	1305696483	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
5	18	1	3.5	1455209816	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
6	19	1	4.0	965705637	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
7	21	1	3.5	1407618878	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
8	27	1	3.0	962685262	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
9	31	1	5.0	850466616	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy

Choosing a row, a column or several columns

display(HTML(steam_df.head(10).to_html()))

# Choosing rows by index
chosen_df = steam_df[3:6]

print("Choosing rows by index")
display(HTML(chosen_df.head(10).to_html()))

# Choosing rows by position
chosen_df = steam_df.iloc[3:6]

print("Choosing rows by position")
display(HTML(chosen_df.head(10).to_html()))

	user-id	game-title	behavior-name	value
0	151603712	The Elder Scrolls V Skyrim	purchase	1.0
1	151603712	The Elder Scrolls V Skyrim	play	273.0
2	151603712	Fallout 4	purchase	1.0
3	151603712	Fallout 4	play	87.0
4	151603712	Spore	purchase	1.0
5	151603712	Spore	play	14.9
6	151603712	Fallout New Vegas	purchase	1.0
7	151603712	Fallout New Vegas	play	12.1
8	151603712	Left 4 Dead 2	purchase	1.0
9	151603712	Left 4 Dead 2	play	8.9

Choosing rows by index

	user-id	game-title	behavior-name	value
3	151603712	Fallout 4	play	87.0
4	151603712	Spore	purchase	1.0
5	151603712	Spore	play	14.9

Choosing rows by position

	user-id	game-title	behavior-name	value
3	151603712	Fallout 4	play	87.0
4	151603712	Spore	purchase	1.0
5	151603712	Spore	play	14.9

# Choosing a column
chosen_df = steam_df['game-title']

print(chosen_df.head(10))

0    The Elder Scrolls V Skyrim
1    The Elder Scrolls V Skyrim
2                     Fallout 4
3                     Fallout 4
4                         Spore
5                         Spore
6             Fallout New Vegas
7             Fallout New Vegas
8                 Left 4 Dead 2
9                 Left 4 Dead 2
Name: game-title, dtype: object

# Choosing several columns
chosen_df = steam_df[['user-id', 'game-title']]

display(HTML(chosen_df.head(10).to_html()))

	user-id	game-title
0	151603712	The Elder Scrolls V Skyrim
1	151603712	The Elder Scrolls V Skyrim
2	151603712	Fallout 4
3	151603712	Fallout 4
4	151603712	Spore
5	151603712	Spore
6	151603712	Fallout New Vegas
7	151603712	Fallout New Vegas
8	151603712	Left 4 Dead 2
9	151603712	Left 4 Dead 2

Splitting the dataset into training and test set

shuffle = np.array(list(range(len(steam_df))))

# alternatively

shuffle = np.arange(len(steam_df))

np.random.shuffle(shuffle)
# shuffle = list(shuffle)
print("Shuffled range of indices")
print(shuffle[:20])
print()

train_test_split = 0.8
split_index = int(len(steam_df) * train_test_split)

training_set = steam_df.iloc[shuffle[:split_index]]
test_set = steam_df.iloc[shuffle[split_index:]]

display(HTML(training_set.head(10).to_html()))

display(HTML(test_set.head(10).to_html()))

print(len(training_set))
print(len(test_set))

Shuffled range of indices
[ 88886  27084  35588  56116 183664  34019 190384 138109  48325  94171
 163304  35071  45875 187591 107927  62332  97588   3784    669  75931]

	user-id	game-title	behavior-name	value
88886	173434036	Mortal Kombat X	purchase	1.0
27084	80779496	Sins of a Solar Empire Trinity	play	0.6
35588	109669093	Killing Floor	play	225.0
56116	94269421	Fallout 4	play	10.1
183664	279406744	BLOCKADE 3D	purchase	1.0
34019	126269125	Grand Theft Auto San Andreas	purchase	1.0
190384	71335402	7 Days to Die	play	8.2
138109	156818121	Half-Life 2	play	22.0
48325	114617787	Garry's Mod	play	1.2
94171	156615447	LEGO MARVEL Super Heroes	play	1.7

	user-id	game-title	behavior-name	value
170080	81591317	Warframe	purchase	1.0
85279	44472980	Serious Sam Double D XXL	purchase	1.0
132916	45592640	Penumbra Black Plague	purchase	1.0
12193	64787956	Always Sometimes Monsters	purchase	1.0
46374	192538478	Heroes & Generals	play	0.4
89823	1936551	Castle Crashers	purchase	1.0
179113	132196353	Knights and Merchants	purchase	1.0
144002	13190476	Blood Bowl 2	play	6.3
35416	60296891	Mirror's Edge	purchase	1.0
120786	62990992	Rome Total War	purchase	1.0

160000
40000

Filtering

Filtering columns

chosen_df = steam_df.loc[:, ['user-id', 'game-title']]

display(HTML(chosen_df.head(10).to_html()))

	user-id	game-title
0	151603712	The Elder Scrolls V Skyrim
1	151603712	The Elder Scrolls V Skyrim
2	151603712	Fallout 4
3	151603712	Fallout 4
4	151603712	Spore
5	151603712	Spore
6	151603712	Fallout New Vegas
7	151603712	Fallout New Vegas
8	151603712	Left 4 Dead 2
9	151603712	Left 4 Dead 2

Filtering rows

condition = steam_df['game-title'] == 'Fallout 4'

print(condition.head(10))

chosen_df = steam_df.loc[condition]

display(HTML(chosen_df.head(10).to_html()))

0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: game-title, dtype: bool

	user-id	game-title	behavior-name	value
2	151603712	Fallout 4	purchase	1.0
3	151603712	Fallout 4	play	87.0
3187	87445402	Fallout 4	purchase	1.0
3188	87445402	Fallout 4	play	83.0
5683	25096601	Fallout 4	purchase	1.0
5684	25096601	Fallout 4	play	1.6
6219	211925330	Fallout 4	purchase	1.0
6220	211925330	Fallout 4	play	133.0
7300	115396529	Fallout 4	purchase	1.0
7301	115396529	Fallout 4	play	17.9

Filtering rows and columns at once

condition = (steam_df['game-title'] == 'Fallout 4') & (steam_df['behavior-name'] == 'play')

chosen_df = steam_df.loc[condition, ['user-id', 'game-title', 'value']]

display(HTML(chosen_df.head(10).to_html()))

	user-id	game-title	value
3	151603712	Fallout 4	87.0
3188	87445402	Fallout 4	83.0
5684	25096601	Fallout 4	1.6
6220	211925330	Fallout 4	133.0
7301	115396529	Fallout 4	17.9
7527	4834220	Fallout 4	19.8
7617	65229865	Fallout 4	0.5
7712	65958466	Fallout 4	123.0
9963	91800733	Fallout 4	63.0
10700	43913966	Fallout 4	65.0

Simple operations on columns

Multiply a column by 2

steam_df_copy = steam_df.copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'] * 2

display(HTML(steam_df_copy.head(10).to_html()))

	user-id	game-title	behavior-name	value
0	151603712	The Elder Scrolls V Skyrim	purchase	1.0
1	151603712	The Elder Scrolls V Skyrim	play	273.0
2	151603712	Fallout 4	purchase	1.0
3	151603712	Fallout 4	play	87.0
4	151603712	Spore	purchase	1.0
5	151603712	Spore	play	14.9
6	151603712	Fallout New Vegas	purchase	1.0
7	151603712	Fallout New Vegas	play	12.1
8	151603712	Left 4 Dead 2	purchase	1.0
9	151603712	Left 4 Dead 2	play	8.9

	user-id	game-title	behavior-name	value
0	151603712	The Elder Scrolls V Skyrim	purchase	2.0
1	151603712	The Elder Scrolls V Skyrim	play	546.0
2	151603712	Fallout 4	purchase	2.0
3	151603712	Fallout 4	play	174.0
4	151603712	Spore	purchase	2.0
5	151603712	Spore	play	29.8
6	151603712	Fallout New Vegas	purchase	2.0
7	151603712	Fallout New Vegas	play	24.2
8	151603712	Left 4 Dead 2	purchase	2.0
9	151603712	Left 4 Dead 2	play	17.8

Choose the first n letters of a string

ml_movies_df_copy = ml_movies_df.copy()

display(HTML(ml_movies_df_copy.head(10).to_html()))

ml_movies_df_copy.loc[:, 'title'] = ml_movies_df_copy['title'].str[:6]

display(HTML(ml_movies_df_copy.head(10).to_html()))

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

	movieId	title	genres
0	1	Toy St	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanj	Adventure\|Children\|Fantasy
2	3	Grumpi	Comedy\|Romance
3	4	Waitin	Comedy\|Drama\|Romance
4	5	Father	Comedy
5	6	Heat (	Action\|Crime\|Thriller
6	7	Sabrin	Comedy\|Romance
7	8	Tom an	Adventure\|Children
8	9	Sudden	Action
9	10	Golden	Action\|Adventure\|Thriller

Take the mean of a column

# Option 1
print(steam_df['value'].mean())

# Option 2
print(np.mean(steam_df['value']))

17.874384000000475
17.874384000000475

Simple operation on filtered data

steam_df_copy = steam_df.loc[((steam_df['game-title'] == 'Fallout 4') | (steam_df['game-title'] == 'The Elder Scrolls V Skyrim')) 
                             & (steam_df['behavior-name'] == 'play')].copy()

display(HTML(steam_df_copy.head(10).to_html()))

condition = (steam_df_copy['game-title'] == 'Fallout 4') & (steam_df_copy['behavior-name'] == 'play')

steam_df_copy.loc[condition, 'value'] = steam_df_copy.loc[condition, 'value'] * 2

display(HTML(steam_df_copy.head(10).to_html()))

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	273.0
3	151603712	Fallout 4	play	87.0
73	59945701	The Elder Scrolls V Skyrim	play	58.0
1066	92107940	The Elder Scrolls V Skyrim	play	110.0
1168	250006052	The Elder Scrolls V Skyrim	play	465.0
1388	11373749	The Elder Scrolls V Skyrim	play	220.0
2065	54103616	The Elder Scrolls V Skyrim	play	35.0
2569	56038151	The Elder Scrolls V Skyrim	play	14.6
3188	87445402	Fallout 4	play	83.0
3233	94088853	The Elder Scrolls V Skyrim	play	320.0

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	273.0
3	151603712	Fallout 4	play	174.0
73	59945701	The Elder Scrolls V Skyrim	play	58.0
1066	92107940	The Elder Scrolls V Skyrim	play	110.0
1168	250006052	The Elder Scrolls V Skyrim	play	465.0
1388	11373749	The Elder Scrolls V Skyrim	play	220.0
2065	54103616	The Elder Scrolls V Skyrim	play	35.0
2569	56038151	The Elder Scrolls V Skyrim	play	14.6
3188	87445402	Fallout 4	play	166.0
3233	94088853	The Elder Scrolls V Skyrim	play	320.0

Advanced operations on columns

def reduce_outliers(x):
    return min(np.log(1 + x), 4)

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'].apply(reduce_outliers)

display(HTML(steam_df_copy.head(10).to_html()))

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	273.0
3	151603712	Fallout 4	play	87.0
5	151603712	Spore	play	14.9
7	151603712	Fallout New Vegas	play	12.1
9	151603712	Left 4 Dead 2	play	8.9
11	151603712	HuniePop	play	8.5
13	151603712	Path of Exile	play	8.1
15	151603712	Poly Bridge	play	7.5
17	151603712	Left 4 Dead	play	3.3
19	151603712	Team Fortress 2	play	2.8

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	4.000000
3	151603712	Fallout 4	play	4.000000
5	151603712	Spore	play	2.766319
7	151603712	Fallout New Vegas	play	2.572612
9	151603712	Left 4 Dead 2	play	2.292535
11	151603712	HuniePop	play	2.251292
13	151603712	Path of Exile	play	2.208274
15	151603712	Poly Bridge	play	2.140066
17	151603712	Left 4 Dead	play	1.458615
19	151603712	Team Fortress 2	play	1.335001

The same apply operation can be achieved with the use of a lambda function

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy['value'].apply(lambda x: min(np.log(1 + x), 4))

display(HTML(steam_df_copy.head(10).to_html()))

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	273.0
3	151603712	Fallout 4	play	87.0
5	151603712	Spore	play	14.9
7	151603712	Fallout New Vegas	play	12.1
9	151603712	Left 4 Dead 2	play	8.9
11	151603712	HuniePop	play	8.5
13	151603712	Path of Exile	play	8.1
15	151603712	Poly Bridge	play	7.5
17	151603712	Left 4 Dead	play	3.3
19	151603712	Team Fortress 2	play	2.8

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	4.000000
3	151603712	Fallout 4	play	4.000000
5	151603712	Spore	play	2.766319
7	151603712	Fallout New Vegas	play	2.572612
9	151603712	Left 4 Dead 2	play	2.292535
11	151603712	HuniePop	play	2.251292
13	151603712	Path of Exile	play	2.208274
15	151603712	Poly Bridge	play	2.140066
17	151603712	Left 4 Dead	play	1.458615
19	151603712	Team Fortress 2	play	1.335001

Apply on two columns at once

steam_df_copy = steam_df.loc[steam_df['behavior-name'] == 'play'].copy()

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value_2'] = steam_df_copy['value'].apply(lambda x: min(np.log(1 + x), 4))

display(HTML(steam_df_copy.head(10).to_html()))

steam_df_copy.loc[:, 'value'] = steam_df_copy[['value', 'value_2']].apply(lambda x: x[0] * x[1], axis=1)

display(HTML(steam_df_copy.head(10).to_html()))

	user-id	game-title	behavior-name	value
1	151603712	The Elder Scrolls V Skyrim	play	273.0
3	151603712	Fallout 4	play	87.0
5	151603712	Spore	play	14.9
7	151603712	Fallout New Vegas	play	12.1
9	151603712	Left 4 Dead 2	play	8.9
11	151603712	HuniePop	play	8.5
13	151603712	Path of Exile	play	8.1
15	151603712	Poly Bridge	play	7.5
17	151603712	Left 4 Dead	play	3.3
19	151603712	Team Fortress 2	play	2.8

	user-id	game-title	behavior-name	value	value_2
1	151603712	The Elder Scrolls V Skyrim	play	273.0	4.000000
3	151603712	Fallout 4	play	87.0	4.000000
5	151603712	Spore	play	14.9	2.766319
7	151603712	Fallout New Vegas	play	12.1	2.572612
9	151603712	Left 4 Dead 2	play	8.9	2.292535
11	151603712	HuniePop	play	8.5	2.251292
13	151603712	Path of Exile	play	8.1	2.208274
15	151603712	Poly Bridge	play	7.5	2.140066
17	151603712	Left 4 Dead	play	3.3	1.458615
19	151603712	Team Fortress 2	play	2.8	1.335001

	user-id	game-title	behavior-name	value	value_2
1	151603712	The Elder Scrolls V Skyrim	play	1092.000000	4.000000
3	151603712	Fallout 4	play	348.000000	4.000000
5	151603712	Spore	play	41.218155	2.766319
7	151603712	Fallout New Vegas	play	31.128608	2.572612
9	151603712	Left 4 Dead 2	play	20.403559	2.292535
11	151603712	HuniePop	play	19.135980	2.251292
13	151603712	Path of Exile	play	17.887023	2.208274
15	151603712	Poly Bridge	play	16.050496	2.140066
17	151603712	Left 4 Dead	play	4.813430	1.458615
19	151603712	Team Fortress 2	play	3.738003	1.335001

ml_movies_df_copy = ml_movies_df.copy()

display(HTML(ml_movies_df_copy.head(10).to_html()))

ml_movies_df_copy.loc[:, 'title|genres'] = ml_movies_df_copy[['title', 'genres']].apply(lambda x: x[0] + "|" + x[1], axis=1)

display(HTML(ml_movies_df_copy.head(10).to_html()))

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

	movieId	title	genres	title\|genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	Toy Story (1995)\|Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	Jumanji (1995)\|Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance	Grumpier Old Men (1995)\|Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	Waiting to Exhale (1995)\|Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy	Father of the Bride Part II (1995)\|Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller	Heat (1995)\|Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance	Sabrina (1995)\|Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children	Tom and Huck (1995)\|Adventure\|Children
8	9	Sudden Death (1995)	Action	Sudden Death (1995)\|Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller	GoldenEye (1995)\|Action\|Adventure\|Thriller

Grouping and aggregating

Find the most popular games (in terms of purchases)

steam_grouped = steam_df.loc[steam_df['behavior-name'] == 'purchase', ['game-title', 'value']]
steam_grouped = steam_grouped.groupby('game-title').sum()
display(HTML(steam_grouped.head(10).to_html()))

steam_grouped = steam_grouped.sort_values(by='value', ascending=False).reset_index()

display(HTML(steam_grouped.head(10).to_html()))

	value
game-title
007 Legends	1.0
0RBITALIS	3.0
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby)	7.0
10 Second Ninja	6.0
10,000,000	1.0
100% Orange Juice	10.0
1000 Amps	2.0
12 Labours of Hercules	10.0
12 Labours of Hercules II The Cretan Bull	12.0
12 Labours of Hercules III Girl Power	6.0

	game-title	value
0	Dota 2	4841.0
1	Team Fortress 2	2323.0
2	Unturned	1563.0
3	Counter-Strike Global Offensive	1412.0
4	Half-Life 2 Lost Coast	981.0
5	Counter-Strike Source	978.0
6	Left 4 Dead 2	951.0
7	Counter-Strike	856.0
8	Warframe	847.0
9	Half-Life 2 Deathmatch	823.0

Iterating over a DataFrame (if possible, use column operations instead)

i = 0
for idx, row in steam_df.iterrows():
    print("[{}, {}, {}, {}]".format(idx, row['user-id'], row['game-title'], row['behavior-name']))
    i += 1
    if i == 10:
        break

[0, 151603712, The Elder Scrolls V Skyrim, purchase]
[1, 151603712, The Elder Scrolls V Skyrim, play]
[2, 151603712, Fallout 4, purchase]
[3, 151603712, Fallout 4, play]
[4, 151603712, Spore, purchase]
[5, 151603712, Spore, play]
[6, 151603712, Fallout New Vegas, purchase]
[7, 151603712, Fallout New Vegas, play]
[8, 151603712, Left 4 Dead 2, purchase]
[9, 151603712, Left 4 Dead 2, play]

Pandas tasks - Steam dataset

Task 4. How many people made a purchase in the Steam dataset? Remember that a person could by many games, but you need to count every person once.

# Write your code here

Task 5. How many people made a purchase of "The Elder Scrolls V Skyrim"?

# Write your code here

Task 6. How many purchases people made on average?

# Write your code here

Task 7. Who bought the most games?

# Write your code here

Task 8. How many hours on average people played in "The Elder Scrolls V Skyrim"?

# Write your code here

Task 9. Which games were played the most (in terms of the number of hours played)? Print the first 10 titles and respective numbers of hours.

# Write your code here

Task 10. Which games are the most consistently played (in terms of the average number of hours played)? Print the first 10 titles and respective numbers of hours.

# Write your code here

Task 11**. Fix the above for the fact that 0 hours played is not listed, but only a purchase is recorded in such a case.

# Write your code here

Task 12. Apply the sigmoid function $$f(x) = \frac{1}{1 + e^{-\frac{1}{100}x}}$$ to hours played and print the first 10 rows from the entire Steam dataset after this change.

# Write your code here

Pandas tasks - MovieLens dataset

Task 13*. Calculate popularity (by the number of users who watched a movie) of all genres.

# Write your code here

Task 14*. Calculate average rating for all genres.

# Write your code here

Task 15. Calculate each movie rating bias (deviation from the mean of all movies average ratings). Print first 10 in the form: title, average rating, bias.

# Write your code here

Task 16. Calculate each user rating bias (deviation from the mean of all users average ratings). Print first 10 in the form: user_id, average rating, bias.

# Write your code here

Task 17. Randomly choose 10 movies and 10 users and print their interaction matrix in the form of a DataFrame with user_id as index and movie titles as columns (use HTML Display for that). You can iterate over the DataFrame in this task.

# Write your code here

Pandas + numpy tasks

Task 18. Create the entire interaction matrix for the MovieLens dataset.

# Write your code here

Task 19. Calculate the matrix of size (n_users, n_users) where at position (i, j) is the number of movies watched both by user i and user j. Print the submatrix of first 10 rows and 10 columns.

# Write your code here

Task 20. Calculate the matrix of size (n_items, n_items) where at position (i, j) is the number of users who watched both movie i and movie j. Print the submatrix of first 10 rows and 10 columns.

# Write your code here

	movieId	title	genres	title\|genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	Toy Story (1995)\|Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	Jumanji (1995)\|Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance	Grumpier Old Men (1995)\|Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	Waiting to Exhale (1995)\|Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy	Father of the Bride Part II (1995)\|Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller	Heat (1995)\|Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance	Sabrina (1995)\|Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children	Tom and Huck (1995)\|Adventure\|Children
8	9	Sudden Death (1995)	Action	Sudden Death (1995)\|Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller	GoldenEye (1995)\|Action\|Adventure\|Thriller

126 KiB Raw Blame History