Codes added

This commit is contained in:
Robert 2020-05-21 13:42:50 +02:00
commit 67d6328405
35 changed files with 913321 additions and 0 deletions

157
Datasets/ml-100k/README Normal file
View File

@ -0,0 +1,157 @@
SUMMARY & USAGE LICENSE
=============================================
MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.
Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:
* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in
publications resulting from the use of the data set
(see below for citation information).
* The user may not redistribute the data without separate
permission.
* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.
If you have any further questions or comments, please contact GroupLens
<grouplens-info@cs.umn.edu>.
CITATION
==============================================
To acknowledge use of the dataset in publications, please cite the
following paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872
ACKNOWLEDGEMENTS
==============================================
Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.
PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.
FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================
The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996. The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.
Further information on the GroupLens Research project, including
research publications, can be found at the following web site:
http://www.grouplens.org/
GroupLens Research currently operates a movie recommender based on
collaborative filtering:
http://www.movielens.org/
DETAILED DESCRIPTIONS OF DATA FILES
==============================================
Here are brief descriptions of the data.
ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
gunzip ml-data.tar.gz
tar xvf ml-data.tar
mku.sh
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user id | item id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC
u.info -- The number of users, items, and ratings in the u data set.
u.item -- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
The movie ids are the ones used in the u.data data set.
u.genre -- A list of the genres.
u.user -- Demographic information about the users; this is a tab
separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.
u.occupation -- A list of the occupations.
u1.base -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test are 80%/20% splits of the u data into training and test data.
u2.base Each of u1, ..., u5 have disjoint test sets; this if for
u2.test 5 fold cross validation (where you repeat your experiment
u3.base with each training and test set and average the results).
u3.test These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test
ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test split the u data into a training set and a test set with
ub.base exactly 10 ratings per user in the test set. The sets
ub.test ua.test and ub.test are disjoint. These data sets can
be generated from u.data by mku.sh.
allbut.pl -- The script that generates training and test sets where
all but n of a users ratings are in the training data.
mku.sh -- A shell script to generate all the u data sets from u.data.

View File

@ -0,0 +1,34 @@
#!/usr/local/bin/perl
# get args
if (@ARGV < 3) {
print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
exit 1;
}
$basename = shift;
$start = shift;
$stop = shift;
$maxtest = shift;
# open files
open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";
# init variables
$testcnt = 0;
while (<>) {
($user) = split;
if (! defined $ratingcnt{$user}) {
$ratingcnt{$user} = 0;
}
++$ratingcnt{$user};
if (($testcnt < $maxtest || $maxtest <= 0)
&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
++$testcnt;
print TESTFILE;
}
else {
print BASEFILE;
}
}

25
Datasets/ml-100k/mku.sh Normal file
View File

@ -0,0 +1,25 @@
#!/bin/sh
trap `rm -f tmp.$$; exit 1` 1 2 15
for i in 1 2 3 4 5
do
head -`expr $i \* 20000` u.data | tail -20000 > tmp.$$
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.test
head -`expr \( $i - 1 \) \* 20000` u.data > tmp.$$
tail -`expr \( 5 - $i \) \* 20000` u.data >> tmp.$$
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.base
done
allbut.pl ua 1 10 100000 u.data
sort -t" " -k 1,1n -k 2,2n ua.base > tmp.$$
mv tmp.$$ ua.base
sort -t" " -k 1,1n -k 2,2n ua.test > tmp.$$
mv tmp.$$ ua.test
allbut.pl ub 11 20 100000 u.data
sort -t" " -k 1,1n -k 2,2n ub.base > tmp.$$
mv tmp.$$ ub.base
sort -t" " -k 1,1n -k 2,2n ub.test > tmp.$$
mv tmp.$$ ub.test

1683
Datasets/ml-100k/movies.csv Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/test.csv Normal file

File diff suppressed because it is too large Load Diff

80000
Datasets/ml-100k/train.csv Normal file

File diff suppressed because it is too large Load Diff

100000
Datasets/ml-100k/u.data Normal file

File diff suppressed because it is too large Load Diff

20
Datasets/ml-100k/u.genre Normal file
View File

@ -0,0 +1,20 @@
unknown|0
Action|1
Adventure|2
Animation|3
Children's|4
Comedy|5
Crime|6
Documentary|7
Drama|8
Fantasy|9
Film-Noir|10
Horror|11
Musical|12
Mystery|13
Romance|14
Sci-Fi|15
Thriller|16
War|17
Western|18

3
Datasets/ml-100k/u.info Normal file
View File

@ -0,0 +1,3 @@
943 users
1682 items
100000 ratings

1682
Datasets/ml-100k/u.item Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,21 @@
administrator
artist
doctor
educator
engineer
entertainment
executive
healthcare
homemaker
lawyer
librarian
marketing
none
other
programmer
retired
salesman
scientist
student
technician
writer

943
Datasets/ml-100k/u.user Normal file
View File

@ -0,0 +1,943 @@
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
11|39|F|other|30329
12|28|F|other|06405
13|47|M|educator|29206
14|45|M|scientist|55106
15|49|F|educator|97301
16|21|M|entertainment|10309
17|30|M|programmer|06355
18|35|F|other|37212
19|40|M|librarian|02138
20|42|F|homemaker|95660
21|26|M|writer|30068
22|25|M|writer|40206
23|30|F|artist|48197
24|21|F|artist|94533
25|39|M|engineer|55107
26|49|M|engineer|21044
27|40|F|librarian|30030
28|32|M|writer|55369
29|41|M|programmer|94043
30|7|M|student|55436
31|24|M|artist|10003
32|28|F|student|78741
33|23|M|student|27510
34|38|F|administrator|42141
35|20|F|homemaker|42459
36|19|F|student|93117
37|23|M|student|55105
38|28|F|other|54467
39|41|M|entertainment|01040
40|38|M|scientist|27514
41|33|M|engineer|80525
42|30|M|administrator|17870
43|29|F|librarian|20854
44|26|M|technician|46260
45|29|M|programmer|50233
46|27|F|marketing|46538
47|53|M|marketing|07102
48|45|M|administrator|12550
49|23|F|student|76111
50|21|M|writer|52245
51|28|M|educator|16509
52|18|F|student|55105
53|26|M|programmer|55414
54|22|M|executive|66315
55|37|M|programmer|01331
56|25|M|librarian|46260
57|16|M|none|84010
58|27|M|programmer|52246
59|49|M|educator|08403
60|50|M|healthcare|06472
61|36|M|engineer|30040
62|27|F|administrator|97214
63|31|M|marketing|75240
64|32|M|educator|43202
65|51|F|educator|48118
66|23|M|student|80521
67|17|M|student|60402
68|19|M|student|22904
69|24|M|engineer|55337
70|27|M|engineer|60067
71|39|M|scientist|98034
72|48|F|administrator|73034
73|24|M|student|41850
74|39|M|scientist|T8H1N
75|24|M|entertainment|08816
76|20|M|student|02215
77|30|M|technician|29379
78|26|M|administrator|61801
79|39|F|administrator|03755
80|34|F|administrator|52241
81|21|M|student|21218
82|50|M|programmer|22902
83|40|M|other|44133
84|32|M|executive|55369
85|51|M|educator|20003
86|26|M|administrator|46005
87|47|M|administrator|89503
88|49|F|librarian|11701
89|43|F|administrator|68106
90|60|M|educator|78155
91|55|M|marketing|01913
92|32|M|entertainment|80525
93|48|M|executive|23112
94|26|M|student|71457
95|31|M|administrator|10707
96|25|F|artist|75206
97|43|M|artist|98006
98|49|F|executive|90291
99|20|M|student|63129
100|36|M|executive|90254
101|15|M|student|05146
102|38|M|programmer|30220
103|26|M|student|55108
104|27|M|student|55108
105|24|M|engineer|94043
106|61|M|retired|55125
107|39|M|scientist|60466
108|44|M|educator|63130
109|29|M|other|55423
110|19|M|student|77840
111|57|M|engineer|90630
112|30|M|salesman|60613
113|47|M|executive|95032
114|27|M|programmer|75013
115|31|M|engineer|17110
116|40|M|healthcare|97232
117|20|M|student|16125
118|21|M|administrator|90210
119|32|M|programmer|67401
120|47|F|other|06260
121|54|M|librarian|99603
122|32|F|writer|22206
123|48|F|artist|20008
124|34|M|student|60615
125|30|M|lawyer|22202
126|28|F|lawyer|20015
127|33|M|none|73439
128|24|F|marketing|20009
129|36|F|marketing|07039
130|20|M|none|60115
131|59|F|administrator|15237
132|24|M|other|94612
133|53|M|engineer|78602
134|31|M|programmer|80236
135|23|M|student|38401
136|51|M|other|97365
137|50|M|educator|84408
138|46|M|doctor|53211
139|20|M|student|08904
140|30|F|student|32250
141|49|M|programmer|36117
142|13|M|other|48118
143|42|M|technician|08832
144|53|M|programmer|20910
145|31|M|entertainment|V3N4P
146|45|M|artist|83814
147|40|F|librarian|02143
148|33|M|engineer|97006
149|35|F|marketing|17325
150|20|F|artist|02139
151|38|F|administrator|48103
152|33|F|educator|68767
153|25|M|student|60641
154|25|M|student|53703
155|32|F|other|11217
156|25|M|educator|08360
157|57|M|engineer|70808
158|50|M|educator|27606
159|23|F|student|55346
160|27|M|programmer|66215
161|50|M|lawyer|55104
162|25|M|artist|15610
163|49|M|administrator|97212
164|47|M|healthcare|80123
165|20|F|other|53715
166|47|M|educator|55113
167|37|M|other|L9G2B
168|48|M|other|80127
169|52|F|other|53705
170|53|F|healthcare|30067
171|48|F|educator|78750
172|55|M|marketing|22207
173|56|M|other|22306
174|30|F|administrator|52302
175|26|F|scientist|21911
176|28|M|scientist|07030
177|20|M|programmer|19104
178|26|M|other|49512
179|15|M|entertainment|20755
180|22|F|administrator|60202
181|26|M|executive|21218
182|36|M|programmer|33884
183|33|M|scientist|27708
184|37|M|librarian|76013
185|53|F|librarian|97403
186|39|F|executive|00000
187|26|M|educator|16801
188|42|M|student|29440
189|32|M|artist|95014
190|30|M|administrator|95938
191|33|M|administrator|95161
192|42|M|educator|90840
193|29|M|student|49931
194|38|M|administrator|02154
195|42|M|scientist|93555
196|49|M|writer|55105
197|55|M|technician|75094
198|21|F|student|55414
199|30|M|writer|17604
200|40|M|programmer|93402
201|27|M|writer|E2A4H
202|41|F|educator|60201
203|25|F|student|32301
204|52|F|librarian|10960
205|47|M|lawyer|06371
206|14|F|student|53115
207|39|M|marketing|92037
208|43|M|engineer|01720
209|33|F|educator|85710
210|39|M|engineer|03060
211|66|M|salesman|32605
212|49|F|educator|61401
213|33|M|executive|55345
214|26|F|librarian|11231
215|35|M|programmer|63033
216|22|M|engineer|02215
217|22|M|other|11727
218|37|M|administrator|06513
219|32|M|programmer|43212
220|30|M|librarian|78205
221|19|M|student|20685
222|29|M|programmer|27502
223|19|F|student|47906
224|31|F|educator|43512
225|51|F|administrator|58202
226|28|M|student|92103
227|46|M|executive|60659
228|21|F|student|22003
229|29|F|librarian|22903
230|28|F|student|14476
231|48|M|librarian|01080
232|45|M|scientist|99709
233|38|M|engineer|98682
234|60|M|retired|94702
235|37|M|educator|22973
236|44|F|writer|53214
237|49|M|administrator|63146
238|42|F|administrator|44124
239|39|M|artist|95628
240|23|F|educator|20784
241|26|F|student|20001
242|33|M|educator|31404
243|33|M|educator|60201
244|28|M|technician|80525
245|22|M|student|55109
246|19|M|student|28734
247|28|M|engineer|20770
248|25|M|student|37235
249|25|M|student|84103
250|29|M|executive|95110
251|28|M|doctor|85032
252|42|M|engineer|07733
253|26|F|librarian|22903
254|44|M|educator|42647
255|23|M|entertainment|07029
256|35|F|none|39042
257|17|M|student|77005
258|19|F|student|77801
259|21|M|student|48823
260|40|F|artist|89801
261|28|M|administrator|85202
262|19|F|student|78264
263|41|M|programmer|55346
264|36|F|writer|90064
265|26|M|executive|84601
266|62|F|administrator|78756
267|23|M|engineer|83716
268|24|M|engineer|19422
269|31|F|librarian|43201
270|18|F|student|63119
271|51|M|engineer|22932
272|33|M|scientist|53706
273|50|F|other|10016
274|20|F|student|55414
275|38|M|engineer|92064
276|21|M|student|95064
277|35|F|administrator|55406
278|37|F|librarian|30033
279|33|M|programmer|85251
280|30|F|librarian|22903
281|15|F|student|06059
282|22|M|administrator|20057
283|28|M|programmer|55305
284|40|M|executive|92629
285|25|M|programmer|53713
286|27|M|student|15217
287|21|M|salesman|31211
288|34|M|marketing|23226
289|11|M|none|94619
290|40|M|engineer|93550
291|19|M|student|44106
292|35|F|programmer|94703
293|24|M|writer|60804
294|34|M|technician|92110
295|31|M|educator|50325
296|43|F|administrator|16803
297|29|F|educator|98103
298|44|M|executive|01581
299|29|M|doctor|63108
300|26|F|programmer|55106
301|24|M|student|55439
302|42|M|educator|77904
303|19|M|student|14853
304|22|F|student|71701
305|23|M|programmer|94086
306|45|M|other|73132
307|25|M|student|55454
308|60|M|retired|95076
309|40|M|scientist|70802
310|37|M|educator|91711
311|32|M|technician|73071
312|48|M|other|02110
313|41|M|marketing|60035
314|20|F|student|08043
315|31|M|educator|18301
316|43|F|other|77009
317|22|M|administrator|13210
318|65|M|retired|06518
319|38|M|programmer|22030
320|19|M|student|24060
321|49|F|educator|55413
322|20|M|student|50613
323|21|M|student|19149
324|21|F|student|02176
325|48|M|technician|02139
326|41|M|administrator|15235
327|22|M|student|11101
328|51|M|administrator|06779
329|48|M|educator|01720
330|35|F|educator|33884
331|33|M|entertainment|91344
332|20|M|student|40504
333|47|M|other|V0R2M
334|32|M|librarian|30002
335|45|M|executive|33775
336|23|M|salesman|42101
337|37|M|scientist|10522
338|39|F|librarian|59717
339|35|M|lawyer|37901
340|46|M|engineer|80123
341|17|F|student|44405
342|25|F|other|98006
343|43|M|engineer|30093
344|30|F|librarian|94117
345|28|F|librarian|94143
346|34|M|other|76059
347|18|M|student|90210
348|24|F|student|45660
349|68|M|retired|61455
350|32|M|student|97301
351|61|M|educator|49938
352|37|F|programmer|55105
353|25|M|scientist|28480
354|29|F|librarian|48197
355|25|M|student|60135
356|32|F|homemaker|92688
357|26|M|executive|98133
358|40|M|educator|10022
359|22|M|student|61801
360|51|M|other|98027
361|22|M|student|44074
362|35|F|homemaker|85233
363|20|M|student|87501
364|63|M|engineer|01810
365|29|M|lawyer|20009
366|20|F|student|50670
367|17|M|student|37411
368|18|M|student|92113
369|24|M|student|91335
370|52|M|writer|08534
371|36|M|engineer|99206
372|25|F|student|66046
373|24|F|other|55116
374|36|M|executive|78746
375|17|M|entertainment|37777
376|28|F|other|10010
377|22|M|student|18015
378|35|M|student|02859
379|44|M|programmer|98117
380|32|M|engineer|55117
381|33|M|artist|94608
382|45|M|engineer|01824
383|42|M|administrator|75204
384|52|M|programmer|45218
385|36|M|writer|10003
386|36|M|salesman|43221
387|33|M|entertainment|37412
388|31|M|other|36106
389|44|F|writer|83702
390|42|F|writer|85016
391|23|M|student|84604
392|52|M|writer|59801
393|19|M|student|83686
394|25|M|administrator|96819
395|43|M|other|44092
396|57|M|engineer|94551
397|17|M|student|27514
398|40|M|other|60008
399|25|M|other|92374
400|33|F|administrator|78213
401|46|F|healthcare|84107
402|30|M|engineer|95129
403|37|M|other|06811
404|29|F|programmer|55108
405|22|F|healthcare|10019
406|52|M|educator|93109
407|29|M|engineer|03261
408|23|M|student|61755
409|48|M|administrator|98225
410|30|F|artist|94025
411|34|M|educator|44691
412|25|M|educator|15222
413|55|M|educator|78212
414|24|M|programmer|38115
415|39|M|educator|85711
416|20|F|student|92626
417|27|F|other|48103
418|55|F|none|21206
419|37|M|lawyer|43215
420|53|M|educator|02140
421|38|F|programmer|55105
422|26|M|entertainment|94533
423|64|M|other|91606
424|36|F|marketing|55422
425|19|M|student|58644
426|55|M|educator|01602
427|51|M|doctor|85258
428|28|M|student|55414
429|27|M|student|29205
430|38|M|scientist|98199
431|24|M|marketing|92629
432|22|M|entertainment|50311
433|27|M|artist|11211
434|16|F|student|49705
435|24|M|engineer|60007
436|30|F|administrator|17345
437|27|F|other|20009
438|51|F|administrator|43204
439|23|F|administrator|20817
440|30|M|other|48076
441|50|M|technician|55013
442|22|M|student|85282
443|35|M|salesman|33308
444|51|F|lawyer|53202
445|21|M|writer|92653
446|57|M|educator|60201
447|30|M|administrator|55113
448|23|M|entertainment|10021
449|23|M|librarian|55021
450|35|F|educator|11758
451|16|M|student|48446
452|35|M|administrator|28018
453|18|M|student|06333
454|57|M|other|97330
455|48|M|administrator|83709
456|24|M|technician|31820
457|33|F|salesman|30011
458|47|M|technician|Y1A6B
459|22|M|student|29201
460|44|F|other|60630
461|15|M|student|98102
462|19|F|student|02918
463|48|F|healthcare|75218
464|60|M|writer|94583
465|32|M|other|05001
466|22|M|student|90804
467|29|M|engineer|91201
468|28|M|engineer|02341
469|60|M|educator|78628
470|24|M|programmer|10021
471|10|M|student|77459
472|24|M|student|87544
473|29|M|student|94708
474|51|M|executive|93711
475|30|M|programmer|75230
476|28|M|student|60440
477|23|F|student|02125
478|29|M|other|10019
479|30|M|educator|55409
480|57|M|retired|98257
481|73|M|retired|37771
482|18|F|student|40256
483|29|M|scientist|43212
484|27|M|student|21208
485|44|F|educator|95821
486|39|M|educator|93101
487|22|M|engineer|92121
488|48|M|technician|21012
489|55|M|other|45218
490|29|F|artist|V5A2B
491|43|F|writer|53711
492|57|M|educator|94618
493|22|M|engineer|60090
494|38|F|administrator|49428
495|29|M|engineer|03052
496|21|F|student|55414
497|20|M|student|50112
498|26|M|writer|55408
499|42|M|programmer|75006
500|28|M|administrator|94305
501|22|M|student|10025
502|22|M|student|23092
503|50|F|writer|27514
504|40|F|writer|92115
505|27|F|other|20657
506|46|M|programmer|03869
507|18|F|writer|28450
508|27|M|marketing|19382
509|23|M|administrator|10011
510|34|M|other|98038
511|22|M|student|21250
512|29|M|other|20090
513|43|M|administrator|26241
514|27|M|programmer|20707
515|53|M|marketing|49508
516|53|F|librarian|10021
517|24|M|student|55454
518|49|F|writer|99709
519|22|M|other|55320
520|62|M|healthcare|12603
521|19|M|student|02146
522|36|M|engineer|55443
523|50|F|administrator|04102
524|56|M|educator|02159
525|27|F|administrator|19711
526|30|M|marketing|97124
527|33|M|librarian|12180
528|18|M|student|55104
529|47|F|administrator|44224
530|29|M|engineer|94040
531|30|F|salesman|97408
532|20|M|student|92705
533|43|M|librarian|02324
534|20|M|student|05464
535|45|F|educator|80302
536|38|M|engineer|30078
537|36|M|engineer|22902
538|31|M|scientist|21010
539|53|F|administrator|80303
540|28|M|engineer|91201
541|19|F|student|84302
542|21|M|student|60515
543|33|M|scientist|95123
544|44|F|other|29464
545|27|M|technician|08052
546|36|M|executive|22911
547|50|M|educator|14534
548|51|M|writer|95468
549|42|M|scientist|45680
550|16|F|student|95453
551|25|M|programmer|55414
552|45|M|other|68147
553|58|M|educator|62901
554|32|M|scientist|62901
555|29|F|educator|23227
556|35|F|educator|30606
557|30|F|writer|11217
558|56|F|writer|63132
559|69|M|executive|10022
560|32|M|student|10003
561|23|M|engineer|60005
562|54|F|administrator|20879
563|39|F|librarian|32707
564|65|M|retired|94591
565|40|M|student|55422
566|20|M|student|14627
567|24|M|entertainment|10003
568|39|M|educator|01915
569|34|M|educator|91903
570|26|M|educator|14627
571|34|M|artist|01945
572|51|M|educator|20003
573|68|M|retired|48911
574|56|M|educator|53188
575|33|M|marketing|46032
576|48|M|executive|98281
577|36|F|student|77845
578|31|M|administrator|M7A1A
579|32|M|educator|48103
580|16|M|student|17961
581|37|M|other|94131
582|17|M|student|93003
583|44|M|engineer|29631
584|25|M|student|27511
585|69|M|librarian|98501
586|20|M|student|79508
587|26|M|other|14216
588|18|F|student|93063
589|21|M|lawyer|90034
590|50|M|educator|82435
591|57|F|librarian|92093
592|18|M|student|97520
593|31|F|educator|68767
594|46|M|educator|M4J2K
595|25|M|programmer|31909
596|20|M|artist|77073
597|23|M|other|84116
598|40|F|marketing|43085
599|22|F|student|R3T5K
600|34|M|programmer|02320
601|19|F|artist|99687
602|47|F|other|34656
603|21|M|programmer|47905
604|39|M|educator|11787
605|33|M|engineer|33716
606|28|M|programmer|63044
607|49|F|healthcare|02154
608|22|M|other|10003
609|13|F|student|55106
610|22|M|student|21227
611|46|M|librarian|77008
612|36|M|educator|79070
613|37|F|marketing|29678
614|54|M|educator|80227
615|38|M|educator|27705
616|55|M|scientist|50613
617|27|F|writer|11201
618|15|F|student|44212
619|17|M|student|44134
620|18|F|writer|81648
621|17|M|student|60402
622|25|M|programmer|14850
623|50|F|educator|60187
624|19|M|student|30067
625|27|M|programmer|20723
626|23|M|scientist|19807
627|24|M|engineer|08034
628|13|M|none|94306
629|46|F|other|44224
630|26|F|healthcare|55408
631|18|F|student|38866
632|18|M|student|55454
633|35|M|programmer|55414
634|39|M|engineer|T8H1N
635|22|M|other|23237
636|47|M|educator|48043
637|30|M|other|74101
638|45|M|engineer|01940
639|42|F|librarian|12065
640|20|M|student|61801
641|24|M|student|60626
642|18|F|student|95521
643|39|M|scientist|55122
644|51|M|retired|63645
645|27|M|programmer|53211
646|17|F|student|51250
647|40|M|educator|45810
648|43|M|engineer|91351
649|20|M|student|39762
650|42|M|engineer|83814
651|65|M|retired|02903
652|35|M|other|22911
653|31|M|executive|55105
654|27|F|student|78739
655|50|F|healthcare|60657
656|48|M|educator|10314
657|26|F|none|78704
658|33|M|programmer|92626
659|31|M|educator|54248
660|26|M|student|77380
661|28|M|programmer|98121
662|55|M|librarian|19102
663|26|M|other|19341
664|30|M|engineer|94115
665|25|M|administrator|55412
666|44|M|administrator|61820
667|35|M|librarian|01970
668|29|F|writer|10016
669|37|M|other|20009
670|30|M|technician|21114
671|21|M|programmer|91919
672|54|F|administrator|90095
673|51|M|educator|22906
674|13|F|student|55337
675|34|M|other|28814
676|30|M|programmer|32712
677|20|M|other|99835
678|50|M|educator|61462
679|20|F|student|54302
680|33|M|lawyer|90405
681|44|F|marketing|97208
682|23|M|programmer|55128
683|42|M|librarian|23509
684|28|M|student|55414
685|32|F|librarian|55409
686|32|M|educator|26506
687|31|F|healthcare|27713
688|37|F|administrator|60476
689|25|M|other|45439
690|35|M|salesman|63304
691|34|M|educator|60089
692|34|M|engineer|18053
693|43|F|healthcare|85210
694|60|M|programmer|06365
695|26|M|writer|38115
696|55|M|other|94920
697|25|M|other|77042
698|28|F|programmer|06906
699|44|M|other|96754
700|17|M|student|76309
701|51|F|librarian|56321
702|37|M|other|89104
703|26|M|educator|49512
704|51|F|librarian|91105
705|21|F|student|54494
706|23|M|student|55454
707|56|F|librarian|19146
708|26|F|homemaker|96349
709|21|M|other|N4T1A
710|19|M|student|92020
711|22|F|student|15203
712|22|F|student|54901
713|42|F|other|07204
714|26|M|engineer|55343
715|21|M|technician|91206
716|36|F|administrator|44265
717|24|M|technician|84105
718|42|M|technician|64118
719|37|F|other|V0R2H
720|49|F|administrator|16506
721|24|F|entertainment|11238
722|50|F|homemaker|17331
723|26|M|executive|94403
724|31|M|executive|40243
725|21|M|student|91711
726|25|F|administrator|80538
727|25|M|student|78741
728|58|M|executive|94306
729|19|M|student|56567
730|31|F|scientist|32114
731|41|F|educator|70403
732|28|F|other|98405
733|44|F|other|60630
734|25|F|other|63108
735|29|F|healthcare|85719
736|48|F|writer|94618
737|30|M|programmer|98072
738|35|M|technician|95403
739|35|M|technician|73162
740|25|F|educator|22206
741|25|M|writer|63108
742|35|M|student|29210
743|31|M|programmer|92660
744|35|M|marketing|47024
745|42|M|writer|55113
746|25|M|engineer|19047
747|19|M|other|93612
748|28|M|administrator|94720
749|33|M|other|80919
750|28|M|administrator|32303
751|24|F|other|90034
752|60|M|retired|21201
753|56|M|salesman|91206
754|59|F|librarian|62901
755|44|F|educator|97007
756|30|F|none|90247
757|26|M|student|55104
758|27|M|student|53706
759|20|F|student|68503
760|35|F|other|14211
761|17|M|student|97302
762|32|M|administrator|95050
763|27|M|scientist|02113
764|27|F|educator|62903
765|31|M|student|33066
766|42|M|other|10960
767|70|M|engineer|00000
768|29|M|administrator|12866
769|39|M|executive|06927
770|28|M|student|14216
771|26|M|student|15232
772|50|M|writer|27105
773|20|M|student|55414
774|30|M|student|80027
775|46|M|executive|90036
776|30|M|librarian|51157
777|63|M|programmer|01810
778|34|M|student|01960
779|31|M|student|K7L5J
780|49|M|programmer|94560
781|20|M|student|48825
782|21|F|artist|33205
783|30|M|marketing|77081
784|47|M|administrator|91040
785|32|M|engineer|23322
786|36|F|engineer|01754
787|18|F|student|98620
788|51|M|administrator|05779
789|29|M|other|55420
790|27|M|technician|80913
791|31|M|educator|20064
792|40|M|programmer|12205
793|22|M|student|85281
794|32|M|educator|57197
795|30|M|programmer|08610
796|32|F|writer|33755
797|44|F|other|62522
798|40|F|writer|64131
799|49|F|administrator|19716
800|25|M|programmer|55337
801|22|M|writer|92154
802|35|M|administrator|34105
803|70|M|administrator|78212
804|39|M|educator|61820
805|27|F|other|20009
806|27|M|marketing|11217
807|41|F|healthcare|93555
808|45|M|salesman|90016
809|50|F|marketing|30803
810|55|F|other|80526
811|40|F|educator|73013
812|22|M|technician|76234
813|14|F|student|02136
814|30|M|other|12345
815|32|M|other|28806
816|34|M|other|20755
817|19|M|student|60152
818|28|M|librarian|27514
819|59|M|administrator|40205
820|22|M|student|37725
821|37|M|engineer|77845
822|29|F|librarian|53144
823|27|M|artist|50322
824|31|M|other|15017
825|44|M|engineer|05452
826|28|M|artist|77048
827|23|F|engineer|80228
828|28|M|librarian|85282
829|48|M|writer|80209
830|46|M|programmer|53066
831|21|M|other|33765
832|24|M|technician|77042
833|34|M|writer|90019
834|26|M|other|64153
835|44|F|executive|11577
836|44|M|artist|10018
837|36|F|artist|55409
838|23|M|student|01375
839|38|F|entertainment|90814
840|39|M|artist|55406
841|45|M|doctor|47401
842|40|M|writer|93055
843|35|M|librarian|44212
844|22|M|engineer|95662
845|64|M|doctor|97405
846|27|M|lawyer|47130
847|29|M|student|55417
848|46|M|engineer|02146
849|15|F|student|25652
850|34|M|technician|78390
851|18|M|other|29646
852|46|M|administrator|94086
853|49|M|writer|40515
854|29|F|student|55408
855|53|M|librarian|04988
856|43|F|marketing|97215
857|35|F|administrator|V1G4L
858|63|M|educator|09645
859|18|F|other|06492
860|70|F|retired|48322
861|38|F|student|14085
862|25|M|executive|13820
863|17|M|student|60089
864|27|M|programmer|63021
865|25|M|artist|11231
866|45|M|other|60302
867|24|M|scientist|92507
868|21|M|programmer|55303
869|30|M|student|10025
870|22|M|student|65203
871|31|M|executive|44648
872|19|F|student|74078
873|48|F|administrator|33763
874|36|M|scientist|37076
875|24|F|student|35802
876|41|M|other|20902
877|30|M|other|77504
878|50|F|educator|98027
879|33|F|administrator|55337
880|13|M|student|83702
881|39|M|marketing|43017
882|35|M|engineer|40503
883|49|M|librarian|50266
884|44|M|engineer|55337
885|30|F|other|95316
886|20|M|student|61820
887|14|F|student|27249
888|41|M|scientist|17036
889|24|M|technician|78704
890|32|M|student|97301
891|51|F|administrator|03062
892|36|M|other|45243
893|25|M|student|95823
894|47|M|educator|74075
895|31|F|librarian|32301
896|28|M|writer|91505
897|30|M|other|33484
898|23|M|homemaker|61755
899|32|M|other|55116
900|60|M|retired|18505
901|38|M|executive|L1V3W
902|45|F|artist|97203
903|28|M|educator|20850
904|17|F|student|61073
905|27|M|other|30350
906|45|M|librarian|70124
907|25|F|other|80526
908|44|F|librarian|68504
909|50|F|educator|53171
910|28|M|healthcare|29301
911|37|F|writer|53210
912|51|M|other|06512
913|27|M|student|76201
914|44|F|other|08105
915|50|M|entertainment|60614
916|27|M|engineer|N2L5N
917|22|F|student|20006
918|40|M|scientist|70116
919|25|M|other|14216
920|30|F|artist|90008
921|20|F|student|98801
922|29|F|administrator|21114
923|21|M|student|E2E3R
924|29|M|other|11753
925|18|F|salesman|49036
926|49|M|entertainment|01701
927|23|M|programmer|55428
928|21|M|student|55408
929|44|M|scientist|53711
930|28|F|scientist|07310
931|60|M|educator|33556
932|58|M|educator|06437
933|28|M|student|48105
934|61|M|engineer|22902
935|42|M|doctor|66221
936|24|M|other|32789
937|48|M|educator|98072
938|38|F|technician|55038
939|26|F|student|33319
940|32|M|administrator|02215
941|20|M|student|97229
942|48|F|librarian|78209
943|22|M|student|77841

80000
Datasets/ml-100k/u1.base Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/u1.test Normal file

File diff suppressed because it is too large Load Diff

80000
Datasets/ml-100k/u2.base Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/u2.test Normal file

File diff suppressed because it is too large Load Diff

80000
Datasets/ml-100k/u3.base Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/u3.test Normal file

File diff suppressed because it is too large Load Diff

80000
Datasets/ml-100k/u4.base Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/u4.test Normal file

File diff suppressed because it is too large Load Diff

80000
Datasets/ml-100k/u5.base Normal file

File diff suppressed because it is too large Load Diff

20000
Datasets/ml-100k/u5.test Normal file

File diff suppressed because it is too large Load Diff

90570
Datasets/ml-100k/ua.base Normal file

File diff suppressed because it is too large Load Diff

9430
Datasets/ml-100k/ua.test Normal file

File diff suppressed because it is too large Load Diff

90570
Datasets/ml-100k/ub.base Normal file

File diff suppressed because it is too large Load Diff

9430
Datasets/ml-100k/ub.test Normal file

File diff suppressed because it is too large Load Diff

725
P0. Data preparation.ipynb Normal file

File diff suppressed because one or more lines are too long

1256
P1. Baseline.ipynb Normal file

File diff suppressed because it is too large Load Diff

2414
P2. Evaluation.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,957 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Self made simplified I-KNN"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import helpers\n",
"import pandas as pd\n",
"import numpy as np\n",
"import scipy.sparse as sparse\n",
"from collections import defaultdict\n",
"from itertools import chain\n",
"import random\n",
"\n",
"train_read=pd.read_csv('./Datasets/ml-100k/train.csv', sep='\\t', header=None)\n",
"test_read=pd.read_csv('./Datasets/ml-100k/test.csv', sep='\\t', header=None)\n",
"train_ui, test_ui, user_code_id, user_id_code, item_code_id, item_id_code = helpers.data_to_csr(train_read, test_read)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"class IKNN():\n",
" \n",
" def fit(self, train_ui):\n",
" self.train_ui=train_ui\n",
" \n",
" train_iu=train_ui.transpose()\n",
" norms=np.linalg.norm(train_iu.A, axis=1) # here we compute lenth of each item ratings vector\n",
" norms=np.vectorize(lambda x: max(x,1))(norms[:,None]) # to avoid dividing by zero\n",
"\n",
" normalized_train_iu=sparse.csr_matrix(train_iu/norms)\n",
"\n",
" self.similarity_matrix_ii=normalized_train_iu*normalized_train_iu.transpose()\n",
" \n",
" self.estimations=np.array(train_ui*self.similarity_matrix_ii/((train_ui>0)*self.similarity_matrix_ii))\n",
" \n",
" def recommend(self, user_code_id, item_code_id, topK=10):\n",
" \n",
" top_k = defaultdict(list)\n",
" for nb_user, user in enumerate(self.estimations):\n",
" \n",
" user_rated=self.train_ui.indices[self.train_ui.indptr[nb_user]:self.train_ui.indptr[nb_user+1]]\n",
" for item, score in enumerate(user):\n",
" if item not in user_rated and not np.isnan(score):\n",
" top_k[user_code_id[nb_user]].append((item_code_id[item], score))\n",
" result=[]\n",
" # Let's choose k best items in the format: (user, item1, score1, item2, score2, ...)\n",
" for uid, item_scores in top_k.items():\n",
" item_scores.sort(key=lambda x: x[1], reverse=True)\n",
" result.append([uid]+list(chain(*item_scores[:topK])))\n",
" return result\n",
" \n",
" def estimate(self, user_code_id, item_code_id, test_ui):\n",
" result=[]\n",
" for user, item in zip(*test_ui.nonzero()):\n",
" result.append([user_code_id[user], item_code_id[item], \n",
" self.estimations[user,item] if not np.isnan(self.estimations[user,item]) else 1])\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"toy train ui:\n"
]
},
{
"data": {
"text/plain": [
"array([[3, 4, 0, 0, 5, 0, 0, 4],\n",
" [0, 1, 2, 3, 0, 0, 0, 0],\n",
" [0, 0, 0, 5, 0, 3, 4, 0]], dtype=int64)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"similarity matrix:\n"
]
},
{
"data": {
"text/plain": [
"array([[1. , 0.9701425 , 0. , 0. , 1. ,\n",
" 0. , 0. , 1. ],\n",
" [0.9701425 , 1. , 0.24253563, 0.12478355, 0.9701425 ,\n",
" 0. , 0. , 0.9701425 ],\n",
" [0. , 0.24253563, 1. , 0.51449576, 0. ,\n",
" 0. , 0. , 0. ],\n",
" [0. , 0.12478355, 0.51449576, 1. , 0. ,\n",
" 0.85749293, 0.85749293, 0. ],\n",
" [1. , 0.9701425 , 0. , 0. , 1. ,\n",
" 0. , 0. , 1. ],\n",
" [0. , 0. , 0. , 0.85749293, 0. ,\n",
" 1. , 1. , 0. ],\n",
" [0. , 0. , 0. , 0.85749293, 0. ,\n",
" 1. , 1. , 0. ],\n",
" [1. , 0.9701425 , 0. , 0. , 1. ,\n",
" 0. , 0. , 1. ]])"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"estimations matrix:\n"
]
},
{
"data": {
"text/plain": [
"array([[4. , 4. , 4. , 4. , 4. ,\n",
" nan, nan, 4. ],\n",
" [1. , 1.35990333, 2.15478388, 2.53390319, 1. ,\n",
" 3. , 3. , 1. ],\n",
" [ nan, 5. , 5. , 4.05248907, nan,\n",
" 3.95012863, 3.95012863, nan]])"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[[0, 20, 4.0, 30, 4.0],\n",
" [10, 50, 3.0, 60, 3.0, 0, 1.0, 40, 1.0, 70, 1.0],\n",
" [20, 10, 5.0, 20, 5.0]]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# toy example\n",
"toy_train_read=pd.read_csv('./Datasets/toy-example/train.csv', sep='\\t', header=None, names=['user', 'item', 'rating', 'timestamp'])\n",
"toy_test_read=pd.read_csv('./Datasets/toy-example/test.csv', sep='\\t', header=None, names=['user', 'item', 'rating', 'timestamp'])\n",
"\n",
"toy_train_ui, toy_test_ui, toy_user_code_id, toy_user_id_code, \\\n",
"toy_item_code_id, toy_item_id_code = helpers.data_to_csr(toy_train_read, toy_test_read)\n",
"\n",
"\n",
"model=IKNN()\n",
"model.fit(toy_train_ui)\n",
"\n",
"print('toy train ui:')\n",
"display(toy_train_ui.A)\n",
"\n",
"print('similarity matrix:')\n",
"display(model.similarity_matrix_ii.A)\n",
"\n",
"print('estimations matrix:')\n",
"display(model.estimations)\n",
"\n",
"model.recommend(toy_user_code_id, toy_item_code_id)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"model=IKNN()\n",
"model.fit(train_ui)\n",
"\n",
"top_n=pd.DataFrame(model.recommend(user_code_id, item_code_id, topK=10))\n",
"\n",
"top_n.to_csv('Recommendations generated/ml-100k/Self_IKNN_reco.csv', index=False, header=False)\n",
"\n",
"estimations=pd.DataFrame(model.estimate(user_code_id, item_code_id, test_ui))\n",
"estimations.to_csv('Recommendations generated/ml-100k/Self_IKNN_estimations.csv', index=False, header=False)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"943it [00:00, 8845.73it/s]\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>RMSE</th>\n",
" <th>MAE</th>\n",
" <th>precision</th>\n",
" <th>recall</th>\n",
" <th>F_1</th>\n",
" <th>F_05</th>\n",
" <th>precision_super</th>\n",
" <th>recall_super</th>\n",
" <th>NDCG</th>\n",
" <th>mAP</th>\n",
" <th>MRR</th>\n",
" <th>LAUC</th>\n",
" <th>HR</th>\n",
" <th>Reco in test</th>\n",
" <th>Test coverage</th>\n",
" <th>Shannon</th>\n",
" <th>Gini</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.018363</td>\n",
" <td>0.808793</td>\n",
" <td>0.000318</td>\n",
" <td>0.000108</td>\n",
" <td>0.00014</td>\n",
" <td>0.000189</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000214</td>\n",
" <td>0.000037</td>\n",
" <td>0.000368</td>\n",
" <td>0.496391</td>\n",
" <td>0.003181</td>\n",
" <td>0.392153</td>\n",
" <td>0.11544</td>\n",
" <td>4.174741</td>\n",
" <td>0.965327</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" RMSE MAE precision recall F_1 F_05 \\\n",
"0 1.018363 0.808793 0.000318 0.000108 0.00014 0.000189 \n",
"\n",
" precision_super recall_super NDCG mAP MRR LAUC \\\n",
"0 0.0 0.0 0.000214 0.000037 0.000368 0.496391 \n",
"\n",
" HR Reco in test Test coverage Shannon Gini \n",
"0 0.003181 0.392153 0.11544 4.174741 0.965327 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import evaluation_measures as ev\n",
"estimations_df=pd.read_csv('Recommendations generated/ml-100k/Self_IKNN_estimations.csv', header=None)\n",
"reco=np.loadtxt('Recommendations generated/ml-100k/Self_IKNN_reco.csv', delimiter=',')\n",
"\n",
"ev.evaluate(test=pd.read_csv('./Datasets/ml-100k/test.csv', sep='\\t', header=None),\n",
" estimations_df=estimations_df, \n",
" reco=reco,\n",
" super_reactions=[4,5])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"943it [00:00, 7423.18it/s]\n",
"943it [00:00, 7890.87it/s]\n",
"943it [00:00, 7370.82it/s]\n",
"943it [00:00, 8035.93it/s]\n",
"943it [00:00, 8071.70it/s]\n",
"943it [00:00, 7893.80it/s]\n",
"943it [00:00, 8159.55it/s]\n",
"943it [00:00, 7982.77it/s]\n",
"943it [00:00, 7514.53it/s]\n",
"943it [00:00, 8047.34it/s]\n",
"943it [00:00, 7874.80it/s]\n",
"943it [00:00, 7657.62it/s]\n",
"943it [00:00, 8281.73it/s]\n",
"943it [00:00, 8253.33it/s]\n",
"943it [00:00, 8332.31it/s]\n",
"943it [00:00, 8348.73it/s]\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>RMSE</th>\n",
" <th>MAE</th>\n",
" <th>precision</th>\n",
" <th>recall</th>\n",
" <th>F_1</th>\n",
" <th>F_05</th>\n",
" <th>precision_super</th>\n",
" <th>recall_super</th>\n",
" <th>NDCG</th>\n",
" <th>mAP</th>\n",
" <th>MRR</th>\n",
" <th>LAUC</th>\n",
" <th>HR</th>\n",
" <th>Reco in test</th>\n",
" <th>Test coverage</th>\n",
" <th>Shannon</th>\n",
" <th>Gini</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_RP3Beta</td>\n",
" <td>3.702446</td>\n",
" <td>3.527273</td>\n",
" <td>0.282185</td>\n",
" <td>0.192092</td>\n",
" <td>0.186749</td>\n",
" <td>0.216980</td>\n",
" <td>0.204185</td>\n",
" <td>0.240096</td>\n",
" <td>0.339114</td>\n",
" <td>0.204905</td>\n",
" <td>0.572157</td>\n",
" <td>0.593544</td>\n",
" <td>0.875928</td>\n",
" <td>1.000000</td>\n",
" <td>0.077201</td>\n",
" <td>3.875892</td>\n",
" <td>0.974947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_TopPop</td>\n",
" <td>2.508258</td>\n",
" <td>2.217909</td>\n",
" <td>0.188865</td>\n",
" <td>0.116919</td>\n",
" <td>0.118732</td>\n",
" <td>0.141584</td>\n",
" <td>0.130472</td>\n",
" <td>0.137473</td>\n",
" <td>0.214651</td>\n",
" <td>0.111707</td>\n",
" <td>0.400939</td>\n",
" <td>0.555546</td>\n",
" <td>0.765642</td>\n",
" <td>1.000000</td>\n",
" <td>0.038961</td>\n",
" <td>3.159079</td>\n",
" <td>0.987317</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_SVD</td>\n",
" <td>0.952784</td>\n",
" <td>0.750597</td>\n",
" <td>0.095228</td>\n",
" <td>0.047497</td>\n",
" <td>0.053142</td>\n",
" <td>0.067082</td>\n",
" <td>0.084871</td>\n",
" <td>0.076457</td>\n",
" <td>0.109075</td>\n",
" <td>0.050124</td>\n",
" <td>0.241366</td>\n",
" <td>0.520459</td>\n",
" <td>0.499470</td>\n",
" <td>0.992047</td>\n",
" <td>0.217893</td>\n",
" <td>4.405246</td>\n",
" <td>0.953484</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_SVDBaseline</td>\n",
" <td>0.930321</td>\n",
" <td>0.734643</td>\n",
" <td>0.092683</td>\n",
" <td>0.042046</td>\n",
" <td>0.048568</td>\n",
" <td>0.063218</td>\n",
" <td>0.082940</td>\n",
" <td>0.068730</td>\n",
" <td>0.098937</td>\n",
" <td>0.044405</td>\n",
" <td>0.203936</td>\n",
" <td>0.517696</td>\n",
" <td>0.469777</td>\n",
" <td>1.000000</td>\n",
" <td>0.058442</td>\n",
" <td>3.085857</td>\n",
" <td>0.988824</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_SVDBiased</td>\n",
" <td>0.940375</td>\n",
" <td>0.742264</td>\n",
" <td>0.092153</td>\n",
" <td>0.039645</td>\n",
" <td>0.046804</td>\n",
" <td>0.061886</td>\n",
" <td>0.079399</td>\n",
" <td>0.055967</td>\n",
" <td>0.102017</td>\n",
" <td>0.047972</td>\n",
" <td>0.216876</td>\n",
" <td>0.516515</td>\n",
" <td>0.441145</td>\n",
" <td>0.997455</td>\n",
" <td>0.167388</td>\n",
" <td>4.235348</td>\n",
" <td>0.962085</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_Baseline</td>\n",
" <td>0.949459</td>\n",
" <td>0.752487</td>\n",
" <td>0.091410</td>\n",
" <td>0.037652</td>\n",
" <td>0.046030</td>\n",
" <td>0.061286</td>\n",
" <td>0.079614</td>\n",
" <td>0.056463</td>\n",
" <td>0.095957</td>\n",
" <td>0.043178</td>\n",
" <td>0.198193</td>\n",
" <td>0.515501</td>\n",
" <td>0.437964</td>\n",
" <td>1.000000</td>\n",
" <td>0.033911</td>\n",
" <td>2.836513</td>\n",
" <td>0.991139</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_SVD</td>\n",
" <td>0.939326</td>\n",
" <td>0.740022</td>\n",
" <td>0.074549</td>\n",
" <td>0.031755</td>\n",
" <td>0.038425</td>\n",
" <td>0.050562</td>\n",
" <td>0.065665</td>\n",
" <td>0.050602</td>\n",
" <td>0.077117</td>\n",
" <td>0.031574</td>\n",
" <td>0.165509</td>\n",
" <td>0.512485</td>\n",
" <td>0.414634</td>\n",
" <td>0.981866</td>\n",
" <td>0.080087</td>\n",
" <td>3.858982</td>\n",
" <td>0.975271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_GlobalAvg</td>\n",
" <td>1.125760</td>\n",
" <td>0.943534</td>\n",
" <td>0.061188</td>\n",
" <td>0.025968</td>\n",
" <td>0.031383</td>\n",
" <td>0.041343</td>\n",
" <td>0.040558</td>\n",
" <td>0.032107</td>\n",
" <td>0.067695</td>\n",
" <td>0.027470</td>\n",
" <td>0.171187</td>\n",
" <td>0.509546</td>\n",
" <td>0.384942</td>\n",
" <td>1.000000</td>\n",
" <td>0.025974</td>\n",
" <td>2.711772</td>\n",
" <td>0.992003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_Random</td>\n",
" <td>1.518551</td>\n",
" <td>1.218784</td>\n",
" <td>0.050583</td>\n",
" <td>0.024085</td>\n",
" <td>0.027323</td>\n",
" <td>0.034826</td>\n",
" <td>0.031223</td>\n",
" <td>0.026436</td>\n",
" <td>0.054902</td>\n",
" <td>0.020652</td>\n",
" <td>0.137928</td>\n",
" <td>0.508570</td>\n",
" <td>0.353128</td>\n",
" <td>0.987699</td>\n",
" <td>0.183261</td>\n",
" <td>5.093805</td>\n",
" <td>0.908215</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_I-KNN</td>\n",
" <td>1.030386</td>\n",
" <td>0.813067</td>\n",
" <td>0.026087</td>\n",
" <td>0.006908</td>\n",
" <td>0.010593</td>\n",
" <td>0.016046</td>\n",
" <td>0.021137</td>\n",
" <td>0.009522</td>\n",
" <td>0.024214</td>\n",
" <td>0.008958</td>\n",
" <td>0.048068</td>\n",
" <td>0.499885</td>\n",
" <td>0.154825</td>\n",
" <td>0.402333</td>\n",
" <td>0.434343</td>\n",
" <td>5.133650</td>\n",
" <td>0.877999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_U-KNNBaseline</td>\n",
" <td>0.935327</td>\n",
" <td>0.737424</td>\n",
" <td>0.002545</td>\n",
" <td>0.000755</td>\n",
" <td>0.001105</td>\n",
" <td>0.001602</td>\n",
" <td>0.002253</td>\n",
" <td>0.000930</td>\n",
" <td>0.003444</td>\n",
" <td>0.001362</td>\n",
" <td>0.011760</td>\n",
" <td>0.496724</td>\n",
" <td>0.021209</td>\n",
" <td>0.482821</td>\n",
" <td>0.059885</td>\n",
" <td>2.232578</td>\n",
" <td>0.994487</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_I-KNNBaseline</td>\n",
" <td>0.935327</td>\n",
" <td>0.737424</td>\n",
" <td>0.002545</td>\n",
" <td>0.000755</td>\n",
" <td>0.001105</td>\n",
" <td>0.001602</td>\n",
" <td>0.002253</td>\n",
" <td>0.000930</td>\n",
" <td>0.003444</td>\n",
" <td>0.001362</td>\n",
" <td>0.011760</td>\n",
" <td>0.496724</td>\n",
" <td>0.021209</td>\n",
" <td>0.482821</td>\n",
" <td>0.059885</td>\n",
" <td>2.232578</td>\n",
" <td>0.994487</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ready_U-KNN</td>\n",
" <td>1.023495</td>\n",
" <td>0.807913</td>\n",
" <td>0.000742</td>\n",
" <td>0.000205</td>\n",
" <td>0.000305</td>\n",
" <td>0.000449</td>\n",
" <td>0.000536</td>\n",
" <td>0.000198</td>\n",
" <td>0.000845</td>\n",
" <td>0.000274</td>\n",
" <td>0.002744</td>\n",
" <td>0.496441</td>\n",
" <td>0.007423</td>\n",
" <td>0.602121</td>\n",
" <td>0.010823</td>\n",
" <td>2.089186</td>\n",
" <td>0.995706</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_TopRated</td>\n",
" <td>1.033085</td>\n",
" <td>0.822057</td>\n",
" <td>0.000954</td>\n",
" <td>0.000188</td>\n",
" <td>0.000298</td>\n",
" <td>0.000481</td>\n",
" <td>0.000644</td>\n",
" <td>0.000223</td>\n",
" <td>0.001043</td>\n",
" <td>0.000335</td>\n",
" <td>0.003348</td>\n",
" <td>0.496433</td>\n",
" <td>0.009544</td>\n",
" <td>0.699046</td>\n",
" <td>0.005051</td>\n",
" <td>1.945910</td>\n",
" <td>0.995669</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_BaselineUI</td>\n",
" <td>0.967585</td>\n",
" <td>0.762740</td>\n",
" <td>0.000954</td>\n",
" <td>0.000170</td>\n",
" <td>0.000278</td>\n",
" <td>0.000463</td>\n",
" <td>0.000644</td>\n",
" <td>0.000189</td>\n",
" <td>0.000752</td>\n",
" <td>0.000168</td>\n",
" <td>0.001677</td>\n",
" <td>0.496424</td>\n",
" <td>0.009544</td>\n",
" <td>0.600530</td>\n",
" <td>0.005051</td>\n",
" <td>1.803126</td>\n",
" <td>0.996380</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Self_IKNN</td>\n",
" <td>1.018363</td>\n",
" <td>0.808793</td>\n",
" <td>0.000318</td>\n",
" <td>0.000108</td>\n",
" <td>0.000140</td>\n",
" <td>0.000189</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000214</td>\n",
" <td>0.000037</td>\n",
" <td>0.000368</td>\n",
" <td>0.496391</td>\n",
" <td>0.003181</td>\n",
" <td>0.392153</td>\n",
" <td>0.115440</td>\n",
" <td>4.174741</td>\n",
" <td>0.965327</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model RMSE MAE precision recall F_1 \\\n",
"0 Self_RP3Beta 3.702446 3.527273 0.282185 0.192092 0.186749 \n",
"0 Self_TopPop 2.508258 2.217909 0.188865 0.116919 0.118732 \n",
"0 Ready_SVD 0.952784 0.750597 0.095228 0.047497 0.053142 \n",
"0 Self_SVDBaseline 0.930321 0.734643 0.092683 0.042046 0.048568 \n",
"0 Ready_SVDBiased 0.940375 0.742264 0.092153 0.039645 0.046804 \n",
"0 Ready_Baseline 0.949459 0.752487 0.091410 0.037652 0.046030 \n",
"0 Self_SVD 0.939326 0.740022 0.074549 0.031755 0.038425 \n",
"0 Self_GlobalAvg 1.125760 0.943534 0.061188 0.025968 0.031383 \n",
"0 Ready_Random 1.518551 1.218784 0.050583 0.024085 0.027323 \n",
"0 Ready_I-KNN 1.030386 0.813067 0.026087 0.006908 0.010593 \n",
"0 Ready_U-KNNBaseline 0.935327 0.737424 0.002545 0.000755 0.001105 \n",
"0 Ready_I-KNNBaseline 0.935327 0.737424 0.002545 0.000755 0.001105 \n",
"0 Ready_U-KNN 1.023495 0.807913 0.000742 0.000205 0.000305 \n",
"0 Self_TopRated 1.033085 0.822057 0.000954 0.000188 0.000298 \n",
"0 Self_BaselineUI 0.967585 0.762740 0.000954 0.000170 0.000278 \n",
"0 Self_IKNN 1.018363 0.808793 0.000318 0.000108 0.000140 \n",
"\n",
" F_05 precision_super recall_super NDCG mAP MRR \\\n",
"0 0.216980 0.204185 0.240096 0.339114 0.204905 0.572157 \n",
"0 0.141584 0.130472 0.137473 0.214651 0.111707 0.400939 \n",
"0 0.067082 0.084871 0.076457 0.109075 0.050124 0.241366 \n",
"0 0.063218 0.082940 0.068730 0.098937 0.044405 0.203936 \n",
"0 0.061886 0.079399 0.055967 0.102017 0.047972 0.216876 \n",
"0 0.061286 0.079614 0.056463 0.095957 0.043178 0.198193 \n",
"0 0.050562 0.065665 0.050602 0.077117 0.031574 0.165509 \n",
"0 0.041343 0.040558 0.032107 0.067695 0.027470 0.171187 \n",
"0 0.034826 0.031223 0.026436 0.054902 0.020652 0.137928 \n",
"0 0.016046 0.021137 0.009522 0.024214 0.008958 0.048068 \n",
"0 0.001602 0.002253 0.000930 0.003444 0.001362 0.011760 \n",
"0 0.001602 0.002253 0.000930 0.003444 0.001362 0.011760 \n",
"0 0.000449 0.000536 0.000198 0.000845 0.000274 0.002744 \n",
"0 0.000481 0.000644 0.000223 0.001043 0.000335 0.003348 \n",
"0 0.000463 0.000644 0.000189 0.000752 0.000168 0.001677 \n",
"0 0.000189 0.000000 0.000000 0.000214 0.000037 0.000368 \n",
"\n",
" LAUC HR Reco in test Test coverage Shannon Gini \n",
"0 0.593544 0.875928 1.000000 0.077201 3.875892 0.974947 \n",
"0 0.555546 0.765642 1.000000 0.038961 3.159079 0.987317 \n",
"0 0.520459 0.499470 0.992047 0.217893 4.405246 0.953484 \n",
"0 0.517696 0.469777 1.000000 0.058442 3.085857 0.988824 \n",
"0 0.516515 0.441145 0.997455 0.167388 4.235348 0.962085 \n",
"0 0.515501 0.437964 1.000000 0.033911 2.836513 0.991139 \n",
"0 0.512485 0.414634 0.981866 0.080087 3.858982 0.975271 \n",
"0 0.509546 0.384942 1.000000 0.025974 2.711772 0.992003 \n",
"0 0.508570 0.353128 0.987699 0.183261 5.093805 0.908215 \n",
"0 0.499885 0.154825 0.402333 0.434343 5.133650 0.877999 \n",
"0 0.496724 0.021209 0.482821 0.059885 2.232578 0.994487 \n",
"0 0.496724 0.021209 0.482821 0.059885 2.232578 0.994487 \n",
"0 0.496441 0.007423 0.602121 0.010823 2.089186 0.995706 \n",
"0 0.496433 0.009544 0.699046 0.005051 1.945910 0.995669 \n",
"0 0.496424 0.009544 0.600530 0.005051 1.803126 0.996380 \n",
"0 0.496391 0.003181 0.392153 0.115440 4.174741 0.965327 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import imp\n",
"imp.reload(ev)\n",
"\n",
"import evaluation_measures as ev\n",
"dir_path=\"Recommendations generated/ml-100k/\"\n",
"super_reactions=[4,5]\n",
"test=pd.read_csv('./Datasets/ml-100k/test.csv', sep='\\t', header=None)\n",
"\n",
"ev.evaluate_all(test, dir_path, super_reactions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ready-made KNNs - Surprise implementation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I-KNN - basic"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"Generating predictions...\n",
"Generating top N recommendations...\n",
"Generating predictions...\n"
]
}
],
"source": [
"import helpers\n",
"import surprise as sp\n",
"import imp\n",
"imp.reload(helpers)\n",
"\n",
"sim_options = {'name': 'cosine',\n",
" 'user_based': False} # compute similarities between items\n",
"algo = sp.KNNBasic(sim_options=sim_options)\n",
"\n",
"helpers.ready_made(algo, reco_path='Recommendations generated/ml-100k/Ready_I-KNN_reco.csv',\n",
" estimations_path='Recommendations generated/ml-100k/Ready_Baseline_I-KNN_estimations.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### U-KNN - basic"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"Generating predictions...\n"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-10-dd4f59625a08>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m helpers.ready_made(algo, reco_path='Recommendations generated/ml-100k/Ready_U-KNN_reco.csv',\n\u001b[0;32m---> 11\u001b[0;31m estimations_path='Recommendations generated/ml-100k/Ready_Baseline_U-KNN_estimations.csv')\n\u001b[0m",
"\u001b[0;32m/mnt/c/Users/rkwie/Repositories/Warsztaty z uczenia maszynowego - systemy rekomendacyjne/helpers.py\u001b[0m in \u001b[0;36mready_made\u001b[0;34m(algo, reco_path, estimations_path)\u001b[0m\n\u001b[1;32m 61\u001b[0m \u001b[0mantitrainset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrainset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_anti_testset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# We want to predict ratings of pairs (user, item) which are not in train set\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Generating predictions...'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0mpredictions\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0malgo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mantitrainset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Generating top N recommendations...'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0mtop_n\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_top_n\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpredictions\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/surprise/prediction_algorithms/algo_base.py\u001b[0m in \u001b[0;36mtest\u001b[0;34m(self, testset, verbose)\u001b[0m\n\u001b[1;32m 165\u001b[0m \u001b[0mr_ui_trans\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 166\u001b[0m verbose=verbose)\n\u001b[0;32m--> 167\u001b[0;31m for (uid, iid, r_ui_trans) in testset]\n\u001b[0m\u001b[1;32m 168\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mpredictions\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 169\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/surprise/prediction_algorithms/algo_base.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 165\u001b[0m \u001b[0mr_ui_trans\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 166\u001b[0m verbose=verbose)\n\u001b[0;32m--> 167\u001b[0;31m for (uid, iid, r_ui_trans) in testset]\n\u001b[0m\u001b[1;32m 168\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mpredictions\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 169\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/surprise/prediction_algorithms/algo_base.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, uid, iid, r_ui, clip, verbose)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0mdetails\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0mest\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mestimate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miuid\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0miiid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;31m# If the details dict was also returned\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/surprise/prediction_algorithms/knns.py\u001b[0m in \u001b[0;36mestimate\u001b[0;34m(self, u, i)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m \u001b[0mneighbors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msim\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mx2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0myr\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mk_neighbors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mheapq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlargest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mneighbors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0;31m# compute weighted average\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/lib/python3.6/heapq.py\u001b[0m in \u001b[0;36mnlargest\u001b[0;34m(n, iterable, key)\u001b[0m\n\u001b[1;32m 567\u001b[0m \u001b[0;31m# General case, slowest method\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 568\u001b[0m \u001b[0mit\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterable\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 569\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mit\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 570\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 571\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/lib/python3.6/heapq.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 567\u001b[0m \u001b[0;31m# General case, slowest method\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 568\u001b[0m \u001b[0mit\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterable\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 569\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mit\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 570\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 571\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/surprise/prediction_algorithms/knns.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(t)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m \u001b[0mneighbors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msim\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mx2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0myr\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mk_neighbors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mheapq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlargest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mneighbors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0;31m# compute weighted average\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"import helpers\n",
"import surprise as sp\n",
"import imp\n",
"imp.reload(helpers)\n",
"\n",
"sim_options = {'name': 'cosine',\n",
" 'user_based': True} # compute similarities between users\n",
"algo = sp.KNNBasic(sim_options=sim_options)\n",
"\n",
"helpers.ready_made(algo, reco_path='Recommendations generated/ml-100k/Ready_U-KNN_reco.csv',\n",
" estimations_path='Recommendations generated/ml-100k/Ready_Baseline_U-KNN_estimations.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I-KNN - on top baseline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import helpers\n",
"import surprise as sp\n",
"import imp\n",
"imp.reload(helpers)\n",
"\n",
"sim_options = {'name': 'cosine',\n",
" 'user_based': False} # compute similarities between items\n",
"algo = sp.KNNBaseline()\n",
"\n",
"helpers.ready_made(algo, reco_path='Recommendations generated/ml-100k/Ready_I-KNNBaseline_reco.csv',\n",
" estimations_path='Recommendations generated/ml-100k/Ready_I-KNNBaseline_estimations.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# project task 4: use a version of your choice of Surprise KNNalgorithm"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# read the docs and try to find best parameter configuration (let say in terms of RMSE)\n",
"# https://surprise.readthedocs.io/en/stable/knn_inspired.html##surprise.prediction_algorithms.knns.KNNBaseline\n",
"# the solution here can be similar to examples above\n",
"# please save the output in 'Recommendations generated/ml-100k/Self_KNNSurprisetask_reco.csv' and\n",
"# 'Recommendations generated/ml-100k/Self_KNNSurprisetask_estimations.csv'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,80 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['dimensions: 1, cases when observation is the nearest: 0.0%',\n",
" 'dimensions: 2, cases when observation is the nearest: 0.0%',\n",
" 'dimensions: 3, cases when observation is the nearest: 0.0%',\n",
" 'dimensions: 10, cases when observation is the nearest: 10.0%',\n",
" 'dimensions: 20, cases when observation is the nearest: 61.0%',\n",
" 'dimensions: 30, cases when observation is the nearest: 96.0%',\n",
" 'dimensions: 40, cases when observation is the nearest: 98.0%',\n",
" 'dimensions: 50, cases when observation is the nearest: 100.0%',\n",
" 'dimensions: 60, cases when observation is the nearest: 100.0%',\n",
" 'dimensions: 70, cases when observation is the nearest: 100.0%',\n",
" 'dimensions: 80, cases when observation is the nearest: 100.0%',\n",
" 'dimensions: 90, cases when observation is the nearest: 100.0%']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import random\n",
"from numpy.linalg import norm\n",
"\n",
"dimensions=[1,2,3]+[10*i for i in range(1,10)]\n",
"nb_vectors=10000\n",
"trials=100\n",
"k=1 # by setting k=1 we want to check how often the closest vector to the avarage of 2 random vectors is one of these 2 vectors\n",
"\n",
"result=[]\n",
"for dimension in dimensions:\n",
" vectors=np.random.normal(0,1,size=(nb_vectors, dimension))\n",
" successes=0\n",
" for i in range(trials):\n",
" i1,i2=random.sample(range(nb_vectors),2)\n",
" target=(vectors[i1]+vectors[i2])/2\n",
"\n",
" distances=pd.DataFrame(enumerate(np.dot(target, vectors.transpose())/norm(target)/norm(vectors.transpose(), axis=0)))\n",
" distances=distances.sort_values(by=[1], ascending=False)\n",
" if (i1 in (list(distances[0][:k]))) | (i2 in (list(distances[0][:k]))):\n",
" successes+=1\n",
" result.append(successes/trials)\n",
" \n",
"[f'dimensions: {i}, cases when observation is the nearest: {100*round(j,3)}%' for i,j in zip(dimensions, result)]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long

1391
P5. Graph-based.ipynb Normal file

File diff suppressed because one or more lines are too long

214
evaluation_measures.py Normal file
View File

@ -0,0 +1,214 @@
import os
import sys
import numpy as np
import pandas as pd
import math
from sklearn.preprocessing import normalize
from tqdm import tqdm
from datetime import datetime, date
import random
import scipy.sparse as sparse
from os import listdir
from os.path import isfile, join
from collections import defaultdict
def evaluate(test,
estimations_df,
reco,
super_reactions=[4,5],
topK=10):
estimations_df=estimations_df.copy()
reco=reco.copy()
test_df=test.copy()
# prepare testset
test_df.columns=['user', 'item', 'rating', 'timestamp']
test_df['user_code'] = test_df['user'].astype("category").cat.codes
test_df['item_code'] = test_df['item'].astype("category").cat.codes
user_code_id = dict(enumerate(test_df['user'].astype("category").cat.categories))
user_id_code = dict((v, k) for k, v in user_code_id.items())
item_code_id = dict(enumerate(test_df['item'].astype("category").cat.categories))
item_id_code = dict((v, k) for k, v in item_code_id.items())
test_ui = sparse.csr_matrix((test_df['rating'], (test_df['user_code'], test_df['item_code'])))
#prepare estimations
estimations_df.columns=['user', 'item' ,'score']
estimations_df['user_code']=[user_id_code[user] for user in estimations_df['user']]
estimations_df['item_code']=[item_id_code[item] for item in estimations_df['item']]
estimations=sparse.csr_matrix((estimations_df['score'], (estimations_df['user_code'], estimations_df['item_code'])), shape=test_ui.shape)
#compute_estimations
estimations_df=estimations_metrics(test_ui, estimations)
#prepare reco
users=reco[:,:1]
items=reco[:,1::2]
# Let's use inner ids instead of real ones
users=np.vectorize(lambda x: user_id_code.setdefault(x, -1))(users) # maybe users we recommend are not in test set
items=np.vectorize(lambda x: item_id_code.setdefault(x, -1))(items) # maybe items we recommend are not in test set
# Let's put them into one array
reco=np.concatenate((users, items), axis=1)
#compute ranking metrics
ranking_df=ranking_metrics(test_ui, reco, super_reactions=super_reactions, topK=topK)
#compute diversity metrics
diversity_df=diversity_metrics(test_ui, reco, topK)
result=pd.concat([estimations_df, ranking_df, diversity_df], axis=1)
return(result)
def ranking_metrics(test_ui, reco, super_reactions=[], topK=10):
nb_items=test_ui.shape[1]
relevant_users, super_relevant_users, prec, rec, F_1, F_05, prec_super, rec_super, ndcg, mAP, MRR, LAUC, HR=\
0,0,0,0,0,0,0,0,0,0,0,0,0
cg = (1.0 / np.log2(np.arange(2, topK + 2)))
cg_sum = np.cumsum(cg)
for (nb_user, user) in tqdm(enumerate(reco[:,0])):
u_rated_items=test_ui.indices[test_ui.indptr[user]:test_ui.indptr[user+1]]
nb_u_rated_items=len(u_rated_items)
if nb_u_rated_items>0: # skip users with no items in test set (still possible that there will be no super items)
relevant_users+=1
u_super_items=u_rated_items[np.vectorize(lambda x: x in super_reactions)\
(test_ui.data[test_ui.indptr[user]:test_ui.indptr[user+1]])]
# more natural seems u_super_items=[item for item in u_rated_items if test_ui[user,item] in super_reactions]
# but accesing test_ui[user,item] is expensive -we should avoid doing it
if len(u_super_items)>0:
super_relevant_users+=1
user_successes=np.zeros(topK)
nb_user_successes=0
user_super_successes=np.zeros(topK)
nb_user_super_successes=0
# evaluation
for (item_position,item) in enumerate(reco[nb_user,1:topK+1]):
if item in u_rated_items:
user_successes[item_position]=1
nb_user_successes+=1
if item in u_super_items:
user_super_successes[item_position]=1
nb_user_super_successes+=1
prec_u=nb_user_successes/topK
prec+=prec_u
rec_u=nb_user_successes/nb_u_rated_items
rec+=rec_u
F_1+=2*(prec_u*rec_u)/(prec_u+rec_u) if prec_u+rec_u>0 else 0
F_05+=(0.5**2+1)*(prec_u*rec_u)/(0.5**2*prec_u+rec_u) if prec_u+rec_u>0 else 0
prec_super+=nb_user_super_successes/topK
rec_super+=nb_user_super_successes/max(len(u_super_items),1)
ndcg+=np.dot(user_successes,cg)/cg_sum[min(topK, nb_u_rated_items)-1]
cumsum_successes=np.cumsum(user_successes)
mAP+=np.dot(cumsum_successes/np.arange(1,topK+1), user_successes)/min(topK, nb_u_rated_items)
MRR+=1/(user_successes.nonzero()[0][0]+1) if user_successes.nonzero()[0].size>0 else 0
LAUC+=(np.dot(cumsum_successes, 1-user_successes)+\
(nb_user_successes+nb_u_rated_items)/2*((nb_items-nb_u_rated_items)-(topK-nb_user_successes)))/\
((nb_items-nb_u_rated_items)*nb_u_rated_items)
HR+=nb_user_successes>0
result=[]
result.append(('precision', prec/relevant_users))
result.append(('recall', rec/relevant_users))
result.append(('F_1', F_1/relevant_users))
result.append(('F_05', F_05/relevant_users))
result.append(('precision_super', prec_super/super_relevant_users))
result.append(('recall_super', rec_super/super_relevant_users))
result.append(('NDCG', ndcg/relevant_users))
result.append(('mAP', mAP/relevant_users))
result.append(('MRR', MRR/relevant_users))
result.append(('LAUC', LAUC/relevant_users))
result.append(('HR', HR/relevant_users))
df_result=pd.DataFrame()
if len(result)>0:
df_result=(pd.DataFrame(list(zip(*result))[1])).T
df_result.columns=list(zip(*result))[0]
return df_result
def estimations_metrics(test_ui, estimations):
result=[]
RMSE=(np.sum((estimations.data-test_ui.data)**2)/estimations.nnz)**(1/2)
result.append(['RMSE', RMSE])
MAE=np.sum(abs(estimations.data-test_ui.data))/estimations.nnz
result.append(['MAE', MAE])
df_result=pd.DataFrame()
if len(result)>0:
df_result=(pd.DataFrame(list(zip(*result))[1])).T
df_result.columns=list(zip(*result))[0]
return df_result
def diversity_metrics(test_ui, reco, topK=10):
frequencies=defaultdict(int)
for item in list(set(test_ui.indices)):
frequencies[item]=0
for item in reco[:,1:].flat:
frequencies[item]+=1
nb_reco_outside_test=frequencies[-1]
del frequencies[-1]
frequencies=np.array(list(frequencies.values()))
nb_rec_items=len(frequencies[frequencies>0])
nb_reco_inside_test=np.sum(frequencies)
frequencies=frequencies/np.sum(frequencies)
frequencies=np.sort(frequencies)
with np.errstate(divide='ignore'): # let's put zeros we items with 0 frequency and ignore division warning
log_frequencies=np.nan_to_num(np.log(frequencies), posinf=0, neginf=0)
result=[]
result.append(('Reco in test', nb_reco_inside_test/(nb_reco_inside_test+nb_reco_outside_test)))
result.append(('Test coverage', nb_rec_items/test_ui.shape[1]))
result.append(('Shannon', -np.dot(frequencies, log_frequencies)))
result.append(('Gini', np.dot(frequencies, np.arange(1-len(frequencies), len(frequencies), 2))/(len(frequencies)-1)))
df_result=(pd.DataFrame(list(zip(*result))[1])).T
df_result.columns=list(zip(*result))[0]
return df_result
def evaluate_all(test,
dir_path="Recommendations generated/ml-100k/",
super_reactions=[4,5],
topK=10):
models = list(set(['_'.join(f.split('_')[:2]) for f in listdir(dir_path)
if isfile(dir_path+f)]))
result=[]
for model in models:
estimations_df=pd.read_csv('{}{}_estimations.csv'.format(dir_path, model), header=None)
reco=np.loadtxt('{}{}_reco.csv'.format(dir_path, model), delimiter=',')
to_append=evaluate(test, estimations_df, reco, super_reactions, topK)
to_append.insert(0, "Model", model)
result.append(to_append)
result=pd.concat(result)
result=result.sort_values(by='recall', ascending=False)
return result

75
helpers.py Normal file
View File

@ -0,0 +1,75 @@
import pandas as pd
import numpy as np
import scipy.sparse as sparse
import surprise as sp
import time
from collections import defaultdict
from itertools import chain
def data_to_csr(train_read, test_read):
train_read.columns=['user', 'item', 'rating', 'timestamp']
test_read.columns=['user', 'item', 'rating', 'timestamp']
# Let's build whole dataset
train_and_test=pd.concat([train_read, test_read], axis=0, ignore_index=True)
train_and_test['user_code'] = train_and_test['user'].astype("category").cat.codes
train_and_test['item_code'] = train_and_test['item'].astype("category").cat.codes
user_code_id = dict(enumerate(train_and_test['user'].astype("category").cat.categories))
user_id_code = dict((v, k) for k, v in user_code_id.items())
item_code_id = dict(enumerate(train_and_test['item'].astype("category").cat.categories))
item_id_code = dict((v, k) for k, v in item_code_id.items())
train_df=pd.merge(train_read, train_and_test, on=list(train_read.columns))
test_df=pd.merge(test_read, train_and_test, on=list(train_read.columns))
# Take number of users and items
(U,I)=(train_and_test['user_code'].max()+1, train_and_test['item_code'].max()+1)
# Create sparse csr matrices
train_ui = sparse.csr_matrix((train_df['rating'], (train_df['user_code'], train_df['item_code'])), shape=(U, I))
test_ui = sparse.csr_matrix((test_df['rating'], (test_df['user_code'], test_df['item_code'])), shape=(U, I))
return train_ui, test_ui, user_code_id, user_id_code, item_code_id, item_id_code
def get_top_n(predictions, n=10):
# Here we create a dictionary which items are lists of pairs (item, score)
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
result=[]
# Let's choose k best items in the format: (user, item1, score1, item2, score2, ...)
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
result.append([uid]+list(chain(*user_ratings[:n])))
return result
def ready_made(algo, reco_path, estimations_path):
reader = sp.Reader(line_format='user item rating timestamp', sep='\t')
trainset = sp.Dataset.load_from_file('./Datasets/ml-100k/train.csv', reader=reader)
trainset = trainset.build_full_trainset() # <class 'surprise.trainset.Trainset'> -> it is needed for using Surprise package
testset = sp.Dataset.load_from_file('./Datasets/ml-100k/test.csv', reader=reader)
testset = sp.Trainset.build_testset(testset.build_full_trainset())
algo.fit(trainset)
antitrainset = trainset.build_anti_testset() # We want to predict ratings of pairs (user, item) which are not in train set
print('Generating predictions...')
predictions = algo.test(antitrainset)
print('Generating top N recommendations...')
top_n = get_top_n(predictions, n=10)
top_n=pd.DataFrame(top_n)
top_n.to_csv(reco_path, index=False, header=False)
print('Generating predictions...')
predictions = algo.test(testset)
predictions_df=[]
for uid, iid, true_r, est, _ in predictions:
predictions_df.append([uid, iid, est])
predictions_df=pd.DataFrame(predictions_df)
predictions_df.to_csv(estimations_path, index=False, header=False)