40 KiB
40 KiB
Uczenie głębokie – przetwarzanie tekstu – laboratoria
1. TF–IDF
import numpy as np
import re
Zbiór dokumentów
documents = ['Ala lubi zwierzęta i ma kota oraz psa!',
'Ola lubi zwierzęta oraz ma kota a także chomika!',
'I Jan jeździ na rowerze.',
'2 wojna światowa była wielkim konfliktem zbrojnym',
'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',
]
Czego potrzebujemy?
- Chcemy zamienić teksty na zbiór słów.
❔ Pytania
- Czy do stokenizowania tekstu możemy użyć
document.split(' ')
? - Jakie trudności możemy napotkać?
Preprocessing
def get_str_cleaned(str_dirty):
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\\\]^_`{|}~'
new_str = str_dirty.lower()
new_str = re.sub(' +', ' ', new_str)
for char in punctuation:
new_str = new_str.replace(char,'')
return new_str
sample_document = get_str_cleaned(documents[0])
sample_document
'ala lubi zwierzęta i ma kota oraz psa'
Tokenizacja
def tokenize_str(document):
return document.split(' ')
tokenize_str(sample_document)
['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']
documents_cleaned = [get_str_cleaned(document) for document in documents]
documents_cleaned
['ala lubi zwierzęta i ma kota oraz psa', 'ola lubi zwierzęta oraz ma kota a także chomika', 'i jan jeździ na rowerze', '2 wojna światowa była wielkim konfliktem zbrojnym', 'tomek lubi psy ma psa i jeździ na motorze i rowerze']
documents_tokenized = [tokenize_str(d) for d in documents_cleaned]
documents_tokenized
[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'], ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'], ['i', 'jan', 'jeździ', 'na', 'rowerze'], ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'], ['tomek', 'lubi', 'psy', 'ma', 'psa', 'i', 'jeździ', 'na', 'motorze', 'i', 'rowerze']]
❔ Pytania
- Jaki jest następny krok w celu stworzenia wektórów TF lub TF–IDF?
- Jakie wielkości będzie wektor TF lub TF–IDF?
Stworzenie słownika
vocabulary = []
for document in documents_tokenized:
for word in document:
vocabulary.append(word)
vocabulary = sorted(set(vocabulary))
vocabulary
['2', 'a', 'ala', 'była', 'chomika', 'i', 'jan', 'jeździ', 'konfliktem', 'kota', 'lubi', 'ma', 'motorze', 'na', 'ola', 'oraz', 'psa', 'psy', 'rowerze', 'także', 'tomek', 'wielkim', 'wojna', 'zbrojnym', 'zwierzęta', 'światowa']
📝 Zadanie 1.1 _(1 pkt)
Napisz funkcję word_to_index(word: str)
, która dla danego słowa zwraca wektor jednostkowy (_one-hot vector) w postaci numpy.array
.
Przyjmij, że słownik dany jest za pomocą zmiennej globalnej vocabulary
.
def word_to_index(word: str) -> np.array:
# TODO
pass
word_to_index('psa')
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
📝 Zadanie 1.2 _(1 pkt)
Napisz funkcję, która zamienia listę słów na wektor TF.
def tf(document: list) -> np.array:
# TODO
pass
tf(documents_tokenized[0])
array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.])
documents_vectorized = list()
for document in documents_tokenized:
document_vector = tf(document)
documents_vectorized.append(document_vector)
documents_vectorized
[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.]), array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]), array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]), array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1.]), array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0.])]
IDF
idf = np.zeros(len(vocabulary))
idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)
display(idf)
array([5. , 5. , 5. , 5. , 5. , 1.66666667, 5. , 2.5 , 5. , 2.5 , 1.66666667, 1.66666667, 5. , 2.5 , 5. , 2.5 , 2.5 , 5. , 2.5 , 5. , 5. , 5. , 5. , 5. , 2.5 , 5. ])
📝 Zadanie 1.3 _(1 pkt)
Napisz funkcję, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej.
def similarity(query: np.array, document: np.array) -> float:
# TODO
pass
documents[0]
'Ala lubi zwierzęta i ma kota oraz psa!'
documents_vectorized[0]
array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.])
documents[1]
'Ola lubi zwierzęta oraz ma kota a także chomika!'
documents_vectorized[1]
array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.])
similarity(documents_vectorized[0], documents_vectorized[1])
0.5892556509887895
Prosta wyszukiwarka
def transform_query(query):
"""Funkcja, która czyści i tokenizuje zapytanie"""
query_vector = tf(tokenize_str(get_str_cleaned(query)))
return query_vector
similarity(transform_query('psa kota'), documents_vectorized[0])
0.4999999999999999
query = 'psa kota'
for i in range(len(documents)):
display(documents[i])
display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.4999999999999999
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.2357022603955158
'I Jan jeździ na rowerze.'
0.0
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'
0.19611613513818402
# dlatego potrzebujemy mianownik w cosine similarity
query = 'rowerze'
for i in range(len(documents)):
display(documents[i])
display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.0
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.0
'I Jan jeździ na rowerze.'
0.4472135954999579
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'
0.2773500981126146
# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument
query = 'i'
for i in range(len(documents)):
display(documents[i])
display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.35355339059327373
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.0
'I Jan jeździ na rowerze.'
0.4472135954999579
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'
0.5547001962252291
# dlatego IDF - żeby ważniejsze słowa miał większą wagę
query = 'i chomika'
for i in range(len(documents)):
display(documents[i])
display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.24999999999999994
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.2357022603955158
'I Jan jeździ na rowerze.'
0.31622776601683794
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'
0.39223227027636803
Biblioteki
import numpy as np
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups = fetch_20newsgroups()['data']
len(newsgroups)
11314
print(newsgroups[0])
From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
Naiwne przeszukiwanie
all_documents = list()
for document in newsgroups:
if 'car' in document:
all_documents.append(document)
print(all_documents[0])
From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
print(all_documents[1])
From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll. Please send a brief message detailing your experiences with the procedure. Top speed attained, CPU rated speed, add on cards and adapters, heat sinks, hour of usage per day, floppy disk functionality with 800 and 1.4 m floppies are especially requested. I will be summarizing in the next two days, so please add to the network knowledge base if you have done the clock upgrade and haven't answered this poll. Thanks. Guy Kuo <guykuo@u.washington.edu>
❔ Pytanie
Jakie są problemy z takim podejściem?
TF–IDF i odległość kosinusowa
vectorizer = TfidfVectorizer()
document_vectors = vectorizer.fit_transform(newsgroups)
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>' with 1787565 stored elements in Compressed Sparse Row format>
document_vectors[0]
<1x130107 sparse matrix of type '<class 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>
document_vectors[0].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.]])
document_vectors[0:4].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'
query_vector = vectorizer.transform([query_str])
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
print(np.sort(similarities)[0][-4:])
print(similarities.argsort()[0][-4:])
for i in range (1,5):
print(newsgroups[similarities.argsort()[0][-i]])
print(np.sort(similarities)[0,-i])
print('-'*100)
print('-'*100)
print('-'*100)
[0.17360013 0.22933014 0.28954818 0.45372239] [ 2455 8920 5497 11031] From: keiths@spider.co.uk (Keith Smith) Subject: win/NT file systems Organization: Spider Systems Limited, Edinburgh, UK. Lines: 6 Nntp-Posting-Host: trapdoor.spider.co.uk OK will some one out there tell me why / how DOS 5 can read (I havn't tried writing in case it breaks something) the Win/NT NTFS file system. I thought NTFS was supposed to be better than the FAT system keith 0.4537223924558256 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: brandt@cs.unc.edu (Andrew Brandt) Subject: Seeking good Alfa Romeo mechanic. Organization: The University of North Carolina at Chapel Hill Lines: 14 NNTP-Posting-Host: axon.cs.unc.edu Keywords: alfa, romeo, spider, mechanic I am looking for recommendations for a good (great?) Alfa Romeo mechanic in South Jersey or Philadelphia or nearby. I have a '78 Alfa Spider that needs some engine, tranny, steering work done. The body is in quite good shape. The car is awful in cold weather, won't start if below freezing (I know, I know, why drive a Spider if there's snow on the ground ...). It has Bosch *mechanical* fuel injection that I am sure needs adjustment. Any opinions are welcome on what to look for or who to call. Email or post (to rec.autos), I will summarize if people want. Thx, Andy (brandt@cs.unc.edu) 0.28954817869991817 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: michaelr@spider.co.uk (Michael S. A. Robb) Subject: Re: Honors Degrees: Do they mean anything? Organization: Spider Systems Limited, Edinburgh, UK. Lines: 44 In article <TKLD.93Apr2123341@burns.cogsci.ed.ac.uk> tkld@cogsci.ed.ac.uk (Kevin Davidson) writes: > >> In my opinion, a programming degree is still worth having. > > Yes, but a CS degree is *not* a programming degree. Does anybody know of >a computing course where *programming* is taught ? Computer Science is >a branch of maths (or the course I did was). > I've also done a Software Engineering course - much more practical and likely >to be the sort of thing an employer really wants, rather than what they think >they want, but also did not teach programming. The ability to program was >an entry requirement. At Robert Gordon University, programming was the main (most time-consuming) start of the course. The first two years consisted of five subjects: Software Engineering (Pascal/C/UNIX), Computer Engineering (6502/6809/68000 assembler), Computer Theory (LISP/Prolog), Mathematics/Statistics and Communication Skills (How to pass interviews/intelligence tests and group discussions e.g. How to survive a helicopter crash in the North Sea). The third year (Industrial placement) was spent working for a computer company for a year. The company could be anywhere in Europe (there was a special Travel Allowance Scheme to cover the visiting costs of professors). The fourth year included Operating Systems(C/Modula-2), Software Engineering (C/8086 assembler), Real Time Laboratory (C/68000 assembler) and Computing Theory (LISP). There were also Group Projects in 2nd and 4th Years, where students worked in teams to select their own project or decide to work for an outside company (the only disadvantage being that specifications would change suddenly). In the first four years, there was a 50%:50% weighting between courseworks and exams for most subjects. However in the Honours year, this was reduced to a 30%:70% split between an Individual Project and final exams (no coursework assessment) - are all Computer Science courses like this? BTW - we started off with 22 students in our first year and were left with 8 by Honours year. Also, every course is tutored separately. Not easy trying to sleep when you are in 8 student class :-). Cheers, Michael -- | Michael S. A. Robb | Tel: +44 31 554 9424 | "..The problem with bolt-on | Software Engineer | Fax: +44 31 554 0649 | software is making sure the | Spider Systems Limited | E-mail: | bolts are the right size.." | Edinburgh, EH6 5NG | michaelr@spider.co.uk | - Anonymous 0.22933013891071233 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: jrm@elm.circa.ufl.edu (Jeff Mason) Subject: AUCTION: Marvel, DC, Valiant, Image, Dark Horse, etc... Organization: Univ. of Florida Psychology Dept. Lines: 59 NNTP-Posting-Host: elm.circa.ufl.edu I am auctioning off the following comics. These minimum bids are set below what I would normally sell them for. Make an offer, and I will accept the highest bid after the auction has been completed. TITLE Minimum/Current -------------------------------------------------------------- Alpha Flight 51 (Jim Lee's first work at Marvel) $ 5.00 Aliens 1 (1st app Aliens in comics, 1st prnt, May 1988) $20.00 Amazing Spider-Man 136 (Intro new Green Goblin) $20.00 Amazing Spider-Man 238 (1st appearance Hobgoblin) $50.00 Archer and Armstrong 1 (Frank Miller/Smith/Layton) $ 7.50 Avengers 263 (1st appearance X-factor) $ 3.50 Bloodshot 1 (Chromium cover, BWSmith Cover/Poster) $ 5.00 Daredevil 158 (Frank Miller art begins) $35.00 Dark Horse Presents 1 (1st app Concrete, 1st printing) $ 7.50 H.A.R.D. Corps 1 $ 5.00 Incredible Hulk 324 (1st app Grey Hulk since #1, 1962) $ 7.50 Incredible Hulk 330 (1st McFarlane issue) $15.00 Incredible Hulk 331 (Grey Hulk series begins) $11.20 Incredible Hulk 367 (1st Dale Keown art in Hulk) $15.00 Incredible Hulk 377 (1st all new hulk, 1st prnt, Keown) $15.00 Marvel Comics Presents 1 (Wolverine, Silver Surfer) $ 7.50 Maxx Limited Ashcan (4000 copies exist, blue cover) $30.00 New Mutants 86 (McFarlane cover, 1st app Cable - cameo) $10.00 New Mutants 100 (1st app X-Force) $ 5.00 New Mutants Annual 5 (1st Liefeld art on New Mutants) $10.00 Omega Men 3 (1st appearance Lobo) $ 7.50 Omega Men 10 (1st full Lobo story) $ 7.50 Power Man & Iron Fist 78 (3rd appearance Sabretooth) $25.00 84 (4th appearance Sabretooth) $20.00 Simpsons Comics and Stories 1 (Polybagged special ed.) $ 7.50 Spectacular Spider-Man 147 (1st app New Hobgoblin) $12.50 Star Trek the Next Generation 1 (Feb 1988, DC mini) $ 7.50 Star Trek the Next Generation 1 (Oct 1989, DC comics) $ 7.50 Web of Spider-Man 29 (Hobgoblin, Wolverine appear) $10.00 Web of Spider-Man 30 (Origin Rose, Hobgoblin appears) $ 7.50 Wolverine 10 (Before claws, 1st battle with Sabretooth) $15.00 Wolverine 41 (Sabretooth claims to be Wolverine's dad) $ 5.00 Wolverine 42 (Sabretooth proven not to be his dad) $ 3.50 Wolverine 43 (Sabretooth/Wolverine saga concludes) $ 3.00 Wolverine 1 (1982 mini-series, Miller art) $20.00 Wonder Woman 267 (Return of Animal Man) $12.50 X-Force 1 (Signed by Liefeld, Bagged, X-Force card) $20.00 X-Force 1 (Signed by Liefeld, Bagged, Shatterstar card) $10.00 X-Force 1 (Signed by Liefeld, Bagged, Deadpool card) $10.00 X-Force 1 (Signed by Liefeld, Bagged, Sunspot/Gideon) $10.00 All comics are in near mint to mint condition, are bagged in shiny polypropylene bags, and backed with white acid free boards. Shipping is $1.50 for one book, $3.00 for more than one book, or free if you order a large enough amount of stuff. I am willing to haggle. I have thousands and thousands of other comics, so please let me know what you've been looking for, and maybe I can help. Some titles I have posted here don't list every issue I have of that title, I tried to save space. -- Geoffrey R. Mason | jrm@elm.circa.ufl.edu Department of Psychology | mason@webb.psych.ufl.edu University of Florida | prothan@maple.circa.ufl.edu 0.17360012846950526 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------
📝 Zadanie 1.4 _(4 pkt.)
Wybierz zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie). Na jego podstawie stwórz wyszukiwarkę wykorzystującą TF–IDF i podobieństwo kosinusowe do oceny podobieństwa dokumentów. Wyszukiwarka powinna zwracać kilka posortowanych najbardziej pasujących dokumentów razem ze score'ami.