ugb/1_TFIDF.ipynb
Paweł Skórzewski 3e045c950b Lab. 1
2024-04-29 08:56:39 +02:00

40 KiB
Raw Blame History

Uczenie głębokie przetwarzanie tekstu laboratoria

1. TFIDF

import numpy as np
import re

Zbiór dokumentów

documents = ['Ala lubi zwierzęta i ma kota oraz psa!',
             'Ola lubi zwierzęta oraz ma kota a także chomika!',
             'I Jan jeździ na rowerze.',
             '2 wojna światowa była wielkim konfliktem zbrojnym',
             'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.',
            ]

Czego potrzebujemy?

  • Chcemy zamienić teksty na zbiór słów.

Pytania

  • Czy do stokenizowania tekstu możemy użyć document.split(' ')?
  • Jakie trudności możemy napotkać?

Preprocessing

def get_str_cleaned(str_dirty):
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\\\]^_`{|}~'
    new_str = str_dirty.lower()
    new_str = re.sub(' +', ' ', new_str)
    for char in punctuation:
        new_str = new_str.replace(char,'')
    return new_str
sample_document = get_str_cleaned(documents[0])
sample_document
'ala lubi zwierzęta i ma kota oraz psa'

Tokenizacja

def tokenize_str(document):
    return document.split(' ')
tokenize_str(sample_document)
['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']
documents_cleaned = [get_str_cleaned(document) for document in documents]
documents_cleaned
['ala lubi zwierzęta i ma kota oraz psa',
 'ola lubi zwierzęta oraz ma kota a także chomika',
 'i jan jeździ na rowerze',
 '2 wojna światowa była wielkim konfliktem zbrojnym',
 'tomek lubi psy ma psa i jeździ na motorze i rowerze']
documents_tokenized = [tokenize_str(d) for d in documents_cleaned]
documents_tokenized
[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],
 ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],
 ['i', 'jan', 'jeździ', 'na', 'rowerze'],
 ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],
 ['tomek',
  'lubi',
  'psy',
  'ma',
  'psa',
  'i',
  'jeździ',
  'na',
  'motorze',
  'i',
  'rowerze']]

Pytania

  • Jaki jest następny krok w celu stworzenia wektórów TF lub TFIDF?
  • Jakie wielkości będzie wektor TF lub TFIDF?

Stworzenie słownika

vocabulary = []
for document in documents_tokenized:
    for word in document:
        vocabulary.append(word)
vocabulary = sorted(set(vocabulary))
vocabulary
['2',
 'a',
 'ala',
 'była',
 'chomika',
 'i',
 'jan',
 'jeździ',
 'konfliktem',
 'kota',
 'lubi',
 'ma',
 'motorze',
 'na',
 'ola',
 'oraz',
 'psa',
 'psy',
 'rowerze',
 'także',
 'tomek',
 'wielkim',
 'wojna',
 'zbrojnym',
 'zwierzęta',
 'światowa']

📝 Zadanie 1.1 _(1 pkt)

Napisz funkcję word_to_index(word: str), która dla danego słowa zwraca wektor jednostkowy (_one-hot vector) w postaci numpy.array.

Przyjmij, że słownik dany jest za pomocą zmiennej globalnej vocabulary.

def word_to_index(word: str) -> np.array:
    # TODO
    pass
word_to_index('psa')
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

📝 Zadanie 1.2 _(1 pkt)

Napisz funkcję, która zamienia listę słów na wektor TF.

def tf(document: list) -> np.array:
    # TODO
    pass
tf(documents_tokenized[0])
array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])
documents_vectorized = list()
for document in documents_tokenized:
    document_vector = tf(document)
    documents_vectorized.append(document_vector)
documents_vectorized
[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0.]),
 array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0.]),
 array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 1., 1., 0., 1.]),
 array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,
        1., 1., 0., 1., 0., 0., 0., 0., 0.])]

IDF

idf = np.zeros(len(vocabulary))
idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)
display(idf)
array([5.        , 5.        , 5.        , 5.        , 5.        ,
       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,
       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,
       2.5       , 2.5       , 5.        , 2.5       , 5.        ,
       5.        , 5.        , 5.        , 5.        , 2.5       ,
       5.        ])

📝 Zadanie 1.3 _(1 pkt)

Napisz funkcję, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej.

def similarity(query: np.array, document: np.array) -> float:
    # TODO
    pass
documents[0]
'Ala lubi zwierzęta i ma kota oraz psa!'
documents_vectorized[0]
array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])
documents[1]
'Ola lubi zwierzęta oraz ma kota a także chomika!'
documents_vectorized[1]
array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0.])
similarity(documents_vectorized[0], documents_vectorized[1])
0.5892556509887895

Prosta wyszukiwarka

def transform_query(query):
    """Funkcja, która czyści i tokenizuje zapytanie"""
    query_vector = tf(tokenize_str(get_str_cleaned(query)))
    return query_vector
similarity(transform_query('psa kota'), documents_vectorized[0])
0.4999999999999999
query = 'psa kota'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.4999999999999999
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.2357022603955158
'I Jan jeździ na rowerze.'
0.0
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'
0.19611613513818402
# dlatego potrzebujemy mianownik w cosine similarity
query = 'rowerze'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.0
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.0
'I Jan jeździ na rowerze.'
0.4472135954999579
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'
0.2773500981126146
# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument
query = 'i'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.35355339059327373
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.0
'I Jan jeździ na rowerze.'
0.4472135954999579
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'
0.5547001962252291
# dlatego IDF - żeby ważniejsze słowa miał większą wagę
query = 'i chomika'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))
'Ala lubi zwierzęta i ma kota oraz psa!'
0.24999999999999994
'Ola lubi zwierzęta oraz ma kota a także chomika!'
0.2357022603955158
'I Jan jeździ na rowerze.'
0.31622776601683794
'2 wojna światowa była wielkim konfliktem zbrojnym'
0.0
'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'
0.39223227027636803

Biblioteki

import numpy as np
import sklearn.metrics

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups = fetch_20newsgroups()['data']
len(newsgroups)
11314
print(newsgroups[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Naiwne przeszukiwanie

all_documents = list() 
for document in newsgroups:
    if 'car' in document:
        all_documents.append(document)
print(all_documents[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(all_documents[1])
From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>

Pytanie

Jakie są problemy z takim podejściem?

TFIDF i odległość kosinusowa

vectorizer = TfidfVectorizer()
document_vectors = vectorizer.fit_transform(newsgroups)
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>
document_vectors[0]
<1x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>
document_vectors[0].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.]])
document_vectors[0:4].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'
query_vector = vectorizer.transform([query_str])
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
print(np.sort(similarities)[0][-4:])
print(similarities.argsort()[0][-4:])

for i in range (1,5):
    print(newsgroups[similarities.argsort()[0][-i]])
    print(np.sort(similarities)[0,-i])
    print('-'*100)
    print('-'*100)
    print('-'*100)
[0.17360013 0.22933014 0.28954818 0.45372239]
[ 2455  8920  5497 11031]
From: keiths@spider.co.uk (Keith Smith)
Subject: win/NT file systems
Organization: Spider Systems Limited, Edinburgh, UK.
Lines: 6
Nntp-Posting-Host: trapdoor.spider.co.uk

OK will some one out there tell me why / how DOS 5
can read (I havn't tried writing in case it breaks something)
the Win/NT NTFS file system.
I thought NTFS was supposed to be better than the FAT system

keith

0.4537223924558256
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: brandt@cs.unc.edu (Andrew Brandt)
Subject: Seeking good Alfa Romeo mechanic.
Organization: The University of North Carolina at Chapel Hill
Lines: 14
NNTP-Posting-Host: axon.cs.unc.edu
Keywords: alfa, romeo, spider, mechanic

I am looking for recommendations for a good (great?) Alfa Romeo
mechanic in South Jersey or Philadelphia or nearby.

I have a '78 Alfa Spider that needs some engine, tranny, steering work
done.  The body is in quite good shape.  The car is awful in cold
weather, won't start if below freezing (I know, I know, why drive a
Spider if there's snow on the ground ...).  It has Bosch *mechanical*
fuel injection that I am sure needs adjustment.

Any opinions are welcome on what to look for or who to call.

Email or post (to rec.autos), I will summarize if people want.

Thx, Andy (brandt@cs.unc.edu)

0.28954817869991817
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: michaelr@spider.co.uk (Michael S. A. Robb)
Subject: Re: Honors Degrees: Do they mean anything?
Organization: Spider Systems Limited, Edinburgh, UK.
Lines: 44

In article <TKLD.93Apr2123341@burns.cogsci.ed.ac.uk> tkld@cogsci.ed.ac.uk (Kevin Davidson) writes:
>
>>   In my opinion, a programming degree is still worth having.
>
> Yes, but a CS degree is *not* a programming degree. Does anybody know of
>a computing course where *programming* is taught ? Computer Science is
>a branch of maths (or the course I did was).
> I've also done a Software Engineering course - much more practical and likely
>to be the sort of thing an employer really wants, rather than what they think
>they want, but also did not teach programming. The ability to program was
>an entry requirement.

At Robert Gordon University, programming was the main (most time-consuming) 
start of the course. The first two years consisted of five subjects:
Software Engineering (Pascal/C/UNIX), Computer Engineering (6502/6809/68000 
assembler), Computer Theory (LISP/Prolog), Mathematics/Statistics and 
Communication Skills (How to pass interviews/intelligence tests and group
discussions e.g. How to survive a helicopter crash in the North Sea).
The third year (Industrial placement) was spent working for a computer company 
for a year. The company could be anywhere in Europe (there was a special 
Travel Allowance Scheme to cover the visiting costs of professors).  
The fourth year included Operating Systems(C/Modula-2), Software Engineering 
(C/8086 assembler), Real Time Laboratory (C/68000 assembler) and Computing 
Theory (LISP).  There were also Group Projects in 2nd and 4th Years, where 
students worked in teams to select their own project or decide to work for an 
outside company (the only disadvantage being that specifications would change 
suddenly).
 
In the first four years, there was a 50%:50% weighting between courseworks and 
exams for most subjects. However in the Honours year, this was reduced to a 
30%:70% split between an Individual Project and final exams (no coursework 
assessment) - are all Computer Science courses like this?

BTW - we started off with 22 students in our first year and were left with 8 by
Honours year. Also, every course is tutored separately. Not easy trying
to sleep when you are in 8 student class :-). 

Cheers,
  Michael 
-- 
| Michael S. A. Robb     | Tel: +44 31 554 9424  | "..The problem with bolt-on
| Software Engineer      | Fax: +44 31 554 0649  |  software is making sure the
| Spider Systems Limited | E-mail:               |  bolts are the right size.."
| Edinburgh, EH6 5NG     | michaelr@spider.co.uk |             - Anonymous

0.22933013891071233
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: jrm@elm.circa.ufl.edu (Jeff Mason)
Subject: AUCTION: Marvel, DC, Valiant, Image, Dark Horse, etc...
Organization: Univ. of Florida Psychology Dept.
Lines: 59
NNTP-Posting-Host: elm.circa.ufl.edu

I am auctioning off the following comics.  These minimum bids are set
below what I would normally sell them for.  Make an offer, and I will
accept the highest bid after the auction has been completed.

TITLE                                                   Minimum/Current 
--------------------------------------------------------------
Alpha Flight 51 (Jim Lee's first work at Marvel)	$ 5.00
Aliens 1 (1st app Aliens in comics, 1st prnt, May 1988)	$20.00
Amazing Spider-Man 136 (Intro new Green Goblin)         $20.00
Amazing Spider-Man 238 (1st appearance Hobgoblin)	$50.00
Archer and Armstrong 1 (Frank Miller/Smith/Layton)	$ 7.50
Avengers 263 (1st appearance X-factor)                  $ 3.50
Bloodshot 1 (Chromium cover, BWSmith Cover/Poster)	$ 5.00
Daredevil 158 (Frank Miller art begins)                 $35.00
Dark Horse Presents 1 (1st app Concrete, 1st printing)	$ 7.50 
H.A.R.D. Corps 1 					$ 5.00
Incredible Hulk 324 (1st app Grey Hulk since #1, 1962)	$ 7.50
Incredible Hulk 330 (1st McFarlane issue)		$15.00
Incredible Hulk 331 (Grey Hulk series begins)		$11.20	
Incredible Hulk 367 (1st Dale Keown art in Hulk)        $15.00
Incredible Hulk 377 (1st all new hulk, 1st prnt, Keown) $15.00
Marvel Comics Presents 1 (Wolverine, Silver Surfer)     $ 7.50
Maxx Limited Ashcan (4000 copies exist, blue cover)	$30.00
New Mutants 86 (McFarlane cover, 1st app Cable - cameo)	$10.00
New Mutants 100 (1st app X-Force)                       $ 5.00
New Mutants Annual 5 (1st Liefeld art on New Mutants)	$10.00
Omega Men 3 (1st appearance Lobo)                       $ 7.50
Omega Men 10 (1st full Lobo story)                      $ 7.50
Power Man & Iron Fist 78 (3rd appearance Sabretooth)    $25.00
                      84 (4th appearance Sabretooth)    $20.00
Simpsons Comics and Stories 1 (Polybagged special ed.)	$ 7.50
Spectacular Spider-Man 147 (1st app New Hobgoblin)      $12.50
Star Trek the Next Generation 1 (Feb 1988, DC mini)     $ 7.50
Star Trek the Next Generation 1 (Oct 1989, DC comics)   $ 7.50
Web of Spider-Man 29 (Hobgoblin, Wolverine appear)      $10.00 
Web of Spider-Man 30 (Origin Rose, Hobgoblin appears)   $ 7.50
Wolverine 10 (Before claws, 1st battle with Sabretooth)	$15.00
Wolverine 41 (Sabretooth claims to be Wolverine's dad)	$ 5.00
Wolverine 42 (Sabretooth proven not to be his dad)	$ 3.50
Wolverine 43 (Sabretooth/Wolverine saga concludes)	$ 3.00
Wolverine 1 (1982 mini-series, Miller art)		$20.00
Wonder Woman 267 (Return of Animal Man)                 $12.50
X-Force 1 (Signed by Liefeld, Bagged, X-Force card)     $20.00
X-Force 1 (Signed by Liefeld, Bagged, Shatterstar card) $10.00
X-Force 1 (Signed by Liefeld, Bagged, Deadpool card)    $10.00
X-Force 1 (Signed by Liefeld, Bagged, Sunspot/Gideon)   $10.00

All comics are in near mint to mint condition, are bagged in shiny 
polypropylene bags, and backed with white acid free boards.  Shipping is
$1.50 for one book, $3.00 for more than one book, or free if you order 
a large enough amount of stuff.  I am willing to haggle.

I have thousands and thousands of other comics, so please let me know what 
you've been looking for, and maybe I can help.  Some titles I have posted
here don't list every issue I have of that title, I tried to save space.
-- 
Geoffrey R. Mason		|	jrm@elm.circa.ufl.edu
Department of Psychology	|	mason@webb.psych.ufl.edu
University of Florida		|	prothan@maple.circa.ufl.edu

0.17360012846950526
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------

📝 Zadanie 1.4 _(4 pkt.)

Wybierz zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie). Na jego podstawie stwórz wyszukiwarkę wykorzystującą TFIDF i podobieństwo kosinusowe do oceny podobieństwa dokumentów. Wyszukiwarka powinna zwracać kilka posortowanych najbardziej pasujących dokumentów razem ze score'ami.