aitech-eks-pub/cw/03b_tfidf_newsgroup.ipynb
Jakub Pokrywka 5acffc0265 add 03
2021-09-27 14:02:30 +02:00

33 KiB

Logo 1

Ekstrakcja informacji

3. tfidf (2) [ćwiczenia]

Jakub Pokrywka (2021)

Logo 2

Importy

import numpy as np
import sklearn.metrics

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

Zbiór danych

newsgroups = fetch_20newsgroups()['data']
len(newsgroups)
11314
print(newsgroups[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Naiwne przeszukiwanie

all_documents = list() 
for document in newsgroups:
    if 'car' in document:
        all_documents.append(document)
print(all_documents[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(all_documents[1])
From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>

jakie są problemy z takim podejściem?

TFIDF i odległość cosinusowa- gotowe biblioteki

vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))
document_vectors = vectorizer.fit_transform(newsgroups)
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>
document_vectors[0]
<1x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>
document_vectors[0].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.]])
document_vectors[0:4].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'
query_vector = vectorizer.transform([query_str])
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>
query_vector
<1x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
np.sort(similarities)[0][-4:]
array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])
similarities.argsort()[0][-4:]
array([4517, 5509, 2116, 9921])
for i in range (1,5):
    print(newsgroups[similarities.argsort()[0][-i]])
    print(np.sort(similarities)[0,-i])
    print('-'*100)
    print('-'*100)
    print('-'*100)
From: ray@netcom.com (Ray Fischer)
Subject: Re: x86 ~= 680x0 ??  (How do they compare?)
Organization: Netcom. San Jose, California
Distribution: usa
Lines: 36

dhk@ubbpc.uucp (Dave Kitabjian) writes ...
>I'm sure Intel and Motorola are competing neck-and-neck for 
>crunch-power, but for a given clock speed, how do we rank the
>following (from 1st to 6th):
>  486		68040
>  386		68030
>  286		68020

040 486 030 386 020 286

>While you're at it, where will the following fit into the list:
>  68060
>  Pentium
>  PowerPC

060 fastest, then Pentium, with the first versions of the PowerPC
somewhere in the vicinity.

>And about clock speed:  Does doubling the clock speed double the
>overall processor speed?  And fill in the __'s below:
>  68030 @ __ MHz = 68040 @ __ MHz

No.  Computer speed is only partly dependent of processor/clock speed.
Memory system speed play a large role as does video system speed and
I/O speed.  As processor clock rates go up, the speed of the memory
system becomes the greatest factor in the overall system speed.  If
you have a 50MHz processor, it can be reading another word from memory
every 20ns.  Sure, you can put all 20ns memory in your computer, but
it will cost 10 times as much as the slower 80ns SIMMs.

And roughly, the 68040 is twice as fast at a given clock
speed as is the 68030.

-- 
Ray Fischer                   "Convictions are more dangerous enemies of truth
ray@netcom.com                 than lies."  -- Friedrich Nietzsche

0.4778416465020907
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)
Subject: Re: x86 ~= 680x0 ?? (How do they compare?)
Distribution: usa
Organization: University of Illinois at Urbana
Lines: 59

ray@netcom.com (Ray Fischer) writes:

>dhk@ubbpc.uucp (Dave Kitabjian) writes ...
>>I'm sure Intel and Motorola are competing neck-and-neck for 
>>crunch-power, but for a given clock speed, how do we rank the
>>following (from 1st to 6th):
>>  486		68040
>>  386		68030
>>  286		68020

>040 486 030 386 020 286

How about some numbers here? Some kind of benchmark?
If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .

>>While you're at it, where will the following fit into the list:
>>  68060
>>  Pentium
>>  PowerPC

>060 fastest, then Pentium, with the first versions of the PowerPC
>somewhere in the vicinity.

Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .
	 PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)
        (Alpha @150MHz  - 74 SPECint92,126 SPECfp92 - just for comparison)

>>And about clock speed:  Does doubling the clock speed double the
>>overall processor speed?  And fill in the __'s below:
>>  68030 @ __ MHz = 68040 @ __ MHz

>No.  Computer speed is only partly dependent of processor/clock speed.
>Memory system speed play a large role as does video system speed and
>I/O speed.  As processor clock rates go up, the speed of the memory
>system becomes the greatest factor in the overall system speed.  If
>you have a 50MHz processor, it can be reading another word from memory
>every 20ns.  Sure, you can put all 20ns memory in your computer, but
>it will cost 10 times as much as the slower 80ns SIMMs.

Not in a clock-doubled system. There isn't a doubling in performance, but
it _is_ quite significant. Maybe about a 70% increase in performance.

Besides, for 0 wait state performance, you'd need a cache anyway. I mean,
who uses a processor that runs at the speed of 80ns SIMMs? Note that this
memory speed corresponds to a clock speed of 12.5 MHz.

>And roughly, the 68040 is twice as fast at a given clock
>speed as is the 68030.

Numbers?

>-- 
>Ray Fischer                   "Convictions are more dangerous enemies of truth
>ray@netcom.com                 than lies."  -- Friedrich Nietzsche
-- 
Ravikumar Venkateswar
rvenkate@uiuc.edu

A pun is a no' blessed form of whit.

0.44292082969477664
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: ray@netcom.com (Ray Fischer)
Subject: Re: x86 ~= 680x0 ?? (How do they compare?)
Organization: Netcom. San Jose, California
Distribution: usa
Lines: 30

rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...
>ray@netcom.com (Ray Fischer) writes:
>>040 486 030 386 020 286
>
>How about some numbers here? Some kind of benchmark?

Benchmarks are for marketing dweebs and CPU envy.  OK, if it will make
you happy, the 486 is faster than the 040.  BFD.  Both architectures
are nearing then end of their lifetimes.  And especially with the x86
architecture: good riddance.

>Besides, for 0 wait state performance, you'd need a cache anyway. I mean,
>who uses a processor that runs at the speed of 80ns SIMMs? Note that this
>memory speed corresponds to a clock speed of 12.5 MHz.

The point being the processor speed is only one of many aspects of a
computers performance.  Clock speed, processor, memory speed, CPU
architecture, I/O systems, even the application program all contribute 
to the overall system performance.

>>And roughly, the 68040 is twice as fast at a given clock
>>speed as is the 68030.
>
>Numbers?

Look them up yourself.

-- 
Ray Fischer                   "Convictions are more dangerous enemies of truth
ray@netcom.com                 than lies."  -- Friedrich Nietzsche

0.3491800997095306
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
From: mb4008@cehp11 (Morgan J Bullard)
Subject: Re: speeding up windows
Keywords: speed
Organization: University of Illinois at Urbana
Lines: 30

djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:

>I have a 386/33 with 8 megs of memory

>I have noticed that lately when I use programs like WpfW or Corel Draw
>my computer "boggs" down and becomes really sluggish!

>What can I do to increase performance?  What should I turn on or off

>Will not loading wallpapers or stuff like that help when it comes to
>the running speed of windows and the programs that run under it?

>Thanx in advance

>Derek

1) make sure your hard drive is defragmented. This will speed up more than 
   just windows BTW.  Use something like Norton's or PC Tools.
2) I _think_ that leaving the wall paper out will use less RAM and therefore
   will speed up your machine but I could very will be wrong on this.
There's a good chance you've already done this but if not it may speed things
up.  good luck
				Morgan Bullard mb4008@coewl.cen.uiuc.edu
					  or   mjbb@uxa.cso.uiuc.edu

>--
>$_    /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca           $ 
>$\'o.O' $Sociologist         $ It's 106 miles to Chicago,we've got a full tank$
>$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$
>$   U   $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues        $  

0.26949927393886913
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------

Zadanie domowe

  • Wybrać zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).

  • Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów razem ze scorami. Należy wypisywać też ilość zwracanych dokumentów, czyli takich z niezerowym scorem. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.

  • Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.

  • Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algorytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy. Należy zrobić inną zmianę niż w tym przykładzie, tylko wymyślić coś własnego.

  • prezentować pracę na następnych zajęciach (14.04) odpowiadając na pytania:

  • jak wygląda zbiór i system wyszukiwania przed zmianami

  • dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)

  • jakie zmiany zostały naniesione

  • jak wyglądają wyniki wyszukiwania po zmianach

  • jak zmiany wpłynęły na wyniki (1-2 zdania)

    Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty. punktów do zdobycia: 60