33 KiB
Ekstrakcja informacji
3. tfidf (2) [ćwiczenia]
Jakub Pokrywka (2021)
Importy
import numpy as np
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
Zbiór danych
newsgroups = fetch_20newsgroups()['data']
len(newsgroups)
11314
print(newsgroups[0])
From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
Naiwne przeszukiwanie
all_documents = list()
for document in newsgroups:
if 'car' in document:
all_documents.append(document)
print(all_documents[0])
From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
print(all_documents[1])
From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll. Please send a brief message detailing your experiences with the procedure. Top speed attained, CPU rated speed, add on cards and adapters, heat sinks, hour of usage per day, floppy disk functionality with 800 and 1.4 m floppies are especially requested. I will be summarizing in the next two days, so please add to the network knowledge base if you have done the clock upgrade and haven't answered this poll. Thanks. Guy Kuo <guykuo@u.washington.edu>
jakie są problemy z takim podejściem?
TFIDF i odległość cosinusowa- gotowe biblioteki
vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))
document_vectors = vectorizer.fit_transform(newsgroups)
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>' with 1787565 stored elements in Compressed Sparse Row format>
document_vectors[0]
<1x130107 sparse matrix of type '<class 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>
document_vectors[0].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.]])
document_vectors[0:4].todense()
matrix([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'
query_vector = vectorizer.transform([query_str])
document_vectors
<11314x130107 sparse matrix of type '<class 'numpy.float64'>' with 1787565 stored elements in Compressed Sparse Row format>
query_vector
<1x130107 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format>
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
np.sort(similarities)[0][-4:]
array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])
similarities.argsort()[0][-4:]
array([4517, 5509, 2116, 9921])
for i in range (1,5):
print(newsgroups[similarities.argsort()[0][-i]])
print(np.sort(similarities)[0,-i])
print('-'*100)
print('-'*100)
print('-'*100)
From: ray@netcom.com (Ray Fischer) Subject: Re: x86 ~= 680x0 ?? (How do they compare?) Organization: Netcom. San Jose, California Distribution: usa Lines: 36 dhk@ubbpc.uucp (Dave Kitabjian) writes ... >I'm sure Intel and Motorola are competing neck-and-neck for >crunch-power, but for a given clock speed, how do we rank the >following (from 1st to 6th): > 486 68040 > 386 68030 > 286 68020 040 486 030 386 020 286 >While you're at it, where will the following fit into the list: > 68060 > Pentium > PowerPC 060 fastest, then Pentium, with the first versions of the PowerPC somewhere in the vicinity. >And about clock speed: Does doubling the clock speed double the >overall processor speed? And fill in the __'s below: > 68030 @ __ MHz = 68040 @ __ MHz No. Computer speed is only partly dependent of processor/clock speed. Memory system speed play a large role as does video system speed and I/O speed. As processor clock rates go up, the speed of the memory system becomes the greatest factor in the overall system speed. If you have a 50MHz processor, it can be reading another word from memory every 20ns. Sure, you can put all 20ns memory in your computer, but it will cost 10 times as much as the slower 80ns SIMMs. And roughly, the 68040 is twice as fast at a given clock speed as is the 68030. -- Ray Fischer "Convictions are more dangerous enemies of truth ray@netcom.com than lies." -- Friedrich Nietzsche 0.4778416465020907 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) Subject: Re: x86 ~= 680x0 ?? (How do they compare?) Distribution: usa Organization: University of Illinois at Urbana Lines: 59 ray@netcom.com (Ray Fischer) writes: >dhk@ubbpc.uucp (Dave Kitabjian) writes ... >>I'm sure Intel and Motorola are competing neck-and-neck for >>crunch-power, but for a given clock speed, how do we rank the >>following (from 1st to 6th): >> 486 68040 >> 386 68030 >> 286 68020 >040 486 030 386 020 286 How about some numbers here? Some kind of benchmark? If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 . >>While you're at it, where will the following fit into the list: >> 68060 >> Pentium >> PowerPC >060 fastest, then Pentium, with the first versions of the PowerPC >somewhere in the vicinity. Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 . PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601) (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison) >>And about clock speed: Does doubling the clock speed double the >>overall processor speed? And fill in the __'s below: >> 68030 @ __ MHz = 68040 @ __ MHz >No. Computer speed is only partly dependent of processor/clock speed. >Memory system speed play a large role as does video system speed and >I/O speed. As processor clock rates go up, the speed of the memory >system becomes the greatest factor in the overall system speed. If >you have a 50MHz processor, it can be reading another word from memory >every 20ns. Sure, you can put all 20ns memory in your computer, but >it will cost 10 times as much as the slower 80ns SIMMs. Not in a clock-doubled system. There isn't a doubling in performance, but it _is_ quite significant. Maybe about a 70% increase in performance. Besides, for 0 wait state performance, you'd need a cache anyway. I mean, who uses a processor that runs at the speed of 80ns SIMMs? Note that this memory speed corresponds to a clock speed of 12.5 MHz. >And roughly, the 68040 is twice as fast at a given clock >speed as is the 68030. Numbers? >-- >Ray Fischer "Convictions are more dangerous enemies of truth >ray@netcom.com than lies." -- Friedrich Nietzsche -- Ravikumar Venkateswar rvenkate@uiuc.edu A pun is a no' blessed form of whit. 0.44292082969477664 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: ray@netcom.com (Ray Fischer) Subject: Re: x86 ~= 680x0 ?? (How do they compare?) Organization: Netcom. San Jose, California Distribution: usa Lines: 30 rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ... >ray@netcom.com (Ray Fischer) writes: >>040 486 030 386 020 286 > >How about some numbers here? Some kind of benchmark? Benchmarks are for marketing dweebs and CPU envy. OK, if it will make you happy, the 486 is faster than the 040. BFD. Both architectures are nearing then end of their lifetimes. And especially with the x86 architecture: good riddance. >Besides, for 0 wait state performance, you'd need a cache anyway. I mean, >who uses a processor that runs at the speed of 80ns SIMMs? Note that this >memory speed corresponds to a clock speed of 12.5 MHz. The point being the processor speed is only one of many aspects of a computers performance. Clock speed, processor, memory speed, CPU architecture, I/O systems, even the application program all contribute to the overall system performance. >>And roughly, the 68040 is twice as fast at a given clock >>speed as is the 68030. > >Numbers? Look them up yourself. -- Ray Fischer "Convictions are more dangerous enemies of truth ray@netcom.com than lies." -- Friedrich Nietzsche 0.3491800997095306 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- From: mb4008@cehp11 (Morgan J Bullard) Subject: Re: speeding up windows Keywords: speed Organization: University of Illinois at Urbana Lines: 30 djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes: >I have a 386/33 with 8 megs of memory >I have noticed that lately when I use programs like WpfW or Corel Draw >my computer "boggs" down and becomes really sluggish! >What can I do to increase performance? What should I turn on or off >Will not loading wallpapers or stuff like that help when it comes to >the running speed of windows and the programs that run under it? >Thanx in advance >Derek 1) make sure your hard drive is defragmented. This will speed up more than just windows BTW. Use something like Norton's or PC Tools. 2) I _think_ that leaving the wall paper out will use less RAM and therefore will speed up your machine but I could very will be wrong on this. There's a good chance you've already done this but if not it may speed things up. good luck Morgan Bullard mb4008@coewl.cen.uiuc.edu or mjbb@uxa.cso.uiuc.edu >-- >$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ >$\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$ >$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$ >$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ 0.26949927393886913 ---------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------
Zadanie domowe
Wybrać zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).
Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów razem ze scorami. Należy wypisywać też ilość zwracanych dokumentów, czyli takich z niezerowym scorem. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.
Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.
Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algorytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy. Należy zrobić inną zmianę niż w tym przykładzie, tylko wymyślić coś własnego.
prezentować pracę na następnych zajęciach (14.04) odpowiadając na pytania:
jak wygląda zbiór i system wyszukiwania przed zmianami
dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)
jakie zmiany zostały naniesione
jak wyglądają wyniki wyszukiwania po zmianach
jak zmiany wpłynęły na wyniki (1-2 zdania)
Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty. punktów do zdobycia: 60