aitech-eks-pub/cw/03b_tfidf_newsgroup.ipynb
Jakub Pokrywka 5acffc0265 add 03
2021-09-27 14:02:30 +02:00

730 lines
33 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 3. <i>tfidf (2)</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajecia 2\n",
"\n",
"Przydatne materia\u0142y:\n",
"\n",
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
"\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importy"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import sklearn.metrics\n",
"\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zbi\u00f3r danych"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"newsgroups = fetch_20newsgroups()['data']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11314"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(newsgroups[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naiwne przeszukiwanie"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"all_documents = list() \n",
"for document in newsgroups:\n",
" if 'car' in document:\n",
" all_documents.append(document)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(all_documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
"Subject: SI Clock Poll - Final Call\n",
"Summary: Final call for SI clock reports\n",
"Keywords: SI,acceleration,clock,upgrade\n",
"Article-I.D.: shelley.1qvfo9INNc3s\n",
"Organization: University of Washington\n",
"Lines: 11\n",
"NNTP-Posting-Host: carson.u.washington.edu\n",
"\n",
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4 m floppies are especially requested.\n",
"\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll. Thanks.\n",
"\n",
"Guy Kuo <guykuo@u.washington.edu>\n",
"\n"
]
}
],
"source": [
"print(all_documents[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jakie s\u0105 problemy z takim podej\u015bciem?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TFIDF i odleg\u0142o\u015b\u0107 cosinusowa- gotowe biblioteki"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"vectorizer = TfidfVectorizer()\n",
"#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"document_vectors = vectorizer.fit_transform(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 89 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0].todense()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0:4].todense()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"query_str = 'speed'\n",
"#query_str = 'speed car'\n",
"#query_str = 'spider man'"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"query_vector = vectorizer.transform([query_str])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_vector"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sort(similarities)[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4517, 5509, 2116, 9921])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarities.argsort()[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 36\n",
"\n",
"dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">I'm sure Intel and Motorola are competing neck-and-neck for \n",
">crunch-power, but for a given clock speed, how do we rank the\n",
">following (from 1st to 6th):\n",
"> 486\t\t68040\n",
"> 386\t\t68030\n",
"> 286\t\t68020\n",
"\n",
"040 486 030 386 020 286\n",
"\n",
">While you're at it, where will the following fit into the list:\n",
"> 68060\n",
"> Pentium\n",
"> PowerPC\n",
"\n",
"060 fastest, then Pentium, with the first versions of the PowerPC\n",
"somewhere in the vicinity.\n",
"\n",
">And about clock speed: Does doubling the clock speed double the\n",
">overall processor speed? And fill in the __'s below:\n",
"> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
"No. Computer speed is only partly dependent of processor/clock speed.\n",
"Memory system speed play a large role as does video system speed and\n",
"I/O speed. As processor clock rates go up, the speed of the memory\n",
"system becomes the greatest factor in the overall system speed. If\n",
"you have a 50MHz processor, it can be reading another word from memory\n",
"every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
"it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"And roughly, the 68040 is twice as fast at a given clock\n",
"speed as is the 68030.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.4778416465020907\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Distribution: usa\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 59\n",
"\n",
"ray@netcom.com (Ray Fischer) writes:\n",
"\n",
">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
">>crunch-power, but for a given clock speed, how do we rank the\n",
">>following (from 1st to 6th):\n",
">> 486\t\t68040\n",
">> 386\t\t68030\n",
">> 286\t\t68020\n",
"\n",
">040 486 030 386 020 286\n",
"\n",
"How about some numbers here? Some kind of benchmark?\n",
"If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
"\n",
">>While you're at it, where will the following fit into the list:\n",
">> 68060\n",
">> Pentium\n",
">> PowerPC\n",
"\n",
">060 fastest, then Pentium, with the first versions of the PowerPC\n",
">somewhere in the vicinity.\n",
"\n",
"Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
"\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
" (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
"\n",
">>And about clock speed: Does doubling the clock speed double the\n",
">>overall processor speed? And fill in the __'s below:\n",
">> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
">No. Computer speed is only partly dependent of processor/clock speed.\n",
">Memory system speed play a large role as does video system speed and\n",
">I/O speed. As processor clock rates go up, the speed of the memory\n",
">system becomes the greatest factor in the overall system speed. If\n",
">you have a 50MHz processor, it can be reading another word from memory\n",
">every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
">it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"Not in a clock-doubled system. There isn't a doubling in performance, but\n",
"it _is_ quite significant. Maybe about a 70% increase in performance.\n",
"\n",
"Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
"who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
"memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
">And roughly, the 68040 is twice as fast at a given clock\n",
">speed as is the 68030.\n",
"\n",
"Numbers?\n",
"\n",
">-- \n",
">Ray Fischer \"Convictions are more dangerous enemies of truth\n",
">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"-- \n",
"Ravikumar Venkateswar\n",
"rvenkate@uiuc.edu\n",
"\n",
"A pun is a no' blessed form of whit.\n",
"\n",
"0.44292082969477664\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 30\n",
"\n",
"rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
">ray@netcom.com (Ray Fischer) writes:\n",
">>040 486 030 386 020 286\n",
">\n",
">How about some numbers here? Some kind of benchmark?\n",
"\n",
"Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n",
"you happy, the 486 is faster than the 040. BFD. Both architectures\n",
"are nearing then end of their lifetimes. And especially with the x86\n",
"architecture: good riddance.\n",
"\n",
">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
">memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
"The point being the processor speed is only one of many aspects of a\n",
"computers performance. Clock speed, processor, memory speed, CPU\n",
"architecture, I/O systems, even the application program all contribute \n",
"to the overall system performance.\n",
"\n",
">>And roughly, the 68040 is twice as fast at a given clock\n",
">>speed as is the 68030.\n",
">\n",
">Numbers?\n",
"\n",
"Look them up yourself.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.3491800997095306\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: mb4008@cehp11 (Morgan J Bullard)\n",
"Subject: Re: speeding up windows\n",
"Keywords: speed\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 30\n",
"\n",
"djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
"\n",
">I have a 386/33 with 8 megs of memory\n",
"\n",
">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
">my computer \"boggs\" down and becomes really sluggish!\n",
"\n",
">What can I do to increase performance? What should I turn on or off\n",
"\n",
">Will not loading wallpapers or stuff like that help when it comes to\n",
">the running speed of windows and the programs that run under it?\n",
"\n",
">Thanx in advance\n",
"\n",
">Derek\n",
"\n",
"1) make sure your hard drive is defragmented. This will speed up more than \n",
" just windows BTW. Use something like Norton's or PC Tools.\n",
"2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
" will speed up your machine but I could very will be wrong on this.\n",
"There's a good chance you've already done this but if not it may speed things\n",
"up. good luck\n",
"\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
"\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n",
"\n",
">--\n",
">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n",
">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n",
">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n",
"\n",
"0.26949927393886913\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n"
]
}
],
"source": [
"for i in range (1,5):\n",
" print(newsgroups[similarities.argsort()[0][-i]])\n",
" print(np.sort(similarities)[0,-i])\n",
" print('-'*100)\n",
" print('-'*100)\n",
" print('-'*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zadanie domowe\n",
"\n",
"\n",
"- Wybra\u0107 zbi\u00f3r tekstowy, kt\u00f3ry ma conajmniej 10000 dokument\u00f3w (inny ni\u017c w tym przyk\u0142adzie).\n",
"- Na jego podstawie stworzy\u0107 wyszukiwark\u0119 bazuj\u0105c\u0105 na OKAPI BM25, tzn. system kt\u00f3ry dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasuj\u0105cych dokument\u00f3w razem ze scorami. Nale\u017cy wypisywa\u0107 te\u017c ilo\u015b\u0107 zwracanych dokument\u00f3w, czyli takich z niezerowym scorem. Mo\u017cna korzysta\u0107 z gotowych bibliotek do wektoryzacji dokument\u00f3w, nale\u017cy jednak samemu zaimplementowa\u0107 OKAPI BM25. \n",
"- Znale\u017a\u0107 fraz\u0119 (query), dla kt\u00f3rej wynik nie jest satysfakcjonuj\u0105cy.\n",
"- Poprawi\u0107 wyszukiwark\u0119 (np. poprzez zmian\u0119 preprocessingu tekstu, wektoryzer, zmian\u0119 parametr\u00f3w algorytmu rankuj\u0105cego lub sam algorytm) tak, \u017ceby zwraca\u0142a satysfakcjonuj\u0105ce wyniki dla poprzedniej frazy. Nale\u017cy zrobi\u0107 inn\u0105 zmian\u0119 ni\u017c w tym przyk\u0142adzie, tylko wymy\u015bli\u0107 co\u015b w\u0142asnego.\n",
"- prezentowa\u0107 prac\u0119 na nast\u0119pnych zaj\u0119ciach (14.04) odpowiadaj\u0105c na pytania:\n",
" - jak wygl\u0105da zbi\u00f3r i system wyszukiwania przed zmianami\n",
" - dla jakiej frazy wyniki s\u0105 niesatysfakcjonuj\u0105ce (pokaza\u0107 wyniki)\n",
" - jakie zmiany zosta\u0142y naniesione\n",
" - jak wygl\u0105daj\u0105 wyniki wyszukiwania po zmianach\n",
" - jak zmiany wp\u0142yn\u0119\u0142y na wyniki (1-2 zdania)\n",
" \n",
"Prezentacja powinna by\u0107 maksymalnie prosta i trwa\u0107 maksymalnie 2-3 minuty.\n",
"punkt\u00f3w do zdobycia: 60\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "3.tfidf (2)[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}