\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajecia 3\n",
"\n",
"Przydatne materiały:\n",
"\n",
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
"\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importy"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import sklearn.metrics\n",
"\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zbiór danych"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"newsgroups = fetch_20newsgroups()['data']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11314"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(newsgroups[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naiwne przeszukiwanie"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"all_documents = list() \n",
"for document in newsgroups:\n",
" if 'car' in document:\n",
" all_documents.append(document)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(all_documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
"Subject: SI Clock Poll - Final Call\n",
"Summary: Final call for SI clock reports\n",
"Keywords: SI,acceleration,clock,upgrade\n",
"Article-I.D.: shelley.1qvfo9INNc3s\n",
"Organization: University of Washington\n",
"Lines: 11\n",
"NNTP-Posting-Host: carson.u.washington.edu\n",
"\n",
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4 m floppies are especially requested.\n",
"\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll. Thanks.\n",
"\n",
"Guy Kuo \n",
"\n"
]
}
],
"source": [
"print(all_documents[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jakie są problemy z takim podejściem?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TFIDF i odległość cosinusowa- gotowe biblioteki"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"vectorizer = TfidfVectorizer()\n",
"#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"document_vectors = vectorizer.fit_transform(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type ''\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type ''\n",
"\twith 89 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0].todense()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0:4].todense()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"#query_str = 'speed'\n",
"#query_str = 'speed car'\n",
"query_str = 'spider man'"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"query_vector = vectorizer.transform([query_str])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type ''\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type ''\n",
"\twith 2 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_vector"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.17360013, 0.22933014, 0.28954818, 0.45372239])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sort(similarities)[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 2455, 8920, 5497, 11031])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarities.argsort()[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: keiths@spider.co.uk (Keith Smith)\n",
"Subject: win/NT file systems\n",
"Organization: Spider Systems Limited, Edinburgh, UK.\n",
"Lines: 6\n",
"Nntp-Posting-Host: trapdoor.spider.co.uk\n",
"\n",
"OK will some one out there tell me why / how DOS 5\n",
"can read (I havn't tried writing in case it breaks something)\n",
"the Win/NT NTFS file system.\n",
"I thought NTFS was supposed to be better than the FAT system\n",
"\n",
"keith\n",
"\n",
"0.4537223924558256\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: brandt@cs.unc.edu (Andrew Brandt)\n",
"Subject: Seeking good Alfa Romeo mechanic.\n",
"Organization: The University of North Carolina at Chapel Hill\n",
"Lines: 14\n",
"NNTP-Posting-Host: axon.cs.unc.edu\n",
"Keywords: alfa, romeo, spider, mechanic\n",
"\n",
"I am looking for recommendations for a good (great?) Alfa Romeo\n",
"mechanic in South Jersey or Philadelphia or nearby.\n",
"\n",
"I have a '78 Alfa Spider that needs some engine, tranny, steering work\n",
"done. The body is in quite good shape. The car is awful in cold\n",
"weather, won't start if below freezing (I know, I know, why drive a\n",
"Spider if there's snow on the ground ...). It has Bosch *mechanical*\n",
"fuel injection that I am sure needs adjustment.\n",
"\n",
"Any opinions are welcome on what to look for or who to call.\n",
"\n",
"Email or post (to rec.autos), I will summarize if people want.\n",
"\n",
"Thx, Andy (brandt@cs.unc.edu)\n",
"\n",
"0.28954817869991817\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: michaelr@spider.co.uk (Michael S. A. Robb)\n",
"Subject: Re: Honors Degrees: Do they mean anything?\n",
"Organization: Spider Systems Limited, Edinburgh, UK.\n",
"Lines: 44\n",
"\n",
"In article tkld@cogsci.ed.ac.uk (Kevin Davidson) writes:\n",
">\n",
">> In my opinion, a programming degree is still worth having.\n",
">\n",
"> Yes, but a CS degree is *not* a programming degree. Does anybody know of\n",
">a computing course where *programming* is taught ? Computer Science is\n",
">a branch of maths (or the course I did was).\n",
"> I've also done a Software Engineering course - much more practical and likely\n",
">to be the sort of thing an employer really wants, rather than what they think\n",
">they want, but also did not teach programming. The ability to program was\n",
">an entry requirement.\n",
"\n",
"At Robert Gordon University, programming was the main (most time-consuming) \n",
"start of the course. The first two years consisted of five subjects:\n",
"Software Engineering (Pascal/C/UNIX), Computer Engineering (6502/6809/68000 \n",
"assembler), Computer Theory (LISP/Prolog), Mathematics/Statistics and \n",
"Communication Skills (How to pass interviews/intelligence tests and group\n",
"discussions e.g. How to survive a helicopter crash in the North Sea).\n",
"The third year (Industrial placement) was spent working for a computer company \n",
"for a year. The company could be anywhere in Europe (there was a special \n",
"Travel Allowance Scheme to cover the visiting costs of professors). \n",
"The fourth year included Operating Systems(C/Modula-2), Software Engineering \n",
"(C/8086 assembler), Real Time Laboratory (C/68000 assembler) and Computing \n",
"Theory (LISP). There were also Group Projects in 2nd and 4th Years, where \n",
"students worked in teams to select their own project or decide to work for an \n",
"outside company (the only disadvantage being that specifications would change \n",
"suddenly).\n",
" \n",
"In the first four years, there was a 50%:50% weighting between courseworks and \n",
"exams for most subjects. However in the Honours year, this was reduced to a \n",
"30%:70% split between an Individual Project and final exams (no coursework \n",
"assessment) - are all Computer Science courses like this?\n",
"\n",
"BTW - we started off with 22 students in our first year and were left with 8 by\n",
"Honours year. Also, every course is tutored separately. Not easy trying\n",
"to sleep when you are in 8 student class :-). \n",
"\n",
"Cheers,\n",
" Michael \n",
"-- \n",
"| Michael S. A. Robb | Tel: +44 31 554 9424 | \"..The problem with bolt-on\n",
"| Software Engineer | Fax: +44 31 554 0649 | software is making sure the\n",
"| Spider Systems Limited | E-mail: | bolts are the right size..\"\n",
"| Edinburgh, EH6 5NG | michaelr@spider.co.uk | - Anonymous\n",
"\n",
"0.22933013891071233\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: jrm@elm.circa.ufl.edu (Jeff Mason)\n",
"Subject: AUCTION: Marvel, DC, Valiant, Image, Dark Horse, etc...\n",
"Organization: Univ. of Florida Psychology Dept.\n",
"Lines: 59\n",
"NNTP-Posting-Host: elm.circa.ufl.edu\n",
"\n",
"I am auctioning off the following comics. These minimum bids are set\n",
"below what I would normally sell them for. Make an offer, and I will\n",
"accept the highest bid after the auction has been completed.\n",
"\n",
"TITLE Minimum/Current \n",
"--------------------------------------------------------------\n",
"Alpha Flight 51 (Jim Lee's first work at Marvel)\t$ 5.00\n",
"Aliens 1 (1st app Aliens in comics, 1st prnt, May 1988)\t$20.00\n",
"Amazing Spider-Man 136 (Intro new Green Goblin) $20.00\n",
"Amazing Spider-Man 238 (1st appearance Hobgoblin)\t$50.00\n",
"Archer and Armstrong 1 (Frank Miller/Smith/Layton)\t$ 7.50\n",
"Avengers 263 (1st appearance X-factor) $ 3.50\n",
"Bloodshot 1 (Chromium cover, BWSmith Cover/Poster)\t$ 5.00\n",
"Daredevil 158 (Frank Miller art begins) $35.00\n",
"Dark Horse Presents 1 (1st app Concrete, 1st printing)\t$ 7.50 \n",
"H.A.R.D. Corps 1 \t\t\t\t\t$ 5.00\n",
"Incredible Hulk 324 (1st app Grey Hulk since #1, 1962)\t$ 7.50\n",
"Incredible Hulk 330 (1st McFarlane issue)\t\t$15.00\n",
"Incredible Hulk 331 (Grey Hulk series begins)\t\t$11.20\t\n",
"Incredible Hulk 367 (1st Dale Keown art in Hulk) $15.00\n",
"Incredible Hulk 377 (1st all new hulk, 1st prnt, Keown) $15.00\n",
"Marvel Comics Presents 1 (Wolverine, Silver Surfer) $ 7.50\n",
"Maxx Limited Ashcan (4000 copies exist, blue cover)\t$30.00\n",
"New Mutants 86 (McFarlane cover, 1st app Cable - cameo)\t$10.00\n",
"New Mutants 100 (1st app X-Force) $ 5.00\n",
"New Mutants Annual 5 (1st Liefeld art on New Mutants)\t$10.00\n",
"Omega Men 3 (1st appearance Lobo) $ 7.50\n",
"Omega Men 10 (1st full Lobo story) $ 7.50\n",
"Power Man & Iron Fist 78 (3rd appearance Sabretooth) $25.00\n",
" 84 (4th appearance Sabretooth) $20.00\n",
"Simpsons Comics and Stories 1 (Polybagged special ed.)\t$ 7.50\n",
"Spectacular Spider-Man 147 (1st app New Hobgoblin) $12.50\n",
"Star Trek the Next Generation 1 (Feb 1988, DC mini) $ 7.50\n",
"Star Trek the Next Generation 1 (Oct 1989, DC comics) $ 7.50\n",
"Web of Spider-Man 29 (Hobgoblin, Wolverine appear) $10.00 \n",
"Web of Spider-Man 30 (Origin Rose, Hobgoblin appears) $ 7.50\n",
"Wolverine 10 (Before claws, 1st battle with Sabretooth)\t$15.00\n",
"Wolverine 41 (Sabretooth claims to be Wolverine's dad)\t$ 5.00\n",
"Wolverine 42 (Sabretooth proven not to be his dad)\t$ 3.50\n",
"Wolverine 43 (Sabretooth/Wolverine saga concludes)\t$ 3.00\n",
"Wolverine 1 (1982 mini-series, Miller art)\t\t$20.00\n",
"Wonder Woman 267 (Return of Animal Man) $12.50\n",
"X-Force 1 (Signed by Liefeld, Bagged, X-Force card) $20.00\n",
"X-Force 1 (Signed by Liefeld, Bagged, Shatterstar card) $10.00\n",
"X-Force 1 (Signed by Liefeld, Bagged, Deadpool card) $10.00\n",
"X-Force 1 (Signed by Liefeld, Bagged, Sunspot/Gideon) $10.00\n",
"\n",
"All comics are in near mint to mint condition, are bagged in shiny \n",
"polypropylene bags, and backed with white acid free boards. Shipping is\n",
"$1.50 for one book, $3.00 for more than one book, or free if you order \n",
"a large enough amount of stuff. I am willing to haggle.\n",
"\n",
"I have thousands and thousands of other comics, so please let me know what \n",
"you've been looking for, and maybe I can help. Some titles I have posted\n",
"here don't list every issue I have of that title, I tried to save space.\n",
"-- \n",
"Geoffrey R. Mason\t\t|\tjrm@elm.circa.ufl.edu\n",
"Department of Psychology\t|\tmason@webb.psych.ufl.edu\n",
"University of Florida\t\t|\tprothan@maple.circa.ufl.edu\n",
"\n",
"0.17360012846950526\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n"
]
}
],
"source": [
"for i in range (1,5):\n",
" print(newsgroups[similarities.argsort()[0][-i]])\n",
" print(np.sort(similarities)[0,-i])\n",
" print('-'*100)\n",
" print('-'*100)\n",
" print('-'*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### analiza\n",
"Dla frazy \"spider man\" (komórka 14) wynik zapytania jest niesatysfakcjonujący, ponieważ pierwszy artykuł nie jest o spider-man'ie, ale zawiera tylko słowa \"spider\". Po zmianie metody wektoryzacji (komórka 8) jako pierwszy wynik pojawia się istotnie film o spider manie (proszę to sprawdzić samodzielnie). Wynika to z faktu, że używamy również bigramów. W ten sposób poprawiliśmy wyszukiwarkę dla tego konkretnego przykładu (chociaż nie wiemy czy nie popsuliśmy wyszukiwarki w innym przypadku- w tym ćwiczeniu nie przejmujemy się tym)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zadanie domowe\n",
"\n",
"\n",
"- Wybrać zbiór tekstowy, który ma conajmniej 10_000 dokumentów (inny niż w tym przykładzie).\n",
"- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów razem ze scorami. Należy wypisywać też ilość zwracanych dokumentów, czyli takich z niezerowym scorem. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25. Można użyć dowolnych parametrów TF-IDF\n",
"- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
"- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algorytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy. Należy zrobić inną zmianę niż w powyższym przykładzie (czyli coś innego niż użycie bigramów), tylko wymyślić coś własnego.\n",
"- prezentować pracę na zajęciach (06.04) odpowiadając na pytania:\n",
" - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
" - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
" - jakie zmiany zostały naniesione\n",
" - jak wyglądają wyniki wyszukiwania po zmianach\n",
" - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
" \n",
"Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
"punktów do zdobycia: 70\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "3.tfidf (2)[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}