aitech-eks-pub-22/cw/03a_tfidf.ipynb

1036 lines
24 KiB
Plaintext
Raw Permalink Normal View History

2022-03-22 12:32:31 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 3. <i>tfidf (1)</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajęcia 3\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## zbiór dokumentów"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",
" 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",
" 'I Jan jeździ na rowerze.',\n",
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
" 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n",
" ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CZEGO CHCEMY?\n",
"- chcemy zamienić teksty na zbiór słów\n",
"\n",
"\n",
"### PYTANIE\n",
"- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?"
]
},
2022-03-22 21:56:40 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ODPOWIEDŹ\n",
"- lepiej użyć preprocessingu i dopiero później tokenizacji"
]
},
2022-03-22 12:32:31 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def get_str_cleaned(str_dirty):\n",
" punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
" new_str = str_dirty.lower()\n",
" new_str = re.sub(' +', ' ', new_str)\n",
" for char in punctuation:\n",
" new_str = new_str.replace(char,'')\n",
" return new_str\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"sample_document = get_str_cleaned(documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ala lubi zwierzęta i ma kota oraz psa'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_document"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## tokenizacja"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def tokenize_str(document):\n",
" return document.split(' ')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenize_str(sample_document)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"documents_cleaned = [get_str_cleaned(d) for d in documents]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ala lubi zwierzęta i ma kota oraz psa',\n",
" 'ola lubi zwierzęta oraz ma kota a także chomika',\n",
" 'i jan jeździ na rowerze',\n",
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
" 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents_cleaned"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
" ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",
" ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",
" ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
" ['tomek',\n",
" 'lubi',\n",
" 'psy',\n",
" 'ma',\n",
" 'psa',\n",
" 'i',\n",
" 'jeździ',\n",
" 'na',\n",
" 'motorze',\n",
" 'i',\n",
" 'rowerze']]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents_tokenized"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PYTANIA\n",
"- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n",
"- jakie wielkości będzie wektor TF lub TF-IDF?\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"vocabulary = []\n",
"for document in documents_tokenized:\n",
" for word in document:\n",
" vocabulary.append(word)\n",
"vocabulary = sorted(set(vocabulary))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['2',\n",
" 'a',\n",
" 'ala',\n",
" 'była',\n",
" 'chomika',\n",
" 'i',\n",
" 'jan',\n",
" 'jeździ',\n",
" 'konfliktem',\n",
" 'kota',\n",
" 'lubi',\n",
" 'ma',\n",
" 'motorze',\n",
" 'na',\n",
" 'ola',\n",
" 'oraz',\n",
" 'psa',\n",
" 'psy',\n",
" 'rowerze',\n",
" 'także',\n",
" 'tomek',\n",
" 'wielkim',\n",
" 'wojna',\n",
" 'zbrojnym',\n",
" 'zwierzęta',\n",
" 'światowa']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocabulary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def word_to_index(word):\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_to_index('psa')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def tf(document):\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf(documents_tokenized[0])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"documents_vectorized = list()\n",
"for document in documents_tokenized:\n",
" document_vector = tf(document)\n",
" documents_vectorized.append(document_vector)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
" 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
" array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
" 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
" array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
" 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
" array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
" 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
" array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
" 1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents_vectorized"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IDF"
]
},
2022-03-22 12:58:16 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wersja bez żadnej normalizacji\n",
"\n",
"\n",
"$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
"\n",
"\n",
"$|D|$ - ilość dokumentów w korpusie\n",
"$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz"
]
},
2022-03-22 12:32:31 +01:00
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([5. , 5. , 5. , 5. , 5. ,\n",
" 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n",
" 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n",
" 2.5 , 2.5 , 5. , 2.5 , 5. ,\n",
" 5. , 5. , 5. , 5. , 2.5 ,\n",
" 5. ])"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"idf = np.zeros(len(vocabulary))\n",
"idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
"display(idf)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"for i in range(len(documents_vectorized)):\n",
2022-03-22 12:58:16 +01:00
" documents_vectorized[i] = documents_vectorized[i] * idf"
2022-03-22 12:32:31 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def similarity(query, document):\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Ala lubi zwierzęta i ma kota oraz psa!'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[0]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents_vectorized[0]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents[1]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
" 0., 0., 1., 0., 0., 0., 0., 1., 0.])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"documents_vectorized[1]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5892556509887895"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarity(documents_vectorized[0],documents_vectorized[1])"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def transform_query(query):\n",
" query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
" return query_vector"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transform_query('psa')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4999999999999999"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarity(transform_query('psa kota'), documents_vectorized[0])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Ala lubi zwierzęta i ma kota oraz psa!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.4999999999999999"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.2357022603955158"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'I Jan jeździ na rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.19611613513818402"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# tak są obsługiwane 2 słowa\n",
"query = 'psa kota'\n",
"for i in range(len(documents)):\n",
" display(documents[i])\n",
" display(similarity(transform_query(query), documents_vectorized[i]))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Ala lubi zwierzęta i ma kota oraz psa!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'I Jan jeździ na rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.4472135954999579"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.2773500981126146"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# dlatego potrzebujemy mianownik w cosine similarity\n",
2022-03-22 21:56:40 +01:00
"# dłuższe dokumenty, w który raz wystąpie słowo rower są gorzej punktowane od\n",
"# krótszych. Jeżeli słowo rower wystąpiło w bardzo krótki dokumencie, to znaczy\n",
"# że jest większe prawdopodobieństwo że dokument jest o rowerze\n",
2022-03-22 12:32:31 +01:00
"query = 'rowerze'\n",
"for i in range(len(documents)):\n",
" display(documents[i])\n",
" display(similarity(transform_query(query), documents_vectorized[i]))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Ala lubi zwierzęta i ma kota oraz psa!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.35355339059327373"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'I Jan jeździ na rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.4472135954999579"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.5547001962252291"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2022-03-22 21:56:40 +01:00
"# dlatego potrzebujemy term frequency → wiecej wystąpień słowa w dokumencie\n",
"# znaczy bardziej dopasowany dokument\n",
2022-03-22 12:32:31 +01:00
"query = 'i'\n",
"for i in range(len(documents)):\n",
" display(documents[i])\n",
" display(similarity(transform_query(query), documents_vectorized[i]))"
]
},
{
"cell_type": "code",
2022-03-22 21:56:40 +01:00
"execution_count": 3,
2022-03-22 12:58:16 +01:00
"metadata": {
2022-03-22 21:56:40 +01:00
"scrolled": false
2022-03-22 12:58:16 +01:00
},
2022-03-22 12:32:31 +01:00
"outputs": [
{
2022-03-22 21:56:40 +01:00
"ename": "NameError",
"evalue": "name 'documents' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-3-ca637083c8f1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# słowo chomik ma większą wagę od i, ponieważ występuje w mniejszej ilości dokumentów\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mquery\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'i chomika'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdocuments\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdocuments\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msimilarity\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtransform_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mquery\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdocuments_vectorized\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'documents' is not defined"
]
2022-03-22 12:32:31 +01:00
}
],
"source": [
"# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",
2022-03-22 21:56:40 +01:00
"# słowo chomik ma większą wagę od i, ponieważ występuje w mniejszej ilości dokumentów\n",
2022-03-22 12:32:31 +01:00
"query = 'i chomika'\n",
"for i in range(len(documents)):\n",
" display(documents[i])\n",
" display(similarity(transform_query(query), documents_vectorized[i]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
2022-03-22 21:56:40 +01:00
"source": [
"### Uwaga\n",
"Powyższe przykłady pokazują score dokuemntu. Aby zrobić wyszukiwarkę, powinniśmy posortować te dokumenty po score (od największego) i zaprezentwoać w tej kolejności."
]
2022-03-22 12:32:31 +01:00
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "3.tfidf (1)[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}