forked from filipg/aitech-eks-pub
1120 lines
38 KiB
Plaintext
1120 lines
38 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"collapsed": false
|
||
|
},
|
||
|
"source": [
|
||
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||
|
"<div class=\"alert alert-block alert-info\">\n",
|
||
|
"<h1> Ekstrakcja informacji </h1>\n",
|
||
|
"<h2> 3. <i>tfidf (1)</i> [\u0107wiczenia]</h2> \n",
|
||
|
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||
|
"</div>\n",
|
||
|
"\n",
|
||
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Zaj\u0119cia 2\n",
|
||
|
"\n",
|
||
|
"Na tych zaj\u0119ciach za aktywno\u015bc mo\u017cna otrzyma\u0107 po 5 punkt\u00f3w za warto\u015bciow\u0105 wypowied\u017a. Maksymalnie jedna osoba mo\u017ce zdoby\u0107 na tych \u0107wiczeniach do 15 punkt\u00f3w."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import numpy as np\n",
|
||
|
"import re"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## zbi\u00f3r dokument\u00f3w"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"documents = ['Ala lubi zwierz\u0119ta i ma kota oraz psa!',\n",
|
||
|
" 'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!',\n",
|
||
|
" 'I Jan je\u017adzi na rowerze.',\n",
|
||
|
" '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
|
||
|
" 'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.',\n",
|
||
|
" ]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### CZEGO CHCEMY?\n",
|
||
|
"- chcemy zamieni\u0107 teksty na zbi\u00f3r s\u0142\u00f3w\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"### PYTANIE\n",
|
||
|
"- czy mo\u017cemy ztokenizowa\u0107 tekst np. documents.split(' ') jakie wyst\u0105pi\u0105 wtedy problemy?"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## preprocessing"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def get_str_cleaned(str_dirty):\n",
|
||
|
" punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
|
||
|
" new_str = str_dirty.lower()\n",
|
||
|
" new_str = re.sub(' +', ' ', new_str)\n",
|
||
|
" for char in punctuation:\n",
|
||
|
" new_str = new_str.replace(char,'')\n",
|
||
|
" return new_str\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"sample_document = get_str_cleaned(documents[0])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'ala lubi zwierz\u0119ta i ma kota oraz psa'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"sample_document"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## tokenizacja"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def tokenize_str(document):\n",
|
||
|
" return document.split(' ')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa']"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tokenize_str(sample_document)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"documents_cleaned = [get_str_cleaned(d) for d in documents]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"['ala lubi zwierz\u0119ta i ma kota oraz psa',\n",
|
||
|
" 'ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika',\n",
|
||
|
" 'i jan je\u017adzi na rowerze',\n",
|
||
|
" '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
|
||
|
" 'tomek lubi psy ma psa i je\u017adzi na motorze i rowerze']"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 9,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents_cleaned"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"[['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
|
||
|
" ['ola', 'lubi', 'zwierz\u0119ta', 'oraz', 'ma', 'kota', 'a', 'tak\u017ce', 'chomika'],\n",
|
||
|
" ['i', 'jan', 'je\u017adzi', 'na', 'rowerze'],\n",
|
||
|
" ['2', 'wojna', '\u015bwiatowa', 'by\u0142a', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
|
||
|
" ['tomek',\n",
|
||
|
" 'lubi',\n",
|
||
|
" 'psy',\n",
|
||
|
" 'ma',\n",
|
||
|
" 'psa',\n",
|
||
|
" 'i',\n",
|
||
|
" 'je\u017adzi',\n",
|
||
|
" 'na',\n",
|
||
|
" 'motorze',\n",
|
||
|
" 'i',\n",
|
||
|
" 'rowerze']]"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 11,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents_tokenized"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## PYTANIA\n",
|
||
|
"- jaki jest nast\u0119pny krok w celu stworzenia wekt\u00f3r\u00f3w TF lub TF-IDF\n",
|
||
|
"- jakie wielko\u015bci b\u0119dzie wektor TF lub TF-IDF?\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 12,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"vocabulary = []\n",
|
||
|
"for document in documents_tokenized:\n",
|
||
|
" for word in document:\n",
|
||
|
" vocabulary.append(word)\n",
|
||
|
"vocabulary = sorted(set(vocabulary))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 13,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"['2',\n",
|
||
|
" 'a',\n",
|
||
|
" 'ala',\n",
|
||
|
" 'by\u0142a',\n",
|
||
|
" 'chomika',\n",
|
||
|
" 'i',\n",
|
||
|
" 'jan',\n",
|
||
|
" 'je\u017adzi',\n",
|
||
|
" 'konfliktem',\n",
|
||
|
" 'kota',\n",
|
||
|
" 'lubi',\n",
|
||
|
" 'ma',\n",
|
||
|
" 'motorze',\n",
|
||
|
" 'na',\n",
|
||
|
" 'ola',\n",
|
||
|
" 'oraz',\n",
|
||
|
" 'psa',\n",
|
||
|
" 'psy',\n",
|
||
|
" 'rowerze',\n",
|
||
|
" 'tak\u017ce',\n",
|
||
|
" 'tomek',\n",
|
||
|
" 'wielkim',\n",
|
||
|
" 'wojna',\n",
|
||
|
" 'zbrojnym',\n",
|
||
|
" 'zwierz\u0119ta',\n",
|
||
|
" '\u015bwiatowa']"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 13,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"vocabulary"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## PYTANIA\n",
|
||
|
"\n",
|
||
|
"jak b\u0119dzie s\u0142owo \"jak\" w reprezentacji wektorowej TF?"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### ZADANIE 1 stworzy\u0107 funkcj\u0119 word_to_index(word:str), funkcja ma zwara\u0107 one-hot vector w postaciu numpy array"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def word_to_index(word):\n",
|
||
|
" pass"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
|
||
|
" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 16,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"word_to_index('psa')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### ZADANIE 2 NAPISAC FUNKCJ\u0118, kt\u00f3ra bierze list\u0119 s\u0142\u00f3w i zamienia na wetktor TF\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def tf(document):\n",
|
||
|
" pass"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 18,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 19,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
|
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 19,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tf(documents_tokenized[0])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 20,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"documents_vectorized = list()\n",
|
||
|
"for document in documents_tokenized:\n",
|
||
|
" document_vector = tf(document)\n",
|
||
|
" documents_vectorized.append(document_vector)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 21,
|
||
|
"metadata": {
|
||
|
"scrolled": true
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
|
" 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
|
||
|
" array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
|
||
|
" 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
|
||
|
" array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
|
||
|
" 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
|
||
|
" array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
|
||
|
" 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
|
||
|
" array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
|
||
|
" 1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 21,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents_vectorized"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### IDF"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 22,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([5. , 5. , 5. , 5. , 5. ,\n",
|
||
|
" 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n",
|
||
|
" 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n",
|
||
|
" 2.5 , 2.5 , 5. , 2.5 , 5. ,\n",
|
||
|
" 5. , 5. , 5. , 5. , 2.5 ,\n",
|
||
|
" 5. ])"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"idf = np.zeros(len(vocabulary))\n",
|
||
|
"idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
|
||
|
"display(idf)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 23,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"for i in range(len(documents_vectorized)):\n",
|
||
|
" documents_vectorized[i] = documents_vectorized[i]# * idf"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### ZADANIE 3 Napisa\u0107 funkcj\u0119 similarity, kt\u00f3ra zwraca podobie\u0144stwo kosinusowe mi\u0119dzy dwoma dokumentami w postaci zwektoryzowanej"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def similarity(query, document):\n",
|
||
|
" pass"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 25,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 25,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents[0]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 26,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
|
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 26,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents_vectorized[0]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 27,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 27,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents[1]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 28,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
|
||
|
" 0., 0., 1., 0., 0., 0., 0., 1., 0.])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 28,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"documents_vectorized[1]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 29,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.5892556509887895"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 29,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"similarity(documents_vectorized[0],documents_vectorized[1])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 30,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def transform_query(query):\n",
|
||
|
" query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
|
||
|
" return query_vector"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 31,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
|
||
|
" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 31,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"transform_query('psa')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 32,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.4999999999999999"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 32,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"similarity(transform_query('psa kota'), documents_vectorized[0])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 33,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.4999999999999999"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.2357022603955158"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'I Jan je\u017adzi na rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.19611613513818402"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# tak s\u0105 obs\u0142ugiwane 2 s\u0142owa\n",
|
||
|
"query = 'psa kota'\n",
|
||
|
"for i in range(len(documents)):\n",
|
||
|
" display(documents[i])\n",
|
||
|
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 34,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'I Jan je\u017adzi na rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.4472135954999579"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.2773500981126146"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# dlatego potrzebujemy mianownik w cosine similarity\n",
|
||
|
"query = 'rowerze'\n",
|
||
|
"for i in range(len(documents)):\n",
|
||
|
" display(documents[i])\n",
|
||
|
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 35,
|
||
|
"metadata": {
|
||
|
"scrolled": true
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.35355339059327373"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'I Jan je\u017adzi na rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.4472135954999579"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.5547001962252291"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# dlatego potrzebujemy term frequency \u2192 wiecej znaczy bardziej dopasowany dokument\n",
|
||
|
"query = 'i'\n",
|
||
|
"for i in range(len(documents)):\n",
|
||
|
" display(documents[i])\n",
|
||
|
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 36,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.24999999999999994"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.2357022603955158"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'I Jan je\u017adzi na rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.31622776601683794"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.0"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.39223227027636803"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# dlatego IDF - \u017ceby wa\u017cniejsze s\u0142owa mia\u0142 wi\u0119ksz\u0105 wag\u0119\n",
|
||
|
"query = 'i chomika'\n",
|
||
|
"for i in range(len(documents)):\n",
|
||
|
" display(documents[i])\n",
|
||
|
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### ZADANIE 4 NAPISA\u0106 IDF w celu zmiany wag z TF na TF- IDF \n",
|
||
|
"\n",
|
||
|
"Prosz\u0119 u\u017cy\u0107 wersj\u0119 bez \u017cadnej normalizacji\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"$|D|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie\n",
|
||
|
"$|\\{d : t_i \\in d \\}|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie, gdzie dany term wyst\u0119puje chocia\u017c jeden raz"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.8.3"
|
||
|
},
|
||
|
"author": "Jakub Pokrywka",
|
||
|
"email": "kubapok@wmi.amu.edu.pl",
|
||
|
"lang": "pl",
|
||
|
"subtitle": "3.tfidf (1)[\u0107wiczenia]",
|
||
|
"title": "Ekstrakcja informacji",
|
||
|
"year": "2021"
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 4
|
||
|
}
|