1573 lines
40 KiB
Plaintext
1573 lines
40 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Uczenie głębokie – przetwarzanie tekstu – laboratoria\n",
|
||
"# 1. TF–IDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import re"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Zbiór dokumentów"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",
|
||
" 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",
|
||
" 'I Jan jeździ na rowerze.',\n",
|
||
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
|
||
" 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n",
|
||
" ]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Czego potrzebujemy?\n",
|
||
"\n",
|
||
"- Chcemy zamienić teksty na zbiór słów."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### ❔ Pytania\n",
|
||
"\n",
|
||
"- Czy do stokenizowania tekstu możemy użyć `document.split(' ')`?\n",
|
||
"- Jakie trudności możemy napotkać?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Preprocessing"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def get_str_cleaned(str_dirty):\n",
|
||
" punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
|
||
" new_str = str_dirty.lower()\n",
|
||
" new_str = re.sub(' +', ' ', new_str)\n",
|
||
" for char in punctuation:\n",
|
||
" new_str = new_str.replace(char,'')\n",
|
||
" return new_str"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"sample_document = get_str_cleaned(documents[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'ala lubi zwierzęta i ma kota oraz psa'"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"sample_document"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Tokenizacja"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def tokenize_str(document):\n",
|
||
" return document.split(' ')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tokenize_str(sample_document)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"documents_cleaned = [get_str_cleaned(document) for document in documents]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['ala lubi zwierzęta i ma kota oraz psa',\n",
|
||
" 'ola lubi zwierzęta oraz ma kota a także chomika',\n",
|
||
" 'i jan jeździ na rowerze',\n",
|
||
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
|
||
" 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents_cleaned"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
|
||
" ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",
|
||
" ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",
|
||
" ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
|
||
" ['tomek',\n",
|
||
" 'lubi',\n",
|
||
" 'psy',\n",
|
||
" 'ma',\n",
|
||
" 'psa',\n",
|
||
" 'i',\n",
|
||
" 'jeździ',\n",
|
||
" 'na',\n",
|
||
" 'motorze',\n",
|
||
" 'i',\n",
|
||
" 'rowerze']]"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents_tokenized"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### ❔ Pytania\n",
|
||
"\n",
|
||
"- Jaki jest następny krok w celu stworzenia wektórów TF lub TF–IDF?\n",
|
||
"- Jakie wielkości będzie wektor TF lub TF–IDF?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Stworzenie słownika"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"vocabulary = []\n",
|
||
"for document in documents_tokenized:\n",
|
||
" for word in document:\n",
|
||
" vocabulary.append(word)\n",
|
||
"vocabulary = sorted(set(vocabulary))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['2',\n",
|
||
" 'a',\n",
|
||
" 'ala',\n",
|
||
" 'była',\n",
|
||
" 'chomika',\n",
|
||
" 'i',\n",
|
||
" 'jan',\n",
|
||
" 'jeździ',\n",
|
||
" 'konfliktem',\n",
|
||
" 'kota',\n",
|
||
" 'lubi',\n",
|
||
" 'ma',\n",
|
||
" 'motorze',\n",
|
||
" 'na',\n",
|
||
" 'ola',\n",
|
||
" 'oraz',\n",
|
||
" 'psa',\n",
|
||
" 'psy',\n",
|
||
" 'rowerze',\n",
|
||
" 'także',\n",
|
||
" 'tomek',\n",
|
||
" 'wielkim',\n",
|
||
" 'wojna',\n",
|
||
" 'zbrojnym',\n",
|
||
" 'zwierzęta',\n",
|
||
" 'światowa']"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vocabulary"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📝 Zadanie **1.1** *(1 pkt)*\n",
|
||
"\n",
|
||
"Napisz funkcję `word_to_index(word: str)`, która dla danego słowa zwraca wektor jednostkowy (*one-hot vector*) w postaci `numpy.array`.\n",
|
||
"\n",
|
||
"Przyjmij, że słownik dany jest za pomocą zmiennej globalnej `vocabulary`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 66,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def word_to_index(word: str) -> np.array:\n",
|
||
" # TODO\n",
|
||
" pass"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
|
||
" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"word_to_index('psa')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📝 Zadanie **1.2** *(1 pkt)*\n",
|
||
"\n",
|
||
"Napisz funkcję, która zamienia listę słów na wektor TF. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 68,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def tf(document: list) -> np.array:\n",
|
||
" # TODO\n",
|
||
" pass"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
|
||
]
|
||
},
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tf(documents_tokenized[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"documents_vectorized = list()\n",
|
||
"for document in documents_tokenized:\n",
|
||
" document_vector = tf(document)\n",
|
||
" documents_vectorized.append(document_vector)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
" 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
|
||
" array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
|
||
" 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
|
||
" array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
|
||
" 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
|
||
" array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
|
||
" 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
|
||
" array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
|
||
" 1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents_vectorized"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## IDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([5. , 5. , 5. , 5. , 5. ,\n",
|
||
" 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n",
|
||
" 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n",
|
||
" 2.5 , 2.5 , 5. , 2.5 , 5. ,\n",
|
||
" 5. , 5. , 5. , 5. , 2.5 ,\n",
|
||
" 5. ])"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"idf = np.zeros(len(vocabulary))\n",
|
||
"idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
|
||
"display(idf)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📝 Zadanie **1.3** *(1 pkt)*\n",
|
||
"\n",
|
||
"Napisz funkcję, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 67,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def similarity(query: np.array, document: np.array) -> float:\n",
|
||
" # TODO\n",
|
||
" pass"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||
]
|
||
},
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
|
||
" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"
|
||
]
|
||
},
|
||
"execution_count": 33,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents_vectorized[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||
]
|
||
},
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents[1]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
|
||
" 0., 0., 1., 0., 0., 0., 0., 1., 0.])"
|
||
]
|
||
},
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents_vectorized[1]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.5892556509887895"
|
||
]
|
||
},
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"similarity(documents_vectorized[0], documents_vectorized[1])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Prosta wyszukiwarka"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def transform_query(query):\n",
|
||
" \"\"\"Funkcja, która czyści i tokenizuje zapytanie\"\"\"\n",
|
||
" query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
|
||
" return query_vector"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.4999999999999999"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"similarity(transform_query('psa kota'), documents_vectorized[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.4999999999999999"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.2357022603955158"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'I Jan jeździ na rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.19611613513818402"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"query = 'psa kota'\n",
|
||
"for i in range(len(documents)):\n",
|
||
" display(documents[i])\n",
|
||
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'I Jan jeździ na rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.4472135954999579"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.2773500981126146"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# dlatego potrzebujemy mianownik w cosine similarity\n",
|
||
"query = 'rowerze'\n",
|
||
"for i in range(len(documents)):\n",
|
||
" display(documents[i])\n",
|
||
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.35355339059327373"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'I Jan jeździ na rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.4472135954999579"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.5547001962252291"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n",
|
||
"query = 'i'\n",
|
||
"for i in range(len(documents)):\n",
|
||
" display(documents[i])\n",
|
||
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.24999999999999994"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.2357022603955158"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'I Jan jeździ na rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.31622776601683794"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.39223227027636803"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",
|
||
"query = 'i chomika'\n",
|
||
"for i in range(len(documents)):\n",
|
||
" display(documents[i])\n",
|
||
" display(similarity(transform_query(query), documents_vectorized[i]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Biblioteki"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import sklearn.metrics\n",
|
||
"\n",
|
||
"from sklearn.datasets import fetch_20newsgroups\n",
|
||
"\n",
|
||
"from sklearn.feature_extraction.text import TfidfVectorizer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"newsgroups = fetch_20newsgroups()['data']"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"11314"
|
||
]
|
||
},
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(newsgroups)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"From: lerxst@wam.umd.edu (where's my thing)\n",
|
||
"Subject: WHAT car is this!?\n",
|
||
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
|
||
"Organization: University of Maryland, College Park\n",
|
||
"Lines: 15\n",
|
||
"\n",
|
||
" I was wondering if anyone out there could enlighten me on this car I saw\n",
|
||
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
|
||
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
|
||
"the front bumper was separate from the rest of the body. This is \n",
|
||
"all I know. If anyone can tellme a model name, engine specs, years\n",
|
||
"of production, where this car is made, history, or whatever info you\n",
|
||
"have on this funky looking car, please e-mail.\n",
|
||
"\n",
|
||
"Thanks,\n",
|
||
"- IL\n",
|
||
" ---- brought to you by your neighborhood Lerxst ----\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(newsgroups[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Naiwne przeszukiwanie"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"all_documents = list() \n",
|
||
"for document in newsgroups:\n",
|
||
" if 'car' in document:\n",
|
||
" all_documents.append(document)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"From: lerxst@wam.umd.edu (where's my thing)\n",
|
||
"Subject: WHAT car is this!?\n",
|
||
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
|
||
"Organization: University of Maryland, College Park\n",
|
||
"Lines: 15\n",
|
||
"\n",
|
||
" I was wondering if anyone out there could enlighten me on this car I saw\n",
|
||
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
|
||
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
|
||
"the front bumper was separate from the rest of the body. This is \n",
|
||
"all I know. If anyone can tellme a model name, engine specs, years\n",
|
||
"of production, where this car is made, history, or whatever info you\n",
|
||
"have on this funky looking car, please e-mail.\n",
|
||
"\n",
|
||
"Thanks,\n",
|
||
"- IL\n",
|
||
" ---- brought to you by your neighborhood Lerxst ----\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(all_documents[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 54,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
|
||
"Subject: SI Clock Poll - Final Call\n",
|
||
"Summary: Final call for SI clock reports\n",
|
||
"Keywords: SI,acceleration,clock,upgrade\n",
|
||
"Article-I.D.: shelley.1qvfo9INNc3s\n",
|
||
"Organization: University of Washington\n",
|
||
"Lines: 11\n",
|
||
"NNTP-Posting-Host: carson.u.washington.edu\n",
|
||
"\n",
|
||
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
|
||
"shared their experiences for this poll. Please send a brief message detailing\n",
|
||
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
|
||
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
|
||
"functionality with 800 and 1.4 m floppies are especially requested.\n",
|
||
"\n",
|
||
"I will be summarizing in the next two days, so please add to the network\n",
|
||
"knowledge base if you have done the clock upgrade and haven't answered this\n",
|
||
"poll. Thanks.\n",
|
||
"\n",
|
||
"Guy Kuo <guykuo@u.washington.edu>\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(all_documents[1])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### ❔ Pytanie\n",
|
||
"\n",
|
||
"Jakie są problemy z takim podejściem?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### TF–IDF i odległość kosinusowa"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 55,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"vectorizer = TfidfVectorizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 56,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"document_vectors = vectorizer.fit_transform(newsgroups)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 57,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||
"\twith 1787565 stored elements in Compressed Sparse Row format>"
|
||
]
|
||
},
|
||
"execution_count": 57,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"document_vectors"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||
"\twith 89 stored elements in Compressed Sparse Row format>"
|
||
]
|
||
},
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"document_vectors[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
|
||
]
|
||
},
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"document_vectors[0].todense()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.]])"
|
||
]
|
||
},
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"document_vectors[0:4].todense()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"query_str = 'speed'\n",
|
||
"#query_str = 'speed car'\n",
|
||
"#query_str = 'spider man'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 65,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0.17360013 0.22933014 0.28954818 0.45372239]\n",
|
||
"[ 2455 8920 5497 11031]\n",
|
||
"From: keiths@spider.co.uk (Keith Smith)\n",
|
||
"Subject: win/NT file systems\n",
|
||
"Organization: Spider Systems Limited, Edinburgh, UK.\n",
|
||
"Lines: 6\n",
|
||
"Nntp-Posting-Host: trapdoor.spider.co.uk\n",
|
||
"\n",
|
||
"OK will some one out there tell me why / how DOS 5\n",
|
||
"can read (I havn't tried writing in case it breaks something)\n",
|
||
"the Win/NT NTFS file system.\n",
|
||
"I thought NTFS was supposed to be better than the FAT system\n",
|
||
"\n",
|
||
"keith\n",
|
||
"\n",
|
||
"0.4537223924558256\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"From: brandt@cs.unc.edu (Andrew Brandt)\n",
|
||
"Subject: Seeking good Alfa Romeo mechanic.\n",
|
||
"Organization: The University of North Carolina at Chapel Hill\n",
|
||
"Lines: 14\n",
|
||
"NNTP-Posting-Host: axon.cs.unc.edu\n",
|
||
"Keywords: alfa, romeo, spider, mechanic\n",
|
||
"\n",
|
||
"I am looking for recommendations for a good (great?) Alfa Romeo\n",
|
||
"mechanic in South Jersey or Philadelphia or nearby.\n",
|
||
"\n",
|
||
"I have a '78 Alfa Spider that needs some engine, tranny, steering work\n",
|
||
"done. The body is in quite good shape. The car is awful in cold\n",
|
||
"weather, won't start if below freezing (I know, I know, why drive a\n",
|
||
"Spider if there's snow on the ground ...). It has Bosch *mechanical*\n",
|
||
"fuel injection that I am sure needs adjustment.\n",
|
||
"\n",
|
||
"Any opinions are welcome on what to look for or who to call.\n",
|
||
"\n",
|
||
"Email or post (to rec.autos), I will summarize if people want.\n",
|
||
"\n",
|
||
"Thx, Andy (brandt@cs.unc.edu)\n",
|
||
"\n",
|
||
"0.28954817869991817\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"From: michaelr@spider.co.uk (Michael S. A. Robb)\n",
|
||
"Subject: Re: Honors Degrees: Do they mean anything?\n",
|
||
"Organization: Spider Systems Limited, Edinburgh, UK.\n",
|
||
"Lines: 44\n",
|
||
"\n",
|
||
"In article <TKLD.93Apr2123341@burns.cogsci.ed.ac.uk> tkld@cogsci.ed.ac.uk (Kevin Davidson) writes:\n",
|
||
">\n",
|
||
">> In my opinion, a programming degree is still worth having.\n",
|
||
">\n",
|
||
"> Yes, but a CS degree is *not* a programming degree. Does anybody know of\n",
|
||
">a computing course where *programming* is taught ? Computer Science is\n",
|
||
">a branch of maths (or the course I did was).\n",
|
||
"> I've also done a Software Engineering course - much more practical and likely\n",
|
||
">to be the sort of thing an employer really wants, rather than what they think\n",
|
||
">they want, but also did not teach programming. The ability to program was\n",
|
||
">an entry requirement.\n",
|
||
"\n",
|
||
"At Robert Gordon University, programming was the main (most time-consuming) \n",
|
||
"start of the course. The first two years consisted of five subjects:\n",
|
||
"Software Engineering (Pascal/C/UNIX), Computer Engineering (6502/6809/68000 \n",
|
||
"assembler), Computer Theory (LISP/Prolog), Mathematics/Statistics and \n",
|
||
"Communication Skills (How to pass interviews/intelligence tests and group\n",
|
||
"discussions e.g. How to survive a helicopter crash in the North Sea).\n",
|
||
"The third year (Industrial placement) was spent working for a computer company \n",
|
||
"for a year. The company could be anywhere in Europe (there was a special \n",
|
||
"Travel Allowance Scheme to cover the visiting costs of professors). \n",
|
||
"The fourth year included Operating Systems(C/Modula-2), Software Engineering \n",
|
||
"(C/8086 assembler), Real Time Laboratory (C/68000 assembler) and Computing \n",
|
||
"Theory (LISP). There were also Group Projects in 2nd and 4th Years, where \n",
|
||
"students worked in teams to select their own project or decide to work for an \n",
|
||
"outside company (the only disadvantage being that specifications would change \n",
|
||
"suddenly).\n",
|
||
" \n",
|
||
"In the first four years, there was a 50%:50% weighting between courseworks and \n",
|
||
"exams for most subjects. However in the Honours year, this was reduced to a \n",
|
||
"30%:70% split between an Individual Project and final exams (no coursework \n",
|
||
"assessment) - are all Computer Science courses like this?\n",
|
||
"\n",
|
||
"BTW - we started off with 22 students in our first year and were left with 8 by\n",
|
||
"Honours year. Also, every course is tutored separately. Not easy trying\n",
|
||
"to sleep when you are in 8 student class :-). \n",
|
||
"\n",
|
||
"Cheers,\n",
|
||
" Michael \n",
|
||
"-- \n",
|
||
"| Michael S. A. Robb | Tel: +44 31 554 9424 | \"..The problem with bolt-on\n",
|
||
"| Software Engineer | Fax: +44 31 554 0649 | software is making sure the\n",
|
||
"| Spider Systems Limited | E-mail: | bolts are the right size..\"\n",
|
||
"| Edinburgh, EH6 5NG | michaelr@spider.co.uk | - Anonymous\n",
|
||
"\n",
|
||
"0.22933013891071233\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"From: jrm@elm.circa.ufl.edu (Jeff Mason)\n",
|
||
"Subject: AUCTION: Marvel, DC, Valiant, Image, Dark Horse, etc...\n",
|
||
"Organization: Univ. of Florida Psychology Dept.\n",
|
||
"Lines: 59\n",
|
||
"NNTP-Posting-Host: elm.circa.ufl.edu\n",
|
||
"\n",
|
||
"I am auctioning off the following comics. These minimum bids are set\n",
|
||
"below what I would normally sell them for. Make an offer, and I will\n",
|
||
"accept the highest bid after the auction has been completed.\n",
|
||
"\n",
|
||
"TITLE Minimum/Current \n",
|
||
"--------------------------------------------------------------\n",
|
||
"Alpha Flight 51 (Jim Lee's first work at Marvel)\t$ 5.00\n",
|
||
"Aliens 1 (1st app Aliens in comics, 1st prnt, May 1988)\t$20.00\n",
|
||
"Amazing Spider-Man 136 (Intro new Green Goblin) $20.00\n",
|
||
"Amazing Spider-Man 238 (1st appearance Hobgoblin)\t$50.00\n",
|
||
"Archer and Armstrong 1 (Frank Miller/Smith/Layton)\t$ 7.50\n",
|
||
"Avengers 263 (1st appearance X-factor) $ 3.50\n",
|
||
"Bloodshot 1 (Chromium cover, BWSmith Cover/Poster)\t$ 5.00\n",
|
||
"Daredevil 158 (Frank Miller art begins) $35.00\n",
|
||
"Dark Horse Presents 1 (1st app Concrete, 1st printing)\t$ 7.50 \n",
|
||
"H.A.R.D. Corps 1 \t\t\t\t\t$ 5.00\n",
|
||
"Incredible Hulk 324 (1st app Grey Hulk since #1, 1962)\t$ 7.50\n",
|
||
"Incredible Hulk 330 (1st McFarlane issue)\t\t$15.00\n",
|
||
"Incredible Hulk 331 (Grey Hulk series begins)\t\t$11.20\t\n",
|
||
"Incredible Hulk 367 (1st Dale Keown art in Hulk) $15.00\n",
|
||
"Incredible Hulk 377 (1st all new hulk, 1st prnt, Keown) $15.00\n",
|
||
"Marvel Comics Presents 1 (Wolverine, Silver Surfer) $ 7.50\n",
|
||
"Maxx Limited Ashcan (4000 copies exist, blue cover)\t$30.00\n",
|
||
"New Mutants 86 (McFarlane cover, 1st app Cable - cameo)\t$10.00\n",
|
||
"New Mutants 100 (1st app X-Force) $ 5.00\n",
|
||
"New Mutants Annual 5 (1st Liefeld art on New Mutants)\t$10.00\n",
|
||
"Omega Men 3 (1st appearance Lobo) $ 7.50\n",
|
||
"Omega Men 10 (1st full Lobo story) $ 7.50\n",
|
||
"Power Man & Iron Fist 78 (3rd appearance Sabretooth) $25.00\n",
|
||
" 84 (4th appearance Sabretooth) $20.00\n",
|
||
"Simpsons Comics and Stories 1 (Polybagged special ed.)\t$ 7.50\n",
|
||
"Spectacular Spider-Man 147 (1st app New Hobgoblin) $12.50\n",
|
||
"Star Trek the Next Generation 1 (Feb 1988, DC mini) $ 7.50\n",
|
||
"Star Trek the Next Generation 1 (Oct 1989, DC comics) $ 7.50\n",
|
||
"Web of Spider-Man 29 (Hobgoblin, Wolverine appear) $10.00 \n",
|
||
"Web of Spider-Man 30 (Origin Rose, Hobgoblin appears) $ 7.50\n",
|
||
"Wolverine 10 (Before claws, 1st battle with Sabretooth)\t$15.00\n",
|
||
"Wolverine 41 (Sabretooth claims to be Wolverine's dad)\t$ 5.00\n",
|
||
"Wolverine 42 (Sabretooth proven not to be his dad)\t$ 3.50\n",
|
||
"Wolverine 43 (Sabretooth/Wolverine saga concludes)\t$ 3.00\n",
|
||
"Wolverine 1 (1982 mini-series, Miller art)\t\t$20.00\n",
|
||
"Wonder Woman 267 (Return of Animal Man) $12.50\n",
|
||
"X-Force 1 (Signed by Liefeld, Bagged, X-Force card) $20.00\n",
|
||
"X-Force 1 (Signed by Liefeld, Bagged, Shatterstar card) $10.00\n",
|
||
"X-Force 1 (Signed by Liefeld, Bagged, Deadpool card) $10.00\n",
|
||
"X-Force 1 (Signed by Liefeld, Bagged, Sunspot/Gideon) $10.00\n",
|
||
"\n",
|
||
"All comics are in near mint to mint condition, are bagged in shiny \n",
|
||
"polypropylene bags, and backed with white acid free boards. Shipping is\n",
|
||
"$1.50 for one book, $3.00 for more than one book, or free if you order \n",
|
||
"a large enough amount of stuff. I am willing to haggle.\n",
|
||
"\n",
|
||
"I have thousands and thousands of other comics, so please let me know what \n",
|
||
"you've been looking for, and maybe I can help. Some titles I have posted\n",
|
||
"here don't list every issue I have of that title, I tried to save space.\n",
|
||
"-- \n",
|
||
"Geoffrey R. Mason\t\t|\tjrm@elm.circa.ufl.edu\n",
|
||
"Department of Psychology\t|\tmason@webb.psych.ufl.edu\n",
|
||
"University of Florida\t\t|\tprothan@maple.circa.ufl.edu\n",
|
||
"\n",
|
||
"0.17360012846950526\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n",
|
||
"----------------------------------------------------------------------------------------------------\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query_vector = vectorizer.transform([query_str])\n",
|
||
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)\n",
|
||
"print(np.sort(similarities)[0][-4:])\n",
|
||
"print(similarities.argsort()[0][-4:])\n",
|
||
"\n",
|
||
"for i in range (1,5):\n",
|
||
" print(newsgroups[similarities.argsort()[0][-i]])\n",
|
||
" print(np.sort(similarities)[0,-i])\n",
|
||
" print('-'*100)\n",
|
||
" print('-'*100)\n",
|
||
" print('-'*100)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📝 Zadanie **1.4** *(4 pkt.)*\n",
|
||
"\n",
|
||
"Wybierz zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).\n",
|
||
"Na jego podstawie stwórz wyszukiwarkę wykorzystującą TF–IDF i podobieństwo kosinusowe do oceny podobieństwa dokumentów. Wyszukiwarka powinna zwracać kilka posortowanych najbardziej pasujących dokumentów razem ze score'ami."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|