aitech-eks-pub/wyk/03_Tfidf.ipynb

4572 lines
192 KiB
Plaintext
Raw Normal View History

2021-03-24 12:10:05 +01:00
{
2021-09-27 07:57:37 +02:00
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 3. <i>Wyszukiwarki \u2014 TF-IDF</i> [wyk\u0142ad]</h2> \n",
"<h3> Filip Grali\u0144ski (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Wyszukiwarka - szybka i sensowna"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Roboczy przyk\u0142ad\n",
"\n",
"Zak\u0142adamy, \u017ce mamy pewn\u0105 kolekcj\u0119 dokument\u00f3w $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokument\u00f3w w kolekcji)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala ma kota."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Prelude hiding(words, take)\n",
"\n",
"collectionD :: [Text]\n",
"collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubi\u0142em.\", \"Kot ma kota.\"]\n",
"\n",
"-- Operator (!!) zwraca element listy o podanym indeksie\n",
"-- (Przy wi\u0119kszych listach b\u0119dzie nieefektywne, ale nie b\u0119dziemy komplikowa\u0107)\n",
"Prelude.head collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wydobycie tekstu\n",
"\n",
"Przyk\u0142adowe narz\u0119dzia:\n",
"\n",
"* pdftotext\n",
"* antiword\n",
"* Tesseract OCR\n",
"* Apache Tika - uniwersalne narz\u0119dzie do wydobywania tekstu z r\u00f3\u017cnych format\u00f3w\n",
"\n",
"## Normalizacja tekstu\n",
"\n",
"Cokolwiek robimy z tekstem, najpierw musimy go _znormalizowa\u0107_."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenizacja\n",
"\n",
"Po pierwsze musimy podzieli\u0107 tekst na _tokeny_, czyli wyrazapodobne jednostki.\n",
"Mo\u017ce po prostu podzieli\u0107 po spacjach?"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizeStupidly :: Text -> [Text]\n",
"-- words to funkcja z Data.Text, kt\u00f3ra dzieli po spacjach\n",
"tokenizeStupidly = words\n",
"\n",
"tokenizeStupidly $ Prelude.head collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A, trzeba _chocia\u017c_ odsun\u0105\u0107 znaki interpunkcyjne. Najpro\u015bciej u\u017cy\u0107 wyra\u017cenia regularnego. Warto u\u017cy\u0107 [unikodowych w\u0142asno\u015bci](https://en.wikipedia.org/wiki/Unicode_character_property) znak\u00f3w i konstrukcji `\\p{...}`. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"But"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zgubi\u0142em"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE QuasiQuotes #-}\n",
"\n",
"import Text.Regex.PCRE.Heavy\n",
"\n",
"tokenize :: Text -> [Text]\n",
"tokenize = map fst . scan [re|C\\+\\+|[\\p{L}0-9]+|\\p{P}|]\n",
"\n",
"tokenize $ collectionD !! 3\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ca\u0142a kolekcja stokenizowana:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Podobno"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"jest"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butach"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Ty"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"masz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"!"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"But"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zgubi\u0142em"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"map tokenize collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Problemy z tokenizacj\u0105\n",
"\n",
"##### J\u0119zyk angielski"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"data"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"base"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a data-base\""
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"database"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a database\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"data"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"base"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a data base\""
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"don"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"t"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"like"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Python"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"'I don't like Python'\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"can"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"see"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"the"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Johnes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"house"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I can see the Johnes' house\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"do"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"not"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"like"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Python"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I do not like Python\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0018"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"555"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"555"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"122"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"+0018 555-555-122\""
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0018555555122"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"+0018555555122\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Which"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"one"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"is"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"better"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
":"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"C++"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"or"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"C"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"#"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"?"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"Which one is better: C++ or C#?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Inne j\u0119zyki?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rechtsschutzversicherungsgesellschaften"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wie"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"die"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"HUK"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Coburg"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"machen"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"es"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"bereits"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"seit"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"geraumer"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Zeit"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"vor"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
":"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\""
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u3001"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u3002"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\uff0c"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u3002"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u3002"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u3002"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613\u3001\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3\u3002\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3\uff0c\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c\u3002\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703\u3002\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02\u3002\""
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"l"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ordinateur"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"l'ordinateur\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lematyzacja"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krze\u015ble\" do \"krzes\u0142o\", \"zrobimy\" do \"zrobi\u0107\" dla j\u0119zyka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla j\u0119zyka angielskiego."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lematyzacja dla j\u0119zyka polskiego jest bardzo trudna, praktycznie nie spos\u00f3b wykona\u0107 j\u0105 regu\u0142owo, po prostu musimy si\u0119 postara\u0107 o bardzo obszerny _s\u0142ownik form fleksyjnych_.\n",
"\n",
"Na potrzeby tego wyk\u0142adu stw\u00f3rzmy sobie ma\u0142y s\u0142ownik form fleksyjnych w postaci tablicy asocjacyjnej (haszuj\u0105cej)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use head</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">collectionD !! 0</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">head collectionD</div></div>"
],
"text/plain": [
"Line 22: Use head\n",
"Found:\n",
"collectionD !! 0\n",
"Why not:\n",
"head collectionD"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"but"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butami"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Wczoraj"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kupi\u0142em"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Map as Map hiding(take, map, filter)\n",
"\n",
"mockInflectionDictionary :: Map Text Text\n",
"mockInflectionDictionary = Map.fromList [\n",
" (\"kota\", \"kot\"),\n",
" (\"butach\", \"but\"),\n",
" (\"masz\", \"mie\u0107\"),\n",
" (\"ma\", \"mie\u0107\"),\n",
" (\"buta\", \"but\"),\n",
" (\"zgubi\u0142em\", \"zgubi\u0107\")]\n",
"\n",
"lemmatizeWord :: Map Text Text -> Text -> Text\n",
"lemmatizeWord dict w = findWithDefault w w dict\n",
"\n",
"lemmatizeWord mockInflectionDictionary \"butach\"\n",
"-- a tego nie ma w naszym s\u0142owniczku, wi\u0119c zwracamy to samo\n",
"lemmatizeWord mockInflectionDictionary \"butami\"\n",
"\n",
"lemmatize :: Map Text Text -> [Text] -> [Text]\n",
"lemmatize dict = map (lemmatizeWord dict)\n",
"\n",
"lemmatize mockInflectionDictionary $ tokenize $ collectionD !! 0 \n",
"\n",
"lemmatize mockInflectionDictionary $ tokenize \"Wczoraj kupi\u0142em kota.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie**: Nawet w naszym s\u0142owniczku mamy problemy z niejednoznaczno\u015bci\u0105 lematyzacji. Jakie?\n",
"\n",
"Obszerny s\u0142ownik form fleksyjnych dla j\u0119zyka polskiego: http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=view&target=PoliMorf-0.6.7.tab.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stemowanie\n",
"\n",
"Stemowanie (rdzeniowanie) obcina wyraz do _rdzenia_ niekoniecznie b\u0119d\u0105cego sensownym wyrazem, np. \"krze\u015ble\" mo\u017ce by\u0107 rdzeniowane do \"krze\u015bl\", \"krze\u015b\" albo \"krzes\", \"zrobimy\" do \"zrobi\".\n",
"\n",
"* stemowanie nie jest tak dobrze okre\u015blone jak lematyzacja (mo\u017cna robi\u0107 na wiele sposob\u00f3w)\n",
"* bardziej podatne na metody regu\u0142owe (cho\u0107 dla polskiego i tak trudno)\n",
"* dla angielskiego istniej\u0105 znane algorytmy stemowania, np. [algorytm Portera](https://tartarus.org/martin/PorterStemmer/def.txt)\n",
"* zob. te\u017c [program Snowball](https://snowballstem.org/) z regu\u0142ami dla wielu j\u0119zyk\u00f3w\n",
"\n",
"Prosty stemmer \"dla ubogich\" dla j\u0119zyka polskiego to obcinanie do sze\u015bciu znak\u00f3w."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"zrobim"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"komput"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butach"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\u017ad\u017ab\u0142a"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"poorMansStemming :: Text -> Text\n",
"poorMansStemming = Data.Text.take 6\n",
"\n",
"poorMansStemming \"zrobimy\"\n",
"poorMansStemming \"komputerami\"\n",
"poorMansStemming \"butach\"\n",
"poorMansStemming \"\u017ad\u017ab\u0142ami\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### _Stop words_\n",
"\n",
"Cz\u0119sto wyszukiwarki pomijaj\u0105 kr\u00f3tkie, cz\u0119ste i nienios\u0105ce znaczenia s\u0142owa - _stop words_ (_s\u0142owa przestankowe_)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"isStopWord :: Text -> Bool\n",
"isStopWord \"w\" = True\n",
"isStopWord \"jest\" = True\n",
"isStopWord \"\u017ce\" = True\n",
"-- przy okazji mo\u017cemy pozby\u0107 si\u0119 znak\u00f3w interpunkcyjnych\n",
"isStopWord w = w \u2248 [re|^\\p{P}+$|]\n",
"\n",
"isStopWord \"kot\"\n",
"isStopWord \"!\"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"removeStopWords :: [Text] -> [Text]\n",
"removeStopWords = filter (not . isStopWord)\n",
"\n",
"removeStopWords $ tokenize $ Prelude.head collectionD "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie**: Jakim zapytaniom usuwanie _stop words_ mo\u017ce szkodzi\u0107? Poda\u0107 przyk\u0142ady dla j\u0119zyka polskiego i angielskiego. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizacja - r\u00f3\u017cno\u015bci\n",
"\n",
"W sk\u0142ad normalizacji mo\u017ce te\u017c wchodzi\u0107:\n",
"\n",
"* poprawianie b\u0142\u0119d\u00f3w literowych\n",
"* sprowadzanie do ma\u0142ych liter (lower-casing czy raczej case-folding)\n",
"* usuwanie znak\u00f3w diakrytycznych\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u017cd\u017ab\u0142o"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toLower \"\u017bD\u0179B\u0141O\""
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u017ad\u017ab\u0142o"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toCaseFold \"\u0179D\u0179B\u0141O\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Kiedy _case-folding_ da inny wynik ni\u017c _lower-casing_? Jakie to ma praktyczne znaczenie?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizacja jako ca\u0142o\u015bciowy proces\n",
"\n",
"Najwa\u017cniejsza zasada: dokumenty w naszej kolekcji powinny by\u0107 normalizowane w dok\u0142adnie taki spos\u00f3b, jak zapytania.\n",
"\n",
"Efektem normalizacji jest zamiana dokumentu na ci\u0105g _term\u00f3w_ (ang. _terms_), czyli znormalizowanych wyraz\u00f3w.\n",
"\n",
"Innymi s\u0142owy po normalizacji dokument $d_i$ traktujemy jako ci\u0105g term\u00f3w $t_i^1,\\dots,t_i^{|d_i|}$."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"podobn"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"but"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ty"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"but"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zgubi\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"normalize :: Text -> [Text]\n",
"normalize = map poorMansStemming . removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
"\n",
"map normalize collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbi\u00f3r wszystkich term\u00f3w w kolekcji dokument\u00f3w nazywamy s\u0142ownikiem (ang. _vocabulary_), nie myli\u0107 ze s\u0142ownikiem jako struktur\u0105 danych w Pythonie (_dictionary_).\n",
"\n",
"$$V = \\bigcup_{i=1}^N \\{t_i^1,\\dots,t_i^{|d_i|}\\}$$\n",
"\n",
"(To zbi\u00f3r, wi\u0119c liczymy bez powt\u00f3rze\u0144!)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fromList [\"ala\",\"but\",\"chyba\",\"kot\",\"mie\\263\",\"podobn\",\"ty\",\"zgubi\\263\"]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Set as Set hiding(map)\n",
"\n",
"getVocabulary :: [Text] -> Set Text \n",
"getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
"\n",
"getVocabulary collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Jak wyszukiwarka mo\u017ce by\u0107 szybka?\n",
"\n",
"_Odwr\u00f3cony indeks_ (ang. _inverted index_) pozwala wyszukiwarce szybko szuka\u0107 w milionach dokument\u00f3w. Odwr\u00f3cony indeks to prostu... indeks, jaki znamy z ksi\u0105\u017cek (mapowanie s\u0142\u00f3w na numery stron/dokument\u00f3w).\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use tuple-section</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ t -> (t, ix)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">(, ix)</div></div>"
],
"text/plain": [
"Line 4: Use tuple-section\n",
"Found:\n",
"\\ t -> (t, ix)\n",
"Why not:\n",
"(, ix)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"chyba\",2),(\"kot\",2),(\"mie\\263\",2),(\"ty\",2)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionDNormalized = map normalize collectionD\n",
"\n",
"documentToPostings :: ([Text], Int) -> Set (Text, Int)\n",
"documentToPostings (d, ix) = Set.fromList $ map (\\t -> (t, ix)) d\n",
"\n",
"documentToPostings (collectionDNormalized !! 2, 2) \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map documentToPostings $ Prelude.zip coll [0 .. ]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith (curry documentToPostings) coll [0 .. ]</div></div>"
],
"text/plain": [
"Line 2: Use zipWith\n",
"Found:\n",
"map documentToPostings $ Prelude.zip coll [0 .. ]\n",
"Why not:\n",
"zipWith (curry documentToPostings) coll [0 .. ]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",0),(\"but\",1),(\"but\",3),(\"chyba\",2),(\"chyba\",3),(\"kot\",0),(\"kot\",1),(\"kot\",2),(\"kot\",4),(\"mie\\263\",0),(\"mie\\263\",2),(\"mie\\263\",4),(\"podobn\",1),(\"ty\",2),(\"zgubi\\263\",3)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionToPostings :: [[Text]] -> Set (Text, Int)\n",
"collectionToPostings coll = Set.unions $ map documentToPostings $ Prelude.zip coll [0..]\n",
"\n",
"collectionToPostings collectionDNormalized"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) invIndex\n",
" = insertWith (++) t [ix] invIndex</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) = insertWith (++) t [ix]</div></div>"
],
"text/plain": [
"Line 2: Eta reduce\n",
"Found:\n",
"updateInvertedIndex (t, ix) invIndex\n",
" = insertWith (++) t [ix] invIndex\n",
"Why not:\n",
"updateInvertedIndex (t, ix) = insertWith (++) t [ix]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",[0]),(\"but\",[1,3]),(\"chyba\",[2,3]),(\"kot\",[0,1,2,4]),(\"mie\\263\",[0,2,4]),(\"podobn\",[1]),(\"ty\",[2]),(\"zgubi\\263\",[3])]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0,1,2,4]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"updateInvertedIndex :: (Text, Int) -> Map Text [Int] -> Map Text [Int]\n",
"updateInvertedIndex (t, ix) invIndex = insertWith (++) t [ix] invIndex\n",
"\n",
"getInvertedIndex :: [[Text]] -> Map Text [Int]\n",
"getInvertedIndex = Prelude.foldr updateInvertedIndex Map.empty . Set.toList . collectionToPostings\n",
"\n",
"ind = getInvertedIndex collectionDNormalized\n",
"ind\n",
"ind ! \"kot\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Relewantno\u015b\u0107\n",
"\n",
"Potrafimy szybko przeszukiwa\u0107 znormalizowane dokumenty, ale kt\u00f3re dokumenty s\u0105 wa\u017cne (_relewantne_) wzgl\u0119dem potrzeby informacyjnej u\u017cytkownika?\n",
"\n",
"### Zapytania boole'owskie\n",
"\n",
"* `pizzeria Pozna\u0144 dow\u00f3z` to `pizzeria AND Pozna\u0144 AND dow\u00f3z` czy `pizzeria OR Pozna\u0144 OR dow\u00f3z`\n",
"* `(pizzeria OR pizza OR tratoria) AND Pozna\u0144 AND dow\u00f3z\n",
"* `pizzeria AND Pozna\u0144 AND dow\u00f3z AND NOT golonka`\n",
"\n",
"Jak domy\u015blnie interpretowa\u0107 zapytanie?\n",
"\n",
"* jako zapytanie AND -- by\u0107 mo\u017ce za ma\u0142o dokument\u00f3w\n",
"* rozwi\u0105zanie po\u015brednie?\n",
"* jako zapytanie OR -- by\u0107 mo\u017ce za du\u017co dokument\u00f3w\n",
"\n",
"Mo\u017cemy jakie\u015b miary dopasowania dokumentu do zapytania, \u017ceby m\u00f3c posortowa\u0107 dokumenty...\n",
"\n",
"### Mierzenie dopasowania dokumentu do zapytania\n",
"\n",
"Potrzebujemy jakie\u015b funkcji $\\sigma : Q x D \\rightarrow \\mathbb{R}$. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Musimy jako\u015b zamieni\u0107 dokumenty na liczby, tj. dokumenty na wektory liczb, a ca\u0142\u0105 kolekcj\u0119 na macierz.\n",
"\n",
"Po pierwsze ponumerujmy wszystkie termy ze s\u0142ownika."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fromList [(0,\"ala\"),(1,\"but\"),(2,\"chyba\"),(3,\"kot\"),(4,\"mie\\263\"),(5,\"podobn\"),(6,\"ty\"),(7,\"zgubi\\263\")]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",0),(\"but\",1),(\"chyba\",2),(\"kot\",3),(\"mie\\263\",4),(\"podobn\",5),(\"ty\",6),(\"zgubi\\263\",7)]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"2"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"voc = getVocabulary collectionD\n",
"\n",
"vocD :: Map Int Text\n",
"vocD = Map.fromList $ zip [0..] $ Set.toList voc\n",
"\n",
"invvocD :: Map Text Int\n",
"invvocD = Map.fromList $ zip (Set.toList voc) [0..]\n",
"\n",
"vocD\n",
"\n",
"invvocD\n",
"\n",
"vocD ! 0\n",
"invvocD ! \"chyba\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Napiszmy funkcj\u0119, kt\u00f3ra _wektoryzuje_ znormalizowany dokument.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 2)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 2</div></div>"
],
"text/plain": [
"Line 2: Redundant $\n",
"Found:\n",
"map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
"Why not:\n",
"map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 9: Redundant bracket\n",
"Found:\n",
"(collectionDNormalized !! 2)\n",
"Why not:\n",
"collectionDNormalized !! 2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ty"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vectorize :: Int -> Map Int Text -> [Text] -> [Double]\n",
"vectorize vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
" where count t doc \n",
" | t `elem` doc = 1.0\n",
" | otherwise = 0.0\n",
" \n",
"vocSize = Set.size voc\n",
"\n",
"(collectionDNormalized !! 2)\n",
"vectorize vocSize vocD (collectionDNormalized !! 2)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ![image](./macierz.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Jak inaczej uwzgl\u0119dni\u0107 cz\u0119sto\u015b\u0107 wyraz\u00f3w?\n",
"\n",
"<div style=\"display:none\">\n",
" $\n",
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
" $\n",
"</div>\n",
"\n",
"* $\\tf_{t,d}$ - term frequency\n",
"\n",
"* $1+\\log(\\tf_{t,d})$\n",
"\n",
"* $0.5 + \\frac{0.5 \\times \\tf_{t,d}}{max_t(\\tf_{t,d})}$"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 4)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 4</div></div>"
],
"text/plain": [
"Line 2: Redundant $\n",
"Found:\n",
"map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
"Why not:\n",
"map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 7: Redundant bracket\n",
"Found:\n",
"(collectionDNormalized !! 4)\n",
"Why not:\n",
"collectionDNormalized !! 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vectorizeTf :: Int -> Map Int Text -> [Text] -> [Double]\n",
"vectorizeTf vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
" where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n",
"\n",
"vocSize = Set.size voc\n",
"\n",
"(collectionDNormalized !! 4)\n",
"vectorize vocSize vocD (collectionDNormalized !! 4)\n",
"vectorizeTf vocSize vocD (collectionDNormalized !! 4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"display:none\">\n",
" $\n",
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
" $\n",
"</div>\n",
"\n",
"### Odwrotna cz\u0119sto\u015b\u0107 dokumentowa\n",
"\n",
"Czy wszystkie wyrazy s\u0105 tak samo wa\u017cne?\n",
"\n",
"**NIE.** Wyrazy pojawiaj\u0105ce si\u0119 w wielu dokumentach s\u0105 mniej wa\u017cne.\n",
"\n",
"Aby to uwzgl\u0119dni\u0107, przemna\u017camy frekwencj\u0119 wyrazu przez _odwrotn\u0105\n",
" cz\u0119sto\u015b\u0107 w dokumentach_ (_inverse document frequency_):\n",
"\n",
"$$\\idf_t = \\log \\frac{N}{\\df_t},$$\n",
"\n",
"gdzie:\n",
"\n",
"* $\\idf_t$ - odwrotna cz\u0119sto\u015b\u0107 wyrazu $t$ w dokumentach\n",
"\n",
"* $N$ - liczba dokument\u00f3w w kolekcji\n",
"\n",
"* $\\df_f$ - w ilu dokumentach wyst\u0105pi\u0142 wyraz $t$?\n",
"\n",
"#### Dlaczego idf?\n",
"\n",
"term $t$ wyst\u0105pi\u0142...\n",
"\n",
"* w 1 dokumencie, $\\idf_t = \\log N/1 = \\log N$\n",
"* 2 razy w kolekcji, $\\idf_t = \\log N/2$ lub $\\log N$\n",
"* w po\u0142owie dokument\u00f3w kolekcji, $\\idf_t = \\log N/(N/2) = \\log 2$\n",
"* we wszystkich dokumentach, $\\idf_t = \\log N/N = \\log 1 = 0$\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.22314355131420976"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"idf :: [[Text]] -> Text -> Double\n",
"idf coll t = log (fromIntegral n / fromIntegral df)\n",
" where df = Prelude.length $ Prelude.filter (\\d -> t `elem` d) coll\n",
" n = Prelude.length coll\n",
" \n",
"idf collectionDNormalized \"kot\" "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9162907318741551"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"idf collectionDNormalized \"chyba\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Co z tego wynika?\n",
"\n",
"Zamiast $\\tf_{t,d}$ b\u0119dziemy w wektorach rozpatrywa\u0107 warto\u015bci:\n",
"\n",
"$$\\tfidf_{t,d} = \\tf_{t,d} \\times \\idf_{t}$$\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mie\u0107"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vectorizeTfIdf :: Int -> [[Text]] -> Map Int Text -> [Text] -> [Double]\n",
"vectorizeTfIdf vecSize coll v doc = map (\\i -> count (v ! i) doc * idf coll (v ! i)) [0..(vecSize-1)]\n",
" where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n",
"\n",
"vocSize = Set.size voc\n",
"\n",
"collectionDNormalized !! 4\n",
"vectorize vocSize vocD (collectionDNormalized !! 4)\n",
"vectorizeTf vocSize vocD (collectionDNormalized !! 4)\n",
"vectorizeTfIdf vocSize collectionDNormalized vocD (collectionDNormalized !! 4)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[1.6094379124341003,0.0,0.0,0.22314355131420976,0.5108256237659907,0.0,0.0,0.0],[0.0,0.9162907318741551,0.0,0.22314355131420976,0.0,1.6094379124341003,0.0,0.0],[0.0,0.0,0.9162907318741551,0.22314355131420976,0.5108256237659907,0.0,1.6094379124341003,0.0],[0.0,0.9162907318741551,0.9162907318741551,0.0,0.0,0.0,0.0,1.6094379124341003],[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"map (vectorizeTfIdf vocSize collectionDNormalized vocD) collectionDNormalized"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz zdefiniujemy _overlap score measure_:\n",
"\n",
"$$\\sigma(q,d) = \\sum_{t \\in q} \\tfidf_{t,d}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Podobie\u0144stwo kosinusowe\n",
"\n",
"_Overlap score measure_ nie jest jedyn\u0105 mo\u017cliw\u0105 metryk\u0105, za pomoc\u0105 kt\u00f3rej mo\u017cemy mierzy\u0107 dopasowanie dokumentu do zapytania. Mo\u017cemy r\u00f3wnie\u017c si\u0119gn\u0105\u0107 po intuicje geometryczne (skoro mamy do czynienia z wektorami).\n",
"\n",
"**Pytanie**: Ile wymiar\u00f3w maj\u0105 wektory, na kt\u00f3rych operujemy? Jak \"wygl\u0105daj\u0105\" te wektory? Czy mo\u017cemy wykonywa\u0107 na nich standardowe operacje geometryczne czy te, kt\u00f3re znamy z geometrii liniowej?\n",
"\n",
"#### Podobie\u0144stwo mi\u0119dzy dokumentami\n",
"\n",
"Zajmijmy si\u0119 teraz poszukiwaniem miary mierz\u0105cej podobie\u0144stwo mi\u0119dzy dokumentami $d_1$ i $d_2$ (czyli poszukujemy sensownej funkcji $\\sigma : D x D \\rightarrow \\mathbb{R}$).\n",
"\n",
"**Uwaga** Poj\u0119cia \"miary\" u\u017cywamy nieformalnie, nie spe\u0142nia ona za\u0142o\u017ce\u0144 znanych z teorii miary.\n",
"\n",
"Rozpatrzmy zbiorek tekst\u00f3w legend miejskich z <git://gonito.net/polish-urban-legends>.\n",
"\n",
"(To autentyczne teksty z Internentu, z j\u0119zykiem potocznym, wulgarnym itd.)\n",
"\n",
"```\n",
" git clone git://gonito.net/polish-urban-legends\n",
" paste polish-urban-legends/dev-0/expected.tsv polish-urban-legends/dev-0/in.tsv > legendy.txt\n",
"``` "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
2021-09-27 08:10:10 +02:00
"Opowie\u015b\u0107 prawdziwa... Olsztyn, akademik, 7 pi\u0119tro, […]"
2021-09-27 07:57:37 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import System.IO\n",
"import Data.List.Split as SP\n",
"\n",
"legendsh <- openFile \"legendy.txt\" ReadMode\n",
"hSetEncoding legendsh utf8\n",
"contents <- hGetContents legendsh\n",
"ls = Prelude.lines contents\n",
"items = map (map pack . SP.splitOn \"\\t\") ls\n",
"Prelude.head items"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"87"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nbOfLegends = Prelude.length items\n",
"nbOfLegends"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lap"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"be_wy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"be_wy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"be_wy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta_ab"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta_ab"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta_ab"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lap"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta_ab"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lap"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"be_wy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lap"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"be_wy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na_ak"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lap"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mo_zu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ba_hy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zw_oz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"tr_su"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ne_dz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w_lud"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Ja podejrzewam \u017ce o polowaniu nie by\u0142o mowy, po prostu znalaz\u0142 martwego szczupaka i skorzysta\u0142 z okazji! Mnie mocno zdziwi\u0142a jego si\u0142a \u017ceby taki p\u00f3\u0142 kilogramowy okaz szczupaka przesuwa\u0107 o par\u0119 metr\u00f3w i to w trzcinach! Szacuneczek. Przypomniala mi sie historia kt\u00f3r\u0105 kiedys zaslyszalem o wlascicielce pytona, ktory nagle polozyl sie wzdluz jej \u0142\u00f3\u017cka. Le\u017ca\u0142 tak wyci\u0105gniety jak struna d\u0142u\u017cszy czas jak nie\u017cywy (a by\u0142 d\u0142ugo\u015bci \u0142\u00f3\u017cka), wi\u0119c kobitka zadzonila do weterynarza co ma robi\u0107. Us\u0142ysza\u0142a \u017ce ma szybko zamkn\u0105\u0107 si\u0119 w \u0142azience i poczeka\u0107 na niego bo pyton j\u0105 mierzy jako potencjaln\u0105 ofiar\u0119 (czy mu si\u0119 zmie\u015bci w brzuchu...). Wierzy\u0107, nie wierzy\u0107? Kiedy\u015b nie wierzy\u0142em ale od kilku dni mam w\u0105tpliwosci... Pozdrawiam"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labelsL = map Prelude.head items\n",
"labelsL\n",
"collectionL = map (!!1) items\n",
"items !! 1"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"348"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLNormalized = map normalize collectionL\n",
"voc' = getVocabulary collectionL\n",
"\n",
"vocLSize = Prelude.length voc'\n",
"\n",
"vocL :: Map Int Text\n",
"vocL = Map.fromList $ zip [0..] $ Set.toList voc'\n",
"\n",
"invvocL :: Map Text Int\n",
"invvocL = Map.fromList $ zip (Set.toList voc') [0..]\n",
"\n",
"vocL ! 0\n",
"invvocL ! \"chyba\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wektoryzujemy ca\u0142\u0105 kolekcj\u0119:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.38837067474886433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.752336051950276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0647107369924282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,1.247032293786383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5947071077466928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.268683541318364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.7578579175523736,0.0,0.0,0.0,0.0,0.0,0.3550342544812725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9395475940384223,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21437689194643514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2878542883066382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lVectorized = map (vectorizeTfIdf vocLSize collectionLNormalized vocL) collectionLNormalized\n",
"lVectorized !! 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Szukamy funkcji $sigma$, kt\u00f3ra da wysok\u0105 warto\u015b\u0107 dla tekst\u00f3w dotycz\u0105cych tego samego w\u0105tku legendowego (np. $d_1$ i $d_2$ m\u00f3wi\u0105 o w\u0119\u017cu przymierzaj\u0105cym si\u0119 do zjedzenia swojej w\u0142a\u015bcicielki) i nisk\u0105 dla tekst\u00f3w z r\u00f3\u017cnych w\u0105tk\u00f3w (np. $d_1$ opowiada o w\u0119\u017cu ludojadzie, $d_2$ - ba\u0142wanku na hydrancie)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mo\u017ce po prostu odleg\u0142o\u015b\u0107 euklidesowa, skoro to punkty w wielowymiarowej przestrzeni?"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber x = printf \"% 7.2f\" x</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber = printf \"% 7.2f\"</div></div>"
],
"text/plain": [
"Line 5: Eta reduce\n",
"Found:\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"Why not:\n",
"formatNumber = printf \"% 7.2f\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" 0.00 79.93 78.37 76.57 87.95 81.15 82.77 127.50 124.54 76.42 84.19 78.90 90.90"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Text.Printf\n",
"import Data.List (take)\n",
"\n",
"formatNumber :: Double -> String\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"\n",
"similarTo :: ([Double] -> [Double] -> Double) -> [[Double]] -> Int -> Text\n",
"similarTo simFun vs ix = pack $ Prelude.unwords $ map (formatNumber . ((vs !! ix) `simFun`)) vs\n",
"\n",
"euclDistance :: [Double] -> [Double] -> Double\n",
"euclDistance v1 v2 = sqrt $ sum $ Prelude.zipWith (\\x1 x2 -> (x1 - x2)**2) v1 v2\n",
"\n",
"limit = 13\n",
"labelsLimited = Data.List.take limit labelsL\n",
"limitedL = Data.List.take limit lVectorized\n",
"\n",
"similarTo euclDistance limitedL 0\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Move brackets to avoid $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\"\\n\"\n",
" <>\n",
" (Data.Text.unlines\n",
" $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)])</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">\"\\n\"\n",
" <>\n",
" Data.Text.unlines\n",
" (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)])</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Move brackets to avoid $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\" \"\n",
" <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">\" \"\n",
" <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Avoid lambda</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ l -> pack $ printf \"% 7s\" l</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">pack . printf \"% 7s\"</div></div>"
],
"text/plain": [
"Line 2: Move brackets to avoid $\n",
"Found:\n",
"\"\\n\"\n",
" <>\n",
" (Data.Text.unlines\n",
" $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)])\n",
"Why not:\n",
"\"\\n\"\n",
" <>\n",
" Data.Text.unlines\n",
" (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)])Line 2: Use zipWith\n",
"Found:\n",
"map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]\n",
"Why not:\n",
"zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]Line 3: Move brackets to avoid $\n",
"Found:\n",
"\" \"\n",
" <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)\n",
"Why not:\n",
"\" \"\n",
" <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)Line 3: Avoid lambda\n",
"Found:\n",
"\\ l -> pack $ printf \"% 7s\" l\n",
"Why not:\n",
"pack . printf \"% 7s\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 0.00 79.93 78.37 76.57 87.95 81.15 82.77 127.50 124.54 76.42 84.19 78.90 90.90\n",
"w_lud 79.93 0.00 38.92 34.35 56.48 44.89 47.21 109.24 104.82 35.33 49.88 39.98 60.20\n",
"ba_hy 78.37 38.92 0.00 30.37 54.23 40.93 43.83 108.15 102.91 27.37 46.95 35.81 58.99\n",
"w_lap 76.57 34.35 30.37 0.00 51.54 37.46 40.86 107.43 103.22 25.22 43.66 32.10 56.53\n",
"ne_dz 87.95 56.48 54.23 51.54 0.00 57.98 60.32 113.66 109.59 50.96 62.17 54.84 70.70\n",
"be_wy 81.15 44.89 40.93 37.46 57.98 0.00 49.55 110.37 100.50 37.77 51.54 37.09 62.92\n",
"zw_oz 82.77 47.21 43.83 40.86 60.32 49.55 0.00 111.11 107.57 41.02 54.07 45.23 64.65\n",
"mo_zu 127.50 109.24 108.15 107.43 113.66 110.37 111.11 0.00 139.57 107.38 109.91 108.20 117.07\n",
"be_wy 124.54 104.82 102.91 103.22 109.59 100.50 107.57 139.57 0.00 102.69 108.32 99.06 113.25\n",
"ba_hy 76.42 35.33 27.37 25.22 50.96 37.77 41.02 107.38 102.69 0.00 43.83 32.08 56.68\n",
"mo_zu 84.19 49.88 46.95 43.66 62.17 51.54 54.07 109.91 108.32 43.83 0.00 47.87 66.40\n",
"be_wy 78.90 39.98 35.81 32.10 54.84 37.09 45.23 108.20 99.06 32.08 47.87 0.00 59.66\n",
"w_lud 90.90 60.20 58.99 56.53 70.70 62.92 64.65 117.07 113.25 56.68 66.40 59.66 0.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"paintMatrix :: ([Double] -> [Double] -> Double) -> [Text] -> [[Double]] -> Text\n",
"paintMatrix simFun labels vs = header <> \"\\n\" <> (Data.Text.unlines $ map (\\(lab, ix) -> lab <> \" \" <> similarTo simFun vs ix) $ zip labels [0..(Prelude.length vs - 1)])\n",
" where header = \" \" <> (Data.Text.unwords $ map (\\l -> pack $ printf \"% 7s\" l) labels)\n",
" \n",
"paintMatrix euclDistance labelsLimited limitedL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem: za du\u017co zale\u017cy od d\u0142ugo\u015bci tekstu.\n",
"\n",
"Rozwi\u0105zanie: znormalizowa\u0107 wektor $v$ do wektora jednostkowego.\n",
"\n",
"$$ \\vec{1}(v) = \\frac{v}{|v|} $$\n",
"\n",
"Taki wektor ma d\u0142ugo\u015b\u0107 1!"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 10.00 0.67 0.66 0.66 0.67 0.67 0.67 0.67 0.67 0.67 0.66 0.67 0.67\n",
"w_lud 0.67 10.00 0.67 0.68 0.67 0.66 0.67 0.67 0.68 0.66 0.67 0.67 0.68\n",
"ba_hy 0.66 0.67 10.00 0.66 0.67 0.67 0.67 0.67 0.69 0.74 0.66 0.67 0.66\n",
"w_lap 0.66 0.68 0.66 10.00 0.66 0.66 0.66 0.66 0.67 0.66 0.66 0.66 0.66\n",
"ne_dz 0.67 0.67 0.67 0.66 10.00 0.67 0.67 0.68 0.69 0.68 0.67 0.67 0.68\n",
"be_wy 0.67 0.66 0.67 0.66 0.67 10.00 0.66 0.67 0.74 0.66 0.67 0.76 0.66\n",
"zw_oz 0.67 0.67 0.67 0.66 0.67 0.66 10.00 0.67 0.67 0.66 0.66 0.67 0.67\n",
"mo_zu 0.67 0.67 0.67 0.66 0.68 0.67 0.67 10.00 0.69 0.67 0.69 0.68 0.67\n",
"be_wy 0.67 0.68 0.69 0.67 0.69 0.74 0.67 0.69 10.00 0.68 0.67 0.75 0.67\n",
"ba_hy 0.67 0.66 0.74 0.66 0.68 0.66 0.66 0.67 0.68 10.00 0.66 0.67 0.66\n",
"mo_zu 0.66 0.67 0.66 0.66 0.67 0.67 0.66 0.69 0.67 0.66 10.00 0.67 0.67\n",
"be_wy 0.67 0.67 0.67 0.66 0.67 0.76 0.67 0.68 0.75 0.67 0.67 10.00 0.67\n",
"w_lud 0.67 0.68 0.66 0.66 0.68 0.66 0.67 0.67 0.67 0.66 0.67 0.67 10.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vectorNorm :: [Double] -> Double\n",
"vectorNorm vs = sqrt $ sum $ map (\\x -> x * x) vs\n",
"\n",
"toUnitVector :: [Double] -> [Double]\n",
"toUnitVector vs = map (/ n) vs\n",
" where n = vectorNorm vs\n",
"\n",
"vectorNorm (toUnitVector [3.0, 4.0])\n",
"\n",
"euclDistanceNormalized :: [Double] -> [Double] -> Double\n",
"euclDistanceNormalized v1 v2 = toUnitVector v1 `euclDistance` toUnitVector v2\n",
"\n",
"euclSim v1 v2 = 1 / (d + 0.1)\n",
" where d = euclDistanceNormalized v1 v2\n",
"\n",
"paintMatrix euclSim labelsLimited limitedL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Podobie\u0144stwo kosinusowe\n",
"\n",
"Cz\u0119\u015bciej zamiast odleg\u0142o\u015bci euklidesowej stosuje si\u0119 podobie\u0144stwo kosinusowe, czyli kosinus k\u0105ta mi\u0119dzy wektorami.\n",
"\n",
"Wektor dokumentu ($\\vec{V}(d)$) - wektor, kt\u00f3rego sk\u0142adowe odpowiadaj\u0105 wyrazom.\n",
"\n",
"$$\\sigma(d_1,d_2) = \\cos\\theta(\\vec{V}(d_1),\\vec{V}(d_2)) = \\frac{\\vec{V}(d_1) \\cdot \\vec{V}(d_2)}{|\\vec{V}(d_1)||\\vec{V}(d_2)|} $$\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zauwa\u017cmy, \u017ce jest to iloczyn skalarny znormalizowanych wektor\u00f3w!\n",
"\n",
"$$\\sigma(d_1,d_2) = \\vec{1}(\\vec{V}(d_1)) \\times \\vec{1}(\\vec{V}(d_2)) $$"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"(\u2715) :: [Double] -> [Double] -> Double\n",
"(\u2715) v1 v2 = sum $ Prelude.zipWith (*) v1 v2\n",
"\n",
"[2, 1, 0] \u2715 [-2, 5, 10]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.02 0.01 0.01 0.03 0.02 0.02 0.04 0.03 0.02 0.01 0.02 0.03\n",
"w_lud 0.02 1.00 0.02 0.05 0.04 0.01 0.03 0.04 0.06 0.01 0.02 0.03 0.06\n",
"ba_hy 0.01 0.02 1.00 0.01 0.02 0.03 0.03 0.04 0.08 0.22 0.01 0.04 0.01\n",
"w_lap 0.01 0.05 0.01 1.00 0.01 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.00\n",
"ne_dz 0.03 0.04 0.02 0.01 1.00 0.04 0.03 0.07 0.08 0.06 0.03 0.03 0.05\n",
"be_wy 0.02 0.01 0.03 0.01 0.04 1.00 0.01 0.03 0.21 0.01 0.02 0.25 0.01\n",
"zw_oz 0.02 0.03 0.03 0.00 0.03 0.01 1.00 0.04 0.03 0.00 0.01 0.02 0.02\n",
"mo_zu 0.04 0.04 0.04 0.01 0.07 0.03 0.04 1.00 0.10 0.02 0.09 0.05 0.04\n",
"be_wy 0.03 0.06 0.08 0.02 0.08 0.21 0.03 0.10 1.00 0.05 0.03 0.24 0.04\n",
"ba_hy 0.02 0.01 0.22 0.00 0.06 0.01 0.00 0.02 0.05 1.00 0.01 0.02 0.00\n",
"mo_zu 0.01 0.02 0.01 0.00 0.03 0.02 0.01 0.09 0.03 0.01 1.00 0.01 0.02\n",
"be_wy 0.02 0.03 0.04 0.00 0.03 0.25 0.02 0.05 0.24 0.02 0.01 1.00 0.02\n",
"w_lud 0.03 0.06 0.01 0.00 0.05 0.01 0.02 0.04 0.04 0.00 0.02 0.02 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cosineSim v1 v2 = toUnitVector v1 \u2715 toUnitVector v2\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2021-09-27 08:10:10 +02:00
"na tylnym siedzeniu w autobusie siedzi matka z 7-8 letnim synkiem. […]"
2021-09-27 07:57:37 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionL !! 5"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2021-09-27 08:10:10 +02:00
"Kr\u00f3tko zwi\u0119\u017ale i na temat. Zastanawia mnie jak ludzie wychowuj\u0105 dzieci. […]"
2021-09-27 07:57:37 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionL !! 8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Z powrotem do wyszukiwarek\n",
"\n",
"Mo\u017cemy potraktowa\u0107 zapytanie jako bardzo kr\u00f3tki dokument, dokona\u0107 jego wektoryzacji i policzy\u0107 cosinus k\u0105ta mi\u0119dzy zapytaniem a dokumentem."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ja za to znam przypadek, \u017ce kole\u017canka mieszkala w bloku par\u0119 lat temu, pewnego razu wchodzi do \u0142azienki w samej bieli\u017anie a tam ogromny w\u0105\u017c na pod\u0142odze i tak si\u0119 wystraszy\u0142a \u017ce wybieg\u0142a z wrzaskiem z mieszkania i wylecia\u0142a przed blok w samej bieli\u017anie i uciek\u0142a do babci swojej, kt\u00f3ra mieszkala gdzie\u015b niedaleko. a potem si\u0119 okaza\u0142o, \u017ce jej s\u0105siad z do\u0142u hodowa\u0142 sobie w\u0119\u017ca i tak w\u0142a\u015bnie swobodnie go \"pasa\u0142\" po mieszkaniu i w\u0105\u017c mu spierdzieli\u0142 przez rur\u0119 w \u0142azience :cool :"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
2021-09-27 08:10:10 +02:00
"Pewna dziewczyna, wieku mi nieznanego, w mie\u015bcie sto\u0142ecznym - rozwiod\u0142a si\u0119. By\u0142a sama i samotna, wi\u0119c zapragn\u0119\u0142a kupi\u0107 sobie zwierz\u0119, […]"
2021-09-27 07:57:37 +02:00
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
2021-09-27 08:10:10 +02:00
"Anakonda. Czy to kolejna miejska legenda? Jaki\u015b czas temu kole\u017canka na jednej z imprez towarzyskich opowiedzia\u0142a mro\u017c\u0105c\u0105 krew w \u017cy\u0142ach histori\u0119 o dziewczynie ze swojej pracy, kt\u00f3ra w Warszawie na dyskotece w Dekadzie pozna\u0142a ch\u0142opaka. […]"
2021-09-27 07:57:37 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Ord\n",
"import Data.List\n",
"\n",
"legendVectorizer = vectorizeTfIdf vocLSize collectionLNormalized vocL . normalize\n",
"\n",
"\n",
"query vs vzer q = map ((collectionL !!) . snd) $ Data.List.take 3 $ sortBy (\\a b -> fst b `compare` fst a) $ zip (map (`cosineSim` qvec) vs) [0..] \n",
" where qvec = vzer q \n",
"\n",
"query lVectorized legendVectorizer \"w\u0105\u017c przymierza si\u0119 do zjedzenia w\u0142a\u015bcicielki\"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Haskell",
"language": "haskell",
"name": "haskell"
},
"language_info": {
"codemirror_mode": "ihaskell",
"file_extension": ".hs",
"mimetype": "text/x-haskell",
"name": "haskell",
"pygments_lexer": "Haskell",
"version": "8.10.4"
},
"author": "Filip Grali\u0144ski",
"email": "filipg@amu.edu.pl",
"lang": "pl",
"subtitle": "3.Wyszukiwarki \u2014 TF-IDF[wyk\u0142ad]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}