aitech-eks-pub/wyk/03_Tfidf.ipynb
2021-03-24 12:10:05 +01:00

2342 lines
51 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Wyszukiwarka - szybka i sensowna"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Roboczy przykład\n",
"\n",
"Zakładamy, że mamy pewną kolekcję dokumentów $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokumentów w kolekcji)."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Podobno jest kot w butach."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Prelude hiding(words, take)\n",
"\n",
"collectionD :: [Text]\n",
"collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubiłem.\"]\n",
"\n",
"-- Operator (!!) zwraca element listy o podanym indeksie\n",
"-- (Przy większych listach będzie nieefektywne, ale nie będziemy komplikować)\n",
"collectionD !! 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wydobycie tekstu\n",
"\n",
"Przykładowe narzędzia:\n",
"\n",
"* pdftotext\n",
"* antiword\n",
"* Tesseract OCR\n",
"* Apache Tika - uniwersalne narzędzie do wydobywania tekstu z różnych formatów\n",
"\n",
"## Normalizacja tekstu\n",
"\n",
"Cokolwiek robimy z tekstem, najpierw musimy go _znormalizować_."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenizacja\n",
"\n",
"Po pierwsze musimy podzielić tekst na _tokeny_, czyli wyrazapodobne jednostki.\n",
"Może po prostu podzielić po spacjach?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizeStupidly :: Text -> [Text]\n",
"-- words to funkcja z Data.Text, która dzieli po spacjach\n",
"tokenizeStupidly = words\n",
"\n",
"tokenizeStupidly $ Prelude.head collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A, trzeba _chociaż_ odsunąć znaki interpunkcyjne. Najprościej użyć wyrażenia regularnego. Warto użyć [unikodowych własności](https://en.wikipedia.org/wiki/Unicode_character_property) znaków i konstrukcji `\\p{...}`. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE QuasiQuotes #-}\n",
"\n",
"import Text.Regex.PCRE.Heavy\n",
"\n",
"tokenize :: Text -> [Text]\n",
"tokenize = map fst . scan [re|[\\p{L}0-9]+|\\p{P}|]\n",
"\n",
"tokenize $ Prelude.head collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cała kolekcja stokenizowana:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Podobno"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"jest"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"w"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butach"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Ty"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"masz"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"!"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"But"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zgubiłem"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"map tokenize collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Problemy z tokenizacją\n",
"\n",
"##### Język angielski"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"data"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"base"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a data-base\""
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"database"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a database\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"use"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"a"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"data"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"base"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I use a data base\""
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"I"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"don"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"t"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"like"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Python"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"I don't like Python\""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0018"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"555"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"555"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"122"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"+0018 555 555 122\""
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0018555555122"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"+0018555555122\""
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Which"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"one"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"is"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"better"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
":"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"C"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"or"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"C"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"#"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"?"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"Which one is better: C++ or C#?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Inne języki?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rechtsschutzversicherungsgesellschaften"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wie"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"die"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"HUK"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"-"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Coburg"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"machen"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"es"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"bereits"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"seit"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"geraumer"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Zeit"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"vor"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
":"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\""
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"今日波兹南是贸易"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"、"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"工业及教育的中心"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"。"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"波兹南是波兰第五大的城市及第四大的工业中心"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"波兹南亦是大波兰省的行政首府"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"。"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"也舉辦有不少展覽會"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"。"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"是波蘭西部重要的交通中心都市"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"。"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"今日波兹南是贸易、工业及教育的中心。波兹南是波兰第五大的城市及第四大的工业中心,波兹南亦是大波兰省的行政首府。也舉辦有不少展覽會。是波蘭西部重要的交通中心都市。\""
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"l"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ordinateur"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenize \"l'ordinateur\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lematyzacja"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krześle\" do \"krzesło\", \"zrobimy\" do \"zrobić\" dla języka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla języka angielskiego."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lematyzacja dla języka polskiego jest bardzo trudna, praktycznie nie sposób wykonać ją regułowo, po prostu musimy się postarać o bardzo obszerny _słownik form fleksyjnych_.\n",
"\n",
"Na potrzeby tego wykładu stwórzmy sobie mały słownik form fleksyjnych w postaci tablicy asocjacyjnej (haszującej)."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use head</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">collectionD !! 0</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">head collectionD</div></div>"
],
"text/plain": [
"Line 22: Use head\n",
"Found:\n",
"collectionD !! 0\n",
"Why not:\n",
"head collectionD"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"but"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butami"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mieć"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"."
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Map as Map hiding(take, map, filter)\n",
"\n",
"mockInflectionDictionary :: Map Text Text\n",
"mockInflectionDictionary = Map.fromList [\n",
" (\"kota\", \"kot\"),\n",
" (\"butach\", \"but\"),\n",
" (\"masz\", \"mieć\"),\n",
" (\"ma\", \"mieć\"),\n",
" (\"buta\", \"but\"),\n",
" (\"zgubiłem\", \"zgubić\")]\n",
"\n",
"lemmatizeWord :: Map Text Text -> Text -> Text\n",
"lemmatizeWord dict w = findWithDefault w w dict\n",
"\n",
"lemmatizeWord mockInflectionDictionary \"butach\"\n",
"-- a tego nie ma w naszym słowniczku, więc zwracamy to samo\n",
"lemmatizeWord mockInflectionDictionary \"butami\"\n",
"\n",
"lemmatize :: Map Text Text -> [Text] -> [Text]\n",
"lemmatize dict = map (lemmatizeWord dict)\n",
"\n",
"lemmatize mockInflectionDictionary $ tokenize $ collectionD !! 0 \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie**: Nawet w naszym słowniczku mamy problemy z niejednoznacznością lematyzacji. Jakie?\n",
"\n",
"Obszerny słownik form fleksyjnych dla języka polskiego: http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=view&target=PoliMorf-0.6.7.tab.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stemowanie\n",
"\n",
"Stemowanie (rdzeniowanie) obcina wyraz do _rdzenia_ niekoniecznie będącego sensownym wyrazem, np. \"krześle\" może być rdzeniowane do \"krześl\", \"krześ\" albo \"krzes\", \"zrobimy\" do \"zrobi\".\n",
"\n",
"* stemowanie nie jest tak dobrze określone jak lematyzacja (można robić na wiele sposobów)\n",
"* bardziej podatne na metody regułowe (choć dla polskiego i tak trudno)\n",
"* dla angielskiego istnieją znane algorytmy stemowania, np. [algorytm Portera](https://tartarus.org/martin/PorterStemmer/def.txt)\n",
"* zob. też [program Snowball](https://snowballstem.org/) z regułami dla wielu języków\n",
"\n",
"Prosty stemmer \"dla ubogich\" dla języka polskiego to obcinanie do sześciu znaków."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"zrobim"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"komput"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"butach"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"poorMansStemming :: Text -> Text\n",
"poorMansStemming = take 6\n",
"\n",
"poorMansStemming \"zrobimy\"\n",
"poorMansStemming \"komputerami\"\n",
"poorMansStemming \"butach\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### _Stop words_\n",
"\n",
"Często wyszukiwarki pomijają krótkie, częste i nieniosące znaczenia słowa - _stop words_ (_słowa przestankowe_)."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"isStopWord :: Text -> Bool\n",
"isStopWord \"w\" = True\n",
"isStopWord \"jest\" = True\n",
"isStopWord \"że\" = True\n",
"-- przy okazji możemy pozbyć się znaków interpunkcyjnych\n",
"isStopWord w = w ≈ [re|^\\p{P}+$|]\n",
"\n",
"isStopWord \"kot\"\n",
"isStopWord \"!\"\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"removeStopWords :: [Text] -> [Text]\n",
"removeStopWords = filter (not . isStopWord)\n",
"\n",
"removeStopWords $ tokenize $ Prelude.head collectionD "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie**: Jakim zapytaniom usuwanie _stop words_ może szkodzić? Podać przykłady dla języka polskiego i angielskiego. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizacja - różności\n",
"\n",
"W skład normalizacji może też wchodzić:\n",
"\n",
"* poprawianie błędów literowych\n",
"* sprowadzanie do małych liter (lower-casing czy raczej case-folding)\n",
"* usuwanie znaków diakrytycznych\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"żdźbło"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toLower \"ŻDŹBŁO\""
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"źdźbło"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"toCaseFold \"ŹDŹBŁO\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Kiedy _case-folding_ da inny wynik niż _lower-casing_? Jakie to ma praktyczne znaczenie?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizacja jako całościowy proces\n",
"\n",
"Najważniejsza zasada: dokumenty w naszej kolekcji powinny być normalizowane w dokładnie taki sposób, jak zapytania.\n",
"\n",
"Efektem normalizacji jest zamiana dokumentu na ciąg _termów_ (ang. _terms_), czyli znormalizowanych wyrazów.\n",
"\n",
"Innymi słowy po normalizacji dokument $d_i$ traktujemy jako ciąg termów $t_i^1,\\dots,t_i^{|d_i|}$."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"but"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zgubić"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"normalize :: Text -> [Text]\n",
"normalize = removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
"\n",
"normalize $ collectionD !! 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbiór wszystkich termów w kolekcji dokumentów nazywamy słownikiem (ang. _vocabulary_), nie mylić ze słownikiem jako strukturą danych w Pythonie (_dictionary_).\n",
"\n",
"$$V = \\bigcup_{i=1}^N \\{t_i^1,\\dots,t_i^{|d_i|}\\}$$\n",
"\n",
"(To zbiór, więc liczymy bez powtórzeń!)"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fromList [\"ala\",\"but\",\"chyba\",\"kot\",\"mie\\263\",\"podobno\",\"ty\",\"zgubi\\263\"]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Set as Set hiding(map)\n",
"\n",
"getVocabulary :: [Text] -> Set Text \n",
"getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
"\n",
"getVocabulary collectionD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Jak wyszukiwarka może być szybka?\n",
"\n",
"_Odwrócony indeks_ (ang. _inverted index_) pozwala wyszukiwarce szybko szukać w milionach dokumentów. Odwrócoy indeks to prostu... indeks, jaki znamy z książek (mapowanie słów na numery stron/dokumentów).\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use tuple-section</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ t -> (t, ix)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">(, ix)</div></div>"
],
"text/plain": [
"Line 4: Use tuple-section\n",
"Found:\n",
"\\ t -> (t, ix)\n",
"Why not:\n",
"(, ix)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"chyba\",2),(\"kot\",2),(\"mie\\263\",2),(\"ty\",2)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionDNormalized = map normalize collectionD\n",
"\n",
"documentToPostings :: ([Text], Int) -> Set (Text, Int)\n",
"documentToPostings (d, ix) = Set.fromList $ map (\\t -> (t, ix)) d\n",
"\n",
"documentToPostings (collectionDNormalized !! 2, 2) \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map documentToPostings $ zip coll [0 .. ]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith (curry documentToPostings) coll [0 .. ]</div></div>"
],
"text/plain": [
"Line 2: Use zipWith\n",
"Found:\n",
"map documentToPostings $ zip coll [0 .. ]\n",
"Why not:\n",
"zipWith (curry documentToPostings) coll [0 .. ]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",0),(\"but\",1),(\"but\",3),(\"chyba\",2),(\"chyba\",3),(\"kot\",0),(\"kot\",1),(\"kot\",2),(\"mie\\263\",0),(\"mie\\263\",2),(\"podobno\",1),(\"ty\",2),(\"zgubi\\263\",3)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionToPostings :: [[Text]] -> Set (Text, Int)\n",
"collectionToPostings coll = Set.unions $ map documentToPostings $ zip coll [0..]\n",
"\n",
"collectionToPostings collectionDNormalized"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) invIndex\n",
" = insertWith (++) t [ix] invIndex</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) = insertWith (++) t [ix]</div></div>"
],
"text/plain": [
"Line 2: Eta reduce\n",
"Found:\n",
"updateInvertedIndex (t, ix) invIndex\n",
" = insertWith (++) t [ix] invIndex\n",
"Why not:\n",
"updateInvertedIndex (t, ix) = insertWith (++) t [ix]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",[0]),(\"but\",[1,3]),(\"chyba\",[2,3]),(\"kot\",[0,1,2]),(\"mie\\263\",[0,2]),(\"podobno\",[1]),(\"ty\",[2]),(\"zgubi\\263\",[3])]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"updateInvertedIndex :: (Text, Int) -> Map Text [Int] -> Map Text [Int]\n",
"updateInvertedIndex (t, ix) invIndex = insertWith (++) t [ix] invIndex\n",
"\n",
"getInvertedIndex :: [[Text]] -> Map Text [Int]\n",
"getInvertedIndex = Prelude.foldr updateInvertedIndex Map.empty . Set.toList . collectionToPostings\n",
"\n",
"getInvertedIndex collectionDNormalized"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Relewantność\n",
"\n",
"Potrafimy szybko przeszukiwać znormalizowane dokumenty, ale które dokumenty są ważne (_relewantne_) względem potrzeby informacyjnej użytkownika?\n",
"\n",
"### Zapytania boole'owskie\n",
"\n",
"* `pizzeria Poznań dowóz` to `pizzeria AND Poznań AND dowóz` czy `pizzera OR POZNAŃ OR dowóz`\n",
"* `(pizzeria OR pizza OR tratoria) AND Poznań AND dowóz\n",
"* `pizzeria AND Poznań AND dowóz AND NOT golonka`\n",
"\n",
"Jak domyślnie interpretować zapytanie?\n",
"\n",
"* jako zapytanie AND -- być może za mało dokumentów\n",
"* rozwiązanie pośrednie?\n",
"* jako zapytanie OR -- być może za dużo dokumentów\n",
"\n",
"Możemy jakieś miary dopasowania dokumentu do zapytania, żeby móc posortować dokumenty...\n",
"\n",
"### Mierzenie dopasowania dokumentu do zapytania\n",
"\n",
"Potrzebujemy jakieś funkcji $\\sigma : Q x D \\rightarrow \\mathbb{R}$. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Musimy jakoś zamienić dokumenty na liczby, tj. dokumenty na wektory liczb, a całą kolekcję na macierz.\n",
"\n",
"Po pierwsze ponumerujmy wszystkie termy ze słownika."
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fromList [(0,\"ala\"),(1,\"but\"),(2,\"chyba\"),(3,\"kot\"),(4,\"mie\\263\"),(5,\"podobno\"),(6,\"ty\"),(7,\"zgubi\\263\")]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"fromList [(\"ala\",0),(\"but\",1),(\"chyba\",2),(\"kot\",3),(\"mie\\263\",4),(\"podobno\",5),(\"ty\",6),(\"zgubi\\263\",7)]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"2"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"voc = getVocabulary collectionD\n",
"\n",
"vocD :: Map Int Text\n",
"vocD = Map.fromList $ zip [0..] $ Set.toList voc\n",
"\n",
"invvocD :: Map Text Int\n",
"invvocD = Map.fromList $ zip (Set.toList voc) [0..]\n",
"\n",
"vocD\n",
"\n",
"invvocD\n",
"\n",
"vocD ! 0\n",
"invvocD ! \"chyba\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Napiszmy funkcję, która _wektoryzuje_ znormalizowany dokument.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 2)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 2</div></div>"
],
"text/plain": [
"Line 2: Redundant $\n",
"Found:\n",
"map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
"Why not:\n",
"map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 9: Redundant bracket\n",
"Found:\n",
"(collectionDNormalized !! 2)\n",
"Why not:\n",
"collectionDNormalized !! 2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ty"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"chyba"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mieć"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kot"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vectorize :: Int -> Map Int Text -> [Text] -> [Double]\n",
"vectorize vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
" where count t doc \n",
" | t `elem` doc = 1.0\n",
" | otherwise = 0.0\n",
" \n",
"vocSize = Set.size voc\n",
"\n",
"(collectionDNormalized !! 2)\n",
"vectorize vocSize vocD (collectionDNormalized !! 2)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ![image](./macierz.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Jak inaczej uwzględnić częstość wyrazów?\n",
"\n",
"<div style=\"display:none\">\n",
" $\n",
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
" $\n",
"</div>\n",
"\n",
"* $\\tf_{t,d}$\n",
"\n",
"* $1+\\log(\\tf_{t,d})$\n",
"\n",
"* $0.5 + \\frac{0.5 \\times \\tf_{t,d}}{max_t(\\tf_{t,d})}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"display:none\">\n",
" $\n",
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
" $\n",
"</div>\n",
"\n",
"### Odwrotna częstość dokumentowa\n",
"\n",
"Czy wszystkie wyrazy są tak samo ważne?\n",
"\n",
"**NIE.** Wyrazy pojawiające się w wielu dokumentach są mniej ważne.\n",
"\n",
"Aby to uwzględnić, przemnażamy frekwencję wyrazu przez _odwrotną\n",
" częstość w dokumentach_ (_inverse document frequency_):\n",
"\n",
"$$\\idf_t = \\log \\frac{N}{\\df_t},$$\n",
"\n",
"gdzie:\n",
"\n",
"* $\\idf_t$ - odwrotna częstość wyrazu $t$ w dokumentach\n",
"\n",
"* $N$ - liczba dokumentów w kolekcji\n",
"\n",
"* $\\df_f$ - w ilu dokumentach wystąpił wyraz $t$?\n",
"\n",
"#### Dlaczego idf?\n",
"\n",
"term $t$ wystąpił...\n",
"\n",
"* w 1 dokumencie, $\\idf_t = \\log N/1 = \\log N$\n",
"* 2 razy w kolekcji, $\\idf_t = \\log N/2$ lub $\\log N$\n",
"* 3 razy w kolekcji, $\\idf_t = \\log N/(N/2) = \\log 2$\n",
"* we wszystkich dokumentach, $\\idf_t = \\log N/N = \\log 1 = 0$\n",
"\n",
"#### Co z tego wynika?\n",
"\n",
"Zamiast $\\tf_{t,d}$ będziemy w wektorach rozpatrywać wartości:\n",
"\n",
"$$\\tfidf_{t,d} = \\tf_{t,d} \\times \\idf_{t}$$\n",
"\n",
"Teraz zdefiniujemy _overlap score measure_:\n",
"\n",
"$$\\sigma(q,d) = \\sum_{t \\in q} \\tfidf_{t,d}$$\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Haskell",
"language": "haskell",
"name": "haskell"
},
"language_info": {
"codemirror_mode": "ihaskell",
"file_extension": ".hs",
"mimetype": "text/x-haskell",
"name": "haskell",
"pygments_lexer": "Haskell",
"version": "8.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}