forked from filipg/aitech-eks-pub
2342 lines
51 KiB
Plaintext
2342 lines
51 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Wyszukiwarka - szybka i sensowna"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Roboczy przykład\n",
|
|||
|
"\n",
|
|||
|
"Zakładamy, że mamy pewną kolekcję dokumentów $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokumentów w kolekcji)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 90,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Podobno jest kot w butach."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"{-# LANGUAGE OverloadedStrings #-}\n",
|
|||
|
"\n",
|
|||
|
"import Data.Text hiding(map, filter, zip)\n",
|
|||
|
"import Prelude hiding(words, take)\n",
|
|||
|
"\n",
|
|||
|
"collectionD :: [Text]\n",
|
|||
|
"collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubiłem.\"]\n",
|
|||
|
"\n",
|
|||
|
"-- Operator (!!) zwraca element listy o podanym indeksie\n",
|
|||
|
"-- (Przy większych listach będzie nieefektywne, ale nie będziemy komplikować)\n",
|
|||
|
"collectionD !! 1"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Wydobycie tekstu\n",
|
|||
|
"\n",
|
|||
|
"Przykładowe narzędzia:\n",
|
|||
|
"\n",
|
|||
|
"* pdftotext\n",
|
|||
|
"* antiword\n",
|
|||
|
"* Tesseract OCR\n",
|
|||
|
"* Apache Tika - uniwersalne narzędzie do wydobywania tekstu z różnych formatów\n",
|
|||
|
"\n",
|
|||
|
"## Normalizacja tekstu\n",
|
|||
|
"\n",
|
|||
|
"Cokolwiek robimy z tekstem, najpierw musimy go _znormalizować_."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Tokenizacja\n",
|
|||
|
"\n",
|
|||
|
"Po pierwsze musimy podzielić tekst na _tokeny_, czyli wyrazapodobne jednostki.\n",
|
|||
|
"Może po prostu podzielić po spacjach?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ma"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kota."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenizeStupidly :: Text -> [Text]\n",
|
|||
|
"-- words to funkcja z Data.Text, która dzieli po spacjach\n",
|
|||
|
"tokenizeStupidly = words\n",
|
|||
|
"\n",
|
|||
|
"tokenizeStupidly $ Prelude.head collectionD"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"A, trzeba _chociaż_ odsunąć znaki interpunkcyjne. Najprościej użyć wyrażenia regularnego. Warto użyć [unikodowych własności](https://en.wikipedia.org/wiki/Unicode_character_property) znaków i konstrukcji `\\p{...}`. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ma"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kota"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"{-# LANGUAGE QuasiQuotes #-}\n",
|
|||
|
"\n",
|
|||
|
"import Text.Regex.PCRE.Heavy\n",
|
|||
|
"\n",
|
|||
|
"tokenize :: Text -> [Text]\n",
|
|||
|
"tokenize = map fst . scan [re|[\\p{L}0-9]+|\\p{P}|]\n",
|
|||
|
"\n",
|
|||
|
"tokenize $ Prelude.head collectionD"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Cała kolekcja stokenizowana:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ma"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kota"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Podobno"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"jest"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kot"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"w"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"butach"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ty"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"chyba"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"masz"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kota"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"!"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"But"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"chyba"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"zgubiłem"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"map tokenize collectionD"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Problemy z tokenizacją\n",
|
|||
|
"\n",
|
|||
|
"##### Język angielski"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"I"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"use"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"a"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"data"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"-"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"base"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"I use a data-base\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"I"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"use"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"a"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"database"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"I use a database\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"I"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"use"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"a"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"data"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"base"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"I use a data base\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"I"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"don"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"t"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"like"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Python"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"I don't like Python\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0018"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"555"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"555"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"122"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"+0018 555 555 122\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0018555555122"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"+0018555555122\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Which"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"one"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"is"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"better"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
":"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"C"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"or"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"C"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"#"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"?"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"Which one is better: C++ or C#?\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"##### Inne języki?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Rechtsschutzversicherungsgesellschaften"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"wie"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"die"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"HUK"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"-"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Coburg"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"machen"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"es"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"bereits"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"seit"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"geraumer"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Zeit"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"vor"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
":"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"今日波兹南是贸易"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"、"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"工业及教育的中心"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"。"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"波兹南是波兰第五大的城市及第四大的工业中心"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
","
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"波兹南亦是大波兰省的行政首府"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"。"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"也舉辦有不少展覽會"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"。"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"是波蘭西部重要的交通中心都市"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"。"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"今日波兹南是贸易、工业及教育的中心。波兹南是波兰第五大的城市及第四大的工业中心,波兹南亦是大波兰省的行政首府。也舉辦有不少展覽會。是波蘭西部重要的交通中心都市。\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"l"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ordinateur"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tokenize \"l'ordinateur\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Lematyzacja"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krześle\" do \"krzesło\", \"zrobimy\" do \"zrobić\" dla języka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla języka angielskiego."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Lematyzacja dla języka polskiego jest bardzo trudna, praktycznie nie sposób wykonać ją regułowo, po prostu musimy się postarać o bardzo obszerny _słownik form fleksyjnych_.\n",
|
|||
|
"\n",
|
|||
|
"Na potrzeby tego wykładu stwórzmy sobie mały słownik form fleksyjnych w postaci tablicy asocjacyjnej (haszującej)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 80,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>/* Styles used for the Hoogle display in the pager */\n",
|
|||
|
".hoogle-doc {\n",
|
|||
|
"display: block;\n",
|
|||
|
"padding-bottom: 1.3em;\n",
|
|||
|
"padding-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-code {\n",
|
|||
|
"display: block;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-text {\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-name {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-head {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-sub {\n",
|
|||
|
"display: block;\n",
|
|||
|
"margin-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-package {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-module {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-class {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".get-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"white-space: pre-wrap;\n",
|
|||
|
"}\n",
|
|||
|
".show-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"margin-left: 1em;\n",
|
|||
|
"}\n",
|
|||
|
".mono {\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
"#unshowable {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg.in.collapse {\n",
|
|||
|
"padding-top: 0.7em;\n",
|
|||
|
"}\n",
|
|||
|
".highlight-code {\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-warning { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: rgb(200, 130, 0);\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-error { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: red;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-name {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use head</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">collectionD !! 0</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">head collectionD</div></div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"Line 22: Use head\n",
|
|||
|
"Found:\n",
|
|||
|
"collectionD !! 0\n",
|
|||
|
"Why not:\n",
|
|||
|
"head collectionD"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"but"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"butami"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"mieć"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kot"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"."
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import Data.Map as Map hiding(take, map, filter)\n",
|
|||
|
"\n",
|
|||
|
"mockInflectionDictionary :: Map Text Text\n",
|
|||
|
"mockInflectionDictionary = Map.fromList [\n",
|
|||
|
" (\"kota\", \"kot\"),\n",
|
|||
|
" (\"butach\", \"but\"),\n",
|
|||
|
" (\"masz\", \"mieć\"),\n",
|
|||
|
" (\"ma\", \"mieć\"),\n",
|
|||
|
" (\"buta\", \"but\"),\n",
|
|||
|
" (\"zgubiłem\", \"zgubić\")]\n",
|
|||
|
"\n",
|
|||
|
"lemmatizeWord :: Map Text Text -> Text -> Text\n",
|
|||
|
"lemmatizeWord dict w = findWithDefault w w dict\n",
|
|||
|
"\n",
|
|||
|
"lemmatizeWord mockInflectionDictionary \"butach\"\n",
|
|||
|
"-- a tego nie ma w naszym słowniczku, więc zwracamy to samo\n",
|
|||
|
"lemmatizeWord mockInflectionDictionary \"butami\"\n",
|
|||
|
"\n",
|
|||
|
"lemmatize :: Map Text Text -> [Text] -> [Text]\n",
|
|||
|
"lemmatize dict = map (lemmatizeWord dict)\n",
|
|||
|
"\n",
|
|||
|
"lemmatize mockInflectionDictionary $ tokenize $ collectionD !! 0 \n",
|
|||
|
"\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Pytanie**: Nawet w naszym słowniczku mamy problemy z niejednoznacznością lematyzacji. Jakie?\n",
|
|||
|
"\n",
|
|||
|
"Obszerny słownik form fleksyjnych dla języka polskiego: http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=view&target=PoliMorf-0.6.7.tab.gz"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Stemowanie\n",
|
|||
|
"\n",
|
|||
|
"Stemowanie (rdzeniowanie) obcina wyraz do _rdzenia_ niekoniecznie będącego sensownym wyrazem, np. \"krześle\" może być rdzeniowane do \"krześl\", \"krześ\" albo \"krzes\", \"zrobimy\" do \"zrobi\".\n",
|
|||
|
"\n",
|
|||
|
"* stemowanie nie jest tak dobrze określone jak lematyzacja (można robić na wiele sposobów)\n",
|
|||
|
"* bardziej podatne na metody regułowe (choć dla polskiego i tak trudno)\n",
|
|||
|
"* dla angielskiego istnieją znane algorytmy stemowania, np. [algorytm Portera](https://tartarus.org/martin/PorterStemmer/def.txt)\n",
|
|||
|
"* zob. też [program Snowball](https://snowballstem.org/) z regułami dla wielu języków\n",
|
|||
|
"\n",
|
|||
|
"Prosty stemmer \"dla ubogich\" dla języka polskiego to obcinanie do sześciu znaków."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 41,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"zrobim"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"komput"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"butach"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"poorMansStemming :: Text -> Text\n",
|
|||
|
"poorMansStemming = take 6\n",
|
|||
|
"\n",
|
|||
|
"poorMansStemming \"zrobimy\"\n",
|
|||
|
"poorMansStemming \"komputerami\"\n",
|
|||
|
"poorMansStemming \"butach\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### _Stop words_\n",
|
|||
|
"\n",
|
|||
|
"Często wyszukiwarki pomijają krótkie, częste i nieniosące znaczenia słowa - _stop words_ (_słowa przestankowe_)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 42,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"False"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"True"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"isStopWord :: Text -> Bool\n",
|
|||
|
"isStopWord \"w\" = True\n",
|
|||
|
"isStopWord \"jest\" = True\n",
|
|||
|
"isStopWord \"że\" = True\n",
|
|||
|
"-- przy okazji możemy pozbyć się znaków interpunkcyjnych\n",
|
|||
|
"isStopWord w = w ≈ [re|^\\p{P}+$|]\n",
|
|||
|
"\n",
|
|||
|
"isStopWord \"kot\"\n",
|
|||
|
"isStopWord \"!\"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 55,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ma"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kota"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"removeStopWords :: [Text] -> [Text]\n",
|
|||
|
"removeStopWords = filter (not . isStopWord)\n",
|
|||
|
"\n",
|
|||
|
"removeStopWords $ tokenize $ Prelude.head collectionD "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Pytanie**: Jakim zapytaniom usuwanie _stop words_ może szkodzić? Podać przykłady dla języka polskiego i angielskiego. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Normalizacja - różności\n",
|
|||
|
"\n",
|
|||
|
"W skład normalizacji może też wchodzić:\n",
|
|||
|
"\n",
|
|||
|
"* poprawianie błędów literowych\n",
|
|||
|
"* sprowadzanie do małych liter (lower-casing czy raczej case-folding)\n",
|
|||
|
"* usuwanie znaków diakrytycznych\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 56,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"żdźbło"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"toLower \"ŻDŹBŁO\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 58,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"źdźbło"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"toCaseFold \"ŹDŹBŁO\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Pytanie:** Kiedy _case-folding_ da inny wynik niż _lower-casing_? Jakie to ma praktyczne znaczenie?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Normalizacja jako całościowy proces\n",
|
|||
|
"\n",
|
|||
|
"Najważniejsza zasada: dokumenty w naszej kolekcji powinny być normalizowane w dokładnie taki sposób, jak zapytania.\n",
|
|||
|
"\n",
|
|||
|
"Efektem normalizacji jest zamiana dokumentu na ciąg _termów_ (ang. _terms_), czyli znormalizowanych wyrazów.\n",
|
|||
|
"\n",
|
|||
|
"Innymi słowy po normalizacji dokument $d_i$ traktujemy jako ciąg termów $t_i^1,\\dots,t_i^{|d_i|}$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"but"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"chyba"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"zgubić"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"normalize :: Text -> [Text]\n",
|
|||
|
"normalize = removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
|
|||
|
"\n",
|
|||
|
"normalize $ collectionD !! 3"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Zbiór wszystkich termów w kolekcji dokumentów nazywamy słownikiem (ang. _vocabulary_), nie mylić ze słownikiem jako strukturą danych w Pythonie (_dictionary_).\n",
|
|||
|
"\n",
|
|||
|
"$$V = \\bigcup_{i=1}^N \\{t_i^1,\\dots,t_i^{|d_i|}\\}$$\n",
|
|||
|
"\n",
|
|||
|
"(To zbiór, więc liczymy bez powtórzeń!)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 84,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [\"ala\",\"but\",\"chyba\",\"kot\",\"mie\\263\",\"podobno\",\"ty\",\"zgubi\\263\"]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import Data.Set as Set hiding(map)\n",
|
|||
|
"\n",
|
|||
|
"getVocabulary :: [Text] -> Set Text \n",
|
|||
|
"getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
|
|||
|
"\n",
|
|||
|
"getVocabulary collectionD"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Jak wyszukiwarka może być szybka?\n",
|
|||
|
"\n",
|
|||
|
"_Odwrócony indeks_ (ang. _inverted index_) pozwala wyszukiwarce szybko szukać w milionach dokumentów. Odwrócoy indeks to prostu... indeks, jaki znamy z książek (mapowanie słów na numery stron/dokumentów).\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 88,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>/* Styles used for the Hoogle display in the pager */\n",
|
|||
|
".hoogle-doc {\n",
|
|||
|
"display: block;\n",
|
|||
|
"padding-bottom: 1.3em;\n",
|
|||
|
"padding-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-code {\n",
|
|||
|
"display: block;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-text {\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-name {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-head {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-sub {\n",
|
|||
|
"display: block;\n",
|
|||
|
"margin-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-package {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-module {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-class {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".get-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"white-space: pre-wrap;\n",
|
|||
|
"}\n",
|
|||
|
".show-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"margin-left: 1em;\n",
|
|||
|
"}\n",
|
|||
|
".mono {\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
"#unshowable {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg.in.collapse {\n",
|
|||
|
"padding-top: 0.7em;\n",
|
|||
|
"}\n",
|
|||
|
".highlight-code {\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-warning { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: rgb(200, 130, 0);\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-error { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: red;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-name {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use tuple-section</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ t -> (t, ix)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">(, ix)</div></div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"Line 4: Use tuple-section\n",
|
|||
|
"Found:\n",
|
|||
|
"\\ t -> (t, ix)\n",
|
|||
|
"Why not:\n",
|
|||
|
"(, ix)"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [(\"chyba\",2),(\"kot\",2),(\"mie\\263\",2),(\"ty\",2)]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"collectionDNormalized = map normalize collectionD\n",
|
|||
|
"\n",
|
|||
|
"documentToPostings :: ([Text], Int) -> Set (Text, Int)\n",
|
|||
|
"documentToPostings (d, ix) = Set.fromList $ map (\\t -> (t, ix)) d\n",
|
|||
|
"\n",
|
|||
|
"documentToPostings (collectionDNormalized !! 2, 2) \n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 91,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>/* Styles used for the Hoogle display in the pager */\n",
|
|||
|
".hoogle-doc {\n",
|
|||
|
"display: block;\n",
|
|||
|
"padding-bottom: 1.3em;\n",
|
|||
|
"padding-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-code {\n",
|
|||
|
"display: block;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-text {\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-name {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-head {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-sub {\n",
|
|||
|
"display: block;\n",
|
|||
|
"margin-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-package {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-module {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-class {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".get-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"white-space: pre-wrap;\n",
|
|||
|
"}\n",
|
|||
|
".show-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"margin-left: 1em;\n",
|
|||
|
"}\n",
|
|||
|
".mono {\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
"#unshowable {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg.in.collapse {\n",
|
|||
|
"padding-top: 0.7em;\n",
|
|||
|
"}\n",
|
|||
|
".highlight-code {\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-warning { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: rgb(200, 130, 0);\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-error { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: red;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-name {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map documentToPostings $ zip coll [0 .. ]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith (curry documentToPostings) coll [0 .. ]</div></div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"Line 2: Use zipWith\n",
|
|||
|
"Found:\n",
|
|||
|
"map documentToPostings $ zip coll [0 .. ]\n",
|
|||
|
"Why not:\n",
|
|||
|
"zipWith (curry documentToPostings) coll [0 .. ]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [(\"ala\",0),(\"but\",1),(\"but\",3),(\"chyba\",2),(\"chyba\",3),(\"kot\",0),(\"kot\",1),(\"kot\",2),(\"mie\\263\",0),(\"mie\\263\",2),(\"podobno\",1),(\"ty\",2),(\"zgubi\\263\",3)]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"collectionToPostings :: [[Text]] -> Set (Text, Int)\n",
|
|||
|
"collectionToPostings coll = Set.unions $ map documentToPostings $ zip coll [0..]\n",
|
|||
|
"\n",
|
|||
|
"collectionToPostings collectionDNormalized"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 102,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>/* Styles used for the Hoogle display in the pager */\n",
|
|||
|
".hoogle-doc {\n",
|
|||
|
"display: block;\n",
|
|||
|
"padding-bottom: 1.3em;\n",
|
|||
|
"padding-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-code {\n",
|
|||
|
"display: block;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-text {\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-name {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-head {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-sub {\n",
|
|||
|
"display: block;\n",
|
|||
|
"margin-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-package {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-module {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-class {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".get-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"white-space: pre-wrap;\n",
|
|||
|
"}\n",
|
|||
|
".show-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"margin-left: 1em;\n",
|
|||
|
"}\n",
|
|||
|
".mono {\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
"#unshowable {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg.in.collapse {\n",
|
|||
|
"padding-top: 0.7em;\n",
|
|||
|
"}\n",
|
|||
|
".highlight-code {\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-warning { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: rgb(200, 130, 0);\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-error { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: red;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-name {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) invIndex\n",
|
|||
|
" = insertWith (++) t [ix] invIndex</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) = insertWith (++) t [ix]</div></div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"Line 2: Eta reduce\n",
|
|||
|
"Found:\n",
|
|||
|
"updateInvertedIndex (t, ix) invIndex\n",
|
|||
|
" = insertWith (++) t [ix] invIndex\n",
|
|||
|
"Why not:\n",
|
|||
|
"updateInvertedIndex (t, ix) = insertWith (++) t [ix]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [(\"ala\",[0]),(\"but\",[1,3]),(\"chyba\",[2,3]),(\"kot\",[0,1,2]),(\"mie\\263\",[0,2]),(\"podobno\",[1]),(\"ty\",[2]),(\"zgubi\\263\",[3])]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"updateInvertedIndex :: (Text, Int) -> Map Text [Int] -> Map Text [Int]\n",
|
|||
|
"updateInvertedIndex (t, ix) invIndex = insertWith (++) t [ix] invIndex\n",
|
|||
|
"\n",
|
|||
|
"getInvertedIndex :: [[Text]] -> Map Text [Int]\n",
|
|||
|
"getInvertedIndex = Prelude.foldr updateInvertedIndex Map.empty . Set.toList . collectionToPostings\n",
|
|||
|
"\n",
|
|||
|
"getInvertedIndex collectionDNormalized"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": []
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Relewantność\n",
|
|||
|
"\n",
|
|||
|
"Potrafimy szybko przeszukiwać znormalizowane dokumenty, ale które dokumenty są ważne (_relewantne_) względem potrzeby informacyjnej użytkownika?\n",
|
|||
|
"\n",
|
|||
|
"### Zapytania boole'owskie\n",
|
|||
|
"\n",
|
|||
|
"* `pizzeria Poznań dowóz` to `pizzeria AND Poznań AND dowóz` czy `pizzera OR POZNAŃ OR dowóz`\n",
|
|||
|
"* `(pizzeria OR pizza OR tratoria) AND Poznań AND dowóz\n",
|
|||
|
"* `pizzeria AND Poznań AND dowóz AND NOT golonka`\n",
|
|||
|
"\n",
|
|||
|
"Jak domyślnie interpretować zapytanie?\n",
|
|||
|
"\n",
|
|||
|
"* jako zapytanie AND -- być może za mało dokumentów\n",
|
|||
|
"* rozwiązanie pośrednie?\n",
|
|||
|
"* jako zapytanie OR -- być może za dużo dokumentów\n",
|
|||
|
"\n",
|
|||
|
"Możemy jakieś miary dopasowania dokumentu do zapytania, żeby móc posortować dokumenty...\n",
|
|||
|
"\n",
|
|||
|
"### Mierzenie dopasowania dokumentu do zapytania\n",
|
|||
|
"\n",
|
|||
|
"Potrzebujemy jakieś funkcji $\\sigma : Q x D \\rightarrow \\mathbb{R}$. \n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Musimy jakoś zamienić dokumenty na liczby, tj. dokumenty na wektory liczb, a całą kolekcję na macierz.\n",
|
|||
|
"\n",
|
|||
|
"Po pierwsze ponumerujmy wszystkie termy ze słownika."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 115,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [(0,\"ala\"),(1,\"but\"),(2,\"chyba\"),(3,\"kot\"),(4,\"mie\\263\"),(5,\"podobno\"),(6,\"ty\"),(7,\"zgubi\\263\")]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"fromList [(\"ala\",0),(\"but\",1),(\"chyba\",2),(\"kot\",3),(\"mie\\263\",4),(\"podobno\",5),(\"ty\",6),(\"zgubi\\263\",7)]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ala"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"2"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"voc = getVocabulary collectionD\n",
|
|||
|
"\n",
|
|||
|
"vocD :: Map Int Text\n",
|
|||
|
"vocD = Map.fromList $ zip [0..] $ Set.toList voc\n",
|
|||
|
"\n",
|
|||
|
"invvocD :: Map Text Int\n",
|
|||
|
"invvocD = Map.fromList $ zip (Set.toList voc) [0..]\n",
|
|||
|
"\n",
|
|||
|
"vocD\n",
|
|||
|
"\n",
|
|||
|
"invvocD\n",
|
|||
|
"\n",
|
|||
|
"vocD ! 0\n",
|
|||
|
"invvocD ! \"chyba\"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Napiszmy funkcję, która _wektoryzuje_ znormalizowany dokument.\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 125,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>/* Styles used for the Hoogle display in the pager */\n",
|
|||
|
".hoogle-doc {\n",
|
|||
|
"display: block;\n",
|
|||
|
"padding-bottom: 1.3em;\n",
|
|||
|
"padding-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-code {\n",
|
|||
|
"display: block;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-text {\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-name {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-head {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-sub {\n",
|
|||
|
"display: block;\n",
|
|||
|
"margin-left: 0.4em;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-package {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-module {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".hoogle-class {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".get-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"white-space: pre-wrap;\n",
|
|||
|
"}\n",
|
|||
|
".show-type {\n",
|
|||
|
"color: green;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"margin-left: 1em;\n",
|
|||
|
"}\n",
|
|||
|
".mono {\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-style: italic;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"display: block;\n",
|
|||
|
"}\n",
|
|||
|
"#unshowable {\n",
|
|||
|
"color: red;\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
".err-msg.in.collapse {\n",
|
|||
|
"padding-top: 0.7em;\n",
|
|||
|
"}\n",
|
|||
|
".highlight-code {\n",
|
|||
|
"white-space: pre;\n",
|
|||
|
"font-family: monospace;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-warning { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: rgb(200, 130, 0);\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-error { \n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"color: red;\n",
|
|||
|
"}\n",
|
|||
|
".suggestion-name {\n",
|
|||
|
"font-weight: bold;\n",
|
|||
|
"}\n",
|
|||
|
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 2)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 2</div></div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"Line 2: Redundant $\n",
|
|||
|
"Found:\n",
|
|||
|
"map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
|
|||
|
"Why not:\n",
|
|||
|
"map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 9: Redundant bracket\n",
|
|||
|
"Found:\n",
|
|||
|
"(collectionDNormalized !! 2)\n",
|
|||
|
"Why not:\n",
|
|||
|
"collectionDNormalized !! 2"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"ty"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"chyba"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"mieć"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"kot"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"[0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"vectorize :: Int -> Map Int Text -> [Text] -> [Double]\n",
|
|||
|
"vectorize vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
|
|||
|
" where count t doc \n",
|
|||
|
" | t `elem` doc = 1.0\n",
|
|||
|
" | otherwise = 0.0\n",
|
|||
|
" \n",
|
|||
|
"vocSize = Set.size voc\n",
|
|||
|
"\n",
|
|||
|
"(collectionDNormalized !! 2)\n",
|
|||
|
"vectorize vocSize vocD (collectionDNormalized !! 2)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
" ![image](./macierz.png)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Jak inaczej uwzględnić częstość wyrazów?\n",
|
|||
|
"\n",
|
|||
|
"<div style=\"display:none\">\n",
|
|||
|
" $\n",
|
|||
|
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
|
|||
|
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
|
|||
|
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
|
|||
|
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
|
|||
|
" $\n",
|
|||
|
"</div>\n",
|
|||
|
"\n",
|
|||
|
"* $\\tf_{t,d}$\n",
|
|||
|
"\n",
|
|||
|
"* $1+\\log(\\tf_{t,d})$\n",
|
|||
|
"\n",
|
|||
|
"* $0.5 + \\frac{0.5 \\times \\tf_{t,d}}{max_t(\\tf_{t,d})}$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"<div style=\"display:none\">\n",
|
|||
|
" $\n",
|
|||
|
" \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
|
|||
|
" \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
|
|||
|
" \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
|
|||
|
" \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
|
|||
|
" $\n",
|
|||
|
"</div>\n",
|
|||
|
"\n",
|
|||
|
"### Odwrotna częstość dokumentowa\n",
|
|||
|
"\n",
|
|||
|
"Czy wszystkie wyrazy są tak samo ważne?\n",
|
|||
|
"\n",
|
|||
|
"**NIE.** Wyrazy pojawiające się w wielu dokumentach są mniej ważne.\n",
|
|||
|
"\n",
|
|||
|
"Aby to uwzględnić, przemnażamy frekwencję wyrazu przez _odwrotną\n",
|
|||
|
" częstość w dokumentach_ (_inverse document frequency_):\n",
|
|||
|
"\n",
|
|||
|
"$$\\idf_t = \\log \\frac{N}{\\df_t},$$\n",
|
|||
|
"\n",
|
|||
|
"gdzie:\n",
|
|||
|
"\n",
|
|||
|
"* $\\idf_t$ - odwrotna częstość wyrazu $t$ w dokumentach\n",
|
|||
|
"\n",
|
|||
|
"* $N$ - liczba dokumentów w kolekcji\n",
|
|||
|
"\n",
|
|||
|
"* $\\df_f$ - w ilu dokumentach wystąpił wyraz $t$?\n",
|
|||
|
"\n",
|
|||
|
"#### Dlaczego idf?\n",
|
|||
|
"\n",
|
|||
|
"term $t$ wystąpił...\n",
|
|||
|
"\n",
|
|||
|
"* w 1 dokumencie, $\\idf_t = \\log N/1 = \\log N$\n",
|
|||
|
"* 2 razy w kolekcji, $\\idf_t = \\log N/2$ lub $\\log N$\n",
|
|||
|
"* 3 razy w kolekcji, $\\idf_t = \\log N/(N/2) = \\log 2$\n",
|
|||
|
"* we wszystkich dokumentach, $\\idf_t = \\log N/N = \\log 1 = 0$\n",
|
|||
|
"\n",
|
|||
|
"#### Co z tego wynika?\n",
|
|||
|
"\n",
|
|||
|
"Zamiast $\\tf_{t,d}$ będziemy w wektorach rozpatrywać wartości:\n",
|
|||
|
"\n",
|
|||
|
"$$\\tfidf_{t,d} = \\tf_{t,d} \\times \\idf_{t}$$\n",
|
|||
|
"\n",
|
|||
|
"Teraz zdefiniujemy _overlap score measure_:\n",
|
|||
|
"\n",
|
|||
|
"$$\\sigma(q,d) = \\sum_{t \\in q} \\tfidf_{t,d}$$\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Haskell",
|
|||
|
"language": "haskell",
|
|||
|
"name": "haskell"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": "ihaskell",
|
|||
|
"file_extension": ".hs",
|
|||
|
"mimetype": "text/x-haskell",
|
|||
|
"name": "haskell",
|
|||
|
"pygments_lexer": "Haskell",
|
|||
|
"version": "8.10.4"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 4
|
|||
|
}
|