{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wyszukiwarka - szybka i sensowna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Roboczy przykład\n", "\n", "Zakładamy, że mamy pewną kolekcję dokumentów $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokumentów w kolekcji)." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Podobno jest kot w butach." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "{-# LANGUAGE OverloadedStrings #-}\n", "\n", "import Data.Text hiding(map, filter, zip)\n", "import Prelude hiding(words, take)\n", "\n", "collectionD :: [Text]\n", "collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubiłem.\"]\n", "\n", "-- Operator (!!) zwraca element listy o podanym indeksie\n", "-- (Przy większych listach będzie nieefektywne, ale nie będziemy komplikować)\n", "collectionD !! 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wydobycie tekstu\n", "\n", "Przykładowe narzędzia:\n", "\n", "* pdftotext\n", "* antiword\n", "* Tesseract OCR\n", "* Apache Tika - uniwersalne narzędzie do wydobywania tekstu z różnych formatów\n", "\n", "## Normalizacja tekstu\n", "\n", "Cokolwiek robimy z tekstem, najpierw musimy go _znormalizować_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenizacja\n", "\n", "Po pierwsze musimy podzielić tekst na _tokeny_, czyli wyrazapodobne jednostki.\n", "Może po prostu podzielić po spacjach?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenizeStupidly :: Text -> [Text]\n", "-- words to funkcja z Data.Text, która dzieli po spacjach\n", "tokenizeStupidly = words\n", "\n", "tokenizeStupidly $ Prelude.head collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A, trzeba _chociaż_ odsunąć znaki interpunkcyjne. Najprościej użyć wyrażenia regularnego. Warto użyć [unikodowych własności](https://en.wikipedia.org/wiki/Unicode_character_property) znaków i konstrukcji `\\p{...}`. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "{-# LANGUAGE QuasiQuotes #-}\n", "\n", "import Text.Regex.PCRE.Heavy\n", "\n", "tokenize :: Text -> [Text]\n", "tokenize = map fst . scan [re|[\\p{L}0-9]+|\\p{P}|]\n", "\n", "tokenize $ Prelude.head collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cała kolekcja stokenizowana:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Podobno" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "jest" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "butach" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Ty" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "masz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "!" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "But" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zgubiłem" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "map tokenize collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problemy z tokenizacją\n", "\n", "##### Język angielski" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "base" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a data-base\"" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "database" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a database\"" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "base" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a data base\"" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "don" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "t" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "like" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Python" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I don't like Python\"" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0018" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "555" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "555" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "122" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"+0018 555 555 122\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0018555555122" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"+0018555555122\"" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Which" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "one" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "is" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "better" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ ":" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "C" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "or" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "C" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "#" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "?" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"Which one is better: C++ or C#?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Inne języki?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Rechtsschutzversicherungsgesellschaften" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "wie" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "die" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "HUK" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Coburg" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "machen" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "es" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "bereits" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "seit" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "geraumer" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Zeit" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "vor" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ ":" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\"" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "今日波兹南是贸易" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "、" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "工业及教育的中心" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "。" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "波兹南是波兰第五大的城市及第四大的工业中心" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "," ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "波兹南亦是大波兰省的行政首府" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "。" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "也舉辦有不少展覽會" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "。" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "是波蘭西部重要的交通中心都市" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "。" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"今日波兹南是贸易、工业及教育的中心。波兹南是波兰第五大的城市及第四大的工业中心,波兹南亦是大波兰省的行政首府。也舉辦有不少展覽會。是波蘭西部重要的交通中心都市。\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "l" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ordinateur" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"l'ordinateur\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lematyzacja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krześle\" do \"krzesło\", \"zrobimy\" do \"zrobić\" dla języka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla języka angielskiego." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lematyzacja dla języka polskiego jest bardzo trudna, praktycznie nie sposób wykonać ją regułowo, po prostu musimy się postarać o bardzo obszerny _słownik form fleksyjnych_.\n", "\n", "Na potrzeby tego wykładu stwórzmy sobie mały słownik form fleksyjnych w postaci tablicy asocjacyjnej (haszującej)." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "