{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

3. Wyszukiwarki \u2014 TF-IDF [wyk\u0142ad]

\n", "

Filip Grali\u0144ski (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wyszukiwarka - szybka i sensowna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Roboczy przyk\u0142ad\n", "\n", "Zak\u0142adamy, \u017ce mamy pewn\u0105 kolekcj\u0119 dokument\u00f3w $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokument\u00f3w w kolekcji)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala ma kota." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "{-# LANGUAGE OverloadedStrings #-}\n", "\n", "import Data.Text hiding(map, filter, zip)\n", "import Prelude hiding(words, take)\n", "\n", "collectionD :: [Text]\n", "collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubi\u0142em.\", \"Kot ma kota.\"]\n", "\n", "-- Operator (!!) zwraca element listy o podanym indeksie\n", "-- (Przy wi\u0119kszych listach b\u0119dzie nieefektywne, ale nie b\u0119dziemy komplikowa\u0107)\n", "Prelude.head collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wydobycie tekstu\n", "\n", "Przyk\u0142adowe narz\u0119dzia:\n", "\n", "* pdftotext\n", "* antiword\n", "* Tesseract OCR\n", "* Apache Tika - uniwersalne narz\u0119dzie do wydobywania tekstu z r\u00f3\u017cnych format\u00f3w\n", "\n", "## Normalizacja tekstu\n", "\n", "Cokolwiek robimy z tekstem, najpierw musimy go _znormalizowa\u0107_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenizacja\n", "\n", "Po pierwsze musimy podzieli\u0107 tekst na _tokeny_, czyli wyrazapodobne jednostki.\n", "Mo\u017ce po prostu podzieli\u0107 po spacjach?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenizeStupidly :: Text -> [Text]\n", "-- words to funkcja z Data.Text, kt\u00f3ra dzieli po spacjach\n", "tokenizeStupidly = words\n", "\n", "tokenizeStupidly $ Prelude.head collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A, trzeba _chocia\u017c_ odsun\u0105\u0107 znaki interpunkcyjne. Najpro\u015bciej u\u017cy\u0107 wyra\u017cenia regularnego. Warto u\u017cy\u0107 [unikodowych w\u0142asno\u015bci](https://en.wikipedia.org/wiki/Unicode_character_property) znak\u00f3w i konstrukcji `\\p{...}`. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "But" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zgubi\u0142em" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "{-# LANGUAGE QuasiQuotes #-}\n", "\n", "import Text.Regex.PCRE.Heavy\n", "\n", "tokenize :: Text -> [Text]\n", "tokenize = map fst . scan [re|C\\+\\+|[\\p{L}0-9]+|\\p{P}|]\n", "\n", "tokenize $ collectionD !! 3\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ca\u0142a kolekcja stokenizowana:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Podobno" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "jest" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "butach" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Ty" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "masz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "!" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "But" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zgubi\u0142em" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "map tokenize collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problemy z tokenizacj\u0105\n", "\n", "##### J\u0119zyk angielski" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "base" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a data-base\"" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "database" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a database\"" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "use" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "base" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I use a data base\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "don" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "t" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "like" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Python" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"'I don't like Python'\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "can" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "see" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "the" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Johnes" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "house" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I can see the Johnes' house\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "I" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "do" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "not" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "like" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Python" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"I do not like Python\"" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0018" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "555" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "555" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "122" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"+0018 555-555-122\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0018555555122" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"+0018555555122\"" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Which" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "one" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "is" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "better" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ ":" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "C++" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "or" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "C" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "#" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "?" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"Which one is better: C++ or C#?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Inne j\u0119zyki?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Rechtsschutzversicherungsgesellschaften" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "wie" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "die" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "HUK" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "-" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Coburg" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "machen" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "es" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "bereits" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "seit" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "geraumer" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Zeit" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "vor" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ ":" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\"" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u3001" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u3002" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\uff0c" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u3002" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u3002" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u3002" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613\u3001\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3\u3002\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3\uff0c\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c\u3002\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703\u3002\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02\u3002\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "l" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ordinateur" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tokenize \"l'ordinateur\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lematyzacja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krze\u015ble\" do \"krzes\u0142o\", \"zrobimy\" do \"zrobi\u0107\" dla j\u0119zyka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla j\u0119zyka angielskiego." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lematyzacja dla j\u0119zyka polskiego jest bardzo trudna, praktycznie nie spos\u00f3b wykona\u0107 j\u0105 regu\u0142owo, po prostu musimy si\u0119 postara\u0107 o bardzo obszerny _s\u0142ownik form fleksyjnych_.\n", "\n", "Na potrzeby tego wyk\u0142adu stw\u00f3rzmy sobie ma\u0142y s\u0142ownik form fleksyjnych w postaci tablicy asocjacyjnej (haszuj\u0105cej)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Use head
Found:
collectionD !! 0
Why Not:
head collectionD
" ], "text/plain": [ "Line 22: Use head\n", "Found:\n", "collectionD !! 0\n", "Why not:\n", "head collectionD" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "but" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "butami" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Wczoraj" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kupi\u0142em" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import Data.Map as Map hiding(take, map, filter)\n", "\n", "mockInflectionDictionary :: Map Text Text\n", "mockInflectionDictionary = Map.fromList [\n", " (\"kota\", \"kot\"),\n", " (\"butach\", \"but\"),\n", " (\"masz\", \"mie\u0107\"),\n", " (\"ma\", \"mie\u0107\"),\n", " (\"buta\", \"but\"),\n", " (\"zgubi\u0142em\", \"zgubi\u0107\")]\n", "\n", "lemmatizeWord :: Map Text Text -> Text -> Text\n", "lemmatizeWord dict w = findWithDefault w w dict\n", "\n", "lemmatizeWord mockInflectionDictionary \"butach\"\n", "-- a tego nie ma w naszym s\u0142owniczku, wi\u0119c zwracamy to samo\n", "lemmatizeWord mockInflectionDictionary \"butami\"\n", "\n", "lemmatize :: Map Text Text -> [Text] -> [Text]\n", "lemmatize dict = map (lemmatizeWord dict)\n", "\n", "lemmatize mockInflectionDictionary $ tokenize $ collectionD !! 0 \n", "\n", "lemmatize mockInflectionDictionary $ tokenize \"Wczoraj kupi\u0142em kota.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pytanie**: Nawet w naszym s\u0142owniczku mamy problemy z niejednoznaczno\u015bci\u0105 lematyzacji. Jakie?\n", "\n", "Obszerny s\u0142ownik form fleksyjnych dla j\u0119zyka polskiego: http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=view&target=PoliMorf-0.6.7.tab.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stemowanie\n", "\n", "Stemowanie (rdzeniowanie) obcina wyraz do _rdzenia_ niekoniecznie b\u0119d\u0105cego sensownym wyrazem, np. \"krze\u015ble\" mo\u017ce by\u0107 rdzeniowane do \"krze\u015bl\", \"krze\u015b\" albo \"krzes\", \"zrobimy\" do \"zrobi\".\n", "\n", "* stemowanie nie jest tak dobrze okre\u015blone jak lematyzacja (mo\u017cna robi\u0107 na wiele sposob\u00f3w)\n", "* bardziej podatne na metody regu\u0142owe (cho\u0107 dla polskiego i tak trudno)\n", "* dla angielskiego istniej\u0105 znane algorytmy stemowania, np. [algorytm Portera](https://tartarus.org/martin/PorterStemmer/def.txt)\n", "* zob. te\u017c [program Snowball](https://snowballstem.org/) z regu\u0142ami dla wielu j\u0119zyk\u00f3w\n", "\n", "Prosty stemmer \"dla ubogich\" dla j\u0119zyka polskiego to obcinanie do sze\u015bciu znak\u00f3w." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "zrobim" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "komput" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "butach" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u017ad\u017ab\u0142a" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "poorMansStemming :: Text -> Text\n", "poorMansStemming = Data.Text.take 6\n", "\n", "poorMansStemming \"zrobimy\"\n", "poorMansStemming \"komputerami\"\n", "poorMansStemming \"butach\"\n", "poorMansStemming \"\u017ad\u017ab\u0142ami\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Stop words_\n", "\n", "Cz\u0119sto wyszukiwarki pomijaj\u0105 kr\u00f3tkie, cz\u0119ste i nienios\u0105ce znaczenia s\u0142owa - _stop words_ (_s\u0142owa przestankowe_)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "True" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "isStopWord :: Text -> Bool\n", "isStopWord \"w\" = True\n", "isStopWord \"jest\" = True\n", "isStopWord \"\u017ce\" = True\n", "-- przy okazji mo\u017cemy pozby\u0107 si\u0119 znak\u00f3w interpunkcyjnych\n", "isStopWord w = w \u2248 [re|^\\p{P}+$|]\n", "\n", "isStopWord \"kot\"\n", "isStopWord \"!\"\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ma" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kota" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "removeStopWords :: [Text] -> [Text]\n", "removeStopWords = filter (not . isStopWord)\n", "\n", "removeStopWords $ tokenize $ Prelude.head collectionD " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pytanie**: Jakim zapytaniom usuwanie _stop words_ mo\u017ce szkodzi\u0107? Poda\u0107 przyk\u0142ady dla j\u0119zyka polskiego i angielskiego. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizacja - r\u00f3\u017cno\u015bci\n", "\n", "W sk\u0142ad normalizacji mo\u017ce te\u017c wchodzi\u0107:\n", "\n", "* poprawianie b\u0142\u0119d\u00f3w literowych\n", "* sprowadzanie do ma\u0142ych liter (lower-casing czy raczej case-folding)\n", "* usuwanie znak\u00f3w diakrytycznych\n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u017cd\u017ab\u0142o" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "toLower \"\u017bD\u0179B\u0141O\"" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u017ad\u017ab\u0142o" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "toCaseFold \"\u0179D\u0179B\u0141O\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pytanie:** Kiedy _case-folding_ da inny wynik ni\u017c _lower-casing_? Jakie to ma praktyczne znaczenie?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizacja jako ca\u0142o\u015bciowy proces\n", "\n", "Najwa\u017cniejsza zasada: dokumenty w naszej kolekcji powinny by\u0107 normalizowane w dok\u0142adnie taki spos\u00f3b, jak zapytania.\n", "\n", "Efektem normalizacji jest zamiana dokumentu na ci\u0105g _term\u00f3w_ (ang. _terms_), czyli znormalizowanych wyraz\u00f3w.\n", "\n", "Innymi s\u0142owy po normalizacji dokument $d_i$ traktujemy jako ci\u0105g term\u00f3w $t_i^1,\\dots,t_i^{|d_i|}$." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "podobn" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "but" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ty" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "but" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zgubi\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "normalize :: Text -> [Text]\n", "normalize = map poorMansStemming . removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n", "\n", "map normalize collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zbi\u00f3r wszystkich term\u00f3w w kolekcji dokument\u00f3w nazywamy s\u0142ownikiem (ang. _vocabulary_), nie myli\u0107 ze s\u0142ownikiem jako struktur\u0105 danych w Pythonie (_dictionary_).\n", "\n", "$$V = \\bigcup_{i=1}^N \\{t_i^1,\\dots,t_i^{|d_i|}\\}$$\n", "\n", "(To zbi\u00f3r, wi\u0119c liczymy bez powt\u00f3rze\u0144!)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fromList [\"ala\",\"but\",\"chyba\",\"kot\",\"mie\\263\",\"podobn\",\"ty\",\"zgubi\\263\"]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import Data.Set as Set hiding(map)\n", "\n", "getVocabulary :: [Text] -> Set Text \n", "getVocabulary = Set.unions . map (Set.fromList . normalize) \n", "\n", "getVocabulary collectionD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Jak wyszukiwarka mo\u017ce by\u0107 szybka?\n", "\n", "_Odwr\u00f3cony indeks_ (ang. _inverted index_) pozwala wyszukiwarce szybko szuka\u0107 w milionach dokument\u00f3w. Odwr\u00f3cony indeks to prostu... indeks, jaki znamy z ksi\u0105\u017cek (mapowanie s\u0142\u00f3w na numery stron/dokument\u00f3w).\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Use tuple-section
Found:
\\ t -> (t, ix)
Why Not:
(, ix)
" ], "text/plain": [ "Line 4: Use tuple-section\n", "Found:\n", "\\ t -> (t, ix)\n", "Why not:\n", "(, ix)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "fromList [(\"chyba\",2),(\"kot\",2),(\"mie\\263\",2),(\"ty\",2)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "collectionDNormalized = map normalize collectionD\n", "\n", "documentToPostings :: ([Text], Int) -> Set (Text, Int)\n", "documentToPostings (d, ix) = Set.fromList $ map (\\t -> (t, ix)) d\n", "\n", "documentToPostings (collectionDNormalized !! 2, 2) \n", "\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Use zipWith
Found:
map documentToPostings $ Prelude.zip coll [0 .. ]
Why Not:
zipWith (curry documentToPostings) coll [0 .. ]
" ], "text/plain": [ "Line 2: Use zipWith\n", "Found:\n", "map documentToPostings $ Prelude.zip coll [0 .. ]\n", "Why not:\n", "zipWith (curry documentToPostings) coll [0 .. ]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "fromList [(\"ala\",0),(\"but\",1),(\"but\",3),(\"chyba\",2),(\"chyba\",3),(\"kot\",0),(\"kot\",1),(\"kot\",2),(\"kot\",4),(\"mie\\263\",0),(\"mie\\263\",2),(\"mie\\263\",4),(\"podobn\",1),(\"ty\",2),(\"zgubi\\263\",3)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "collectionToPostings :: [[Text]] -> Set (Text, Int)\n", "collectionToPostings coll = Set.unions $ map documentToPostings $ Prelude.zip coll [0..]\n", "\n", "collectionToPostings collectionDNormalized" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Eta reduce
Found:
updateInvertedIndex (t, ix) invIndex\n", " = insertWith (++) t [ix] invIndex
Why Not:
updateInvertedIndex (t, ix) = insertWith (++) t [ix]
" ], "text/plain": [ "Line 2: Eta reduce\n", "Found:\n", "updateInvertedIndex (t, ix) invIndex\n", " = insertWith (++) t [ix] invIndex\n", "Why not:\n", "updateInvertedIndex (t, ix) = insertWith (++) t [ix]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "fromList [(\"ala\",[0]),(\"but\",[1,3]),(\"chyba\",[2,3]),(\"kot\",[0,1,2,4]),(\"mie\\263\",[0,2,4]),(\"podobn\",[1]),(\"ty\",[2]),(\"zgubi\\263\",[3])]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0,1,2,4]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "updateInvertedIndex :: (Text, Int) -> Map Text [Int] -> Map Text [Int]\n", "updateInvertedIndex (t, ix) invIndex = insertWith (++) t [ix] invIndex\n", "\n", "getInvertedIndex :: [[Text]] -> Map Text [Int]\n", "getInvertedIndex = Prelude.foldr updateInvertedIndex Map.empty . Set.toList . collectionToPostings\n", "\n", "ind = getInvertedIndex collectionDNormalized\n", "ind\n", "ind ! \"kot\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relewantno\u015b\u0107\n", "\n", "Potrafimy szybko przeszukiwa\u0107 znormalizowane dokumenty, ale kt\u00f3re dokumenty s\u0105 wa\u017cne (_relewantne_) wzgl\u0119dem potrzeby informacyjnej u\u017cytkownika?\n", "\n", "### Zapytania boole'owskie\n", "\n", "* `pizzeria Pozna\u0144 dow\u00f3z` to `pizzeria AND Pozna\u0144 AND dow\u00f3z` czy `pizzeria OR Pozna\u0144 OR dow\u00f3z`\n", "* `(pizzeria OR pizza OR tratoria) AND Pozna\u0144 AND dow\u00f3z\n", "* `pizzeria AND Pozna\u0144 AND dow\u00f3z AND NOT golonka`\n", "\n", "Jak domy\u015blnie interpretowa\u0107 zapytanie?\n", "\n", "* jako zapytanie AND -- by\u0107 mo\u017ce za ma\u0142o dokument\u00f3w\n", "* rozwi\u0105zanie po\u015brednie?\n", "* jako zapytanie OR -- by\u0107 mo\u017ce za du\u017co dokument\u00f3w\n", "\n", "Mo\u017cemy jakie\u015b miary dopasowania dokumentu do zapytania, \u017ceby m\u00f3c posortowa\u0107 dokumenty...\n", "\n", "### Mierzenie dopasowania dokumentu do zapytania\n", "\n", "Potrzebujemy jakie\u015b funkcji $\\sigma : Q x D \\rightarrow \\mathbb{R}$. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Musimy jako\u015b zamieni\u0107 dokumenty na liczby, tj. dokumenty na wektory liczb, a ca\u0142\u0105 kolekcj\u0119 na macierz.\n", "\n", "Po pierwsze ponumerujmy wszystkie termy ze s\u0142ownika." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fromList [(0,\"ala\"),(1,\"but\"),(2,\"chyba\"),(3,\"kot\"),(4,\"mie\\263\"),(5,\"podobn\"),(6,\"ty\"),(7,\"zgubi\\263\")]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "fromList [(\"ala\",0),(\"but\",1),(\"chyba\",2),(\"kot\",3),(\"mie\\263\",4),(\"podobn\",5),(\"ty\",6),(\"zgubi\\263\",7)]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ala" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "voc = getVocabulary collectionD\n", "\n", "vocD :: Map Int Text\n", "vocD = Map.fromList $ zip [0..] $ Set.toList voc\n", "\n", "invvocD :: Map Text Int\n", "invvocD = Map.fromList $ zip (Set.toList voc) [0..]\n", "\n", "vocD\n", "\n", "invvocD\n", "\n", "vocD ! 0\n", "invvocD ! \"chyba\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Napiszmy funkcj\u0119, kt\u00f3ra _wektoryzuje_ znormalizowany dokument.\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Redundant $
Found:
map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]
Why Not:
map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]
Redundant bracket
Found:
(collectionDNormalized !! 2)
Why Not:
collectionDNormalized !! 2
" ], "text/plain": [ "Line 2: Redundant $\n", "Found:\n", "map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n", "Why not:\n", "map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 9: Redundant bracket\n", "Found:\n", "(collectionDNormalized !! 2)\n", "Why not:\n", "collectionDNormalized !! 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ty" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "chyba" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vectorize :: Int -> Map Int Text -> [Text] -> [Double]\n", "vectorize vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n", " where count t doc \n", " | t `elem` doc = 1.0\n", " | otherwise = 0.0\n", " \n", "vocSize = Set.size voc\n", "\n", "(collectionDNormalized !! 2)\n", "vectorize vocSize vocD (collectionDNormalized !! 2)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ![image](./macierz.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jak inaczej uwzgl\u0119dni\u0107 cz\u0119sto\u015b\u0107 wyraz\u00f3w?\n", "\n", "
\n", " $\n", " \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n", " \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n", " \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n", " \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n", " $\n", "
\n", "\n", "* $\\tf_{t,d}$ - term frequency\n", "\n", "* $1+\\log(\\tf_{t,d})$\n", "\n", "* $0.5 + \\frac{0.5 \\times \\tf_{t,d}}{max_t(\\tf_{t,d})}$" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Redundant $
Found:
map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]
Why Not:
map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]
Redundant bracket
Found:
(collectionDNormalized !! 4)
Why Not:
collectionDNormalized !! 4
" ], "text/plain": [ "Line 2: Redundant $\n", "Found:\n", "map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n", "Why not:\n", "map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 7: Redundant bracket\n", "Found:\n", "(collectionDNormalized !! 4)\n", "Why not:\n", "collectionDNormalized !! 4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vectorizeTf :: Int -> Map Int Text -> [Text] -> [Double]\n", "vectorizeTf vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n", " where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n", "\n", "vocSize = Set.size voc\n", "\n", "(collectionDNormalized !! 4)\n", "vectorize vocSize vocD (collectionDNormalized !! 4)\n", "vectorizeTf vocSize vocD (collectionDNormalized !! 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " $\n", " \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n", " \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n", " \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n", " \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n", " $\n", "
\n", "\n", "### Odwrotna cz\u0119sto\u015b\u0107 dokumentowa\n", "\n", "Czy wszystkie wyrazy s\u0105 tak samo wa\u017cne?\n", "\n", "**NIE.** Wyrazy pojawiaj\u0105ce si\u0119 w wielu dokumentach s\u0105 mniej wa\u017cne.\n", "\n", "Aby to uwzgl\u0119dni\u0107, przemna\u017camy frekwencj\u0119 wyrazu przez _odwrotn\u0105\n", " cz\u0119sto\u015b\u0107 w dokumentach_ (_inverse document frequency_):\n", "\n", "$$\\idf_t = \\log \\frac{N}{\\df_t},$$\n", "\n", "gdzie:\n", "\n", "* $\\idf_t$ - odwrotna cz\u0119sto\u015b\u0107 wyrazu $t$ w dokumentach\n", "\n", "* $N$ - liczba dokument\u00f3w w kolekcji\n", "\n", "* $\\df_f$ - w ilu dokumentach wyst\u0105pi\u0142 wyraz $t$?\n", "\n", "#### Dlaczego idf?\n", "\n", "term $t$ wyst\u0105pi\u0142...\n", "\n", "* w 1 dokumencie, $\\idf_t = \\log N/1 = \\log N$\n", "* 2 razy w kolekcji, $\\idf_t = \\log N/2$ lub $\\log N$\n", "* w po\u0142owie dokument\u00f3w kolekcji, $\\idf_t = \\log N/(N/2) = \\log 2$\n", "* we wszystkich dokumentach, $\\idf_t = \\log N/N = \\log 1 = 0$\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.22314355131420976" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "idf :: [[Text]] -> Text -> Double\n", "idf coll t = log (fromIntegral n / fromIntegral df)\n", " where df = Prelude.length $ Prelude.filter (\\d -> t `elem` d) coll\n", " n = Prelude.length coll\n", " \n", "idf collectionDNormalized \"kot\" " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9162907318741551" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "idf collectionDNormalized \"chyba\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Co z tego wynika?\n", "\n", "Zamiast $\\tf_{t,d}$ b\u0119dziemy w wektorach rozpatrywa\u0107 warto\u015bci:\n", "\n", "$$\\tfidf_{t,d} = \\tf_{t,d} \\times \\idf_{t}$$\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mie\u0107" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "kot" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vectorizeTfIdf :: Int -> [[Text]] -> Map Int Text -> [Text] -> [Double]\n", "vectorizeTfIdf vecSize coll v doc = map (\\i -> count (v ! i) doc * idf coll (v ! i)) [0..(vecSize-1)]\n", " where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n", "\n", "vocSize = Set.size voc\n", "\n", "collectionDNormalized !! 4\n", "vectorize vocSize vocD (collectionDNormalized !! 4)\n", "vectorizeTf vocSize vocD (collectionDNormalized !! 4)\n", "vectorizeTfIdf vocSize collectionDNormalized vocD (collectionDNormalized !! 4)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[1.6094379124341003,0.0,0.0,0.22314355131420976,0.5108256237659907,0.0,0.0,0.0],[0.0,0.9162907318741551,0.0,0.22314355131420976,0.0,1.6094379124341003,0.0,0.0],[0.0,0.0,0.9162907318741551,0.22314355131420976,0.5108256237659907,0.0,1.6094379124341003,0.0],[0.0,0.9162907318741551,0.9162907318741551,0.0,0.0,0.0,0.0,1.6094379124341003],[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "map (vectorizeTfIdf vocSize collectionDNormalized vocD) collectionDNormalized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Teraz zdefiniujemy _overlap score measure_:\n", "\n", "$$\\sigma(q,d) = \\sum_{t \\in q} \\tfidf_{t,d}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Podobie\u0144stwo kosinusowe\n", "\n", "_Overlap score measure_ nie jest jedyn\u0105 mo\u017cliw\u0105 metryk\u0105, za pomoc\u0105 kt\u00f3rej mo\u017cemy mierzy\u0107 dopasowanie dokumentu do zapytania. Mo\u017cemy r\u00f3wnie\u017c si\u0119gn\u0105\u0107 po intuicje geometryczne (skoro mamy do czynienia z wektorami).\n", "\n", "**Pytanie**: Ile wymiar\u00f3w maj\u0105 wektory, na kt\u00f3rych operujemy? Jak \"wygl\u0105daj\u0105\" te wektory? Czy mo\u017cemy wykonywa\u0107 na nich standardowe operacje geometryczne czy te, kt\u00f3re znamy z geometrii liniowej?\n", "\n", "#### Podobie\u0144stwo mi\u0119dzy dokumentami\n", "\n", "Zajmijmy si\u0119 teraz poszukiwaniem miary mierz\u0105cej podobie\u0144stwo mi\u0119dzy dokumentami $d_1$ i $d_2$ (czyli poszukujemy sensownej funkcji $\\sigma : D x D \\rightarrow \\mathbb{R}$).\n", "\n", "**Uwaga** Poj\u0119cia \"miary\" u\u017cywamy nieformalnie, nie spe\u0142nia ona za\u0142o\u017ce\u0144 znanych z teorii miary.\n", "\n", "Rozpatrzmy zbiorek tekst\u00f3w legend miejskich z .\n", "\n", "(To autentyczne teksty z Internentu, z j\u0119zykiem potocznym, wulgarnym itd.)\n", "\n", "```\n", " git clone git://gonito.net/polish-urban-legends\n", " paste polish-urban-legends/dev-0/expected.tsv polish-urban-legends/dev-0/in.tsv > legendy.txt\n", "``` " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Opowie\u015b\u0107 prawdziwa... Olsztyn, akademik, 7 pi\u0119tro, impreza u Mariusza, jak to na polskiej najebce bywa kto\u015b rzuci\u0142 tekstem: \"Mariusz nie zjedziesz na nartach po schodach\". Sprawa ucichla, studencii wrocili do tego co lubia i w sumie umiej\u0105 najbardziej czyli picia, lecz nad ranem kolo godziny 6.00 ludzia przypomnialo sie ze Mariusz mia\u0142 zjecha\u0107 na nartach po schodach. Tu warto wspomnie\u0107 \u017ce Mariusz by\u0142 zapalonym narciarzem st\u0105d w\u0142a\u015bnie w jego pokoju znalezc mo\u017cna bylo narty, bo po ki huj komu\u015b narty w Olsztynie! Tak wracajac do historii nasz bohater odzia\u0142 si\u0119 w sprzet, podszed do schodow i niestety da\u0142 rad\u0119 zjecha\u0107 jedynie w po\u0142owie, gdy\u017c jak to powiedzial \"no kurwa potkn\u0105\u0142em sie\", ale nieustraszoony Mariusz pr\u00f3bowal dalej. Nastepny zjazd byl perfekcyjny, jedno pietro zanim, niestety pomiedzy 6 a 5 pietrem Mariusza natrafil na Pania sprz\u0105taczke, kt\u00f3ra potr\u0105ci\u0142 i zwia\u0142 z miejsca wypadku. Ok godziny 10.00 nastopilo przebudzenie Mariusza, ktory zaraz po obudzeniu uslyszal co narobi\u0142, mianowicie o skutkach potracenia, Pani sprzataczka z\u0142amala r\u0119k\u0119 i trafi\u0142a do szpitala. Mog\u0142y powsta\u0107 przez to cie\u017ckie konsekwencje, Mariusz m\u00f3g\u0142 wyleciec z akademika je\u017celi kierownik dowie sie o calym zaj\u015bciu. Wiec koledzy poradzili narcia\u017cowi, aby kupi\u0142 kwiaty i bombonierk\u0119 i poszed\u0142 do szpitala z przeprosinami. Po szybkich zakupach w sasiedniej Biedr\u0105ce, Mariusz byl przygotowany na konfrontacje z Pania sprz\u0105taczka, ale nie mog\u0142o poj\u015b\u0107 pi\u0119knie i g\u0142adko. Po wej\u015bciu do szpitala nasz bohater skierowal swoje kroki do recepcji pytajac si\u0119 o cioci\u0119, kt\u00f3ra mia\u0142a wypadek w akademiku, recepcjonistka skierowa\u0142a go do lekarza, gdzie czeka\u0142 na jego wyj\u015bcie ok 15 minut, gdy lekarz ju\u017c wyszed\u0142 ten odrazu podlecia\u0142 do niego, \u017ceby spyta\u0107 si\u0119 o stan zdrowia Pani sprz\u0105taczki. Wnet uslyszla od lekarz, niestety Pani teraz jest u psychiatry po twierdzi, \u017ce kto\u015b potracil ja zje\u017cdzajac na nartach w akademiku. Po uslyszeniu tej wiadomosci Mariusz odwroci\u0142 si\u0119, wybieg\u0142, kupi\u0142 piecie i szybko pobieg\u0142 do akademika pi\u0107 dalej! Mora\u0142... student potrafi!" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import System.IO\n", "import Data.List.Split as SP\n", "\n", "legendsh <- openFile \"legendy.txt\" ReadMode\n", "hSetEncoding legendsh utf8\n", "contents <- hGetContents legendsh\n", "ls = Prelude.lines contents\n", "items = map (map pack . SP.splitOn \"\\t\") ls\n", "Prelude.head items" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "87" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "nbOfLegends = Prelude.length items\n", "nbOfLegends" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lap" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "be_wy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "be_wy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "be_wy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ta_ab" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ta_ab" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ta_ab" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lap" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ta_ab" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lap" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "be_wy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lap" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "be_wy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "na_ak" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lap" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "mo_zu" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ba_hy" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "zw_oz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "tr_su" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "ne_dz" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "w_lud" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Ja podejrzewam \u017ce o polowaniu nie by\u0142o mowy, po prostu znalaz\u0142 martwego szczupaka i skorzysta\u0142 z okazji! Mnie mocno zdziwi\u0142a jego si\u0142a \u017ceby taki p\u00f3\u0142 kilogramowy okaz szczupaka przesuwa\u0107 o par\u0119 metr\u00f3w i to w trzcinach! Szacuneczek. Przypomniala mi sie historia kt\u00f3r\u0105 kiedys zaslyszalem o wlascicielce pytona, ktory nagle polozyl sie wzdluz jej \u0142\u00f3\u017cka. Le\u017ca\u0142 tak wyci\u0105gniety jak struna d\u0142u\u017cszy czas jak nie\u017cywy (a by\u0142 d\u0142ugo\u015bci \u0142\u00f3\u017cka), wi\u0119c kobitka zadzonila do weterynarza co ma robi\u0107. Us\u0142ysza\u0142a \u017ce ma szybko zamkn\u0105\u0107 si\u0119 w \u0142azience i poczeka\u0107 na niego bo pyton j\u0105 mierzy jako potencjaln\u0105 ofiar\u0119 (czy mu si\u0119 zmie\u015bci w brzuchu...). Wierzy\u0107, nie wierzy\u0107? Kiedy\u015b nie wierzy\u0142em ale od kilku dni mam w\u0105tpliwosci... Pozdrawiam" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "labelsL = map Prelude.head items\n", "labelsL\n", "collectionL = map (!!1) items\n", "items !! 1" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "348" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "collectionLNormalized = map normalize collectionL\n", "voc' = getVocabulary collectionL\n", "\n", "vocLSize = Prelude.length voc'\n", "\n", "vocL :: Map Int Text\n", "vocL = Map.fromList $ zip [0..] $ Set.toList voc'\n", "\n", "invvocL :: Map Text Int\n", "invvocL = Map.fromList $ zip (Set.toList voc') [0..]\n", "\n", "vocL ! 0\n", "invvocL ! \"chyba\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wektoryzujemy ca\u0142\u0105 kolekcj\u0119:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.38837067474886433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.752336051950276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0647107369924282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,1.247032293786383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5947071077466928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.268683541318364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.7578579175523736,0.0,0.0,0.0,0.0,0.0,0.3550342544812725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9395475940384223,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21437689194643514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2878542883066382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2745334443309775,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.330413902725434,0.0,1.247032293786383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.330413902725434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,2.5199979695992702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.6741486494265287,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5199979695992702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.6741486494265287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.386466576974748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.856470206220483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,1.0319209141694374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,2.340142505300509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.7578579175523736,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5214691394881432,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.388148398070203e-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.9810014688665833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6096847248398047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.575536360758419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.1847155011136463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0319209141694374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,2.856470206220483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.322773392263051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.163323025660538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.900958761193047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,3.079613757534693,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.340142505300509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.710068508962545,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.931816237309167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5199979695992702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0319209141694374,0.0,2.163323025660538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26121549926361765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.6741486494265287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.386466576974748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.238841272604079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.330413902725434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.163323025660538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12210269680089991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.068012845856213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.856470206220483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.856470206220483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.079613757534693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.712940412440966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.068012845856213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "lVectorized = map (vectorizeTfIdf vocLSize collectionLNormalized vocL) collectionLNormalized\n", "lVectorized !! 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Szukamy funkcji $sigma$, kt\u00f3ra da wysok\u0105 warto\u015b\u0107 dla tekst\u00f3w dotycz\u0105cych tego samego w\u0105tku legendowego (np. $d_1$ i $d_2$ m\u00f3wi\u0105 o w\u0119\u017cu przymierzaj\u0105cym si\u0119 do zjedzenia swojej w\u0142a\u015bcicielki) i nisk\u0105 dla tekst\u00f3w z r\u00f3\u017cnych w\u0105tk\u00f3w (np. $d_1$ opowiada o w\u0119\u017cu ludojadzie, $d_2$ - ba\u0142wanku na hydrancie)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mo\u017ce po prostu odleg\u0142o\u015b\u0107 euklidesowa, skoro to punkty w wielowymiarowej przestrzeni?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Eta reduce
Found:
formatNumber x = printf \"% 7.2f\" x
Why Not:
formatNumber = printf \"% 7.2f\"
" ], "text/plain": [ "Line 5: Eta reduce\n", "Found:\n", "formatNumber x = printf \"% 7.2f\" x\n", "Why not:\n", "formatNumber = printf \"% 7.2f\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ " 0.00 79.93 78.37 76.57 87.95 81.15 82.77 127.50 124.54 76.42 84.19 78.90 90.90" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import Text.Printf\n", "import Data.List (take)\n", "\n", "formatNumber :: Double -> String\n", "formatNumber x = printf \"% 7.2f\" x\n", "\n", "similarTo :: ([Double] -> [Double] -> Double) -> [[Double]] -> Int -> Text\n", "similarTo simFun vs ix = pack $ Prelude.unwords $ map (formatNumber . ((vs !! ix) `simFun`)) vs\n", "\n", "euclDistance :: [Double] -> [Double] -> Double\n", "euclDistance v1 v2 = sqrt $ sum $ Prelude.zipWith (\\x1 x2 -> (x1 - x2)**2) v1 v2\n", "\n", "limit = 13\n", "labelsLimited = Data.List.take limit labelsL\n", "limitedL = Data.List.take limit lVectorized\n", "\n", "similarTo euclDistance limitedL 0\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Move brackets to avoid $
Found:
\"\\n\"\n", " <>\n", " (Data.Text.unlines\n", " $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)])
Why Not:
\"\\n\"\n", " <>\n", " Data.Text.unlines\n", " (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)])
Use zipWith
Found:
map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)]
Why Not:
zipWith\n", " (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n", " labels [0 .. (Prelude.length vs - 1)]
Move brackets to avoid $
Found:
\" \"\n", " <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)
Why Not:
\" \"\n", " <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)
Avoid lambda
Found:
\\ l -> pack $ printf \"% 7s\" l
Why Not:
pack . printf \"% 7s\"
" ], "text/plain": [ "Line 2: Move brackets to avoid $\n", "Found:\n", "\"\\n\"\n", " <>\n", " (Data.Text.unlines\n", " $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)])\n", "Why not:\n", "\"\\n\"\n", " <>\n", " Data.Text.unlines\n", " (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)])Line 2: Use zipWith\n", "Found:\n", "map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n", " $ zip labels [0 .. (Prelude.length vs - 1)]\n", "Why not:\n", "zipWith\n", " (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n", " labels [0 .. (Prelude.length vs - 1)]Line 3: Move brackets to avoid $\n", "Found:\n", "\" \"\n", " <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)\n", "Why not:\n", "\" \"\n", " <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)Line 3: Avoid lambda\n", "Found:\n", "\\ l -> pack $ printf \"% 7s\" l\n", "Why not:\n", "pack . printf \"% 7s\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ " na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n", "na_ak 0.00 79.93 78.37 76.57 87.95 81.15 82.77 127.50 124.54 76.42 84.19 78.90 90.90\n", "w_lud 79.93 0.00 38.92 34.35 56.48 44.89 47.21 109.24 104.82 35.33 49.88 39.98 60.20\n", "ba_hy 78.37 38.92 0.00 30.37 54.23 40.93 43.83 108.15 102.91 27.37 46.95 35.81 58.99\n", "w_lap 76.57 34.35 30.37 0.00 51.54 37.46 40.86 107.43 103.22 25.22 43.66 32.10 56.53\n", "ne_dz 87.95 56.48 54.23 51.54 0.00 57.98 60.32 113.66 109.59 50.96 62.17 54.84 70.70\n", "be_wy 81.15 44.89 40.93 37.46 57.98 0.00 49.55 110.37 100.50 37.77 51.54 37.09 62.92\n", "zw_oz 82.77 47.21 43.83 40.86 60.32 49.55 0.00 111.11 107.57 41.02 54.07 45.23 64.65\n", "mo_zu 127.50 109.24 108.15 107.43 113.66 110.37 111.11 0.00 139.57 107.38 109.91 108.20 117.07\n", "be_wy 124.54 104.82 102.91 103.22 109.59 100.50 107.57 139.57 0.00 102.69 108.32 99.06 113.25\n", "ba_hy 76.42 35.33 27.37 25.22 50.96 37.77 41.02 107.38 102.69 0.00 43.83 32.08 56.68\n", "mo_zu 84.19 49.88 46.95 43.66 62.17 51.54 54.07 109.91 108.32 43.83 0.00 47.87 66.40\n", "be_wy 78.90 39.98 35.81 32.10 54.84 37.09 45.23 108.20 99.06 32.08 47.87 0.00 59.66\n", "w_lud 90.90 60.20 58.99 56.53 70.70 62.92 64.65 117.07 113.25 56.68 66.40 59.66 0.00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "paintMatrix :: ([Double] -> [Double] -> Double) -> [Text] -> [[Double]] -> Text\n", "paintMatrix simFun labels vs = header <> \"\\n\" <> (Data.Text.unlines $ map (\\(lab, ix) -> lab <> \" \" <> similarTo simFun vs ix) $ zip labels [0..(Prelude.length vs - 1)])\n", " where header = \" \" <> (Data.Text.unwords $ map (\\l -> pack $ printf \"% 7s\" l) labels)\n", " \n", "paintMatrix euclDistance labelsLimited limitedL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Problem: za du\u017co zale\u017cy od d\u0142ugo\u015bci tekstu.\n", "\n", "Rozwi\u0105zanie: znormalizowa\u0107 wektor $v$ do wektora jednostkowego.\n", "\n", "$$ \\vec{1}(v) = \\frac{v}{|v|} $$\n", "\n", "Taki wektor ma d\u0142ugo\u015b\u0107 1!" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ " na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n", "na_ak 10.00 0.67 0.66 0.66 0.67 0.67 0.67 0.67 0.67 0.67 0.66 0.67 0.67\n", "w_lud 0.67 10.00 0.67 0.68 0.67 0.66 0.67 0.67 0.68 0.66 0.67 0.67 0.68\n", "ba_hy 0.66 0.67 10.00 0.66 0.67 0.67 0.67 0.67 0.69 0.74 0.66 0.67 0.66\n", "w_lap 0.66 0.68 0.66 10.00 0.66 0.66 0.66 0.66 0.67 0.66 0.66 0.66 0.66\n", "ne_dz 0.67 0.67 0.67 0.66 10.00 0.67 0.67 0.68 0.69 0.68 0.67 0.67 0.68\n", "be_wy 0.67 0.66 0.67 0.66 0.67 10.00 0.66 0.67 0.74 0.66 0.67 0.76 0.66\n", "zw_oz 0.67 0.67 0.67 0.66 0.67 0.66 10.00 0.67 0.67 0.66 0.66 0.67 0.67\n", "mo_zu 0.67 0.67 0.67 0.66 0.68 0.67 0.67 10.00 0.69 0.67 0.69 0.68 0.67\n", "be_wy 0.67 0.68 0.69 0.67 0.69 0.74 0.67 0.69 10.00 0.68 0.67 0.75 0.67\n", "ba_hy 0.67 0.66 0.74 0.66 0.68 0.66 0.66 0.67 0.68 10.00 0.66 0.67 0.66\n", "mo_zu 0.66 0.67 0.66 0.66 0.67 0.67 0.66 0.69 0.67 0.66 10.00 0.67 0.67\n", "be_wy 0.67 0.67 0.67 0.66 0.67 0.76 0.67 0.68 0.75 0.67 0.67 10.00 0.67\n", "w_lud 0.67 0.68 0.66 0.66 0.68 0.66 0.67 0.67 0.67 0.66 0.67 0.67 10.00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vectorNorm :: [Double] -> Double\n", "vectorNorm vs = sqrt $ sum $ map (\\x -> x * x) vs\n", "\n", "toUnitVector :: [Double] -> [Double]\n", "toUnitVector vs = map (/ n) vs\n", " where n = vectorNorm vs\n", "\n", "vectorNorm (toUnitVector [3.0, 4.0])\n", "\n", "euclDistanceNormalized :: [Double] -> [Double] -> Double\n", "euclDistanceNormalized v1 v2 = toUnitVector v1 `euclDistance` toUnitVector v2\n", "\n", "euclSim v1 v2 = 1 / (d + 0.1)\n", " where d = euclDistanceNormalized v1 v2\n", "\n", "paintMatrix euclSim labelsLimited limitedL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Podobie\u0144stwo kosinusowe\n", "\n", "Cz\u0119\u015bciej zamiast odleg\u0142o\u015bci euklidesowej stosuje si\u0119 podobie\u0144stwo kosinusowe, czyli kosinus k\u0105ta mi\u0119dzy wektorami.\n", "\n", "Wektor dokumentu ($\\vec{V}(d)$) - wektor, kt\u00f3rego sk\u0142adowe odpowiadaj\u0105 wyrazom.\n", "\n", "$$\\sigma(d_1,d_2) = \\cos\\theta(\\vec{V}(d_1),\\vec{V}(d_2)) = \\frac{\\vec{V}(d_1) \\cdot \\vec{V}(d_2)}{|\\vec{V}(d_1)||\\vec{V}(d_2)|} $$\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zauwa\u017cmy, \u017ce jest to iloczyn skalarny znormalizowanych wektor\u00f3w!\n", "\n", "$$\\sigma(d_1,d_2) = \\vec{1}(\\vec{V}(d_1)) \\times \\vec{1}(\\vec{V}(d_2)) $$" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "(\u2715) :: [Double] -> [Double] -> Double\n", "(\u2715) v1 v2 = sum $ Prelude.zipWith (*) v1 v2\n", "\n", "[2, 1, 0] \u2715 [-2, 5, 10]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n", "na_ak 1.00 0.02 0.01 0.01 0.03 0.02 0.02 0.04 0.03 0.02 0.01 0.02 0.03\n", "w_lud 0.02 1.00 0.02 0.05 0.04 0.01 0.03 0.04 0.06 0.01 0.02 0.03 0.06\n", "ba_hy 0.01 0.02 1.00 0.01 0.02 0.03 0.03 0.04 0.08 0.22 0.01 0.04 0.01\n", "w_lap 0.01 0.05 0.01 1.00 0.01 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.00\n", "ne_dz 0.03 0.04 0.02 0.01 1.00 0.04 0.03 0.07 0.08 0.06 0.03 0.03 0.05\n", "be_wy 0.02 0.01 0.03 0.01 0.04 1.00 0.01 0.03 0.21 0.01 0.02 0.25 0.01\n", "zw_oz 0.02 0.03 0.03 0.00 0.03 0.01 1.00 0.04 0.03 0.00 0.01 0.02 0.02\n", "mo_zu 0.04 0.04 0.04 0.01 0.07 0.03 0.04 1.00 0.10 0.02 0.09 0.05 0.04\n", "be_wy 0.03 0.06 0.08 0.02 0.08 0.21 0.03 0.10 1.00 0.05 0.03 0.24 0.04\n", "ba_hy 0.02 0.01 0.22 0.00 0.06 0.01 0.00 0.02 0.05 1.00 0.01 0.02 0.00\n", "mo_zu 0.01 0.02 0.01 0.00 0.03 0.02 0.01 0.09 0.03 0.01 1.00 0.01 0.02\n", "be_wy 0.02 0.03 0.04 0.00 0.03 0.25 0.02 0.05 0.24 0.02 0.01 1.00 0.02\n", "w_lud 0.03 0.06 0.01 0.00 0.05 0.01 0.02 0.04 0.04 0.00 0.02 0.02 1.00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cosineSim v1 v2 = toUnitVector v1 \u2715 toUnitVector v2\n", "\n", "paintMatrix cosineSim labelsLimited limitedL" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "na tylnym siedzeniu w autobusie siedzi matka z 7-8 letnim synkiem. naprzeciwko synka siedzi kobieta (zwr\u00f3cona twarz\u0105 do dzieciaka). synek co chwile wymachuje nogami i kopie kobiet\u0119, matka widz\u0105c to nie reaguje na to wog\u00f3le. wreszcie kobieta zwraca uwag\u0119 matce, \u017ceby ta powiedzia\u0142a co\u015b synowi a matka do niej: nie mog\u0119, bo wychowuj\u0119 syna bezstresowo!!! ...ch\u0142opak, kt\u00f3ry sta\u0142 w pobli\u017cu i widzia\u0142 i s\u0142ysza\u0142 ca\u0142e to zaj\u015bcie wyplu\u0142 z ust gum\u0119 do \u017cucia i przyklei\u0142 matce na czo\u0142o i powiedzia\u0142: ja te\u017c by\u0142em bezstresowo wychowywany... autentyczny przypadek w londy\u0144skim autobusie (a tym co przyklei\u0142 matce gum\u0119 na czo\u0142o by\u0142 chyba nawet m\u0142ody Polak)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "collectionL !! 5" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Kr\u00f3tko zwi\u0119\u017ale i na temat. Zastanawia mnie jak ludzie wychowuj\u0105 dzieci. Co prawda sam nie mam potomstwa i nie zamierzam mie\u0107 jak narazie (bo to troch\u0119 g\u0142upie mie\u0107 17-letniego tatusia), ale niestety mam przyjemno\u015b\u0107 ogl\u0105da\u0107 efekty wychowawcze niekt\u00f3rych par (dzi\u0119ki znajomym rodzic\u00f3w w r\u00f3\u017cnym wieku). S\u0105 trzy najbardziej znane mi modele wychowania. Surowe, bezstresowe (w moim znaczeniu) i \"bezstresowe\" w mowie potocznej. Zaczynam od tego pierwszego. Jak nazwa wskazuje, jest to surowe wychowanie, oparte na karach cielesnych lub torturach umys\u0142owych. Nie uwa\u017cam tego za dobre wychowanie, bo dziecko jak b\u0119dzie nieco starsze b\u0119dzie si\u0119 ba\u0142o wszystkiego, bo uzna, \u017c jak zrobi co\u015b \u017cle to spotka je kara. Wi\u0119c bicie za r\u00f3\u017cne rzeczy odpada (no chyba, \u017ce dzieciak na serio nabroi to oczywi\u015bcie). Wychowanie bezstresowe z mojego s\u0142ownika oznacza nienara\u017canie dziecka na stresy, pocieszanie w trudnych sytuacjach, za\u0142atwianie problem\u00f3w przez rozmow\u0119 oraz sta\u0142y kontakt z dzieckiem. I to chyba najlepsze. Sam zosta\u0142em tak wychowany i ciesz\u0119 si\u0119 z tego powodu. I oczywi\u015bcie \"wychowanie bezstresowe\". A tu si\u0119 normalnie rozpisz\u0119. Po pierwsze geneza. Wi\u0119c jak dochodzi do takiego wychowania? Odpowied\u017a. Mamusi i tatusiowi si\u0119 zachcia\u0142o bobaska bo to takie malutkie fajniutkie i ooo. Oboje zazdroszcz\u0105 innym parom bo one maj\u0105, a oni nie, wi\u0119c oni te\u017c chc\u0105. No wi\u0119c rodzi im si\u0119 bobasek, chuchaj\u0105 dmuchaj\u0105 na niego p\u00f3ki ma\u0142e. Ale przychodzi ten okres, kiedy dziecko trzeba wychowa\u0107 i kiedy ma si\u0119 na dzieciaka najwi\u0119kszy wp\u0142yw. I tu si\u0119 zaczynaj\u0105 schody. Nagle oboje nie maj\u0105 czasu i m\u00f3wi\u0105 \"Wychowamy go/j\u0105/ich (niepotrzebne skre\u015bli\u0107) bezstresowo.\" Po drugie. Decyzja o sposobie wychowania podj\u0119ta. A wi\u0119c jak to wygl\u0105da? Odpowied\u017a. Totalna olewka! Mama i tata baluj\u0105, a dzieciaka zostawiaj\u0105 samemu sobie, albo pod opiek\u0119 babci, kt\u00f3ra r\u00f3wnie\u017c leje na dziecko ciep\u0142ym moczem. Dzieciak ro\u015bnie i ro\u015bnie, nie wie co dobre a co z\u0142e. Przypomnia\u0142a mi si\u0119 pewna, podobno autentyczna scenka. Ch\u0142opak jedzie ze szwagrem autobusem czy tam tramwajem. Na jednym miejscu siedzi starowinka, a na przeciwko niej siedzi lafirynda z brzd\u0105cem na kolanach. No i sobie dzieciak macha n\u00f3\u017ckami i tu ciach i kopn\u0105\u0142 staruszk\u0119 w nog\u0119. Babcia nic sobie z tego nie zrobi\u0142a, a dzieciak nie widz\u0105c reakcji zacz\u0105\u0142 j\u0105 ju\u017c celowo kopa\u0107. Staruszka: Mo\u017ce pani powiedzie\u0107 co\u015b synkowi \u017ceby mnie nie kopa\u0142. Matka: Nie bo ja go wychowuj\u0119 bezstresowo. Szwagier wyci\u0105ga z ust gum\u0119 do \u017cucia i przykleja mamusi na czo\u0142o m\u00f3wi\u0105c: Moja mama te\u017c mnie wychowa\u0142a bezstresowo. Ciekaw jestem ile w tym prawdy by\u0142o, a je\u017celi 100% to czy mamusi si\u0119 odmieni\u0142y pogl\u0105dy. Kto go wie? Po trzecie. Doros\u0142y wychowany bezstresowo. Jaki on jest? Odpowied\u017a. Zupe\u0142nie inny. My\u015bli, \u017ce jest p\u0119pkiem \u015bwiata i \u017ce wszystko musi by\u0107 pod jego dyktando. Pracuj\u0105c w Szwajcarii przy piel\u0119gnacji winogron, syn polskiego kolegi taty zacz\u0105\u0142 rzuca\u0107 we mnie winogronami. Mia\u0142em ochot\u0119 wbi\u0107 mu no\u017cyczki (kt\u00f3rymi podcina\u0142em li\u015bcie) w oczy. A to by\u0142by ciekawy widok. Dzieciak o bia\u0142ych w\u0142osach, sk\u00f3rze i niebieskich oczach sta\u0142by sie albinosem (bo z niebieskich oczu sta\u0142yby sie czerwone jak u bia\u0142ych szczur\u00f3w i myszek). Ojciec sie co prawda na niego wydziera\u0142, \u017ceby nie przeszkadza\u0142, ale jak wida\u0107 dzieciak mia\u0142 to po prostu w dupie. Wi\u0119c skoro dziecko nie s\u0142ucha si\u0119 nawet rodzica, to jak w szkole pos\u0142ucha nauczyciela? Jak znajdzie prac\u0119, w kt\u00f3rej b\u0119dzie jaki\u015b szef (chyba, \u017ce sam sobie b\u0119dzie szefem)? W ten oto spos\u00f3b jak dowiaduj\u0119 si\u0119 o tym, \u017ce kto\u015b wychowuje dzieciaka bezstresowo, ciary przechodz\u0105 mi po plecach, a tego\u017c rodzica mam ochot\u0119 paln\u0105\u0107 mu w \u0142eb tak \u017ceby si\u0119 przekr\u0119ci\u0142 (zar\u00f3wno \u0142eb jak i pogl\u0105dy). A jak mnie wychowano? By\u0142em cz\u0119sto sam sobie zostawiany. Ale nie oznacza \u017ce to byla wspomniana olewka. Jako, \u017ce rodzice pracowali, a rodze\u0144stwo chodzi\u0142o do szko\u0142y, podrzucali mnie do babci. A wieczorami si\u0119 mn\u0105 opiekowali. Gadali jak mia\u0142em problemy i nie bili bo pono\u0107 by\u0142em spokojnym dzieckiem. No i tyle. Do 17 urodzin 2 dni, a szczura chyba nie dostan\u0119. A sam nie kupi\u0119!;(" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "collectionL !! 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Z powrotem do wyszukiwarek\n", "\n", "Mo\u017cemy potraktowa\u0107 zapytanie jako bardzo kr\u00f3tki dokument, dokona\u0107 jego wektoryzacji i policzy\u0107 cosinus k\u0105ta mi\u0119dzy zapytaniem a dokumentem." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ja za to znam przypadek, \u017ce kole\u017canka mieszkala w bloku par\u0119 lat temu, pewnego razu wchodzi do \u0142azienki w samej bieli\u017anie a tam ogromny w\u0105\u017c na pod\u0142odze i tak si\u0119 wystraszy\u0142a \u017ce wybieg\u0142a z wrzaskiem z mieszkania i wylecia\u0142a przed blok w samej bieli\u017anie i uciek\u0142a do babci swojej, kt\u00f3ra mieszkala gdzie\u015b niedaleko. a potem si\u0119 okaza\u0142o, \u017ce jej s\u0105siad z do\u0142u hodowa\u0142 sobie w\u0119\u017ca i tak w\u0142a\u015bnie swobodnie go \"pasa\u0142\" po mieszkaniu i w\u0105\u017c mu spierdzieli\u0142 przez rur\u0119 w \u0142azience :cool :" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Pewna dziewczyna, wieku mi nieznanego, w mie\u015bcie sto\u0142ecznym - rozwiod\u0142a si\u0119. By\u0142a sama i samotna, wi\u0119c zapragn\u0119\u0142a kupi\u0107 sobie zwierz\u0119, aby sw\u0105 mi\u0142\u0105 obecno\u015bci\u0105 rozja\u015bnia\u0142o jej puste wieczory i takie\u017c poranki. Dziewczyna by\u0142a najwyra\u017aniej ekscentryczk\u0105, bo zamiast rozkosznego, mi\u0119kkiego kociaka z czerwonym k\u0142\u0119buszkiem we\u0142enki lub kud\u0142atego pieska , co sika na parkiet i gryzie skarpetki - kupi\u0142a sobie ... w\u0119\u017ca. W\u0105\u017c zamieszka\u0142 z dziewczyn\u0105, i dobrze im by\u0142o. Gad jad\u0142, spa\u0142 i r\u00f3s\u0142, a po pierwszym okresie oboj\u0119tno\u015bci ( zw\u0142aszcza ze strony w\u0119\u017ca ) nawi\u0105za\u0142a si\u0119 mi\u0119dzy nimi ni\u0107 porozumienia. Przynajmniej dziewczyna odczuwa\u0142a t\u0119 ni\u0107 wyra\u017anie, gdy\u017c w\u0105\u017c reagowa\u0142 na jej obecno\u015b\u0107, a noc\u0105 spa\u0142 zwini\u0119ty w k\u0142\u0119bek w nogach jej \u0142\u00f3\u017cka. Po dw\u00f3ch latach wsp\u00f3lnego bytowania, nie przerywanych \u017cadnym znacz\u0105cym wydarzeniem w ich wzajemnych relacjach, dziewczyna zauwa\u017cy\u0142a, \u017ce w\u0105\u017c sta\u0142 si\u0119 osowia\u0142y. Przesta\u0142 je\u015b\u0107, chowa\u0142 si\u0119 po k\u0105tach, a nocami, zamiast w nogach \u0142\u00f3\u017cka - sypia\u0142 wyci\u0105gni\u0119ty wzd\u0142u\u017c jej boku. Martwi\u0142a si\u0119 o swojego gada i posz\u0142a z nim do weterynarza. Weterynarz zbada\u0142 go, zapisa\u0142 leki na popraw\u0119 apetytu ( ciekawe, jak si\u0119 bada w\u0119\u017ca ? ) i odes\u0142a\u0142 do domu. Zdrowie \u015bliskiego pacjenta nie poprawi\u0142o si\u0119, wi\u0119c troskliwa dziewczyna postanowi\u0142a zasi\u0119gn\u0105\u0107 porady u znawcy gad\u00f3w i gadzich obyczaj\u00f3w. Znawca wys\u0142ucha\u0142 opisu niepokoj\u0105cych objaw\u00f3w, i powiedzia\u0142 : - Prosz\u0119 pani. Ten w\u0105\u017c nie jest chory. On teraz po\u015bci. A le\u017cy wzd\u0142u\u017c pani noc\u0105, bo sprawdza, czy pani si\u0119 zmie\u015bci. To prawdziwa historia. Opowiedzia\u0142a nam j\u0105 dzi\u015b klientka. Le\u017c\u0119 na \u0142\u00f3\u017cku, pisze tego posta, i patrz\u0119 na drzemi\u0105c\u0105 obok mnie kotk\u0119. Troch\u0119 ma\u0142a jest. Raczej nie ma szans, \u017cebym sie zmie\u015bci\u0142a, jakby co.." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Anakonda. Czy to kolejna miejska legenda? Jaki\u015b czas temu kole\u017canka na jednej z imprez towarzyskich opowiedzia\u0142a mro\u017c\u0105c\u0105 krew w \u017cy\u0142ach histori\u0119 o dziewczynie ze swojej pracy, kt\u00f3ra w Warszawie na dyskotece w Dekadzie pozna\u0142a ch\u0142opaka. Spotyka\u0142a si\u0119 z nim na kaw\u0119 i po drugiej randce dosz\u0142o do poca\u0142unk\u00f3w. Um\u00f3wi\u0142a si\u0119 na trzeci\u0105 randk\u0119, ale zanim do niej dosz\u0142o wyskoczy\u0142 jej jaki\u015b pryszcz na twarzy. Posz\u0142a do lekarza, a ten... zawiadomi\u0142 policj\u0119, prokuratur\u0119 itd. , bo rozpozna\u0142 zara\u017cenie... jadem trupim! Rozpocz\u0119to przes\u0142uchanie dziewczyny i po wyja\u015bnieniach trafiono do ch\u0142opaka, z kt\u00f3rym si\u0119 ca\u0142owa\u0142a. W jego domu odkryto rozk\u0142adaj\u0105ce si\u0119 zw\u0142oki dw\u00f3ch dziewczyn. By\u0142am ta histori\u0105 wstrz\u0105\u015bni\u0119ta. Nast\u0119pnego dnia opowiedzia\u0142am j\u0105 w pracy, a kole\u017canka Justyna przyzna\u0142a, \u017ce ju\u017c o tym slysza\u0142a. To mnie utwierdzi\u0142o, \u017ce historia jest prawdziwa, ale... tylko do wieczora. Co\u015b mi nie dawa\u0142o spokoju. Uwaga TVN nic? Interwencja Polsatu - nic? Nasz rodzimy Telekurier nic? Zacz\u0119\u0142am sprawdza\u0107 w internecie co to jest jad trupi, opryszczka od zaka\u017cenia tym\u017ce jadem i tak... trafi\u0142am na miejsk\u0105 legend\u0119. Historia wydarzy\u0142a si\u0119 nie tylko w Warszawie, ale i w Olsztynie, Toruniu, Wroc\u0142awiu i Krakowie, a by\u0107 mo\u017ce w og\u00f3le za granic\u0105. Cho\u0107 prawdopodobne jest, \u017ce nie wydarzy\u0142a si\u0119 nigdy. G\u0142o\u015bno o niej by\u0142o na miejskch forach. Za ka\u017cdym razem ofiara by\u0142a czyj\u0105\u015b znajom\u0105. Po przeczytaniu kolejnej wersji historii zadzwoni\u0142am do kole\u017canki, kt\u00f3ra opowiedzia\u0142a mi t\u0119 histori\u0119 i skl\u0119\u0142am czym \u015bwiat stoi. Dlatego kiedy kilka dni temu inna kole\u017canka opowiedzia\u0142a kolejn\u0105 mro\u017c\u0105c\u0105 krew w \u017cy\u0142ach histori\u0119 - tym razem o anakondzie - rozpocz\u0119\u0142am poszukiwania w internecie czy to nie jest nast\u0119pna miejska legenda. Nic nie znalaz\u0142am. Jednak co\u015b mi nie pasuje, cho\u0107 ta historia mo\u017ce brzmie\u0107 wielce prawdopodobnie. Zw\u0142aszcza, gdy kto\u015b ogl\u0105da\u0142 g\u0142upawy film z J. Lo. Zainteresowa\u0142o mnie to, bo siedz\u0105c nad powie\u015bci\u0105 \"Dzika\" poczyta\u0142am troch\u0119 o w\u0119\u017cach. A o jak\u0105 histori\u0119 mi chodzi? Pewna kobieta (podobno s\u0105siadka tej mojej kole\u017canki z pracy, kt\u00f3ra histori\u0119 opowiada\u0142a) hodowa\u0142a w domu w\u0119\u017ca - anakond\u0119. Hodowa\u0142a j\u0105 pi\u0119\u0107 lat i nie trzyma\u0142a w terrarium. Anakonda chodzi\u0142a (pe\u0142za\u0142a) samopas po domu i co kilka dni dostawa\u0142a chomika, szczura, mysz lub kr\u00f3lika do zjedzenia. Pewnego dnia przesta\u0142a je\u015b\u0107 i zacz\u0119\u0142a si\u0119 dziwnie zachowywa\u0107. Ka\u017cdego ranka po przebudzeniu w\u0142a\u015bcicielka znajdowa\u0142a j\u0105 w swoim \u0142\u00f3\u017cku wyprostowan\u0105 jak struna. Po dw\u00f3ch tygodniach takich zachowa\u0144 ze strony anakondy w\u0142a\u015bcicielka zaniepokojona stanem zdrowia ukochanego w\u0119\u017ca posz\u0142a z nim do lekarza. Ten wys\u0142ucha\u0142 objaw\u00f3w \"choroby\" i powiedzia\u0142, \u017ce anakonda g\u0142odzi\u0142a si\u0119, by zje\u015b\u0107... w\u0142ascicielk\u0119. K\u0142adzenie si\u0119 ko\u0142o niej by\u0142o mierzeniem ile jeszcze g\u0142odzi\u0107 si\u0119 trzeba, by w\u0142a\u015bcicielka zmie\u015bci\u0142a si\u0119 w pysku no i badaniem od kt\u00f3rej strony trzeba j\u0105 zaatakowa\u0107. W\u0119\u017cowi chodzi\u0142o bowiem o to, by smakowity i du\u017cy obiad si\u0119 za bardzo nie broni\u0142. Ja domy\u015bli\u0142am si\u0119 od razu do czego zmierza ta historia (lektura artyku\u0142\u00f3w o w\u0119\u017cach zrobi\u0142a swoje), ale dla reszty, kt\u00f3rzy s\u0142uchali by\u0142o to szokiem. Mnie szokuje co innego. Po co trzyma\u0107 w\u0119\u017ca skoro nie ma z nim cz\u0142owiek \u017cadnego kontaktu? To nie pies, kot czy inny ssak. To nie ptak. W\u0105\u017c to w\u0105\u017c! Nie przyjdzie na zawo\u0142anie. Jaby kto\u015b nie wiedzia\u0142 to... W\u0119\u017ce s\u0105 mi\u0119so\u017cerne. Po\u0142ykaj\u0105 ofiary w ca\u0142o\u015bci, mimo \u017ce cz\u0119sto wielokrotnie s\u0105 one wi\u0119ksze od samego w\u0119\u017ca. Po\u0142ykanie polega na nasuwaniu si\u0119 w\u0119\u017ca na swoj\u0105 ofiar\u0119. A anakonda... \u017cyje zwykle w wodzie i na drzewach, \u017cywi\u0105c si\u0119 ssakami (m.in. tapiry, dziki, kapibary, jelenie!, gryzonie, niekiedy nawet jaguary), gadami (kajmany), rybami i ptakami, poluj\u0105c zazwyczaj w nocy. Jest w stanie po\u0142kn\u0105\u0107 ofiar\u0119 znacznie szersz\u0105 od swojego cia\u0142a, co jest mo\u017cliwe dzi\u0119ki rozci\u0105gni\u0119ciu szcz\u0119k. Trawienie jest bardzo powolne - po posi\u0142ku w\u0105\u017c trawi wi\u0119ksz\u0105 ofiar\u0119 przez wiele dni, a potem mo\u017ce po\u015bci\u0107 przez szereg tygodni lub miesi\u0119cy. Zanotowany rekord postu, w przypadku anakondy znajduj\u0105cej si\u0119 w niewoli, wynosi 2 lata. Z historii wynika, \u017ce gdyby nie interwencja u weterynarza mog\u0142aby rodzina przez kilka lat szuka\u0107 w\u0142a\u015bcicielki anakondy. My\u015bleliby, \u017ce jest na wycieczce a ona w brzuszku w postaci obiadku. Jest tylko jedno ale. Nigdzie nie znalaz\u0142am jednak \u015bladu, ani nawet wzmianki o tym, \u017ce anakonda zjad\u0142a cz\u0142owieka. I dlatego ci\u0105gle w sumie mam w\u0105tpliwo\u015bci. ps. Dalszy los anakondy \"s\u0105siadki\" kole\u017canki nie jest mi znany." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import Data.Ord\n", "import Data.List\n", "\n", "legendVectorizer = vectorizeTfIdf vocLSize collectionLNormalized vocL . normalize\n", "\n", "\n", "query vs vzer q = map ((collectionL !!) . snd) $ Data.List.take 3 $ sortBy (\\a b -> fst b `compare` fst a) $ zip (map (`cosineSim` qvec) vs) [0..] \n", " where qvec = vzer q \n", "\n", "query lVectorized legendVectorizer \"w\u0105\u017c przymierza si\u0119 do zjedzenia w\u0142a\u015bcicielki\"\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Haskell", "language": "haskell", "name": "haskell" }, "language_info": { "codemirror_mode": "ihaskell", "file_extension": ".hs", "mimetype": "text/x-haskell", "name": "haskell", "pygments_lexer": "Haskell", "version": "8.10.4" }, "author": "Filip Grali\u0144ski", "email": "filipg@amu.edu.pl", "lang": "pl", "subtitle": "3.Wyszukiwarki \u2014 TF-IDF[wyk\u0142ad]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }