aitech-eks-pub/wyk/03_Tfidf.ipynb

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {
                "collapsed": false
            },
            "source": [
                "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
                "<div class=\"alert alert-block alert-info\">\n",
                "<h1> Ekstrakcja informacji </h1>\n",
                "<h2> 3. <i>Wyszukiwarki \u2014 TF-IDF</i>  [wyk\u0142ad]</h2> \n",
                "<h3> Filip Grali\u0144ski (2021)</h3>\n",
                "</div>\n",
                "\n",
                "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Wyszukiwarka - szybka i sensowna"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Roboczy przyk\u0142ad\n",
                "\n",
                "Zak\u0142adamy, \u017ce mamy pewn\u0105 kolekcj\u0119 dokument\u00f3w $D = {d_1, \\ldots, d_N}$. ($N$ - liczba dokument\u00f3w w kolekcji)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Ala ma kota."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "{-# LANGUAGE OverloadedStrings #-}\n",
                "\n",
                "import Data.Text hiding(map, filter, zip)\n",
                "import Prelude hiding(words, take)\n",
                "\n",
                "collectionD :: [Text]\n",
                "collectionD = [\"Ala ma kota.\", \"Podobno jest kot w butach.\", \"Ty chyba masz kota!\", \"But chyba zgubi\u0142em.\", \"Kot ma kota.\"]\n",
                "\n",
                "-- Operator (!!) zwraca element listy o podanym indeksie\n",
                "-- (Przy wi\u0119kszych listach b\u0119dzie nieefektywne, ale nie b\u0119dziemy komplikowa\u0107)\n",
                "Prelude.head collectionD"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Wydobycie tekstu\n",
                "\n",
                "Przyk\u0142adowe narz\u0119dzia:\n",
                "\n",
                "* pdftotext\n",
                "* antiword\n",
                "* Tesseract OCR\n",
                "* Apache Tika - uniwersalne narz\u0119dzie do wydobywania tekstu z r\u00f3\u017cnych format\u00f3w\n",
                "\n",
                "## Normalizacja tekstu\n",
                "\n",
                "Cokolwiek robimy z tekstem, najpierw musimy go _znormalizowa\u0107_."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Tokenizacja\n",
                "\n",
                "Po pierwsze musimy podzieli\u0107 tekst na _tokeny_, czyli wyrazapodobne jednostki.\n",
                "Mo\u017ce po prostu podzieli\u0107 po spacjach?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ma"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kota."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenizeStupidly :: Text -> [Text]\n",
                "-- words to funkcja z Data.Text, kt\u00f3ra dzieli po spacjach\n",
                "tokenizeStupidly = words\n",
                "\n",
                "tokenizeStupidly $ Prelude.head collectionD"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "A, trzeba _chocia\u017c_ odsun\u0105\u0107 znaki interpunkcyjne. Najpro\u015bciej u\u017cy\u0107 wyra\u017cenia regularnego. Warto u\u017cy\u0107 [unikodowych w\u0142asno\u015bci](https://en.wikipedia.org/wiki/Unicode_character_property) znak\u00f3w i konstrukcji `\\p{...}`. "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "But"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zgubi\u0142em"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "{-# LANGUAGE QuasiQuotes #-}\n",
                "\n",
                "import Text.Regex.PCRE.Heavy\n",
                "\n",
                "tokenize :: Text -> [Text]\n",
                "tokenize = map fst . scan [re|C\\+\\+|[\\p{L}0-9]+|\\p{P}|]\n",
                "\n",
                "tokenize $ collectionD !! 3\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Ca\u0142a kolekcja stokenizowana:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ma"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kota"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Podobno"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "jest"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "butach"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Ty"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "masz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kota"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "!"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "But"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zgubi\u0142em"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ma"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kota"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "map tokenize collectionD"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### Problemy z tokenizacj\u0105\n",
                "\n",
                "##### J\u0119zyk angielski"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "use"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "a"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "data"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "-"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "base"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"I use a data-base\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 20,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "use"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "a"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "database"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"I use a database\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "use"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "a"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "data"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "base"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"I use a data base\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "don"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "t"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "like"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Python"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"'I don't like Python'\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "can"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "see"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "the"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Johnes"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "house"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"I can see the Johnes' house\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "I"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "do"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "not"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "like"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Python"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"I do not like Python\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0018"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "555"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "-"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "555"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "-"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "122"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"+0018 555-555-122\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 24,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0018555555122"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"+0018555555122\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 15,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Which"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "one"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "is"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "better"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            ":"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "C++"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "or"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "C"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "#"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "?"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"Which one is better: C++ or C#?\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "##### Inne j\u0119zyki?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 28,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Rechtsschutzversicherungsgesellschaften"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "wie"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "die"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "HUK"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "-"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Coburg"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "machen"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "es"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "bereits"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "seit"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "geraumer"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Zeit"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "vor"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            ":"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"Rechtsschutzversicherungsgesellschaften wie die HUK-Coburg machen es bereits seit geraumer Zeit vor:\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 29,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u3001"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u3002"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\uff0c"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u3002"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u3002"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u3002"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"\u4eca\u65e5\u6ce2\u5179\u5357\u662f\u8d38\u6613\u3001\u5de5\u4e1a\u53ca\u6559\u80b2\u7684\u4e2d\u5fc3\u3002\u6ce2\u5179\u5357\u662f\u6ce2\u5170\u7b2c\u4e94\u5927\u7684\u57ce\u5e02\u53ca\u7b2c\u56db\u5927\u7684\u5de5\u4e1a\u4e2d\u5fc3\uff0c\u6ce2\u5179\u5357\u4ea6\u662f\u5927\u6ce2\u5170\u7701\u7684\u884c\u653f\u9996\u5e9c\u3002\u4e5f\u8209\u8fa6\u6709\u4e0d\u5c11\u5c55\u89bd\u6703\u3002\u662f\u6ce2\u862d\u897f\u90e8\u91cd\u8981\u7684\u4ea4\u901a\u4e2d\u5fc3\u90fd\u5e02\u3002\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 30,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "l"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ordinateur"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "tokenize \"l'ordinateur\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Lematyzacja"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "_Lematyzacja_ to sprowadzenie do formy podstawowej (_lematu_), np. \"krze\u015ble\" do \"krzes\u0142o\", \"zrobimy\" do \"zrobi\u0107\" dla j\u0119zyka polskiego, \"chairs\" do \"chair\", \"made\" do \"make\" dla j\u0119zyka angielskiego."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Lematyzacja dla j\u0119zyka polskiego jest bardzo trudna, praktycznie nie spos\u00f3b wykona\u0107 j\u0105 regu\u0142owo, po prostu musimy si\u0119 postara\u0107 o bardzo obszerny _s\u0142ownik form fleksyjnych_.\n",
                "\n",
                "Na potrzeby tego wyk\u0142adu stw\u00f3rzmy sobie ma\u0142y s\u0142ownik form fleksyjnych w postaci tablicy asocjacyjnej (haszuj\u0105cej)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Use head</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">collectionD !! 0</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">head collectionD</div></div>"
                        ],
                        "text/plain": [
                            "Line 22: Use head\n",
                            "Found:\n",
                            "collectionD !! 0\n",
                            "Why not:\n",
                            "head collectionD"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "but"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "butami"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Wczoraj"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kupi\u0142em"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "."
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "import Data.Map as Map hiding(take, map, filter)\n",
                "\n",
                "mockInflectionDictionary :: Map Text Text\n",
                "mockInflectionDictionary = Map.fromList [\n",
                "   (\"kota\", \"kot\"),\n",
                "   (\"butach\", \"but\"),\n",
                "   (\"masz\", \"mie\u0107\"),\n",
                "   (\"ma\", \"mie\u0107\"),\n",
                "   (\"buta\", \"but\"),\n",
                "   (\"zgubi\u0142em\", \"zgubi\u0107\")]\n",
                "\n",
                "lemmatizeWord :: Map Text Text -> Text -> Text\n",
                "lemmatizeWord dict w = findWithDefault w w dict\n",
                "\n",
                "lemmatizeWord mockInflectionDictionary \"butach\"\n",
                "-- a tego nie ma w naszym s\u0142owniczku, wi\u0119c zwracamy to samo\n",
                "lemmatizeWord mockInflectionDictionary \"butami\"\n",
                "\n",
                "lemmatize :: Map Text Text -> [Text] -> [Text]\n",
                "lemmatize dict = map (lemmatizeWord dict)\n",
                "\n",
                "lemmatize mockInflectionDictionary $ tokenize $ collectionD !! 0 \n",
                "\n",
                "lemmatize mockInflectionDictionary $ tokenize \"Wczoraj kupi\u0142em kota.\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**Pytanie**: Nawet w naszym s\u0142owniczku mamy problemy z niejednoznaczno\u015bci\u0105 lematyzacji. Jakie?\n",
                "\n",
                "Obszerny s\u0142ownik form fleksyjnych dla j\u0119zyka polskiego: http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=view&target=PoliMorf-0.6.7.tab.gz"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Stemowanie\n",
                "\n",
                "Stemowanie (rdzeniowanie) obcina wyraz do _rdzenia_ niekoniecznie b\u0119d\u0105cego sensownym wyrazem, np. \"krze\u015ble\" mo\u017ce by\u0107 rdzeniowane do \"krze\u015bl\", \"krze\u015b\" albo \"krzes\", \"zrobimy\" do \"zrobi\".\n",
                "\n",
                "* stemowanie nie jest tak dobrze okre\u015blone jak lematyzacja (mo\u017cna robi\u0107 na wiele sposob\u00f3w)\n",
                "* bardziej podatne na metody regu\u0142owe (cho\u0107 dla polskiego i tak trudno)\n",
                "* dla angielskiego istniej\u0105 znane algorytmy stemowania, np. [algorytm Portera](https://tartarus.org/martin/PorterStemmer/def.txt)\n",
                "* zob. te\u017c [program Snowball](https://snowballstem.org/) z regu\u0142ami dla wielu j\u0119zyk\u00f3w\n",
                "\n",
                "Prosty stemmer \"dla ubogich\" dla j\u0119zyka polskiego to obcinanie do sze\u015bciu znak\u00f3w."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 35,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "zrobim"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "komput"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "butach"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "\u017ad\u017ab\u0142a"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "poorMansStemming :: Text -> Text\n",
                "poorMansStemming = Data.Text.take 6\n",
                "\n",
                "poorMansStemming \"zrobimy\"\n",
                "poorMansStemming \"komputerami\"\n",
                "poorMansStemming \"butach\"\n",
                "poorMansStemming \"\u017ad\u017ab\u0142ami\"\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### _Stop words_\n",
                "\n",
                "Cz\u0119sto wyszukiwarki pomijaj\u0105 kr\u00f3tkie, cz\u0119ste i nienios\u0105ce znaczenia s\u0142owa - _stop words_ (_s\u0142owa przestankowe_)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "False"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "True"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "isStopWord :: Text -> Bool\n",
                "isStopWord \"w\" = True\n",
                "isStopWord \"jest\" = True\n",
                "isStopWord \"\u017ce\" = True\n",
                "-- przy okazji mo\u017cemy pozby\u0107 si\u0119 znak\u00f3w interpunkcyjnych\n",
                "isStopWord w = w \u2248 [re|^\\p{P}+$|]\n",
                "\n",
                "isStopWord \"kot\"\n",
                "isStopWord \"!\"\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ma"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kota"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "removeStopWords :: [Text] -> [Text]\n",
                "removeStopWords = filter (not . isStopWord)\n",
                "\n",
                "removeStopWords $ tokenize $ Prelude.head collectionD "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**Pytanie**: Jakim zapytaniom usuwanie _stop words_ mo\u017ce szkodzi\u0107? Poda\u0107 przyk\u0142ady dla j\u0119zyka polskiego i angielskiego. "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Normalizacja - r\u00f3\u017cno\u015bci\n",
                "\n",
                "W sk\u0142ad normalizacji mo\u017ce te\u017c wchodzi\u0107:\n",
                "\n",
                "* poprawianie b\u0142\u0119d\u00f3w literowych\n",
                "* sprowadzanie do ma\u0142ych liter (lower-casing czy raczej case-folding)\n",
                "* usuwanie znak\u00f3w diakrytycznych\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "\u017cd\u017ab\u0142o"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "toLower \"\u017bD\u0179B\u0141O\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 22,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "\u017ad\u017ab\u0142o"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "toCaseFold \"\u0179D\u0179B\u0141O\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**Pytanie:** Kiedy _case-folding_ da inny wynik ni\u017c _lower-casing_? Jakie to ma praktyczne znaczenie?"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Normalizacja jako ca\u0142o\u015bciowy proces\n",
                "\n",
                "Najwa\u017cniejsza zasada: dokumenty w naszej kolekcji powinny by\u0107 normalizowane w dok\u0142adnie taki spos\u00f3b, jak zapytania.\n",
                "\n",
                "Efektem normalizacji jest zamiana dokumentu na ci\u0105g _term\u00f3w_ (ang. _terms_), czyli znormalizowanych wyraz\u00f3w.\n",
                "\n",
                "Innymi s\u0142owy po normalizacji dokument $d_i$ traktujemy jako ci\u0105g term\u00f3w $t_i^1,\\dots,t_i^{|d_i|}$."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 38,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "podobn"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "but"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ty"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "but"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zgubi\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "normalize :: Text -> [Text]\n",
                "normalize = map poorMansStemming . removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
                "\n",
                "map normalize collectionD"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Zbi\u00f3r wszystkich term\u00f3w w kolekcji dokument\u00f3w nazywamy s\u0142ownikiem (ang. _vocabulary_), nie myli\u0107 ze s\u0142ownikiem jako struktur\u0105 danych w Pythonie (_dictionary_).\n",
                "\n",
                "$$V = \\bigcup_{i=1}^N \\{t_i^1,\\dots,t_i^{|d_i|}\\}$$\n",
                "\n",
                "(To zbi\u00f3r, wi\u0119c liczymy bez powt\u00f3rze\u0144!)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "fromList [\"ala\",\"but\",\"chyba\",\"kot\",\"mie\\263\",\"podobn\",\"ty\",\"zgubi\\263\"]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "import Data.Set as Set hiding(map)\n",
                "\n",
                "getVocabulary :: [Text] -> Set Text \n",
                "getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
                "\n",
                "getVocabulary collectionD"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Jak wyszukiwarka mo\u017ce by\u0107 szybka?\n",
                "\n",
                "_Odwr\u00f3cony indeks_ (ang. _inverted index_) pozwala wyszukiwarce szybko szuka\u0107 w milionach dokument\u00f3w. Odwr\u00f3cony indeks to prostu... indeks, jaki znamy z ksi\u0105\u017cek (mapowanie s\u0142\u00f3w na numery stron/dokument\u00f3w).\n",
                "\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 43,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Use tuple-section</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ t -> (t, ix)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">(, ix)</div></div>"
                        ],
                        "text/plain": [
                            "Line 4: Use tuple-section\n",
                            "Found:\n",
                            "\\ t -> (t, ix)\n",
                            "Why not:\n",
                            "(, ix)"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "fromList [(\"chyba\",2),(\"kot\",2),(\"mie\\263\",2),(\"ty\",2)]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "collectionDNormalized = map normalize collectionD\n",
                "\n",
                "documentToPostings :: ([Text], Int) -> Set (Text, Int)\n",
                "documentToPostings (d, ix) = Set.fromList $ map (\\t -> (t, ix)) d\n",
                "\n",
                "documentToPostings (collectionDNormalized !! 2, 2) \n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 46,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map documentToPostings $ Prelude.zip coll [0 .. ]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith (curry documentToPostings) coll [0 .. ]</div></div>"
                        ],
                        "text/plain": [
                            "Line 2: Use zipWith\n",
                            "Found:\n",
                            "map documentToPostings $ Prelude.zip coll [0 .. ]\n",
                            "Why not:\n",
                            "zipWith (curry documentToPostings) coll [0 .. ]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "fromList [(\"ala\",0),(\"but\",1),(\"but\",3),(\"chyba\",2),(\"chyba\",3),(\"kot\",0),(\"kot\",1),(\"kot\",2),(\"kot\",4),(\"mie\\263\",0),(\"mie\\263\",2),(\"mie\\263\",4),(\"podobn\",1),(\"ty\",2),(\"zgubi\\263\",3)]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "collectionToPostings :: [[Text]] -> Set (Text, Int)\n",
                "collectionToPostings coll = Set.unions $ map documentToPostings $ Prelude.zip coll [0..]\n",
                "\n",
                "collectionToPostings collectionDNormalized"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 41,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) invIndex\n",
                            "  = insertWith (++) t [ix] invIndex</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">updateInvertedIndex (t, ix) = insertWith (++) t [ix]</div></div>"
                        ],
                        "text/plain": [
                            "Line 2: Eta reduce\n",
                            "Found:\n",
                            "updateInvertedIndex (t, ix) invIndex\n",
                            "  = insertWith (++) t [ix] invIndex\n",
                            "Why not:\n",
                            "updateInvertedIndex (t, ix) = insertWith (++) t [ix]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "fromList [(\"ala\",[0]),(\"but\",[1,3]),(\"chyba\",[2,3]),(\"kot\",[0,1,2,4]),(\"mie\\263\",[0,2,4]),(\"podobn\",[1]),(\"ty\",[2]),(\"zgubi\\263\",[3])]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0,1,2,4]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "updateInvertedIndex :: (Text, Int) -> Map Text [Int] -> Map Text [Int]\n",
                "updateInvertedIndex (t, ix) invIndex = insertWith (++) t [ix] invIndex\n",
                "\n",
                "getInvertedIndex :: [[Text]] -> Map Text [Int]\n",
                "getInvertedIndex = Prelude.foldr updateInvertedIndex Map.empty . Set.toList . collectionToPostings\n",
                "\n",
                "ind = getInvertedIndex collectionDNormalized\n",
                "ind\n",
                "ind ! \"kot\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": []
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Relewantno\u015b\u0107\n",
                "\n",
                "Potrafimy szybko przeszukiwa\u0107 znormalizowane dokumenty, ale kt\u00f3re dokumenty s\u0105 wa\u017cne (_relewantne_) wzgl\u0119dem potrzeby informacyjnej u\u017cytkownika?\n",
                "\n",
                "### Zapytania boole'owskie\n",
                "\n",
                "* `pizzeria Pozna\u0144 dow\u00f3z` to `pizzeria AND Pozna\u0144 AND dow\u00f3z` czy `pizzeria OR Pozna\u0144 OR dow\u00f3z`\n",
                "* `(pizzeria OR pizza OR tratoria) AND Pozna\u0144 AND dow\u00f3z\n",
                "* `pizzeria AND Pozna\u0144 AND dow\u00f3z AND NOT golonka`\n",
                "\n",
                "Jak domy\u015blnie interpretowa\u0107 zapytanie?\n",
                "\n",
                "* jako zapytanie AND -- by\u0107 mo\u017ce za ma\u0142o dokument\u00f3w\n",
                "* rozwi\u0105zanie po\u015brednie?\n",
                "* jako zapytanie OR -- by\u0107 mo\u017ce za du\u017co dokument\u00f3w\n",
                "\n",
                "Mo\u017cemy jakie\u015b miary dopasowania dokumentu do zapytania, \u017ceby m\u00f3c posortowa\u0107 dokumenty...\n",
                "\n",
                "### Mierzenie dopasowania dokumentu do zapytania\n",
                "\n",
                "Potrzebujemy jakie\u015b funkcji $\\sigma : Q x D \\rightarrow \\mathbb{R}$. \n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Musimy jako\u015b zamieni\u0107 dokumenty na liczby, tj. dokumenty na wektory liczb, a ca\u0142\u0105 kolekcj\u0119 na macierz.\n",
                "\n",
                "Po pierwsze ponumerujmy wszystkie termy ze s\u0142ownika."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 15,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "fromList [(0,\"ala\"),(1,\"but\"),(2,\"chyba\"),(3,\"kot\"),(4,\"mie\\263\"),(5,\"podobn\"),(6,\"ty\"),(7,\"zgubi\\263\")]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "fromList [(\"ala\",0),(\"but\",1),(\"chyba\",2),(\"kot\",3),(\"mie\\263\",4),(\"podobn\",5),(\"ty\",6),(\"zgubi\\263\",7)]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ala"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "2"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "voc = getVocabulary collectionD\n",
                "\n",
                "vocD :: Map Int Text\n",
                "vocD = Map.fromList $ zip [0..] $ Set.toList voc\n",
                "\n",
                "invvocD :: Map Text Int\n",
                "invvocD = Map.fromList $ zip (Set.toList voc) [0..]\n",
                "\n",
                "vocD\n",
                "\n",
                "invvocD\n",
                "\n",
                "vocD ! 0\n",
                "invvocD ! \"chyba\"\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Napiszmy funkcj\u0119, kt\u00f3ra _wektoryzuje_ znormalizowany dokument.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 2)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 2</div></div>"
                        ],
                        "text/plain": [
                            "Line 2: Redundant $\n",
                            "Found:\n",
                            "map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
                            "Why not:\n",
                            "map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 9: Redundant bracket\n",
                            "Found:\n",
                            "(collectionDNormalized !! 2)\n",
                            "Why not:\n",
                            "collectionDNormalized !! 2"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ty"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "chyba"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "vectorize :: Int -> Map Int Text -> [Text] -> [Double]\n",
                "vectorize vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
                "   where count t doc \n",
                "           | t `elem` doc = 1.0\n",
                "           | otherwise = 0.0\n",
                "           \n",
                "vocSize = Set.size voc\n",
                "\n",
                "(collectionDNormalized !! 2)\n",
                "vectorize vocSize vocD (collectionDNormalized !! 2)\n",
                "\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                " ![image](./macierz.png)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Jak inaczej uwzgl\u0119dni\u0107 cz\u0119sto\u015b\u0107 wyraz\u00f3w?\n",
                "\n",
                "<div style=\"display:none\">\n",
                "  $\n",
                "    \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
                "    \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
                "    \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
                "    \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
                "  $\n",
                "</div>\n",
                "\n",
                "* $\\tf_{t,d}$ - term frequency\n",
                "\n",
                "* $1+\\log(\\tf_{t,d})$\n",
                "\n",
                "* $0.5 + \\frac{0.5 \\times \\tf_{t,d}}{max_t(\\tf_{t,d})}$"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Redundant $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Redundant bracket</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">(collectionDNormalized !! 4)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">collectionDNormalized !! 4</div></div>"
                        ],
                        "text/plain": [
                            "Line 2: Redundant $\n",
                            "Found:\n",
                            "map (\\ i -> count (v ! i) doc) $ [0 .. (vecSize - 1)]\n",
                            "Why not:\n",
                            "map (\\ i -> count (v ! i) doc) [0 .. (vecSize - 1)]Line 7: Redundant bracket\n",
                            "Found:\n",
                            "(collectionDNormalized !! 4)\n",
                            "Why not:\n",
                            "collectionDNormalized !! 4"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "vectorizeTf :: Int -> Map Int Text -> [Text] -> [Double]\n",
                "vectorizeTf vecSize v doc = map (\\i -> count (v ! i) doc) $ [0..(vecSize-1)]\n",
                "   where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n",
                "\n",
                "vocSize = Set.size voc\n",
                "\n",
                "(collectionDNormalized !! 4)\n",
                "vectorize vocSize vocD (collectionDNormalized !! 4)\n",
                "vectorizeTf vocSize vocD (collectionDNormalized !! 4)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<div style=\"display:none\">\n",
                "  $\n",
                "    \\newcommand{\\idf}{\\mathop{\\rm idf}\\nolimits}\n",
                "    \\newcommand{\\tf}{\\mathop{\\rm tf}\\nolimits}\n",
                "    \\newcommand{\\df}{\\mathop{\\rm df}\\nolimits}\n",
                "    \\newcommand{\\tfidf}{\\mathop{\\rm tfidf}\\nolimits}\n",
                "  $\n",
                "</div>\n",
                "\n",
                "### Odwrotna cz\u0119sto\u015b\u0107 dokumentowa\n",
                "\n",
                "Czy wszystkie wyrazy s\u0105 tak samo wa\u017cne?\n",
                "\n",
                "**NIE.** Wyrazy pojawiaj\u0105ce si\u0119 w wielu dokumentach s\u0105 mniej wa\u017cne.\n",
                "\n",
                "Aby to uwzgl\u0119dni\u0107, przemna\u017camy frekwencj\u0119 wyrazu przez _odwrotn\u0105\n",
                "  cz\u0119sto\u015b\u0107 w dokumentach_ (_inverse document frequency_):\n",
                "\n",
                "$$\\idf_t = \\log \\frac{N}{\\df_t},$$\n",
                "\n",
                "gdzie:\n",
                "\n",
                "* $\\idf_t$ - odwrotna cz\u0119sto\u015b\u0107 wyrazu $t$ w dokumentach\n",
                "\n",
                "* $N$ - liczba dokument\u00f3w w kolekcji\n",
                "\n",
                "* $\\df_f$ - w ilu dokumentach wyst\u0105pi\u0142 wyraz $t$?\n",
                "\n",
                "#### Dlaczego idf?\n",
                "\n",
                "term $t$ wyst\u0105pi\u0142...\n",
                "\n",
                "* w 1 dokumencie, $\\idf_t = \\log N/1 = \\log N$\n",
                "* 2 razy w kolekcji, $\\idf_t = \\log N/2$ lub $\\log N$\n",
                "* w po\u0142owie dokument\u00f3w kolekcji, $\\idf_t = \\log N/(N/2) = \\log 2$\n",
                "* we wszystkich dokumentach, $\\idf_t = \\log N/N = \\log 1 = 0$\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0.22314355131420976"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "idf :: [[Text]] -> Text -> Double\n",
                "idf coll t = log (fromIntegral n / fromIntegral df)\n",
                "  where df = Prelude.length $ Prelude.filter (\\d -> t `elem` d) coll\n",
                "        n = Prelude.length coll\n",
                "        \n",
                "idf collectionDNormalized \"kot\"        "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 34,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0.9162907318741551"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "idf collectionDNormalized \"chyba\" "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### Co z tego wynika?\n",
                "\n",
                "Zamiast $\\tf_{t,d}$ b\u0119dziemy w wektorach rozpatrywa\u0107 warto\u015bci:\n",
                "\n",
                "$$\\tfidf_{t,d} = \\tf_{t,d} \\times \\idf_{t}$$\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mie\u0107"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "kot"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "vectorizeTfIdf :: Int -> [[Text]] -> Map Int Text -> [Text] -> [Double]\n",
                "vectorizeTfIdf vecSize coll v doc = map (\\i -> count (v ! i) doc * idf coll (v ! i)) [0..(vecSize-1)]\n",
                "   where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc\n",
                "\n",
                "vocSize = Set.size voc\n",
                "\n",
                "collectionDNormalized !! 4\n",
                "vectorize vocSize vocD (collectionDNormalized !! 4)\n",
                "vectorizeTf vocSize vocD (collectionDNormalized !! 4)\n",
                "vectorizeTfIdf vocSize collectionDNormalized vocD (collectionDNormalized !! 4)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 36,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "[[1.6094379124341003,0.0,0.0,0.22314355131420976,0.5108256237659907,0.0,0.0,0.0],[0.0,0.9162907318741551,0.0,0.22314355131420976,0.0,1.6094379124341003,0.0,0.0],[0.0,0.0,0.9162907318741551,0.22314355131420976,0.5108256237659907,0.0,1.6094379124341003,0.0],[0.0,0.9162907318741551,0.9162907318741551,0.0,0.0,0.0,0.0,1.6094379124341003],[0.0,0.0,0.0,0.44628710262841953,0.5108256237659907,0.0,0.0,0.0]]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "map (vectorizeTfIdf vocSize collectionDNormalized vocD) collectionDNormalized"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Teraz zdefiniujemy _overlap score measure_:\n",
                "\n",
                "$$\\sigma(q,d) = \\sum_{t \\in q} \\tfidf_{t,d}$$"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Podobie\u0144stwo kosinusowe\n",
                "\n",
                "_Overlap score measure_ nie jest jedyn\u0105 mo\u017cliw\u0105 metryk\u0105, za pomoc\u0105 kt\u00f3rej mo\u017cemy mierzy\u0107 dopasowanie dokumentu do zapytania. Mo\u017cemy r\u00f3wnie\u017c si\u0119gn\u0105\u0107 po intuicje geometryczne (skoro mamy do czynienia z wektorami).\n",
                "\n",
                "**Pytanie**: Ile wymiar\u00f3w maj\u0105 wektory, na kt\u00f3rych operujemy? Jak \"wygl\u0105daj\u0105\" te wektory? Czy mo\u017cemy wykonywa\u0107 na nich standardowe operacje geometryczne czy te, kt\u00f3re znamy z geometrii liniowej?\n",
                "\n",
                "#### Podobie\u0144stwo mi\u0119dzy dokumentami\n",
                "\n",
                "Zajmijmy si\u0119 teraz poszukiwaniem miary mierz\u0105cej podobie\u0144stwo mi\u0119dzy dokumentami $d_1$ i $d_2$ (czyli poszukujemy sensownej funkcji $\\sigma : D x D \\rightarrow \\mathbb{R}$).\n",
                "\n",
                "**Uwaga** Poj\u0119cia \"miary\" u\u017cywamy nieformalnie, nie spe\u0142nia ona za\u0142o\u017ce\u0144 znanych z teorii miary.\n",
                "\n",
                "Rozpatrzmy zbiorek tekst\u00f3w legend miejskich z <git://gonito.net/polish-urban-legends>.\n",
                "\n",
                "(To autentyczne teksty z Internentu, z j\u0119zykiem potocznym, wulgarnym itd.)\n",
                "\n",
                "```\n",
                "   git clone git://gonito.net/polish-urban-legends\n",
                "   paste polish-urban-legends/dev-0/expected.tsv polish-urban-legends/dev-0/in.tsv > legendy.txt\n",
                "```   "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 20,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Opowie\u015b\u0107 prawdziwa... Olsztyn, akademik, 7 pi\u0119tro, […]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "import System.IO\n",
                "import Data.List.Split as SP\n",
                "\n",
                "legendsh <- openFile \"legendy.txt\" ReadMode\n",
                "hSetEncoding legendsh utf8\n",
                "contents <- hGetContents legendsh\n",
                "ls = Prelude.lines contents\n",
                "items = map (map pack . SP.splitOn \"\\t\") ls\n",
                "Prelude.head items"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "87"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "nbOfLegends = Prelude.length items\n",
                "nbOfLegends"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 22,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lap"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "be_wy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "be_wy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "be_wy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ta_ab"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ta_ab"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ta_ab"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lap"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ta_ab"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lap"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "be_wy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lap"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "be_wy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "na_ak"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lap"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "mo_zu"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ba_hy"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "zw_oz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "tr_su"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "ne_dz"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "w_lud"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Ja podejrzewam \u017ce o polowaniu nie by\u0142o mowy, po prostu znalaz\u0142 martwego szczupaka i skorzysta\u0142 z okazji! Mnie mocno zdziwi\u0142a jego si\u0142a \u017ceby taki p\u00f3\u0142 kilogramowy okaz szczupaka przesuwa\u0107 o par\u0119 metr\u00f3w i to w trzcinach! Szacuneczek. Przypomniala mi sie historia kt\u00f3r\u0105 kiedys zaslyszalem o wlascicielce pytona, ktory nagle polozyl sie wzdluz jej \u0142\u00f3\u017cka. Le\u017ca\u0142 tak wyci\u0105gniety jak struna d\u0142u\u017cszy czas jak nie\u017cywy (a by\u0142 d\u0142ugo\u015bci \u0142\u00f3\u017cka), wi\u0119c kobitka zadzonila do weterynarza co ma robi\u0107. Us\u0142ysza\u0142a \u017ce ma szybko zamkn\u0105\u0107 si\u0119 w \u0142azience i poczeka\u0107 na niego bo pyton j\u0105 mierzy jako potencjaln\u0105 ofiar\u0119 (czy mu si\u0119 zmie\u015bci w brzuchu...). Wierzy\u0107, nie wierzy\u0107? Kiedy\u015b nie wierzy\u0142em ale od kilku dni mam w\u0105tpliwosci... Pozdrawiam"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "labelsL = map Prelude.head items\n",
                "labelsL\n",
                "collectionL = map (!!1) items\n",
                "items !! 1"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 23,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "348"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "collectionLNormalized = map normalize collectionL\n",
                "voc' = getVocabulary collectionL\n",
                "\n",
                "vocLSize = Prelude.length voc'\n",
                "\n",
                "vocL :: Map Int Text\n",
                "vocL = Map.fromList $ zip [0..] $ Set.toList voc'\n",
                "\n",
                "invvocL :: Map Text Int\n",
                "invvocL = Map.fromList $ zip (Set.toList voc') [0..]\n",
                "\n",
                "vocL ! 0\n",
                "invvocL ! \"chyba\"\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Wektoryzujemy ca\u0142\u0105 kolekcj\u0119:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 48,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.38837067474886433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.752336051950276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0647107369924282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,1.247032293786383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5947071077466928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.268683541318364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2078115806331018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.7578579175523736,0.0,0.0,0.0,0.0,0.0,0.3550342544812725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9395475940384223,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21437689194643514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2878542883066382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "lVectorized = map (vectorizeTfIdf vocLSize collectionLNormalized vocL) collectionLNormalized\n",
                "lVectorized !! 1"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Szukamy funkcji $sigma$, kt\u00f3ra da wysok\u0105 warto\u015b\u0107 dla tekst\u00f3w dotycz\u0105cych tego samego w\u0105tku legendowego (np. $d_1$ i $d_2$ m\u00f3wi\u0105 o w\u0119\u017cu przymierzaj\u0105cym si\u0119 do zjedzenia swojej w\u0142a\u015bcicielki) i nisk\u0105 dla tekst\u00f3w z r\u00f3\u017cnych w\u0105tk\u00f3w (np. $d_1$ opowiada o w\u0119\u017cu ludojadzie, $d_2$ - ba\u0142wanku na hydrancie)."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Mo\u017ce po prostu odleg\u0142o\u015b\u0107 euklidesowa, skoro to punkty w wielowymiarowej przestrzeni?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 25,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber x = printf \"% 7.2f\" x</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber = printf \"% 7.2f\"</div></div>"
                        ],
                        "text/plain": [
                            "Line 5: Eta reduce\n",
                            "Found:\n",
                            "formatNumber x = printf \"% 7.2f\" x\n",
                            "Why not:\n",
                            "formatNumber = printf \"% 7.2f\""
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "   0.00   79.93   78.37   76.57   87.95   81.15   82.77  127.50  124.54   76.42   84.19   78.90   90.90"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "import Text.Printf\n",
                "import Data.List (take)\n",
                "\n",
                "formatNumber :: Double -> String\n",
                "formatNumber x = printf \"% 7.2f\" x\n",
                "\n",
                "similarTo :: ([Double] -> [Double] -> Double) -> [[Double]] -> Int -> Text\n",
                "similarTo simFun vs ix = pack $ Prelude.unwords $ map (formatNumber . ((vs !! ix) `simFun`)) vs\n",
                "\n",
                "euclDistance :: [Double] -> [Double] -> Double\n",
                "euclDistance v1 v2 = sqrt $ sum $ Prelude.zipWith (\\x1 x2 -> (x1 - x2)**2) v1 v2\n",
                "\n",
                "limit = 13\n",
                "labelsLimited =  Data.List.take limit labelsL\n",
                "limitedL = Data.List.take limit lVectorized\n",
                "\n",
                "similarTo euclDistance limitedL 0\n",
                "\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 26,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>/* Styles used for the Hoogle display in the pager */\n",
                            ".hoogle-doc {\n",
                            "display: block;\n",
                            "padding-bottom: 1.3em;\n",
                            "padding-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-code {\n",
                            "display: block;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "}\n",
                            ".hoogle-text {\n",
                            "display: block;\n",
                            "}\n",
                            ".hoogle-name {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-head {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-sub {\n",
                            "display: block;\n",
                            "margin-left: 0.4em;\n",
                            "}\n",
                            ".hoogle-package {\n",
                            "font-weight: bold;\n",
                            "font-style: italic;\n",
                            "}\n",
                            ".hoogle-module {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".hoogle-class {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".get-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "white-space: pre-wrap;\n",
                            "}\n",
                            ".show-type {\n",
                            "color: green;\n",
                            "font-weight: bold;\n",
                            "font-family: monospace;\n",
                            "margin-left: 1em;\n",
                            "}\n",
                            ".mono {\n",
                            "font-family: monospace;\n",
                            "display: block;\n",
                            "}\n",
                            ".err-msg {\n",
                            "color: red;\n",
                            "font-style: italic;\n",
                            "font-family: monospace;\n",
                            "white-space: pre;\n",
                            "display: block;\n",
                            "}\n",
                            "#unshowable {\n",
                            "color: red;\n",
                            "font-weight: bold;\n",
                            "}\n",
                            ".err-msg.in.collapse {\n",
                            "padding-top: 0.7em;\n",
                            "}\n",
                            ".highlight-code {\n",
                            "white-space: pre;\n",
                            "font-family: monospace;\n",
                            "}\n",
                            ".suggestion-warning { \n",
                            "font-weight: bold;\n",
                            "color: rgb(200, 130, 0);\n",
                            "}\n",
                            ".suggestion-error { \n",
                            "font-weight: bold;\n",
                            "color: red;\n",
                            "}\n",
                            ".suggestion-name {\n",
                            "font-weight: bold;\n",
                            "}\n",
                            "</style><div class=\"suggestion-name\" style=\"clear:both;\">Move brackets to avoid $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\"\\n\"\n",
                            "  <>\n",
                            "    (Data.Text.unlines\n",
                            "       $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "           $ zip labels [0 .. (Prelude.length vs - 1)])</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">\"\\n\"\n",
                            "  <>\n",
                            "    Data.Text.unlines\n",
                            "      (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "         $ zip labels [0 .. (Prelude.length vs - 1)])</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "  $ zip labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith\n",
                            "  (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
                            "  labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Move brackets to avoid $</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\"      \"\n",
                            "  <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">\"      \"\n",
                            "  <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Avoid lambda</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ l -> pack $ printf \"% 7s\" l</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">pack . printf \"% 7s\"</div></div>"
                        ],
                        "text/plain": [
                            "Line 2: Move brackets to avoid $\n",
                            "Found:\n",
                            "\"\\n\"\n",
                            "  <>\n",
                            "    (Data.Text.unlines\n",
                            "       $ map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "           $ zip labels [0 .. (Prelude.length vs - 1)])\n",
                            "Why not:\n",
                            "\"\\n\"\n",
                            "  <>\n",
                            "    Data.Text.unlines\n",
                            "      (map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "         $ zip labels [0 .. (Prelude.length vs - 1)])Line 2: Use zipWith\n",
                            "Found:\n",
                            "map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
                            "  $ zip labels [0 .. (Prelude.length vs - 1)]\n",
                            "Why not:\n",
                            "zipWith\n",
                            "  (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
                            "  labels [0 .. (Prelude.length vs - 1)]Line 3: Move brackets to avoid $\n",
                            "Found:\n",
                            "\"      \"\n",
                            "  <> (Data.Text.unwords $ map (\\ l -> pack $ printf \"% 7s\" l) labels)\n",
                            "Why not:\n",
                            "\"      \"\n",
                            "  <> Data.Text.unwords (map (\\ l -> pack $ printf \"% 7s\" l) labels)Line 3: Avoid lambda\n",
                            "Found:\n",
                            "\\ l -> pack $ printf \"% 7s\" l\n",
                            "Why not:\n",
                            "pack . printf \"% 7s\""
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "        na_ak   w_lud   ba_hy   w_lap   ne_dz   be_wy   zw_oz   mo_zu   be_wy   ba_hy   mo_zu   be_wy   w_lud\n",
                            "na_ak    0.00   79.93   78.37   76.57   87.95   81.15   82.77  127.50  124.54   76.42   84.19   78.90   90.90\n",
                            "w_lud   79.93    0.00   38.92   34.35   56.48   44.89   47.21  109.24  104.82   35.33   49.88   39.98   60.20\n",
                            "ba_hy   78.37   38.92    0.00   30.37   54.23   40.93   43.83  108.15  102.91   27.37   46.95   35.81   58.99\n",
                            "w_lap   76.57   34.35   30.37    0.00   51.54   37.46   40.86  107.43  103.22   25.22   43.66   32.10   56.53\n",
                            "ne_dz   87.95   56.48   54.23   51.54    0.00   57.98   60.32  113.66  109.59   50.96   62.17   54.84   70.70\n",
                            "be_wy   81.15   44.89   40.93   37.46   57.98    0.00   49.55  110.37  100.50   37.77   51.54   37.09   62.92\n",
                            "zw_oz   82.77   47.21   43.83   40.86   60.32   49.55    0.00  111.11  107.57   41.02   54.07   45.23   64.65\n",
                            "mo_zu  127.50  109.24  108.15  107.43  113.66  110.37  111.11    0.00  139.57  107.38  109.91  108.20  117.07\n",
                            "be_wy  124.54  104.82  102.91  103.22  109.59  100.50  107.57  139.57    0.00  102.69  108.32   99.06  113.25\n",
                            "ba_hy   76.42   35.33   27.37   25.22   50.96   37.77   41.02  107.38  102.69    0.00   43.83   32.08   56.68\n",
                            "mo_zu   84.19   49.88   46.95   43.66   62.17   51.54   54.07  109.91  108.32   43.83    0.00   47.87   66.40\n",
                            "be_wy   78.90   39.98   35.81   32.10   54.84   37.09   45.23  108.20   99.06   32.08   47.87    0.00   59.66\n",
                            "w_lud   90.90   60.20   58.99   56.53   70.70   62.92   64.65  117.07  113.25   56.68   66.40   59.66    0.00"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "paintMatrix :: ([Double] -> [Double] -> Double) -> [Text] -> [[Double]] -> Text\n",
                "paintMatrix simFun labels vs = header <> \"\\n\" <> (Data.Text.unlines $ map (\\(lab, ix) -> lab <> \" \" <> similarTo simFun vs ix) $ zip labels [0..(Prelude.length vs - 1)])\n",
                "    where header = \"      \" <> (Data.Text.unwords $ map (\\l -> pack $ printf \"% 7s\" l) labels)\n",
                "    \n",
                "paintMatrix euclDistance labelsLimited limitedL"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Problem: za du\u017co zale\u017cy od d\u0142ugo\u015bci tekstu.\n",
                "\n",
                "Rozwi\u0105zanie: znormalizowa\u0107 wektor $v$ do wektora jednostkowego.\n",
                "\n",
                "$$ \\vec{1}(v) = \\frac{v}{|v|} $$\n",
                "\n",
                "Taki wektor ma d\u0142ugo\u015b\u0107 1!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 54,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "1.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "        na_ak   w_lud   ba_hy   w_lap   ne_dz   be_wy   zw_oz   mo_zu   be_wy   ba_hy   mo_zu   be_wy   w_lud\n",
                            "na_ak   10.00    0.67    0.66    0.66    0.67    0.67    0.67    0.67    0.67    0.67    0.66    0.67    0.67\n",
                            "w_lud    0.67   10.00    0.67    0.68    0.67    0.66    0.67    0.67    0.68    0.66    0.67    0.67    0.68\n",
                            "ba_hy    0.66    0.67   10.00    0.66    0.67    0.67    0.67    0.67    0.69    0.74    0.66    0.67    0.66\n",
                            "w_lap    0.66    0.68    0.66   10.00    0.66    0.66    0.66    0.66    0.67    0.66    0.66    0.66    0.66\n",
                            "ne_dz    0.67    0.67    0.67    0.66   10.00    0.67    0.67    0.68    0.69    0.68    0.67    0.67    0.68\n",
                            "be_wy    0.67    0.66    0.67    0.66    0.67   10.00    0.66    0.67    0.74    0.66    0.67    0.76    0.66\n",
                            "zw_oz    0.67    0.67    0.67    0.66    0.67    0.66   10.00    0.67    0.67    0.66    0.66    0.67    0.67\n",
                            "mo_zu    0.67    0.67    0.67    0.66    0.68    0.67    0.67   10.00    0.69    0.67    0.69    0.68    0.67\n",
                            "be_wy    0.67    0.68    0.69    0.67    0.69    0.74    0.67    0.69   10.00    0.68    0.67    0.75    0.67\n",
                            "ba_hy    0.67    0.66    0.74    0.66    0.68    0.66    0.66    0.67    0.68   10.00    0.66    0.67    0.66\n",
                            "mo_zu    0.66    0.67    0.66    0.66    0.67    0.67    0.66    0.69    0.67    0.66   10.00    0.67    0.67\n",
                            "be_wy    0.67    0.67    0.67    0.66    0.67    0.76    0.67    0.68    0.75    0.67    0.67   10.00    0.67\n",
                            "w_lud    0.67    0.68    0.66    0.66    0.68    0.66    0.67    0.67    0.67    0.66    0.67    0.67   10.00"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "vectorNorm :: [Double] -> Double\n",
                "vectorNorm vs = sqrt $ sum $ map (\\x -> x * x) vs\n",
                "\n",
                "toUnitVector :: [Double] -> [Double]\n",
                "toUnitVector vs = map (/ n) vs\n",
                "   where n = vectorNorm vs\n",
                "\n",
                "vectorNorm (toUnitVector [3.0, 4.0])\n",
                "\n",
                "euclDistanceNormalized :: [Double] -> [Double] -> Double\n",
                "euclDistanceNormalized v1 v2 = toUnitVector v1 `euclDistance` toUnitVector v2\n",
                "\n",
                "euclSim v1 v2 = 1 / (d + 0.1)\n",
                "   where d = euclDistanceNormalized v1 v2\n",
                "\n",
                "paintMatrix euclSim labelsLimited limitedL"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### Podobie\u0144stwo kosinusowe\n",
                "\n",
                "Cz\u0119\u015bciej zamiast odleg\u0142o\u015bci euklidesowej stosuje si\u0119 podobie\u0144stwo kosinusowe, czyli kosinus k\u0105ta mi\u0119dzy wektorami.\n",
                "\n",
                "Wektor dokumentu ($\\vec{V}(d)$) - wektor, kt\u00f3rego sk\u0142adowe odpowiadaj\u0105 wyrazom.\n",
                "\n",
                "$$\\sigma(d_1,d_2) = \\cos\\theta(\\vec{V}(d_1),\\vec{V}(d_2)) = \\frac{\\vec{V}(d_1) \\cdot \\vec{V}(d_2)}{|\\vec{V}(d_1)||\\vec{V}(d_2)|} $$\n",
                "\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Zauwa\u017cmy, \u017ce jest to iloczyn skalarny znormalizowanych wektor\u00f3w!\n",
                "\n",
                "$$\\sigma(d_1,d_2) = \\vec{1}(\\vec{V}(d_1)) \\times \\vec{1}(\\vec{V}(d_2)) $$"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 55,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "1.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "(\u2715) :: [Double] -> [Double] -> Double\n",
                "(\u2715) v1 v2 = sum $ Prelude.zipWith (*) v1 v2\n",
                "\n",
                "[2, 1, 0] \u2715 [-2, 5, 10]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 30,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "        na_ak   w_lud   ba_hy   w_lap   ne_dz   be_wy   zw_oz   mo_zu   be_wy   ba_hy   mo_zu   be_wy   w_lud\n",
                            "na_ak    1.00    0.02    0.01    0.01    0.03    0.02    0.02    0.04    0.03    0.02    0.01    0.02    0.03\n",
                            "w_lud    0.02    1.00    0.02    0.05    0.04    0.01    0.03    0.04    0.06    0.01    0.02    0.03    0.06\n",
                            "ba_hy    0.01    0.02    1.00    0.01    0.02    0.03    0.03    0.04    0.08    0.22    0.01    0.04    0.01\n",
                            "w_lap    0.01    0.05    0.01    1.00    0.01    0.01    0.00    0.01    0.02    0.00    0.00    0.00    0.00\n",
                            "ne_dz    0.03    0.04    0.02    0.01    1.00    0.04    0.03    0.07    0.08    0.06    0.03    0.03    0.05\n",
                            "be_wy    0.02    0.01    0.03    0.01    0.04    1.00    0.01    0.03    0.21    0.01    0.02    0.25    0.01\n",
                            "zw_oz    0.02    0.03    0.03    0.00    0.03    0.01    1.00    0.04    0.03    0.00    0.01    0.02    0.02\n",
                            "mo_zu    0.04    0.04    0.04    0.01    0.07    0.03    0.04    1.00    0.10    0.02    0.09    0.05    0.04\n",
                            "be_wy    0.03    0.06    0.08    0.02    0.08    0.21    0.03    0.10    1.00    0.05    0.03    0.24    0.04\n",
                            "ba_hy    0.02    0.01    0.22    0.00    0.06    0.01    0.00    0.02    0.05    1.00    0.01    0.02    0.00\n",
                            "mo_zu    0.01    0.02    0.01    0.00    0.03    0.02    0.01    0.09    0.03    0.01    1.00    0.01    0.02\n",
                            "be_wy    0.02    0.03    0.04    0.00    0.03    0.25    0.02    0.05    0.24    0.02    0.01    1.00    0.02\n",
                            "w_lud    0.03    0.06    0.01    0.00    0.05    0.01    0.02    0.04    0.04    0.00    0.02    0.02    1.00"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "cosineSim v1 v2 = toUnitVector v1 \u2715 toUnitVector v2\n",
                "\n",
                "paintMatrix cosineSim labelsLimited limitedL"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 140,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "na tylnym siedzeniu w autobusie siedzi matka z 7-8 letnim synkiem. […]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "collectionL !! 5"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 141,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "Kr\u00f3tko zwi\u0119\u017ale i na temat. Zastanawia mnie jak ludzie wychowuj\u0105 dzieci. […]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "collectionL !! 8"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "##### Z powrotem do wyszukiwarek\n",
                "\n",
                "Mo\u017cemy potraktowa\u0107 zapytanie jako bardzo kr\u00f3tki dokument, dokona\u0107 jego wektoryzacji i policzy\u0107 cosinus k\u0105ta mi\u0119dzy zapytaniem a dokumentem."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 56,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "ja za to znam przypadek, \u017ce kole\u017canka mieszkala w bloku par\u0119 lat temu, pewnego razu wchodzi do \u0142azienki w samej bieli\u017anie a tam ogromny w\u0105\u017c na pod\u0142odze i tak si\u0119 wystraszy\u0142a \u017ce wybieg\u0142a z wrzaskiem z mieszkania i wylecia\u0142a przed blok w samej bieli\u017anie i uciek\u0142a do babci swojej, kt\u00f3ra mieszkala gdzie\u015b niedaleko. a potem si\u0119 okaza\u0142o, \u017ce jej s\u0105siad z do\u0142u hodowa\u0142 sobie w\u0119\u017ca i tak w\u0142a\u015bnie swobodnie go \"pasa\u0142\" po mieszkaniu i w\u0105\u017c mu spierdzieli\u0142 przez rur\u0119 w \u0142azience :cool :"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Pewna dziewczyna, wieku mi nieznanego, w mie\u015bcie sto\u0142ecznym - rozwiod\u0142a si\u0119. By\u0142a sama i samotna, wi\u0119c zapragn\u0119\u0142a kupi\u0107 sobie zwierz\u0119, […]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "Anakonda. Czy to kolejna miejska legenda? Jaki\u015b czas temu kole\u017canka na jednej z imprez towarzyskich opowiedzia\u0142a mro\u017c\u0105c\u0105 krew w \u017cy\u0142ach histori\u0119 o dziewczynie ze swojej pracy, kt\u00f3ra w Warszawie na dyskotece w Dekadzie pozna\u0142a ch\u0142opaka. […]"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "import Data.Ord\n",
                "import Data.List\n",
                "\n",
                "legendVectorizer = vectorizeTfIdf vocLSize collectionLNormalized vocL . normalize\n",
                "\n",
                "\n",
                "query vs vzer q = map ((collectionL !!) . snd) $ Data.List.take 3 $ sortBy (\\a b -> fst b `compare` fst a) $ zip (map (`cosineSim` qvec) vs) [0..]  \n",
                "   where qvec = vzer q \n",
                "\n",
                "query lVectorized legendVectorizer \"w\u0105\u017c przymierza si\u0119 do zjedzenia w\u0142a\u015bcicielki\"\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Haskell",
            "language": "haskell",
            "name": "haskell"
        },
        "language_info": {
            "codemirror_mode": "ihaskell",
            "file_extension": ".hs",
            "mimetype": "text/x-haskell",
            "name": "haskell",
            "pygments_lexer": "Haskell",
            "version": "8.10.4"
        },
        "author": "Filip Grali\u0144ski",
        "email": "filipg@amu.edu.pl",
        "lang": "pl",
        "subtitle": "3.Wyszukiwarki \u2014 TF-IDF[wyk\u0142ad]",
        "title": "Ekstrakcja informacji",
        "year": "2021"
    },
    "nbformat": 4,
    "nbformat_minor": 4
}