aitech-eks-pub/cw/03a_tfidf.ipynb

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {
                "collapsed": false
            },
            "source": [
                "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
                "<div class=\"alert alert-block alert-info\">\n",
                "<h1> Ekstrakcja informacji </h1>\n",
                "<h2> 3. <i>tfidf (1)</i>  [\u0107wiczenia]</h2> \n",
                "<h3> Jakub Pokrywka (2021)</h3>\n",
                "</div>\n",
                "\n",
                "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Zaj\u0119cia 2\n",
                "\n",
                "Na tych zaj\u0119ciach za aktywno\u015bc mo\u017cna otrzyma\u0107 po 5 punkt\u00f3w za warto\u015bciow\u0105 wypowied\u017a. Maksymalnie jedna osoba mo\u017ce zdoby\u0107 na tych \u0107wiczeniach do 15 punkt\u00f3w."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": [
                "import numpy as np\n",
                "import re"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## zbi\u00f3r dokument\u00f3w"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [],
            "source": [
                "documents = ['Ala lubi zwierz\u0119ta i ma kota oraz psa!',\n",
                "             'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!',\n",
                "             'I Jan je\u017adzi na rowerze.',\n",
                "             '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
                "             'Tomek lubi psy, ma psa  i je\u017adzi na motorze i rowerze.',\n",
                "            ]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### CZEGO CHCEMY?\n",
                "- chcemy zamieni\u0107 teksty na zbi\u00f3r s\u0142\u00f3w\n",
                "\n",
                "\n",
                "### PYTANIE\n",
                "- czy mo\u017cemy ztokenizowa\u0107 tekst np. documents.split(' ') jakie wyst\u0105pi\u0105 wtedy problemy?"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## preprocessing"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [],
            "source": [
                "def get_str_cleaned(str_dirty):\n",
                "    punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
                "    new_str = str_dirty.lower()\n",
                "    new_str = re.sub(' +', ' ', new_str)\n",
                "    for char in punctuation:\n",
                "        new_str = new_str.replace(char,'')\n",
                "    return new_str\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "sample_document = get_str_cleaned(documents[0])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'ala lubi zwierz\u0119ta i ma kota oraz psa'"
                        ]
                    },
                    "execution_count": 5,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "sample_document"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## tokenizacja"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "def tokenize_str(document):\n",
                "    return document.split(' ')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa']"
                        ]
                    },
                    "execution_count": 7,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "tokenize_str(sample_document)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "documents_cleaned = [get_str_cleaned(d) for d in documents]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "['ala lubi zwierz\u0119ta i ma kota oraz psa',\n",
                            " 'ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika',\n",
                            " 'i jan je\u017adzi na rowerze',\n",
                            " '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
                            " 'tomek lubi psy ma psa i je\u017adzi na motorze i rowerze']"
                        ]
                    },
                    "execution_count": 9,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents_cleaned"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "[['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
                            " ['ola', 'lubi', 'zwierz\u0119ta', 'oraz', 'ma', 'kota', 'a', 'tak\u017ce', 'chomika'],\n",
                            " ['i', 'jan', 'je\u017adzi', 'na', 'rowerze'],\n",
                            " ['2', 'wojna', '\u015bwiatowa', 'by\u0142a', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
                            " ['tomek',\n",
                            "  'lubi',\n",
                            "  'psy',\n",
                            "  'ma',\n",
                            "  'psa',\n",
                            "  'i',\n",
                            "  'je\u017adzi',\n",
                            "  'na',\n",
                            "  'motorze',\n",
                            "  'i',\n",
                            "  'rowerze']]"
                        ]
                    },
                    "execution_count": 11,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents_tokenized"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## PYTANIA\n",
                "- jaki jest nast\u0119pny krok w celu stworzenia wekt\u00f3r\u00f3w TF lub TF-IDF\n",
                "- jakie wielko\u015bci b\u0119dzie wektor TF lub TF-IDF?\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [],
            "source": [
                "vocabulary = []\n",
                "for document in documents_tokenized:\n",
                "    for word in document:\n",
                "        vocabulary.append(word)\n",
                "vocabulary = sorted(set(vocabulary))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "['2',\n",
                            " 'a',\n",
                            " 'ala',\n",
                            " 'by\u0142a',\n",
                            " 'chomika',\n",
                            " 'i',\n",
                            " 'jan',\n",
                            " 'je\u017adzi',\n",
                            " 'konfliktem',\n",
                            " 'kota',\n",
                            " 'lubi',\n",
                            " 'ma',\n",
                            " 'motorze',\n",
                            " 'na',\n",
                            " 'ola',\n",
                            " 'oraz',\n",
                            " 'psa',\n",
                            " 'psy',\n",
                            " 'rowerze',\n",
                            " 'tak\u017ce',\n",
                            " 'tomek',\n",
                            " 'wielkim',\n",
                            " 'wojna',\n",
                            " 'zbrojnym',\n",
                            " 'zwierz\u0119ta',\n",
                            " '\u015bwiatowa']"
                        ]
                    },
                    "execution_count": 13,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "vocabulary"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## PYTANIA\n",
                "\n",
                "jak b\u0119dzie s\u0142owo \"jak\" w reprezentacji wektorowej TF?"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### ZADANIE 1 stworzy\u0107 funkcj\u0119 word_to_index(word:str), funkcja ma zwara\u0107 one-hot vector w postaciu numpy array"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {},
            "outputs": [],
            "source": [
                "def word_to_index(word):\n",
                "    pass"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
                            "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
                        ]
                    },
                    "execution_count": 16,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "word_to_index('psa')"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### ZADANIE 2 NAPISAC FUNKCJ\u0118, kt\u00f3ra bierze list\u0119 s\u0142\u00f3w i zamienia na wetktor TF\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [],
            "source": [
                "def tf(document):\n",
                "    pass"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {},
            "outputs": [],
            "source": []
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
                            "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
                        ]
                    },
                    "execution_count": 19,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "tf(documents_tokenized[0])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 20,
            "metadata": {},
            "outputs": [],
            "source": [
                "documents_vectorized = list()\n",
                "for document in documents_tokenized:\n",
                "    document_vector = tf(document)\n",
                "    documents_vectorized.append(document_vector)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
                            "        0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
                            " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
                            "        0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
                            " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
                            "        0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
                            " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
                            "        0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
                            " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
                            "        1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
                        ]
                    },
                    "execution_count": 21,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents_vectorized"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### IDF"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 22,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([5.        , 5.        , 5.        , 5.        , 5.        ,\n",
                            "       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,\n",
                            "       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,\n",
                            "       2.5       , 2.5       , 5.        , 2.5       , 5.        ,\n",
                            "       5.        , 5.        , 5.        , 5.        , 2.5       ,\n",
                            "       5.        ])"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "idf = np.zeros(len(vocabulary))\n",
                "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
                "display(idf)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 23,
            "metadata": {},
            "outputs": [],
            "source": [
                "for i in range(len(documents_vectorized)):\n",
                "    documents_vectorized[i] = documents_vectorized[i]# * idf"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### ZADANIE 3 Napisa\u0107 funkcj\u0119 similarity, kt\u00f3ra zwraca podobie\u0144stwo kosinusowe mi\u0119dzy dwoma dokumentami w postaci zwektoryzowanej"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def similarity(query, document):\n",
                "    pass"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 25,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
                        ]
                    },
                    "execution_count": 25,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents[0]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 26,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
                            "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
                        ]
                    },
                    "execution_count": 26,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents_vectorized[0]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 27,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
                        ]
                    },
                    "execution_count": 27,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents[1]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 28,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
                            "       0., 0., 1., 0., 0., 0., 0., 1., 0.])"
                        ]
                    },
                    "execution_count": 28,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "documents_vectorized[1]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 29,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0.5892556509887895"
                        ]
                    },
                    "execution_count": 29,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "similarity(documents_vectorized[0],documents_vectorized[1])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 30,
            "metadata": {},
            "outputs": [],
            "source": [
                "def transform_query(query):\n",
                "    query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
                "    return query_vector"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 31,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
                            "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
                        ]
                    },
                    "execution_count": 31,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "transform_query('psa')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 32,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "0.4999999999999999"
                        ]
                    },
                    "execution_count": 32,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "similarity(transform_query('psa kota'), documents_vectorized[0])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 33,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.4999999999999999"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.2357022603955158"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'I Jan je\u017adzi na rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Tomek lubi psy, ma psa  i je\u017adzi na motorze i rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.19611613513818402"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# tak s\u0105 obs\u0142ugiwane 2 s\u0142owa\n",
                "query = 'psa kota'\n",
                "for i in range(len(documents)):\n",
                "    display(documents[i])\n",
                "    display(similarity(transform_query(query), documents_vectorized[i]))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 34,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'I Jan je\u017adzi na rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.4472135954999579"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Tomek lubi psy, ma psa  i je\u017adzi na motorze i rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.2773500981126146"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# dlatego potrzebujemy mianownik w cosine similarity\n",
                "query = 'rowerze'\n",
                "for i in range(len(documents)):\n",
                "    display(documents[i])\n",
                "    display(similarity(transform_query(query), documents_vectorized[i]))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 35,
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.35355339059327373"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'I Jan je\u017adzi na rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.4472135954999579"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Tomek lubi psy, ma psa  i je\u017adzi na motorze i rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.5547001962252291"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# dlatego potrzebujemy term frequency \u2192 wiecej znaczy bardziej dopasowany dokument\n",
                "query = 'i'\n",
                "for i in range(len(documents)):\n",
                "    display(documents[i])\n",
                "    display(similarity(transform_query(query), documents_vectorized[i]))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 36,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.24999999999999994"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.2357022603955158"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'I Jan je\u017adzi na rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.31622776601683794"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.0"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "'Tomek lubi psy, ma psa  i je\u017adzi na motorze i rowerze.'"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
                        "text/plain": [
                            "0.39223227027636803"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# dlatego IDF - \u017ceby wa\u017cniejsze s\u0142owa mia\u0142 wi\u0119ksz\u0105 wag\u0119\n",
                "query = 'i chomika'\n",
                "for i in range(len(documents)):\n",
                "    display(documents[i])\n",
                "    display(similarity(transform_query(query), documents_vectorized[i]))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### ZADANIE 4 NAPISA\u0106 IDF w celu zmiany wag z TF na TF- IDF \n",
                "\n",
                "Prosz\u0119 u\u017cy\u0107 wersj\u0119 bez \u017cadnej normalizacji\n",
                "\n",
                "\n",
                "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
                "\n",
                "\n",
                "$|D|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie\n",
                "$|\\{d : t_i \\in d \\}|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie, gdzie dany term wyst\u0119puje chocia\u017c jeden raz"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.3"
        },
        "author": "Jakub Pokrywka",
        "email": "kubapok@wmi.amu.edu.pl",
        "lang": "pl",
        "subtitle": "3.tfidf (1)[\u0107wiczenia]",
        "title": "Ekstrakcja informacji",
        "year": "2021"
    },
    "nbformat": 4,
    "nbformat_minor": 4
}