aitech-eks-pub/cw/03a_tfidf.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<h1> Ekstrakcja informacji </h1>\n",
    "<h2> 3. <i>tfidf (1)</i>  [ćwiczenia]</h2> \n",
    "<h3> Jakub Pokrywka (2021)</h3>\n",
    "</div>\n",
    "\n",
    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zajęcia 2\n",
    "\n",
    "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## zbiór dokumentów"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",
    "             'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",
    "             'I Jan jeździ na rowerze.',\n",
    "             '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
    "             'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.',\n",
    "            ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CZEGO CHCEMY?\n",
    "- chcemy zamienić teksty na zbiór słów\n",
    "\n",
    "\n",
    "### PYTANIE\n",
    "- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_str_cleaned(str_dirty):\n",
    "    punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
    "    new_str = str_dirty.lower()\n",
    "    new_str = re.sub(' +', ' ', new_str)\n",
    "    for char in punctuation:\n",
    "        new_str = new_str.replace(char,'')\n",
    "    return new_str\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_document = get_str_cleaned(documents[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ala lubi zwierzęta i ma kota oraz psa'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## tokenizacja"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_str(document):\n",
    "    return document.split(' ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenize_str(sample_document)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_cleaned = [get_str_cleaned(d) for d in documents]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ala lubi zwierzęta i ma kota oraz psa',\n",
       " 'ola lubi zwierzęta oraz ma kota a także chomika',\n",
       " 'i jan jeździ na rowerze',\n",
       " '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
       " 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_cleaned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
       " ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",
       " ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",
       " ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
       " ['tomek',\n",
       "  'lubi',\n",
       "  'psy',\n",
       "  'ma',\n",
       "  'psa',\n",
       "  'i',\n",
       "  'jeździ',\n",
       "  'na',\n",
       "  'motorze',\n",
       "  'i',\n",
       "  'rowerze']]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_tokenized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PYTANIA\n",
    "- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n",
    "- jakie wielkości będzie wektor TF lub TF-IDF?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocabulary = []\n",
    "for document in documents_tokenized:\n",
    "    for word in document:\n",
    "        vocabulary.append(word)\n",
    "vocabulary = sorted(set(vocabulary))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2',\n",
       " 'a',\n",
       " 'ala',\n",
       " 'była',\n",
       " 'chomika',\n",
       " 'i',\n",
       " 'jan',\n",
       " 'jeździ',\n",
       " 'konfliktem',\n",
       " 'kota',\n",
       " 'lubi',\n",
       " 'ma',\n",
       " 'motorze',\n",
       " 'na',\n",
       " 'ola',\n",
       " 'oraz',\n",
       " 'psa',\n",
       " 'psy',\n",
       " 'rowerze',\n",
       " 'także',\n",
       " 'tomek',\n",
       " 'wielkim',\n",
       " 'wojna',\n",
       " 'zbrojnym',\n",
       " 'zwierzęta',\n",
       " 'światowa']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PYTANIA\n",
    "\n",
    "jak będzie słowo \"jak\" w reprezentacji wektorowej TF?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def word_to_index(word):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_to_index('psa')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tf(document):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf(documents_tokenized[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_vectorized = list()\n",
    "for document in documents_tokenized:\n",
    "    document_vector = tf(document)\n",
    "    documents_vectorized.append(document_vector)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
       "        0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
       " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
       "        0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
       " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
       "        0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
       " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "        0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
       " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
       "        1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_vectorized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5.        , 5.        , 5.        , 5.        , 5.        ,\n",
       "       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,\n",
       "       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,\n",
       "       2.5       , 2.5       , 5.        , 2.5       , 5.        ,\n",
       "       5.        , 5.        , 5.        , 5.        , 2.5       ,\n",
       "       5.        ])"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "idf = np.zeros(len(vocabulary))\n",
    "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
    "display(idf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(len(documents_vectorized)):\n",
    "    documents_vectorized[i] = documents_vectorized[i]# * idf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def similarity(query, document):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ala lubi zwierzęta i ma kota oraz psa!'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_vectorized[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
       "       0., 0., 1., 0., 0., 0., 0., 1., 0.])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_vectorized[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5892556509887895"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "similarity(documents_vectorized[0],documents_vectorized[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def transform_query(query):\n",
    "    query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
    "    return query_vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transform_query('psa')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4999999999999999"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "similarity(transform_query('psa kota'), documents_vectorized[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ala lubi zwierzęta i ma kota oraz psa!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.4999999999999999"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.2357022603955158"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'I Jan jeździ na rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.19611613513818402"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# tak są obsługiwane 2 słowa\n",
    "query = 'psa kota'\n",
    "for i in range(len(documents)):\n",
    "    display(documents[i])\n",
    "    display(similarity(transform_query(query), documents_vectorized[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ala lubi zwierzęta i ma kota oraz psa!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'I Jan jeździ na rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.4472135954999579"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.2773500981126146"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# dlatego potrzebujemy mianownik w cosine similarity\n",
    "query = 'rowerze'\n",
    "for i in range(len(documents)):\n",
    "    display(documents[i])\n",
    "    display(similarity(transform_query(query), documents_vectorized[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ala lubi zwierzęta i ma kota oraz psa!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.35355339059327373"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'I Jan jeździ na rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.4472135954999579"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.5547001962252291"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n",
    "query = 'i'\n",
    "for i in range(len(documents)):\n",
    "    display(documents[i])\n",
    "    display(similarity(transform_query(query), documents_vectorized[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Ala lubi zwierzęta i ma kota oraz psa!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.24999999999999994"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.2357022603955158"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'I Jan jeździ na rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.31622776601683794"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0.39223227027636803"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",
    "query = 'i chomika'\n",
    "for i in range(len(documents)):\n",
    "    display(documents[i])\n",
    "    display(similarity(transform_query(query), documents_vectorized[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n",
    "\n",
    "Proszę użyć wersję bez żadnej normalizacji\n",
    "\n",
    "\n",
    "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
    "\n",
    "\n",
    "$|D|$ - ilość dokumentów w korpusie\n",
    "$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz"
   ]
  }
 ],
 "metadata": {
  "author": "Jakub Pokrywka",
  "email": "kubapok@wmi.amu.edu.pl",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "lang": "pl",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "subtitle": "3.tfidf (1)[ćwiczenia]",
  "title": "Ekstrakcja informacji",
  "year": "2021"
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
add 03 2021-09-27 14:02:30 +02:00			`{`
reformat 2021-10-05 15:04:58 +02:00			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",`
			`"<div class=\"alert alert-block alert-info\">\n",`
			`"<h1> Ekstrakcja informacji </h1>\n",`
			`"<h2> 3. <i>tfidf (1)</i> [ćwiczenia]</h2> \n",`
			`"<h3> Jakub Pokrywka (2021)</h3>\n",`
			`"</div>\n",`
			`"\n",`
			`"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Zajęcia 2\n",`
			`"\n",`
			`"Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import numpy as np\n",`
			`"import re"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## zbiór dokumentów"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",`
			`" 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",`
			`" 'I Jan jeździ na rowerze.',\n",`
			`" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",`
			`" 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n",`
			`" ]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### CZEGO CHCEMY?\n",`
			`"- chcemy zamienić teksty na zbiór słów\n",`
			`"\n",`
			`"\n",`
			`"### PYTANIE\n",`
			`"- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## preprocessing"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def get_str_cleaned(str_dirty):\n",`
			" punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{\|}~'\n",
			`" new_str = str_dirty.lower()\n",`
			`" new_str = re.sub(' +', ' ', new_str)\n",`
			`" for char in punctuation:\n",`
			`" new_str = new_str.replace(char,'')\n",`
			`" return new_str\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"sample_document = get_str_cleaned(documents[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'ala lubi zwierzęta i ma kota oraz psa'"`
			`]`
			`},`
			`"execution_count": 5,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"sample_document"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## tokenizacja"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def tokenize_str(document):\n",`
			`" return document.split(' ')"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"tokenize_str(sample_document)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"documents_cleaned = [get_str_cleaned(d) for d in documents]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['ala lubi zwierzęta i ma kota oraz psa',\n",`
			`" 'ola lubi zwierzęta oraz ma kota a także chomika',\n",`
			`" 'i jan jeździ na rowerze',\n",`
			`" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",`
			`" 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"`
			`]`
			`},`
			`"execution_count": 9,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents_cleaned"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",`
			`" ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",`
			`" ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",`
			`" ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",`
			`" ['tomek',\n",`
			`" 'lubi',\n",`
			`" 'psy',\n",`
			`" 'ma',\n",`
			`" 'psa',\n",`
			`" 'i',\n",`
			`" 'jeździ',\n",`
			`" 'na',\n",`
			`" 'motorze',\n",`
			`" 'i',\n",`
			`" 'rowerze']]"`
			`]`
			`},`
			`"execution_count": 11,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents_tokenized"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## PYTANIA\n",`
			`"- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n",`
			`"- jakie wielkości będzie wektor TF lub TF-IDF?\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"vocabulary = []\n",`
			`"for document in documents_tokenized:\n",`
			`" for word in document:\n",`
			`" vocabulary.append(word)\n",`
			`"vocabulary = sorted(set(vocabulary))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['2',\n",`
			`" 'a',\n",`
			`" 'ala',\n",`
			`" 'była',\n",`
			`" 'chomika',\n",`
			`" 'i',\n",`
			`" 'jan',\n",`
			`" 'jeździ',\n",`
			`" 'konfliktem',\n",`
			`" 'kota',\n",`
			`" 'lubi',\n",`
			`" 'ma',\n",`
			`" 'motorze',\n",`
			`" 'na',\n",`
			`" 'ola',\n",`
			`" 'oraz',\n",`
			`" 'psa',\n",`
			`" 'psy',\n",`
			`" 'rowerze',\n",`
			`" 'także',\n",`
			`" 'tomek',\n",`
			`" 'wielkim',\n",`
			`" 'wojna',\n",`
			`" 'zbrojnym',\n",`
			`" 'zwierzęta',\n",`
			`" 'światowa']"`
			`]`
			`},`
			`"execution_count": 13,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"vocabulary"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## PYTANIA\n",`
			`"\n",`
			`"jak będzie słowo \"jak\" w reprezentacji wektorowej TF?"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def word_to_index(word):\n",`
			`" pass"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",`
			`" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"`
			`]`
			`},`
			`"execution_count": 16,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"word_to_index('psa')"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def tf(document):\n",`
			`" pass"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",`
			`" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"`
			`]`
			`},`
			`"execution_count": 19,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"tf(documents_tokenized[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 20,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"documents_vectorized = list()\n",`
			`"for document in documents_tokenized:\n",`
			`" document_vector = tf(document)\n",`
			`" documents_vectorized.append(document_vector)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 21,`
			`"metadata": {`
			`"scrolled": true`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",`
			`" 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",`
			`" array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",`
			`" 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",`
			`" array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",`
			`" 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",`
			`" array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",`
			`" 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",`
			`" array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",`
			`" 1., 1., 0., 1., 0., 0., 0., 0., 0.])]"`
			`]`
			`},`
			`"execution_count": 21,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents_vectorized"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### IDF"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 22,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([5. , 5. , 5. , 5. , 5. ,\n",`
			`" 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n",`
			`" 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n",`
			`" 2.5 , 2.5 , 5. , 2.5 , 5. ,\n",`
			`" 5. , 5. , 5. , 5. , 2.5 ,\n",`
			`" 5. ])"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"idf = np.zeros(len(vocabulary))\n",`
			`"idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",`
			`"display(idf)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 23,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"for i in range(len(documents_vectorized)):\n",`
			`" documents_vectorized[i] = documents_vectorized[i]# * idf"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def similarity(query, document):\n",`
			`" pass"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 25,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ala lubi zwierzęta i ma kota oraz psa!'"`
			`]`
			`},`
			`"execution_count": 25,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 26,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",`
			`" 0., 0., 0., 0., 0., 0., 0., 1., 0.])"`
			`]`
			`},`
			`"execution_count": 26,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents_vectorized[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 27,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ola lubi zwierzęta oraz ma kota a także chomika!'"`
			`]`
			`},`
			`"execution_count": 27,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents[1]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 28,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",`
			`" 0., 0., 1., 0., 0., 0., 0., 1., 0.])"`
			`]`
			`},`
			`"execution_count": 28,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"documents_vectorized[1]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 29,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.5892556509887895"`
			`]`
			`},`
			`"execution_count": 29,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"similarity(documents_vectorized[0],documents_vectorized[1])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 30,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def transform_query(query):\n",`
			`" query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",`
			`" return query_vector"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 31,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",`
			`" 0., 0., 0., 0., 0., 0., 0., 0., 0.])"`
			`]`
			`},`
			`"execution_count": 31,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"transform_query('psa')"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 32,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.4999999999999999"`
			`]`
			`},`
			`"execution_count": 32,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"similarity(transform_query('psa kota'), documents_vectorized[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 33,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ala lubi zwierzęta i ma kota oraz psa!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
add 03 2021-09-27 14:02:30 +02:00			`},`
reformat 2021-10-05 15:04:58 +02:00			`{`
			`"data": {`
			`"text/plain": [`
			`"0.4999999999999999"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ola lubi zwierzęta oraz ma kota a także chomika!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.2357022603955158"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'I Jan jeździ na rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'2 wojna światowa była wielkim konfliktem zbrojnym'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.19611613513818402"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"# tak są obsługiwane 2 słowa\n",`
			`"query = 'psa kota'\n",`
			`"for i in range(len(documents)):\n",`
			`" display(documents[i])\n",`
			`" display(similarity(transform_query(query), documents_vectorized[i]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 34,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ala lubi zwierzęta i ma kota oraz psa!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ola lubi zwierzęta oraz ma kota a także chomika!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'I Jan jeździ na rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.4472135954999579"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'2 wojna światowa była wielkim konfliktem zbrojnym'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.2773500981126146"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"# dlatego potrzebujemy mianownik w cosine similarity\n",`
			`"query = 'rowerze'\n",`
			`"for i in range(len(documents)):\n",`
			`" display(documents[i])\n",`
			`" display(similarity(transform_query(query), documents_vectorized[i]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 35,`
			`"metadata": {`
			`"scrolled": true`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ala lubi zwierzęta i ma kota oraz psa!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.35355339059327373"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ola lubi zwierzęta oraz ma kota a także chomika!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'I Jan jeździ na rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.4472135954999579"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'2 wojna światowa była wielkim konfliktem zbrojnym'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.5547001962252291"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n",`
			`"query = 'i'\n",`
			`"for i in range(len(documents)):\n",`
			`" display(documents[i])\n",`
			`" display(similarity(transform_query(query), documents_vectorized[i]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 36,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ala lubi zwierzęta i ma kota oraz psa!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.24999999999999994"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Ola lubi zwierzęta oraz ma kota a także chomika!'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.2357022603955158"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'I Jan jeździ na rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.31622776601683794"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'2 wojna światowa była wielkim konfliktem zbrojnym'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.0"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0.39223227027636803"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",`
			`"query = 'i chomika'\n",`
			`"for i in range(len(documents)):\n",`
			`" display(documents[i])\n",`
			`" display(similarity(transform_query(query), documents_vectorized[i]))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n",`
			`"\n",`
			`"Proszę użyć wersję bez żadnej normalizacji\n",`
			`"\n",`
			`"\n",`
			`"$idf_i = \\Large\\frac{\|D\|}{\|\\{d : t_i \\in d \\}\|}$\n",`
			`"\n",`
			`"\n",`
			`"$\|D\|$ - ilość dokumentów w korpusie\n",`
			`"$\|\\{d : t_i \\in d \\}\|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"author": "Jakub Pokrywka",`
			`"email": "kubapok@wmi.amu.edu.pl",`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"lang": "pl",`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.8.3"`
			`},`
			`"subtitle": "3.tfidf (1)[ćwiczenia]",`
			`"title": "Ekstrakcja informacji",`
			`"year": "2021"`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`