diff --git a/cw/00_Informacje_na_temat_przedmiotu.ipynb b/cw/00_Informacje_na_temat_przedmiotu.ipynb
new file mode 100644
index 0000000..bf44929
--- /dev/null
+++ b/cw/00_Informacje_na_temat_przedmiotu.ipynb
@@ -0,0 +1,81 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Informacje ogólne"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##  Kontakt z prowadzącym\n",
+    "\n",
+    "prowadzący: mgr inż. Jakub Pokrywka\n",
+    "\n",
+    "Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
+    "\n",
+    "\n",
+    "## Literatura\n",
+    "Polecana literatura do przedmiotu:\n",
+    "\n",
+    "\n",
+    "- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
+    "- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
+    "- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
+    "\n",
+    "- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
+    "\n",
+    "- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67.  \n",
+    "\n",
+    "- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
+    "\n",
+    "- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
+    "\n",
+    "## Zaliczenie\n",
+    "\n",
+    "\n",
+    "\n",
+    "Do zdobycia będzie conajmniej 500 punktów.\n",
+    "\n",
+    "Ocena:\n",
+    "\n",
+    "-     -299 — 2\n",
+    "\n",
+    "-     300-349 — 3\n",
+    "\n",
+    "-     350-399 — 3+\n",
+    "\n",
+    "-     400-449 — 4\n",
+    "\n",
+    "-     450—499 — 4+\n",
+    "\n",
+    "-     500- — 5\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/cw/01_Wyszukiwarki-wprowadzenie.ipynb b/cw/01_Wyszukiwarki-wprowadzenie.ipynb
new file mode 100644
index 0000000..f25a297
--- /dev/null
+++ b/cw/01_Wyszukiwarki-wprowadzenie.ipynb
@@ -0,0 +1,257 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Zajecia 1\n",
+    "\n",
+    "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Przydatne materiały:\n",
+    "\n",
+    "https://www.google.com/advanced_search\n",
+    "\n",
+    "https://www.google.pl/advanced_image_search\n",
+    "\n",
+    "https://support.google.com/websearch/answer/2466433?hl=en\n",
+    "\n",
+    "https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
+    "\n",
+    "https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo\n",
+    "\n",
+    "https://developer.allegro.pl/about/\n",
+    "\n",
+    "https://serpapi.com/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Będziemy omawiać: \n",
+    "- Wyszukiwarki ogólnego przeznaczenia (google, bing, ...)\n",
+    "- Wyszukiwarki na konkretną platformę (amazon, allegro, olx, spar, ...)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Wyszukiwanie zaawansowane google\n",
+    "\n",
+    "- \"job steve\"\n",
+    "- poduszka |/OR drzwi \n",
+    "- poduszka -biała\n",
+    "- poduszka * drzwi\n",
+    "- define:pillow\n",
+    "- cache:wp.pl\n",
+    "- poduszka filetype:pdf\n",
+    "- poduszka site:allegro.pl\n",
+    "- related:allegro.pl\n",
+    "- intitle:poduszka\n",
+    "- allintitle:poduszka biała\n",
+    "- inurl:poduszka\n",
+    "- allinurl:poduszka biała\n",
+    "- poduszka AROUND(4) drzwi\n",
+    "- weather:poznan\n",
+    "- stocks:gme\n",
+    "- map:poznań\n",
+    "- $329 in pln\n",
+    "- euro 1990..2000\n",
+    "- 15*30\n",
+    "- color picker\n",
+    "- elon musk @twitter\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Komponenty wyszukiwarki google\n",
+    "- pole do wpisywania tekstu i search button \n",
+    "- sugestie do wpisywania\n",
+    "- ghosting\n",
+    "- autokorekta, np. pdouszka\n",
+    "- ilość wyświetleń dla wyniku \n",
+    "- elementy dodaktowe po wpisaniu frazy (odpowiedzi na pytania ogólne, wyszukiwania powiązane, itp)\n",
+    "- lista elementów (podzielona na strony)\n",
+    "- jak działają strony na urządzeniach mobilnych?\n",
+    "- prezentacja wyników: nazwa strony oraz tam gdzie jest match pogrubienie (czy google ma prawo do umieszczania takich tekstów na swojej stronie)?\n",
+    "- inne komponenty - np best games for nintendo switch\n",
+    "- reklamy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Komponenty wyszukiwarki specjalistycznej na przykładzie allegro\n",
+    "\n",
+    "- wyszukiwarna tekstowa albo nawigowanie bezpośrednio po kategoriach\n",
+    "- każdy ma własny unikalny sposób wyszukiwania\n",
+    "- okno wyszukiwania\n",
+    "- sugestie przy wpisywaniu frazy\n",
+    "- ghosting (np santander.pl)\n",
+    "- autokorekta (sugestia oraz przekierowanie)\n",
+    "- można też wpisać, że szukamy również w opisach, parametrach itp.\n",
+    "- komentarz: tutaj wpisujemy jakąś frazę\n",
+    "- mamy zbiór dokumumentów oraz są posortowane w jakiś sposób (ale niekoniecznie tak musi być)\n",
+    "- jak działa odzyskiwanie dokumentów?\n",
+    " - stopwordy \n",
+    " - normalizacja do lowercase\n",
+    " - lista synonimów, fleksja, odmiana  (także ujednoznacznienie do jednej formy → wielka poduszka/ wielki poduszka, kubek kubki)\n",
+    "- sortowania (omówić możliwe sortowania)- element którego nie ma w google\n",
+    "https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo#moja-oferta-ma-duza-sprzedaz-a-mimo-tego-jest-ona-nizej-w-sortowaniu-po-trafnosci-niz-inne-nowe-oferty-dlaczego-\n",
+    "- trafność dla każdego może znaczyć coś innego\n",
+    "- sortowanie domyślne- jakie jest jego znaczenie?\n",
+    "- inne rodzaje sortowania\n",
+    "- rerankowanie \n",
+    "- po lewej stronie mamy zawężenie do kategorii oraz filtry, wyszukiwanie facetowe- nie ma w google\n",
+    "- mamy także oferty sponsorowane oraz promowane - dylemat- ważniejszy jest biznes czy użytkownik\n",
+    "- rekomendacje dla użytkowników na dole- właściwie to jest osobny dział \n",
+    "- inne możliwości (szukaj wielu)\n",
+    "- wyszukiwanie zaawansowane: https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
+    "- ewaluacja jakości wyszukiwarki- dyskusja, kto by co wybrał, jak wygląda sprawa z uczeniem maszynowym?\n",
+    "- jakie cele musi spełniać inżynier trafonośći?\n",
+    "- jak ewaluować wyszukiwarki?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API do wyszukiwarek\n",
+    "- https://developer.allegro.pl/listing/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Google trends"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SEO (Search Engine Optimization)\n",
+    "- pod google\n",
+    "- pod wyszukiwarki typu allegro, olx \n",
+    "- https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zadanie domowe\n",
+    "\n",
+    "----------------------\n",
+    "Maksymalnie do zdobycia za zadania 100: 30\n",
+    "\n",
+    "Maksymalnie do zdobycia za zadania 101-107: 50\n",
+    "\n",
+    "\n",
+    "Zadania proszę oddawać w formie pliku pdf w MS TEAMS (grupa kanału → assignments) do końca 17.03.2021.\n",
+    "\n",
+    "Oprocz samego rozwiązania, proszę umieścić sposób w jaki Państwo do niego doszli (np frazy wpisywane w wysuzkiwarkę, itp.).\n",
+    "\n",
+    "## Zadanie 100\n",
+    "\n",
+    "Znaleźć przykłady „wyzwań” researcherskich — nagród pieniężnych za\n",
+    "znalezienie jakiejś informacji, najwcześniejszego wystąpienia jakiegoś słowa itp.\n",
+    "Wyzwanie musi polegać na znalezieniu jakieś informacji w powszechnie dostępnych źródłach (internet, biblioteki).\n",
+    "Zatem nie liczą sie np. nagrody za udzielenie informacji o jakimś mordercy, itp.\n",
+    "Interesują nas tylko „otwarte” wyzwania. Język, jakiego dotyczy wyzwanie — dowolny.\n",
+    "\n",
+    "Wyzwania podać w formie tabelki: nagroda, link, krótki opis.\n",
+    "\n",
+    "Liczba punktów za każde znalezione wyzwanie: max( 30, 5*log_10(nagroda w dolarach) )\n",
+    "\n",
+    "Przykład: [nagroda $250 za znalezienie wzmianki dotyczącej chupacabry\n",
+    "(potwora) przed 1990 rokiem](http://www.cryptozoonews.com/chupa-250/).\n",
+    "\n",
+    "Maksymalna liczba punktów: 30.\n",
+    "\n",
+    "\n",
+    "## Zadanie 101\n",
+    "\n",
+    "Podać 3 przykłady zapytań na allegro, które daje zaskakujące/niesatysfakcjonujące wyniki. Napisz jaka może być przyczyna takich wyników?\n",
+    "\n",
+    "Maksymalna liczba punktów: 20.\n",
+    "\n",
+    "## Zadanie 102\n",
+    "    \n",
+    "Znaleźć PDF-a w języku francuskim opublikowanego w Internecie przed\n",
+    "10 marca 2021 roku z największą ilością stron.\n",
+    "\n",
+    "Punkty: 30 (za największy plik).\n",
+    "    \n",
+    "## Zadanie 103\n",
+    "\n",
+    "Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"coronavirus\".\n",
+    "\n",
+    "Punkty: 35\n",
+    "\n",
+    "## Zadanie 104\n",
+    "\n",
+    "Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"SARS-CoV-2\".\n",
+    "Punkty: 35\n",
+    "    \n",
+    "    \n",
+    "## Zadanie 105\n",
+    "    \n",
+    "Podaj 3 przykłady ofert na portalach (allegro, olx, inne), które mają nieoczywiste tytuły w celu pojawienia się\n",
+    "dla jak największej ilości zapytań. Powinny to być 3 różne powody. Napisz jakie to są powody przy ofercie.\n",
+    "\n",
+    "Punkty: 20\n",
+    "\n",
+    "\n",
+    "## Zadanie 106\n",
+    "\n",
+    "Znajdź wykres na google trends, który pokazuje równoczesny wzrost zainteresowania jednej frazy, gdy maleje\n",
+    "zainteresowanie drugą frazą. Obie frazy powinny być choć trochę popularne. Niekoniecznie musi występować \n",
+    "powiązanie przyczynowo-skutkowe, ale jeżeli zachodzi- tym lepiej. Skorzystaj z opcji porównywania trendów.\n",
+    "\n",
+    "Punkty: 20\n",
+    "\n",
+    "## Zadanie 107\n",
+    "\n",
+    "Znajdź zapytanie na google trends, które jest popularne w niektórych regionach polski, a w innych nie. Z czego mogą wynikać te różnice?\n",
+    "\n",
+    "Punkty: 20\n",
+    "    \n",
+    "    \n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/cw/02a_tfidf_tasks.ipynb b/cw/02a_tfidf_tasks.ipynb
new file mode 100644
index 0000000..24b36fa
--- /dev/null
+++ b/cw/02a_tfidf_tasks.ipynb
@@ -0,0 +1,1125 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Zajęcia 2\n",
+    "\n",
+    "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## zbiór dokumentów"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",
+    "             'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",
+    "             'I Jan jeździ na rowerze.',\n",
+    "             '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
+    "             'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.',\n",
+    "            ]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### CZEGO CHCEMY?\n",
+    "- chcemy zamienić teksty na zbiór słów\n",
+    "\n",
+    "\n",
+    "### PYTANIE\n",
+    "- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## preprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_str_cleaned(str_dirty):\n",
+    "    punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
+    "    new_str = str_dirty.lower()\n",
+    "    new_str = re.sub(' +', ' ', new_str)\n",
+    "    for char in punctuation:\n",
+    "        new_str = new_str.replace(char,'')\n",
+    "    return new_str\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_document = get_str_cleaned(documents[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'ala lubi zwierzęta i ma kota oraz psa'"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sample_document"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## tokenizacja"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tokenize_str(document):\n",
+    "    return document.split(' ')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenize_str(sample_document)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents_cleaned = [get_str_cleaned(d) for d in documents]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['ala lubi zwierzęta i ma kota oraz psa',\n",
+       " 'ola lubi zwierzęta oraz ma kota a także chomika',\n",
+       " 'i jan jeździ na rowerze',\n",
+       " '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
+       " 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents_cleaned"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
+       " ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",
+       " ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",
+       " ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
+       " ['tomek',\n",
+       "  'lubi',\n",
+       "  'psy',\n",
+       "  'ma',\n",
+       "  'psa',\n",
+       "  'i',\n",
+       "  'jeździ',\n",
+       "  'na',\n",
+       "  'motorze',\n",
+       "  'i',\n",
+       "  'rowerze']]"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents_tokenized"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PYTANIA\n",
+    "- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n",
+    "- jakie wielkości będzie wektor TF lub TF-IDF?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vocabulary = []\n",
+    "for document in documents_tokenized:\n",
+    "    for word in document:\n",
+    "        vocabulary.append(word)\n",
+    "vocabulary = sorted(set(vocabulary))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['2',\n",
+       " 'a',\n",
+       " 'ala',\n",
+       " 'była',\n",
+       " 'chomika',\n",
+       " 'i',\n",
+       " 'jan',\n",
+       " 'jeździ',\n",
+       " 'konfliktem',\n",
+       " 'kota',\n",
+       " 'lubi',\n",
+       " 'ma',\n",
+       " 'motorze',\n",
+       " 'na',\n",
+       " 'ola',\n",
+       " 'oraz',\n",
+       " 'psa',\n",
+       " 'psy',\n",
+       " 'rowerze',\n",
+       " 'także',\n",
+       " 'tomek',\n",
+       " 'wielkim',\n",
+       " 'wojna',\n",
+       " 'zbrojnym',\n",
+       " 'zwierzęta',\n",
+       " 'światowa']"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocabulary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PYTANIA\n",
+    "\n",
+    "jak będzie słowo \"jak\" w reprezentacji wektorowej TF?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def word_to_index(word):\n",
+    "    pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def word_to_index(word):\n",
+    "    vec = np.zeros(len(vocabulary))\n",
+    "    if word in vocabulary:\n",
+    "        idx = vocabulary.index(word)\n",
+    "        vec[idx] = 1\n",
+    "    else:\n",
+    "        vec[-1] = 1\n",
+    "    return vec"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
+       "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "word_to_index('psa')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tf(document):\n",
+    "    pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tf(document):\n",
+    "    document_vector = None\n",
+    "    for word in document:\n",
+    "        if document_vector is None:\n",
+    "            document_vector = word_to_index(word)\n",
+    "        else:\n",
+    "            document_vector += word_to_index(word)\n",
+    "    return document_vector"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
+       "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tf(documents_tokenized[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents_vectorized = list()\n",
+    "for document in documents_tokenized:\n",
+    "    document_vector = tf(document)\n",
+    "    documents_vectorized.append(document_vector)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
+       "        0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
+       " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
+       "        0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n",
+       " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
+       "        0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
+       " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+       "        0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n",
+       " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n",
+       "        1., 1., 0., 1., 0., 0., 0., 0., 0.])]"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents_vectorized"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### IDF"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([5.        , 5.        , 5.        , 5.        , 5.        ,\n",
+       "       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,\n",
+       "       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,\n",
+       "       2.5       , 2.5       , 5.        , 2.5       , 5.        ,\n",
+       "       5.        , 5.        , 5.        , 5.        , 2.5       ,\n",
+       "       5.        ])"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "idf = np.zeros(len(vocabulary))\n",
+    "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n",
+    "display(idf)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i in range(len(documents_vectorized)):\n",
+    "    documents_vectorized[i] = documents_vectorized[i]# * idf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def similarity(query, document):\n",
+    "    numerator = np.sum(query * document)\n",
+    "    denominator = np.sqrt(np.sum(query*query)) * np.sqrt(np.sum(document*document)) \n",
+    "    return numerator / denominator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ala lubi zwierzęta i ma kota oraz psa!'"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n",
+       "       0., 0., 0., 0., 0., 0., 0., 1., 0.])"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents_vectorized[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n",
+       "       0., 0., 1., 0., 0., 0., 0., 1., 0.])"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents_vectorized[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.5892556509887895"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "similarity(documents_vectorized[0],documents_vectorized[1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def transform_query(query):\n",
+    "    query_vector = tf(tokenize_str(get_str_cleaned(query)))\n",
+    "    return query_vector"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
+       "       0., 0., 0., 0., 0., 0., 0., 0., 0.])"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "transform_query('psa')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.4999999999999999"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "similarity(transform_query('psa kota'), documents_vectorized[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ala lubi zwierzęta i ma kota oraz psa!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.4999999999999999"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.2357022603955158"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'I Jan jeździ na rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.19611613513818402"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# tak są obsługiwane 2 słowa\n",
+    "query = 'psa kota'\n",
+    "for i in range(len(documents)):\n",
+    "    display(documents[i])\n",
+    "    display(similarity(transform_query(query), documents_vectorized[i]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ala lubi zwierzęta i ma kota oraz psa!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'I Jan jeździ na rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.4472135954999579"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.2773500981126146"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# dlatego potrzebujemy mianownik w cosine similarity\n",
+    "query = 'rowerze'\n",
+    "for i in range(len(documents)):\n",
+    "    display(documents[i])\n",
+    "    display(similarity(transform_query(query), documents_vectorized[i]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ala lubi zwierzęta i ma kota oraz psa!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.35355339059327373"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'I Jan jeździ na rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.4472135954999579"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.5547001962252291"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n",
+    "query = 'i'\n",
+    "for i in range(len(documents)):\n",
+    "    display(documents[i])\n",
+    "    display(similarity(transform_query(query), documents_vectorized[i]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Ala lubi zwierzęta i ma kota oraz psa!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.24999999999999994"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Ola lubi zwierzęta oraz ma kota a także chomika!'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.2357022603955158"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'I Jan jeździ na rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.31622776601683794"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2 wojna światowa była wielkim konfliktem zbrojnym'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.0"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0.39223227027636803"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",
+    "query = 'i chomika'\n",
+    "for i in range(len(documents)):\n",
+    "    display(documents[i])\n",
+    "    display(similarity(transform_query(query), documents_vectorized[i]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n",
+    "\n",
+    "Proszę użyć wersję bez żadnej normalizacji\n",
+    "\n",
+    "\n",
+    "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
+    "\n",
+    "\n",
+    "$|D|$ - ilość dokumentów w korpusie\n",
+    "$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb b/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb
new file mode 100644
index 0000000..956cbd9
--- /dev/null
+++ b/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb
@@ -0,0 +1,32 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/cw/02b_tfidf_newsgroup.ipynb b/cw/02b_tfidf_newsgroup.ipynb
new file mode 100644
index 0000000..3961462
--- /dev/null
+++ b/cw/02b_tfidf_newsgroup.ipynb
@@ -0,0 +1,708 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Zajecia 2\n",
+    "\n",
+    "Przydatne materiały:\n",
+    "\n",
+    "https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
+    "\n",
+    "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Importy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import sklearn.metrics\n",
+    "\n",
+    "from sklearn.datasets import fetch_20newsgroups\n",
+    "\n",
+    "from sklearn.feature_extraction.text import TfidfVectorizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zbiór danych"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "newsgroups = fetch_20newsgroups()['data']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "11314"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(newsgroups)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "From: lerxst@wam.umd.edu (where's my thing)\n",
+      "Subject: WHAT car is this!?\n",
+      "Nntp-Posting-Host: rac3.wam.umd.edu\n",
+      "Organization: University of Maryland, College Park\n",
+      "Lines: 15\n",
+      "\n",
+      " I was wondering if anyone out there could enlighten me on this car I saw\n",
+      "the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
+      "early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
+      "the front bumper was separate from the rest of the body. This is \n",
+      "all I know. If anyone can tellme a model name, engine specs, years\n",
+      "of production, where this car is made, history, or whatever info you\n",
+      "have on this funky looking car, please e-mail.\n",
+      "\n",
+      "Thanks,\n",
+      "- IL\n",
+      "   ---- brought to you by your neighborhood Lerxst ----\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(newsgroups[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Naiwne przeszukiwanie"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_documents = list() \n",
+    "for document in newsgroups:\n",
+    "    if 'car' in document:\n",
+    "        all_documents.append(document)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "From: lerxst@wam.umd.edu (where's my thing)\n",
+      "Subject: WHAT car is this!?\n",
+      "Nntp-Posting-Host: rac3.wam.umd.edu\n",
+      "Organization: University of Maryland, College Park\n",
+      "Lines: 15\n",
+      "\n",
+      " I was wondering if anyone out there could enlighten me on this car I saw\n",
+      "the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
+      "early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
+      "the front bumper was separate from the rest of the body. This is \n",
+      "all I know. If anyone can tellme a model name, engine specs, years\n",
+      "of production, where this car is made, history, or whatever info you\n",
+      "have on this funky looking car, please e-mail.\n",
+      "\n",
+      "Thanks,\n",
+      "- IL\n",
+      "   ---- brought to you by your neighborhood Lerxst ----\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(all_documents[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
+      "Subject: SI Clock Poll - Final Call\n",
+      "Summary: Final call for SI clock reports\n",
+      "Keywords: SI,acceleration,clock,upgrade\n",
+      "Article-I.D.: shelley.1qvfo9INNc3s\n",
+      "Organization: University of Washington\n",
+      "Lines: 11\n",
+      "NNTP-Posting-Host: carson.u.washington.edu\n",
+      "\n",
+      "A fair number of brave souls who upgraded their SI clock oscillator have\n",
+      "shared their experiences for this poll. Please send a brief message detailing\n",
+      "your experiences with the procedure. Top speed attained, CPU rated speed,\n",
+      "add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
+      "functionality with 800 and 1.4 m floppies are especially requested.\n",
+      "\n",
+      "I will be summarizing in the next two days, so please add to the network\n",
+      "knowledge base if you have done the clock upgrade and haven't answered this\n",
+      "poll. Thanks.\n",
+      "\n",
+      "Guy Kuo <guykuo@u.washington.edu>\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(all_documents[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### jakie są problemy z takim podejściem?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## TFIDF i odległość cosinusowa- gotowe biblioteki"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vectorizer = TfidfVectorizer()\n",
+    "#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "document_vectors = vectorizer.fit_transform(newsgroups)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
+       "\twith 1787565 stored elements in Compressed Sparse Row format>"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "document_vectors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
+       "\twith 89 stored elements in Compressed Sparse Row format>"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "document_vectors[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "matrix([[0., 0., 0., ..., 0., 0., 0.]])"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "document_vectors[0].todense()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
+       "        [0., 0., 0., ..., 0., 0., 0.],\n",
+       "        [0., 0., 0., ..., 0., 0., 0.],\n",
+       "        [0., 0., 0., ..., 0., 0., 0.]])"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "document_vectors[0:4].todense()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_str = 'speed'\n",
+    "#query_str = 'speed car'\n",
+    "#query_str = 'spider man'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_vector = vectorizer.transform([query_str])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
+       "\twith 1787565 stored elements in Compressed Sparse Row format>"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "document_vectors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
+       "\twith 1 stored elements in Compressed Sparse Row format>"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "query_vector"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.sort(similarities)[0][-4:]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([4517, 5509, 2116, 9921])"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "similarities.argsort()[0][-4:]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "From: ray@netcom.com (Ray Fischer)\n",
+      "Subject: Re: x86 ~= 680x0 ??  (How do they compare?)\n",
+      "Organization: Netcom. San Jose, California\n",
+      "Distribution: usa\n",
+      "Lines: 36\n",
+      "\n",
+      "dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
+      ">I'm sure Intel and Motorola are competing neck-and-neck for \n",
+      ">crunch-power, but for a given clock speed, how do we rank the\n",
+      ">following (from 1st to 6th):\n",
+      ">  486\t\t68040\n",
+      ">  386\t\t68030\n",
+      ">  286\t\t68020\n",
+      "\n",
+      "040 486 030 386 020 286\n",
+      "\n",
+      ">While you're at it, where will the following fit into the list:\n",
+      ">  68060\n",
+      ">  Pentium\n",
+      ">  PowerPC\n",
+      "\n",
+      "060 fastest, then Pentium, with the first versions of the PowerPC\n",
+      "somewhere in the vicinity.\n",
+      "\n",
+      ">And about clock speed:  Does doubling the clock speed double the\n",
+      ">overall processor speed?  And fill in the __'s below:\n",
+      ">  68030 @ __ MHz = 68040 @ __ MHz\n",
+      "\n",
+      "No.  Computer speed is only partly dependent of processor/clock speed.\n",
+      "Memory system speed play a large role as does video system speed and\n",
+      "I/O speed.  As processor clock rates go up, the speed of the memory\n",
+      "system becomes the greatest factor in the overall system speed.  If\n",
+      "you have a 50MHz processor, it can be reading another word from memory\n",
+      "every 20ns.  Sure, you can put all 20ns memory in your computer, but\n",
+      "it will cost 10 times as much as the slower 80ns SIMMs.\n",
+      "\n",
+      "And roughly, the 68040 is twice as fast at a given clock\n",
+      "speed as is the 68030.\n",
+      "\n",
+      "-- \n",
+      "Ray Fischer                   \"Convictions are more dangerous enemies of truth\n",
+      "ray@netcom.com                 than lies.\"  -- Friedrich Nietzsche\n",
+      "\n",
+      "0.4778416465020907\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
+      "Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
+      "Distribution: usa\n",
+      "Organization: University of Illinois at Urbana\n",
+      "Lines: 59\n",
+      "\n",
+      "ray@netcom.com (Ray Fischer) writes:\n",
+      "\n",
+      ">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
+      ">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
+      ">>crunch-power, but for a given clock speed, how do we rank the\n",
+      ">>following (from 1st to 6th):\n",
+      ">>  486\t\t68040\n",
+      ">>  386\t\t68030\n",
+      ">>  286\t\t68020\n",
+      "\n",
+      ">040 486 030 386 020 286\n",
+      "\n",
+      "How about some numbers here? Some kind of benchmark?\n",
+      "If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
+      "\n",
+      ">>While you're at it, where will the following fit into the list:\n",
+      ">>  68060\n",
+      ">>  Pentium\n",
+      ">>  PowerPC\n",
+      "\n",
+      ">060 fastest, then Pentium, with the first versions of the PowerPC\n",
+      ">somewhere in the vicinity.\n",
+      "\n",
+      "Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
+      "\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
+      "        (Alpha @150MHz  - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
+      "\n",
+      ">>And about clock speed:  Does doubling the clock speed double the\n",
+      ">>overall processor speed?  And fill in the __'s below:\n",
+      ">>  68030 @ __ MHz = 68040 @ __ MHz\n",
+      "\n",
+      ">No.  Computer speed is only partly dependent of processor/clock speed.\n",
+      ">Memory system speed play a large role as does video system speed and\n",
+      ">I/O speed.  As processor clock rates go up, the speed of the memory\n",
+      ">system becomes the greatest factor in the overall system speed.  If\n",
+      ">you have a 50MHz processor, it can be reading another word from memory\n",
+      ">every 20ns.  Sure, you can put all 20ns memory in your computer, but\n",
+      ">it will cost 10 times as much as the slower 80ns SIMMs.\n",
+      "\n",
+      "Not in a clock-doubled system. There isn't a doubling in performance, but\n",
+      "it _is_ quite significant. Maybe about a 70% increase in performance.\n",
+      "\n",
+      "Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
+      "who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
+      "memory speed corresponds to a clock speed of 12.5 MHz.\n",
+      "\n",
+      ">And roughly, the 68040 is twice as fast at a given clock\n",
+      ">speed as is the 68030.\n",
+      "\n",
+      "Numbers?\n",
+      "\n",
+      ">-- \n",
+      ">Ray Fischer                   \"Convictions are more dangerous enemies of truth\n",
+      ">ray@netcom.com                 than lies.\"  -- Friedrich Nietzsche\n",
+      "-- \n",
+      "Ravikumar Venkateswar\n",
+      "rvenkate@uiuc.edu\n",
+      "\n",
+      "A pun is a no' blessed form of whit.\n",
+      "\n",
+      "0.44292082969477664\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "From: ray@netcom.com (Ray Fischer)\n",
+      "Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
+      "Organization: Netcom. San Jose, California\n",
+      "Distribution: usa\n",
+      "Lines: 30\n",
+      "\n",
+      "rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
+      ">ray@netcom.com (Ray Fischer) writes:\n",
+      ">>040 486 030 386 020 286\n",
+      ">\n",
+      ">How about some numbers here? Some kind of benchmark?\n",
+      "\n",
+      "Benchmarks are for marketing dweebs and CPU envy.  OK, if it will make\n",
+      "you happy, the 486 is faster than the 040.  BFD.  Both architectures\n",
+      "are nearing then end of their lifetimes.  And especially with the x86\n",
+      "architecture: good riddance.\n",
+      "\n",
+      ">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
+      ">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
+      ">memory speed corresponds to a clock speed of 12.5 MHz.\n",
+      "\n",
+      "The point being the processor speed is only one of many aspects of a\n",
+      "computers performance.  Clock speed, processor, memory speed, CPU\n",
+      "architecture, I/O systems, even the application program all contribute \n",
+      "to the overall system performance.\n",
+      "\n",
+      ">>And roughly, the 68040 is twice as fast at a given clock\n",
+      ">>speed as is the 68030.\n",
+      ">\n",
+      ">Numbers?\n",
+      "\n",
+      "Look them up yourself.\n",
+      "\n",
+      "-- \n",
+      "Ray Fischer                   \"Convictions are more dangerous enemies of truth\n",
+      "ray@netcom.com                 than lies.\"  -- Friedrich Nietzsche\n",
+      "\n",
+      "0.3491800997095306\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "From: mb4008@cehp11 (Morgan J Bullard)\n",
+      "Subject: Re: speeding up windows\n",
+      "Keywords: speed\n",
+      "Organization: University of Illinois at Urbana\n",
+      "Lines: 30\n",
+      "\n",
+      "djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
+      "\n",
+      ">I have a 386/33 with 8 megs of memory\n",
+      "\n",
+      ">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
+      ">my computer \"boggs\" down and becomes really sluggish!\n",
+      "\n",
+      ">What can I do to increase performance?  What should I turn on or off\n",
+      "\n",
+      ">Will not loading wallpapers or stuff like that help when it comes to\n",
+      ">the running speed of windows and the programs that run under it?\n",
+      "\n",
+      ">Thanx in advance\n",
+      "\n",
+      ">Derek\n",
+      "\n",
+      "1) make sure your hard drive is defragmented. This will speed up more than \n",
+      "   just windows BTW.  Use something like Norton's or PC Tools.\n",
+      "2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
+      "   will speed up your machine but I could very will be wrong on this.\n",
+      "There's a good chance you've already done this but if not it may speed things\n",
+      "up.  good luck\n",
+      "\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
+      "\t\t\t\t\t  or   mjbb@uxa.cso.uiuc.edu\n",
+      "\n",
+      ">--\n",
+      ">$_    /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca           $ \n",
+      ">$\\'o.O' $Sociologist         $ It's 106 miles to Chicago,we've got a full tank$\n",
+      ">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
+      ">$   U   $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues        $  \n",
+      "\n",
+      "0.26949927393886913\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "----------------------------------------------------------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i in range (1,5):\n",
+    "    print(newsgroups[similarities.argsort()[0][-i]])\n",
+    "    print(np.sort(similarities)[0,-i])\n",
+    "    print('-'*100)\n",
+    "    print('-'*100)\n",
+    "    print('-'*100)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zadanie domowe\n",
+    "\n",
+    "\n",
+    "- Wybrać zbiór tekstowy, który ma conajmniej 5000 dokumentów.\n",
+    "- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.\n",
+    "- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
+    "- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algotytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy\n",
+    "- prezentować pracę na następnych zajęciach (15.03) odpowiadając na pytania:\n",
+    " - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
+    " - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
+    " - jakie zmiany zostały naniesione\n",
+    " - jak wyglądają wyniki wyszukiwania po zmianach\n",
+    " - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
+    " \n",
+    "Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
+    "punktów do zdobycia: 40\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}