diff --git a/cw/00_Informacje_na_temat_przedmiotu.ipynb b/cw/00_Informacje_na_temat_przedmiotu.ipynb new file mode 100644 index 0000000..bf44929 --- /dev/null +++ b/cw/00_Informacje_na_temat_przedmiotu.ipynb @@ -0,0 +1,81 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Informacje ogólne" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Kontakt z prowadzącym\n", + "\n", + "prowadzący: mgr inż. Jakub Pokrywka\n", + "\n", + "Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n", + "\n", + "\n", + "## Literatura\n", + "Polecana literatura do przedmiotu:\n", + "\n", + "\n", + "- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n", + "- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n", + "- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n", + "\n", + "- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n", + "\n", + "- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n", + "\n", + "- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n", + "\n", + "- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n", + "\n", + "## Zaliczenie\n", + "\n", + "\n", + "\n", + "Do zdobycia będzie conajmniej 500 punktów.\n", + "\n", + "Ocena:\n", + "\n", + "- -299 — 2\n", + "\n", + "- 300-349 — 3\n", + "\n", + "- 350-399 — 3+\n", + "\n", + "- 400-449 — 4\n", + "\n", + "- 450—499 — 4+\n", + "\n", + "- 500- — 5\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/cw/01_Wyszukiwarki-wprowadzenie.ipynb b/cw/01_Wyszukiwarki-wprowadzenie.ipynb new file mode 100644 index 0000000..f25a297 --- /dev/null +++ b/cw/01_Wyszukiwarki-wprowadzenie.ipynb @@ -0,0 +1,257 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Zajecia 1\n", + "\n", + "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Przydatne materiały:\n", + "\n", + "https://www.google.com/advanced_search\n", + "\n", + "https://www.google.pl/advanced_image_search\n", + "\n", + "https://support.google.com/websearch/answer/2466433?hl=en\n", + "\n", + "https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n", + "\n", + "https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo\n", + "\n", + "https://developer.allegro.pl/about/\n", + "\n", + "https://serpapi.com/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Będziemy omawiać: \n", + "- Wyszukiwarki ogólnego przeznaczenia (google, bing, ...)\n", + "- Wyszukiwarki na konkretną platformę (amazon, allegro, olx, spar, ...)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wyszukiwanie zaawansowane google\n", + "\n", + "- \"job steve\"\n", + "- poduszka |/OR drzwi \n", + "- poduszka -biała\n", + "- poduszka * drzwi\n", + "- define:pillow\n", + "- cache:wp.pl\n", + "- poduszka filetype:pdf\n", + "- poduszka site:allegro.pl\n", + "- related:allegro.pl\n", + "- intitle:poduszka\n", + "- allintitle:poduszka biała\n", + "- inurl:poduszka\n", + "- allinurl:poduszka biała\n", + "- poduszka AROUND(4) drzwi\n", + "- weather:poznan\n", + "- stocks:gme\n", + "- map:poznań\n", + "- $329 in pln\n", + "- euro 1990..2000\n", + "- 15*30\n", + "- color picker\n", + "- elon musk @twitter\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Komponenty wyszukiwarki google\n", + "- pole do wpisywania tekstu i search button \n", + "- sugestie do wpisywania\n", + "- ghosting\n", + "- autokorekta, np. pdouszka\n", + "- ilość wyświetleń dla wyniku \n", + "- elementy dodaktowe po wpisaniu frazy (odpowiedzi na pytania ogólne, wyszukiwania powiązane, itp)\n", + "- lista elementów (podzielona na strony)\n", + "- jak działają strony na urządzeniach mobilnych?\n", + "- prezentacja wyników: nazwa strony oraz tam gdzie jest match pogrubienie (czy google ma prawo do umieszczania takich tekstów na swojej stronie)?\n", + "- inne komponenty - np best games for nintendo switch\n", + "- reklamy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Komponenty wyszukiwarki specjalistycznej na przykładzie allegro\n", + "\n", + "- wyszukiwarna tekstowa albo nawigowanie bezpośrednio po kategoriach\n", + "- każdy ma własny unikalny sposób wyszukiwania\n", + "- okno wyszukiwania\n", + "- sugestie przy wpisywaniu frazy\n", + "- ghosting (np santander.pl)\n", + "- autokorekta (sugestia oraz przekierowanie)\n", + "- można też wpisać, że szukamy również w opisach, parametrach itp.\n", + "- komentarz: tutaj wpisujemy jakąś frazę\n", + "- mamy zbiór dokumumentów oraz są posortowane w jakiś sposób (ale niekoniecznie tak musi być)\n", + "- jak działa odzyskiwanie dokumentów?\n", + " - stopwordy \n", + " - normalizacja do lowercase\n", + " - lista synonimów, fleksja, odmiana (także ujednoznacznienie do jednej formy → wielka poduszka/ wielki poduszka, kubek kubki)\n", + "- sortowania (omówić możliwe sortowania)- element którego nie ma w google\n", + "https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo#moja-oferta-ma-duza-sprzedaz-a-mimo-tego-jest-ona-nizej-w-sortowaniu-po-trafnosci-niz-inne-nowe-oferty-dlaczego-\n", + "- trafność dla każdego może znaczyć coś innego\n", + "- sortowanie domyślne- jakie jest jego znaczenie?\n", + "- inne rodzaje sortowania\n", + "- rerankowanie \n", + "- po lewej stronie mamy zawężenie do kategorii oraz filtry, wyszukiwanie facetowe- nie ma w google\n", + "- mamy także oferty sponsorowane oraz promowane - dylemat- ważniejszy jest biznes czy użytkownik\n", + "- rekomendacje dla użytkowników na dole- właściwie to jest osobny dział \n", + "- inne możliwości (szukaj wielu)\n", + "- wyszukiwanie zaawansowane: https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n", + "- ewaluacja jakości wyszukiwarki- dyskusja, kto by co wybrał, jak wygląda sprawa z uczeniem maszynowym?\n", + "- jakie cele musi spełniać inżynier trafonośći?\n", + "- jak ewaluować wyszukiwarki?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API do wyszukiwarek\n", + "- https://developer.allegro.pl/listing/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Google trends" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SEO (Search Engine Optimization)\n", + "- pod google\n", + "- pod wyszukiwarki typu allegro, olx \n", + "- https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Zadanie domowe\n", + "\n", + "----------------------\n", + "Maksymalnie do zdobycia za zadania 100: 30\n", + "\n", + "Maksymalnie do zdobycia za zadania 101-107: 50\n", + "\n", + "\n", + "Zadania proszę oddawać w formie pliku pdf w MS TEAMS (grupa kanału → assignments) do końca 17.03.2021.\n", + "\n", + "Oprocz samego rozwiązania, proszę umieścić sposób w jaki Państwo do niego doszli (np frazy wpisywane w wysuzkiwarkę, itp.).\n", + "\n", + "## Zadanie 100\n", + "\n", + "Znaleźć przykłady „wyzwań” researcherskich — nagród pieniężnych za\n", + "znalezienie jakiejś informacji, najwcześniejszego wystąpienia jakiegoś słowa itp.\n", + "Wyzwanie musi polegać na znalezieniu jakieś informacji w powszechnie dostępnych źródłach (internet, biblioteki).\n", + "Zatem nie liczą sie np. nagrody za udzielenie informacji o jakimś mordercy, itp.\n", + "Interesują nas tylko „otwarte” wyzwania. Język, jakiego dotyczy wyzwanie — dowolny.\n", + "\n", + "Wyzwania podać w formie tabelki: nagroda, link, krótki opis.\n", + "\n", + "Liczba punktów za każde znalezione wyzwanie: max( 30, 5*log_10(nagroda w dolarach) )\n", + "\n", + "Przykład: [nagroda $250 za znalezienie wzmianki dotyczącej chupacabry\n", + "(potwora) przed 1990 rokiem](http://www.cryptozoonews.com/chupa-250/).\n", + "\n", + "Maksymalna liczba punktów: 30.\n", + "\n", + "\n", + "## Zadanie 101\n", + "\n", + "Podać 3 przykłady zapytań na allegro, które daje zaskakujące/niesatysfakcjonujące wyniki. Napisz jaka może być przyczyna takich wyników?\n", + "\n", + "Maksymalna liczba punktów: 20.\n", + "\n", + "## Zadanie 102\n", + " \n", + "Znaleźć PDF-a w języku francuskim opublikowanego w Internecie przed\n", + "10 marca 2021 roku z największą ilością stron.\n", + "\n", + "Punkty: 30 (za największy plik).\n", + " \n", + "## Zadanie 103\n", + "\n", + "Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"coronavirus\".\n", + "\n", + "Punkty: 35\n", + "\n", + "## Zadanie 104\n", + "\n", + "Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"SARS-CoV-2\".\n", + "Punkty: 35\n", + " \n", + " \n", + "## Zadanie 105\n", + " \n", + "Podaj 3 przykłady ofert na portalach (allegro, olx, inne), które mają nieoczywiste tytuły w celu pojawienia się\n", + "dla jak największej ilości zapytań. Powinny to być 3 różne powody. Napisz jakie to są powody przy ofercie.\n", + "\n", + "Punkty: 20\n", + "\n", + "\n", + "## Zadanie 106\n", + "\n", + "Znajdź wykres na google trends, który pokazuje równoczesny wzrost zainteresowania jednej frazy, gdy maleje\n", + "zainteresowanie drugą frazą. Obie frazy powinny być choć trochę popularne. Niekoniecznie musi występować \n", + "powiązanie przyczynowo-skutkowe, ale jeżeli zachodzi- tym lepiej. Skorzystaj z opcji porównywania trendów.\n", + "\n", + "Punkty: 20\n", + "\n", + "## Zadanie 107\n", + "\n", + "Znajdź zapytanie na google trends, które jest popularne w niektórych regionach polski, a w innych nie. Z czego mogą wynikać te różnice?\n", + "\n", + "Punkty: 20\n", + " \n", + " \n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/cw/02a_tfidf_tasks.ipynb b/cw/02a_tfidf_tasks.ipynb new file mode 100644 index 0000000..24b36fa --- /dev/null +++ b/cw/02a_tfidf_tasks.ipynb @@ -0,0 +1,1125 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Zajęcia 2\n", + "\n", + "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import re" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## zbiór dokumentów" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n", + " 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n", + " 'I Jan jeździ na rowerze.',\n", + " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", + " 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n", + " ]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### CZEGO CHCEMY?\n", + "- chcemy zamienić teksty na zbiór słów\n", + "\n", + "\n", + "### PYTANIE\n", + "- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def get_str_cleaned(str_dirty):\n", + " punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", + " new_str = str_dirty.lower()\n", + " new_str = re.sub(' +', ' ', new_str)\n", + " for char in punctuation:\n", + " new_str = new_str.replace(char,'')\n", + " return new_str\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "sample_document = get_str_cleaned(documents[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ala lubi zwierzęta i ma kota oraz psa'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_document" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## tokenizacja" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "def tokenize_str(document):\n", + " return document.split(' ')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tokenize_str(sample_document)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "documents_cleaned = [get_str_cleaned(d) for d in documents]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['ala lubi zwierzęta i ma kota oraz psa',\n", + " 'ola lubi zwierzęta oraz ma kota a także chomika',\n", + " 'i jan jeździ na rowerze',\n", + " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", + " 'tomek lubi psy ma psa i jeździ na motorze i rowerze']" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents_cleaned" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n", + " ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n", + " ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n", + " ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n", + " ['tomek',\n", + " 'lubi',\n", + " 'psy',\n", + " 'ma',\n", + " 'psa',\n", + " 'i',\n", + " 'jeździ',\n", + " 'na',\n", + " 'motorze',\n", + " 'i',\n", + " 'rowerze']]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents_tokenized" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PYTANIA\n", + "- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n", + "- jakie wielkości będzie wektor TF lub TF-IDF?\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "vocabulary = []\n", + "for document in documents_tokenized:\n", + " for word in document:\n", + " vocabulary.append(word)\n", + "vocabulary = sorted(set(vocabulary))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['2',\n", + " 'a',\n", + " 'ala',\n", + " 'była',\n", + " 'chomika',\n", + " 'i',\n", + " 'jan',\n", + " 'jeździ',\n", + " 'konfliktem',\n", + " 'kota',\n", + " 'lubi',\n", + " 'ma',\n", + " 'motorze',\n", + " 'na',\n", + " 'ola',\n", + " 'oraz',\n", + " 'psa',\n", + " 'psy',\n", + " 'rowerze',\n", + " 'także',\n", + " 'tomek',\n", + " 'wielkim',\n", + " 'wojna',\n", + " 'zbrojnym',\n", + " 'zwierzęta',\n", + " 'światowa']" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vocabulary" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PYTANIA\n", + "\n", + "jak będzie słowo \"jak\" w reprezentacji wektorowej TF?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "def word_to_index(word):\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "def word_to_index(word):\n", + " vec = np.zeros(len(vocabulary))\n", + " if word in vocabulary:\n", + " idx = vocabulary.index(word)\n", + " vec[idx] = 1\n", + " else:\n", + " vec[-1] = 1\n", + " return vec" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", + " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word_to_index('psa')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "def tf(document):\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def tf(document):\n", + " document_vector = None\n", + " for word in document:\n", + " if document_vector is None:\n", + " document_vector = word_to_index(word)\n", + " else:\n", + " document_vector += word_to_index(word)\n", + " return document_vector" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", + " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tf(documents_tokenized[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "documents_vectorized = list()\n", + "for document in documents_tokenized:\n", + " document_vector = tf(document)\n", + " documents_vectorized.append(document_vector)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", + " 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", + " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", + " 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n", + " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n", + " 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", + " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", + " 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n", + " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n", + " 1., 1., 0., 1., 0., 0., 0., 0., 0.])]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents_vectorized" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### IDF" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([5. , 5. , 5. , 5. , 5. ,\n", + " 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n", + " 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n", + " 2.5 , 2.5 , 5. , 2.5 , 5. ,\n", + " 5. , 5. , 5. , 5. , 2.5 ,\n", + " 5. ])" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "idf = np.zeros(len(vocabulary))\n", + "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n", + "display(idf)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(documents_vectorized)):\n", + " documents_vectorized[i] = documents_vectorized[i]# * idf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "def similarity(query, document):\n", + " numerator = np.sum(query * document)\n", + " denominator = np.sqrt(np.sum(query*query)) * np.sqrt(np.sum(document*document)) \n", + " return numerator / denominator" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ala lubi zwierzęta i ma kota oraz psa!'" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", + " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents_vectorized[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ola lubi zwierzęta oraz ma kota a także chomika!'" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", + " 0., 0., 1., 0., 0., 0., 0., 1., 0.])" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "documents_vectorized[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.5892556509887895" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "similarity(documents_vectorized[0],documents_vectorized[1])" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "def transform_query(query):\n", + " query_vector = tf(tokenize_str(get_str_cleaned(query)))\n", + " return query_vector" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", + " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "transform_query('psa')" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.4999999999999999" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "similarity(transform_query('psa kota'), documents_vectorized[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ala lubi zwierzęta i ma kota oraz psa!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.4999999999999999" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Ola lubi zwierzęta oraz ma kota a także chomika!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.2357022603955158" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'I Jan jeździ na rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'2 wojna światowa była wielkim konfliktem zbrojnym'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.19611613513818402" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# tak są obsługiwane 2 słowa\n", + "query = 'psa kota'\n", + "for i in range(len(documents)):\n", + " display(documents[i])\n", + " display(similarity(transform_query(query), documents_vectorized[i]))" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ala lubi zwierzęta i ma kota oraz psa!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Ola lubi zwierzęta oraz ma kota a także chomika!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'I Jan jeździ na rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.4472135954999579" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'2 wojna światowa była wielkim konfliktem zbrojnym'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.2773500981126146" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# dlatego potrzebujemy mianownik w cosine similarity\n", + "query = 'rowerze'\n", + "for i in range(len(documents)):\n", + " display(documents[i])\n", + " display(similarity(transform_query(query), documents_vectorized[i]))" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ala lubi zwierzęta i ma kota oraz psa!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.35355339059327373" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Ola lubi zwierzęta oraz ma kota a także chomika!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'I Jan jeździ na rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.4472135954999579" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'2 wojna światowa była wielkim konfliktem zbrojnym'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.5547001962252291" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n", + "query = 'i'\n", + "for i in range(len(documents)):\n", + " display(documents[i])\n", + " display(similarity(transform_query(query), documents_vectorized[i]))" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Ala lubi zwierzęta i ma kota oraz psa!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.24999999999999994" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Ola lubi zwierzęta oraz ma kota a także chomika!'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.2357022603955158" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'I Jan jeździ na rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.31622776601683794" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'2 wojna światowa była wielkim konfliktem zbrojnym'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "0.39223227027636803" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n", + "query = 'i chomika'\n", + "for i in range(len(documents)):\n", + " display(documents[i])\n", + " display(similarity(transform_query(query), documents_vectorized[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n", + "\n", + "Proszę użyć wersję bez żadnej normalizacji\n", + "\n", + "\n", + "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n", + "\n", + "\n", + "$|D|$ - ilość dokumentów w korpusie\n", + "$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb b/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb new file mode 100644 index 0000000..956cbd9 --- /dev/null +++ b/cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb @@ -0,0 +1,32 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/cw/02b_tfidf_newsgroup.ipynb b/cw/02b_tfidf_newsgroup.ipynb new file mode 100644 index 0000000..3961462 --- /dev/null +++ b/cw/02b_tfidf_newsgroup.ipynb @@ -0,0 +1,708 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Zajecia 2\n", + "\n", + "Przydatne materiały:\n", + "\n", + "https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n", + "\n", + "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importy" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import sklearn.metrics\n", + "\n", + "from sklearn.datasets import fetch_20newsgroups\n", + "\n", + "from sklearn.feature_extraction.text import TfidfVectorizer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Zbiór danych" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "newsgroups = fetch_20newsgroups()['data']" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "11314" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(newsgroups)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From: lerxst@wam.umd.edu (where's my thing)\n", + "Subject: WHAT car is this!?\n", + "Nntp-Posting-Host: rac3.wam.umd.edu\n", + "Organization: University of Maryland, College Park\n", + "Lines: 15\n", + "\n", + " I was wondering if anyone out there could enlighten me on this car I saw\n", + "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", + "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", + "the front bumper was separate from the rest of the body. This is \n", + "all I know. If anyone can tellme a model name, engine specs, years\n", + "of production, where this car is made, history, or whatever info you\n", + "have on this funky looking car, please e-mail.\n", + "\n", + "Thanks,\n", + "- IL\n", + " ---- brought to you by your neighborhood Lerxst ----\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "print(newsgroups[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Naiwne przeszukiwanie" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "all_documents = list() \n", + "for document in newsgroups:\n", + " if 'car' in document:\n", + " all_documents.append(document)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From: lerxst@wam.umd.edu (where's my thing)\n", + "Subject: WHAT car is this!?\n", + "Nntp-Posting-Host: rac3.wam.umd.edu\n", + "Organization: University of Maryland, College Park\n", + "Lines: 15\n", + "\n", + " I was wondering if anyone out there could enlighten me on this car I saw\n", + "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", + "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", + "the front bumper was separate from the rest of the body. This is \n", + "all I know. If anyone can tellme a model name, engine specs, years\n", + "of production, where this car is made, history, or whatever info you\n", + "have on this funky looking car, please e-mail.\n", + "\n", + "Thanks,\n", + "- IL\n", + " ---- brought to you by your neighborhood Lerxst ----\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "print(all_documents[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From: guykuo@carson.u.washington.edu (Guy Kuo)\n", + "Subject: SI Clock Poll - Final Call\n", + "Summary: Final call for SI clock reports\n", + "Keywords: SI,acceleration,clock,upgrade\n", + "Article-I.D.: shelley.1qvfo9INNc3s\n", + "Organization: University of Washington\n", + "Lines: 11\n", + "NNTP-Posting-Host: carson.u.washington.edu\n", + "\n", + "A fair number of brave souls who upgraded their SI clock oscillator have\n", + "shared their experiences for this poll. Please send a brief message detailing\n", + "your experiences with the procedure. Top speed attained, CPU rated speed,\n", + "add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n", + "functionality with 800 and 1.4 m floppies are especially requested.\n", + "\n", + "I will be summarizing in the next two days, so please add to the network\n", + "knowledge base if you have done the clock upgrade and haven't answered this\n", + "poll. Thanks.\n", + "\n", + "Guy Kuo \n", + "\n" + ] + } + ], + "source": [ + "print(all_documents[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### jakie są problemy z takim podejściem?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TFIDF i odległość cosinusowa- gotowe biblioteki" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "vectorizer = TfidfVectorizer()\n", + "#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "document_vectors = vectorizer.fit_transform(newsgroups)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<11314x130107 sparse matrix of type ''\n", + "\twith 1787565 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "document_vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<1x130107 sparse matrix of type ''\n", + "\twith 89 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "document_vectors[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "matrix([[0., 0., 0., ..., 0., 0., 0.]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "document_vectors[0].todense()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "matrix([[0., 0., 0., ..., 0., 0., 0.],\n", + " [0., 0., 0., ..., 0., 0., 0.],\n", + " [0., 0., 0., ..., 0., 0., 0.],\n", + " [0., 0., 0., ..., 0., 0., 0.]])" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "document_vectors[0:4].todense()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "query_str = 'speed'\n", + "#query_str = 'speed car'\n", + "#query_str = 'spider man'" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "query_vector = vectorizer.transform([query_str])" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<11314x130107 sparse matrix of type ''\n", + "\twith 1787565 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "document_vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<1x130107 sparse matrix of type ''\n", + "\twith 1 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "query_vector" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.sort(similarities)[0][-4:]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([4517, 5509, 2116, 9921])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "similarities.argsort()[0][-4:]" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From: ray@netcom.com (Ray Fischer)\n", + "Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n", + "Organization: Netcom. San Jose, California\n", + "Distribution: usa\n", + "Lines: 36\n", + "\n", + "dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n", + ">I'm sure Intel and Motorola are competing neck-and-neck for \n", + ">crunch-power, but for a given clock speed, how do we rank the\n", + ">following (from 1st to 6th):\n", + "> 486\t\t68040\n", + "> 386\t\t68030\n", + "> 286\t\t68020\n", + "\n", + "040 486 030 386 020 286\n", + "\n", + ">While you're at it, where will the following fit into the list:\n", + "> 68060\n", + "> Pentium\n", + "> PowerPC\n", + "\n", + "060 fastest, then Pentium, with the first versions of the PowerPC\n", + "somewhere in the vicinity.\n", + "\n", + ">And about clock speed: Does doubling the clock speed double the\n", + ">overall processor speed? And fill in the __'s below:\n", + "> 68030 @ __ MHz = 68040 @ __ MHz\n", + "\n", + "No. Computer speed is only partly dependent of processor/clock speed.\n", + "Memory system speed play a large role as does video system speed and\n", + "I/O speed. As processor clock rates go up, the speed of the memory\n", + "system becomes the greatest factor in the overall system speed. If\n", + "you have a 50MHz processor, it can be reading another word from memory\n", + "every 20ns. Sure, you can put all 20ns memory in your computer, but\n", + "it will cost 10 times as much as the slower 80ns SIMMs.\n", + "\n", + "And roughly, the 68040 is twice as fast at a given clock\n", + "speed as is the 68030.\n", + "\n", + "-- \n", + "Ray Fischer \"Convictions are more dangerous enemies of truth\n", + "ray@netcom.com than lies.\" -- Friedrich Nietzsche\n", + "\n", + "0.4778416465020907\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n", + "Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n", + "Distribution: usa\n", + "Organization: University of Illinois at Urbana\n", + "Lines: 59\n", + "\n", + "ray@netcom.com (Ray Fischer) writes:\n", + "\n", + ">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n", + ">>I'm sure Intel and Motorola are competing neck-and-neck for \n", + ">>crunch-power, but for a given clock speed, how do we rank the\n", + ">>following (from 1st to 6th):\n", + ">> 486\t\t68040\n", + ">> 386\t\t68030\n", + ">> 286\t\t68020\n", + "\n", + ">040 486 030 386 020 286\n", + "\n", + "How about some numbers here? Some kind of benchmark?\n", + "If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n", + "\n", + ">>While you're at it, where will the following fit into the list:\n", + ">> 68060\n", + ">> Pentium\n", + ">> PowerPC\n", + "\n", + ">060 fastest, then Pentium, with the first versions of the PowerPC\n", + ">somewhere in the vicinity.\n", + "\n", + "Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n", + "\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n", + " (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n", + "\n", + ">>And about clock speed: Does doubling the clock speed double the\n", + ">>overall processor speed? And fill in the __'s below:\n", + ">> 68030 @ __ MHz = 68040 @ __ MHz\n", + "\n", + ">No. Computer speed is only partly dependent of processor/clock speed.\n", + ">Memory system speed play a large role as does video system speed and\n", + ">I/O speed. As processor clock rates go up, the speed of the memory\n", + ">system becomes the greatest factor in the overall system speed. If\n", + ">you have a 50MHz processor, it can be reading another word from memory\n", + ">every 20ns. Sure, you can put all 20ns memory in your computer, but\n", + ">it will cost 10 times as much as the slower 80ns SIMMs.\n", + "\n", + "Not in a clock-doubled system. There isn't a doubling in performance, but\n", + "it _is_ quite significant. Maybe about a 70% increase in performance.\n", + "\n", + "Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n", + "who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n", + "memory speed corresponds to a clock speed of 12.5 MHz.\n", + "\n", + ">And roughly, the 68040 is twice as fast at a given clock\n", + ">speed as is the 68030.\n", + "\n", + "Numbers?\n", + "\n", + ">-- \n", + ">Ray Fischer \"Convictions are more dangerous enemies of truth\n", + ">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n", + "-- \n", + "Ravikumar Venkateswar\n", + "rvenkate@uiuc.edu\n", + "\n", + "A pun is a no' blessed form of whit.\n", + "\n", + "0.44292082969477664\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "From: ray@netcom.com (Ray Fischer)\n", + "Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n", + "Organization: Netcom. San Jose, California\n", + "Distribution: usa\n", + "Lines: 30\n", + "\n", + "rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n", + ">ray@netcom.com (Ray Fischer) writes:\n", + ">>040 486 030 386 020 286\n", + ">\n", + ">How about some numbers here? Some kind of benchmark?\n", + "\n", + "Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n", + "you happy, the 486 is faster than the 040. BFD. Both architectures\n", + "are nearing then end of their lifetimes. And especially with the x86\n", + "architecture: good riddance.\n", + "\n", + ">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n", + ">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n", + ">memory speed corresponds to a clock speed of 12.5 MHz.\n", + "\n", + "The point being the processor speed is only one of many aspects of a\n", + "computers performance. Clock speed, processor, memory speed, CPU\n", + "architecture, I/O systems, even the application program all contribute \n", + "to the overall system performance.\n", + "\n", + ">>And roughly, the 68040 is twice as fast at a given clock\n", + ">>speed as is the 68030.\n", + ">\n", + ">Numbers?\n", + "\n", + "Look them up yourself.\n", + "\n", + "-- \n", + "Ray Fischer \"Convictions are more dangerous enemies of truth\n", + "ray@netcom.com than lies.\" -- Friedrich Nietzsche\n", + "\n", + "0.3491800997095306\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "From: mb4008@cehp11 (Morgan J Bullard)\n", + "Subject: Re: speeding up windows\n", + "Keywords: speed\n", + "Organization: University of Illinois at Urbana\n", + "Lines: 30\n", + "\n", + "djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n", + "\n", + ">I have a 386/33 with 8 megs of memory\n", + "\n", + ">I have noticed that lately when I use programs like WpfW or Corel Draw\n", + ">my computer \"boggs\" down and becomes really sluggish!\n", + "\n", + ">What can I do to increase performance? What should I turn on or off\n", + "\n", + ">Will not loading wallpapers or stuff like that help when it comes to\n", + ">the running speed of windows and the programs that run under it?\n", + "\n", + ">Thanx in advance\n", + "\n", + ">Derek\n", + "\n", + "1) make sure your hard drive is defragmented. This will speed up more than \n", + " just windows BTW. Use something like Norton's or PC Tools.\n", + "2) I _think_ that leaving the wall paper out will use less RAM and therefore\n", + " will speed up your machine but I could very will be wrong on this.\n", + "There's a good chance you've already done this but if not it may speed things\n", + "up. good luck\n", + "\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n", + "\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n", + "\n", + ">--\n", + ">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n", + ">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n", + ">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n", + ">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n", + "\n", + "0.26949927393886913\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n", + "----------------------------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "for i in range (1,5):\n", + " print(newsgroups[similarities.argsort()[0][-i]])\n", + " print(np.sort(similarities)[0,-i])\n", + " print('-'*100)\n", + " print('-'*100)\n", + " print('-'*100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Zadanie domowe\n", + "\n", + "\n", + "- Wybrać zbiór tekstowy, który ma conajmniej 5000 dokumentów.\n", + "- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.\n", + "- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n", + "- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algotytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy\n", + "- prezentować pracę na następnych zajęciach (15.03) odpowiadając na pytania:\n", + " - jak wygląda zbiór i system wyszukiwania przed zmianami\n", + " - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n", + " - jakie zmiany zostały naniesione\n", + " - jak wyglądają wyniki wyszukiwania po zmianach\n", + " - jak zmiany wpłynęły na wyniki (1-2 zdania)\n", + " \n", + "Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n", + "punktów do zdobycia: 40\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}