{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

3. tfidf (1) [ćwiczenia]

\n", "

Jakub Pokrywka (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Zajęcia 3\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## zbiór dokumentów" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n", " 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n", " 'I Jan jeździ na rowerze.',\n", " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", " 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CZEGO CHCEMY?\n", "- chcemy zamienić teksty na zbiór słów\n", "\n", "\n", "### PYTANIE\n", "- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ODPOWIEDŹ\n", "- lepiej użyć preprocessingu i dopiero później tokenizacji" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## preprocessing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def get_str_cleaned(str_dirty):\n", " punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", " new_str = str_dirty.lower()\n", " new_str = re.sub(' +', ' ', new_str)\n", " for char in punctuation:\n", " new_str = new_str.replace(char,'')\n", " return new_str\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sample_document = get_str_cleaned(documents[0])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ala lubi zwierzęta i ma kota oraz psa'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## tokenizacja" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def tokenize_str(document):\n", " return document.split(' ')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenize_str(sample_document)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "documents_cleaned = [get_str_cleaned(d) for d in documents]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala lubi zwierzęta i ma kota oraz psa',\n", " 'ola lubi zwierzęta oraz ma kota a także chomika',\n", " 'i jan jeździ na rowerze',\n", " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", " 'tomek lubi psy ma psa i jeździ na motorze i rowerze']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_cleaned" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n", " ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n", " ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n", " ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n", " ['tomek',\n", " 'lubi',\n", " 'psy',\n", " 'ma',\n", " 'psa',\n", " 'i',\n", " 'jeździ',\n", " 'na',\n", " 'motorze',\n", " 'i',\n", " 'rowerze']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_tokenized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PYTANIA\n", "- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n", "- jakie wielkości będzie wektor TF lub TF-IDF?\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "vocabulary = []\n", "for document in documents_tokenized:\n", " for word in document:\n", " vocabulary.append(word)\n", "vocabulary = sorted(set(vocabulary))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2',\n", " 'a',\n", " 'ala',\n", " 'była',\n", " 'chomika',\n", " 'i',\n", " 'jan',\n", " 'jeździ',\n", " 'konfliktem',\n", " 'kota',\n", " 'lubi',\n", " 'ma',\n", " 'motorze',\n", " 'na',\n", " 'ola',\n", " 'oraz',\n", " 'psa',\n", " 'psy',\n", " 'rowerze',\n", " 'także',\n", " 'tomek',\n", " 'wielkim',\n", " 'wojna',\n", " 'zbrojnym',\n", " 'zwierzęta',\n", " 'światowa']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def word_to_index(word):\n", " pass" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_to_index('psa')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def tf(document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf(documents_tokenized[0])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "documents_vectorized = list()\n", "for document in documents_tokenized:\n", " document_vector = tf(document)\n", " documents_vectorized.append(document_vector)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n", " 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n", " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n", " 1., 1., 0., 1., 0., 0., 0., 0., 0.])]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wersja bez żadnej normalizacji\n", "\n", "\n", "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n", "\n", "\n", "$|D|$ - ilość dokumentów w korpusie\n", "$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5. , 5. , 5. , 5. , 5. ,\n", " 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n", " 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n", " 2.5 , 2.5 , 5. , 2.5 , 5. ,\n", " 5. , 5. , 5. , 5. , 2.5 ,\n", " 5. ])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "idf = np.zeros(len(vocabulary))\n", "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n", "display(idf)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "for i in range(len(documents_vectorized)):\n", " documents_vectorized[i] = documents_vectorized[i] * idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def similarity(query, document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[0]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[0]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[1]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[1]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5892556509887895" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(documents_vectorized[0],documents_vectorized[1])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def transform_query(query):\n", " query_vector = tf(tokenize_str(get_str_cleaned(query)))\n", " return query_vector" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform_query('psa')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4999999999999999" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(transform_query('psa kota'), documents_vectorized[0])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4999999999999999" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2357022603955158" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.19611613513818402" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# tak są obsługiwane 2 słowa\n", "query = 'psa kota'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2773500981126146" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy mianownik w cosine similarity\n", "# dłuższe dokumenty, w który raz wystąpie słowo rower są gorzej punktowane od\n", "# krótszych. Jeżeli słowo rower wystąpiło w bardzo krótki dokumencie, to znaczy\n", "# że jest większe prawdopodobieństwo że dokument jest o rowerze\n", "query = 'rowerze'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.35355339059327373" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.5547001962252291" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy term frequency → wiecej wystąpień słowa w dokumencie\n", "# znaczy bardziej dopasowany dokument\n", "query = 'i'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'documents' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# słowo chomik ma większą wagę od i, ponieważ występuje w mniejszej ilości dokumentów\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mquery\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'i chomika'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdocuments\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdocuments\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mdisplay\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msimilarity\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtransform_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mquery\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdocuments_vectorized\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'documents' is not defined" ] } ], "source": [ "# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n", "# słowo chomik ma większą wagę od i, ponieważ występuje w mniejszej ilości dokumentów\n", "query = 'i chomika'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uwaga\n", "Powyższe przykłady pokazują score dokuemntu. Aby zrobić wyszukiwarkę, powinniśmy posortować te dokumenty po score (od największego) i zaprezentwoać w tej kolejności." ] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "subtitle": "3.tfidf (1)[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }