{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

3. tfidf (1) [ćwiczenia]

\n", "

Jakub Pokrywka (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Zajęcia 2\n", "\n", "Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## zbiór dokumentów" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n", " 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n", " 'I Jan jeździ na rowerze.',\n", " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", " 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CZEGO CHCEMY?\n", "- chcemy zamienić teksty na zbiór słów\n", "\n", "\n", "### PYTANIE\n", "- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## preprocessing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def get_str_cleaned(str_dirty):\n", " punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", " new_str = str_dirty.lower()\n", " new_str = re.sub(' +', ' ', new_str)\n", " for char in punctuation:\n", " new_str = new_str.replace(char,'')\n", " return new_str\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sample_document = get_str_cleaned(documents[0])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ala lubi zwierzęta i ma kota oraz psa'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## tokenizacja" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def tokenize_str(document):\n", " return document.split(' ')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenize_str(sample_document)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "documents_cleaned = [get_str_cleaned(d) for d in documents]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala lubi zwierzęta i ma kota oraz psa',\n", " 'ola lubi zwierzęta oraz ma kota a także chomika',\n", " 'i jan jeździ na rowerze',\n", " '2 wojna światowa była wielkim konfliktem zbrojnym',\n", " 'tomek lubi psy ma psa i jeździ na motorze i rowerze']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_cleaned" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n", " ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n", " ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n", " ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n", " ['tomek',\n", " 'lubi',\n", " 'psy',\n", " 'ma',\n", " 'psa',\n", " 'i',\n", " 'jeździ',\n", " 'na',\n", " 'motorze',\n", " 'i',\n", " 'rowerze']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_tokenized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PYTANIA\n", "- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n", "- jakie wielkości będzie wektor TF lub TF-IDF?\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "vocabulary = []\n", "for document in documents_tokenized:\n", " for word in document:\n", " vocabulary.append(word)\n", "vocabulary = sorted(set(vocabulary))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2',\n", " 'a',\n", " 'ala',\n", " 'była',\n", " 'chomika',\n", " 'i',\n", " 'jan',\n", " 'jeździ',\n", " 'konfliktem',\n", " 'kota',\n", " 'lubi',\n", " 'ma',\n", " 'motorze',\n", " 'na',\n", " 'ola',\n", " 'oraz',\n", " 'psa',\n", " 'psy',\n", " 'rowerze',\n", " 'także',\n", " 'tomek',\n", " 'wielkim',\n", " 'wojna',\n", " 'zbrojnym',\n", " 'zwierzęta',\n", " 'światowa']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PYTANIA\n", "\n", "jak będzie słowo \"jak\" w reprezentacji wektorowej TF?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def word_to_index(word):\n", " pass" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_to_index('psa')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def tf(document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf(documents_tokenized[0])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "documents_vectorized = list()\n", "for document in documents_tokenized:\n", " document_vector = tf(document)\n", " documents_vectorized.append(document_vector)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n", " 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n", " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n", " 1., 1., 0., 1., 0., 0., 0., 0., 0.])]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IDF" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5. , 5. , 5. , 5. , 5. ,\n", " 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n", " 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n", " 2.5 , 2.5 , 5. , 2.5 , 5. ,\n", " 5. , 5. , 5. , 5. , 2.5 ,\n", " 5. ])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "idf = np.zeros(len(vocabulary))\n", "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n", "display(idf)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "for i in range(len(documents_vectorized)):\n", " documents_vectorized[i] = documents_vectorized[i]# * idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def similarity(query, document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[0]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[0]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[1]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[1]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5892556509887895" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(documents_vectorized[0],documents_vectorized[1])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def transform_query(query):\n", " query_vector = tf(tokenize_str(get_str_cleaned(query)))\n", " return query_vector" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform_query('psa')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4999999999999999" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(transform_query('psa kota'), documents_vectorized[0])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4999999999999999" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2357022603955158" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.19611613513818402" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# tak są obsługiwane 2 słowa\n", "query = 'psa kota'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2773500981126146" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy mianownik w cosine similarity\n", "query = 'rowerze'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.35355339059327373" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.5547001962252291" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n", "query = 'i'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierzęta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.24999999999999994" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierzęta oraz ma kota a także chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2357022603955158" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan jeździ na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.31622776601683794" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna światowa była wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.39223227027636803" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n", "query = 'i chomika'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n", "\n", "Proszę użyć wersję bez żadnej normalizacji\n", "\n", "\n", "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n", "\n", "\n", "$|D|$ - ilość dokumentów w korpusie\n", "$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz" ] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "subtitle": "3.tfidf (1)[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }