{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

3. tfidf (1) [\u0107wiczenia]

\n", "

Jakub Pokrywka (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Zaj\u0119cia 2\n", "\n", "Na tych zaj\u0119ciach za aktywno\u015bc mo\u017cna otrzyma\u0107 po 5 punkt\u00f3w za warto\u015bciow\u0105 wypowied\u017a. Maksymalnie jedna osoba mo\u017ce zdoby\u0107 na tych \u0107wiczeniach do 15 punkt\u00f3w." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## zbi\u00f3r dokument\u00f3w" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "documents = ['Ala lubi zwierz\u0119ta i ma kota oraz psa!',\n", " 'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!',\n", " 'I Jan je\u017adzi na rowerze.',\n", " '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n", " 'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.',\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CZEGO CHCEMY?\n", "- chcemy zamieni\u0107 teksty na zbi\u00f3r s\u0142\u00f3w\n", "\n", "\n", "### PYTANIE\n", "- czy mo\u017cemy ztokenizowa\u0107 tekst np. documents.split(' ') jakie wyst\u0105pi\u0105 wtedy problemy?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## preprocessing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def get_str_cleaned(str_dirty):\n", " punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", " new_str = str_dirty.lower()\n", " new_str = re.sub(' +', ' ', new_str)\n", " for char in punctuation:\n", " new_str = new_str.replace(char,'')\n", " return new_str\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sample_document = get_str_cleaned(documents[0])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ala lubi zwierz\u0119ta i ma kota oraz psa'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## tokenizacja" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def tokenize_str(document):\n", " return document.split(' ')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenize_str(sample_document)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "documents_cleaned = [get_str_cleaned(d) for d in documents]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ala lubi zwierz\u0119ta i ma kota oraz psa',\n", " 'ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika',\n", " 'i jan je\u017adzi na rowerze',\n", " '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n", " 'tomek lubi psy ma psa i je\u017adzi na motorze i rowerze']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_cleaned" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "documents_tokenized = [tokenize_str(d) for d in documents_cleaned]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n", " ['ola', 'lubi', 'zwierz\u0119ta', 'oraz', 'ma', 'kota', 'a', 'tak\u017ce', 'chomika'],\n", " ['i', 'jan', 'je\u017adzi', 'na', 'rowerze'],\n", " ['2', 'wojna', '\u015bwiatowa', 'by\u0142a', 'wielkim', 'konfliktem', 'zbrojnym'],\n", " ['tomek',\n", " 'lubi',\n", " 'psy',\n", " 'ma',\n", " 'psa',\n", " 'i',\n", " 'je\u017adzi',\n", " 'na',\n", " 'motorze',\n", " 'i',\n", " 'rowerze']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_tokenized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PYTANIA\n", "- jaki jest nast\u0119pny krok w celu stworzenia wekt\u00f3r\u00f3w TF lub TF-IDF\n", "- jakie wielko\u015bci b\u0119dzie wektor TF lub TF-IDF?\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "vocabulary = []\n", "for document in documents_tokenized:\n", " for word in document:\n", " vocabulary.append(word)\n", "vocabulary = sorted(set(vocabulary))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2',\n", " 'a',\n", " 'ala',\n", " 'by\u0142a',\n", " 'chomika',\n", " 'i',\n", " 'jan',\n", " 'je\u017adzi',\n", " 'konfliktem',\n", " 'kota',\n", " 'lubi',\n", " 'ma',\n", " 'motorze',\n", " 'na',\n", " 'ola',\n", " 'oraz',\n", " 'psa',\n", " 'psy',\n", " 'rowerze',\n", " 'tak\u017ce',\n", " 'tomek',\n", " 'wielkim',\n", " 'wojna',\n", " 'zbrojnym',\n", " 'zwierz\u0119ta',\n", " '\u015bwiatowa']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PYTANIA\n", "\n", "jak b\u0119dzie s\u0142owo \"jak\" w reprezentacji wektorowej TF?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 1 stworzy\u0107 funkcj\u0119 word_to_index(word:str), funkcja ma zwara\u0107 one-hot vector w postaciu numpy array" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def word_to_index(word):\n", " pass" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_to_index('psa')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 2 NAPISAC FUNKCJ\u0118, kt\u00f3ra bierze list\u0119 s\u0142\u00f3w i zamienia na wetktor TF\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def tf(document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf(documents_tokenized[0])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "documents_vectorized = list()\n", "for document in documents_tokenized:\n", " document_vector = tf(document)\n", " documents_vectorized.append(document_vector)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.]),\n", " array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n", " 0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", " array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 1., 1., 1., 0., 1.]),\n", " array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,\n", " 1., 1., 0., 1., 0., 0., 0., 0., 0.])]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IDF" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5. , 5. , 5. , 5. , 5. ,\n", " 1.66666667, 5. , 2.5 , 5. , 2.5 ,\n", " 1.66666667, 1.66666667, 5. , 2.5 , 5. ,\n", " 2.5 , 2.5 , 5. , 2.5 , 5. ,\n", " 5. , 5. , 5. , 5. , 2.5 ,\n", " 5. ])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "idf = np.zeros(len(vocabulary))\n", "idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)\n", "display(idf)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "for i in range(len(documents_vectorized)):\n", " documents_vectorized[i] = documents_vectorized[i]# * idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 3 Napisa\u0107 funkcj\u0119 similarity, kt\u00f3ra zwraca podobie\u0144stwo kosinusowe mi\u0119dzy dwoma dokumentami w postaci zwektoryzowanej" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def similarity(query, document):\n", " pass" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[0]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[0]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents[1]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 1., 0., 0., 0., 0., 1., 0.])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents_vectorized[1]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5892556509887895" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(documents_vectorized[0],documents_vectorized[1])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def transform_query(query):\n", " query_vector = tf(tokenize_str(get_str_cleaned(query)))\n", " return query_vector" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform_query('psa')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4999999999999999" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "similarity(transform_query('psa kota'), documents_vectorized[0])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4999999999999999" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2357022603955158" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan je\u017adzi na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.19611613513818402" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# tak s\u0105 obs\u0142ugiwane 2 s\u0142owa\n", "query = 'psa kota'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan je\u017adzi na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2773500981126146" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy mianownik w cosine similarity\n", "query = 'rowerze'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.35355339059327373" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan je\u017adzi na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.4472135954999579" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.5547001962252291" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego potrzebujemy term frequency \u2192 wiecej znaczy bardziej dopasowany dokument\n", "query = 'i'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ala lubi zwierz\u0119ta i ma kota oraz psa!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.24999999999999994" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.2357022603955158" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'I Jan je\u017adzi na rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.31622776601683794" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.39223227027636803" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dlatego IDF - \u017ceby wa\u017cniejsze s\u0142owa mia\u0142 wi\u0119ksz\u0105 wag\u0119\n", "query = 'i chomika'\n", "for i in range(len(documents)):\n", " display(documents[i])\n", " display(similarity(transform_query(query), documents_vectorized[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE 4 NAPISA\u0106 IDF w celu zmiany wag z TF na TF- IDF \n", "\n", "Prosz\u0119 u\u017cy\u0107 wersj\u0119 bez \u017cadnej normalizacji\n", "\n", "\n", "$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n", "\n", "\n", "$|D|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie\n", "$|\\{d : t_i \\in d \\}|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie, gdzie dany term wyst\u0119puje chocia\u017c jeden raz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "lang": "pl", "subtitle": "3.tfidf (1)[\u0107wiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }