aitech-eks-pub/wyk/05_Geste_wektory.ipynb

1645 lines
79 KiB
Plaintext
Raw Normal View History

2021-04-07 08:43:05 +02:00
{
2021-09-27 07:57:37 +02:00
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 5. <i>G\u0119ste reprezentacje wektorowe</i> [wyk\u0142ad]</h2> \n",
"<h3> Filip Grali\u0144ski (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zag\u0119szczamy wektory\n",
"\n",
"Podstawowy problem z wektorow\u0105 reprezentacj\u0105 typu tf-idf polega na tym, \u017ce wektory dokument\u00f3w (i macierz ca\u0142ej kolekcji dokument\u00f3w) s\u0105 _rzadkie_, tzn. zawieraj\u0105 du\u017co zer. W praktyce potrzebujemy bardziej \"g\u0119stej\" czy \"kompaktowej\" reprezentacji numerycznej dokument\u00f3w. \n",
"\n",
"## _Hashing trick_\n",
"\n",
"Powierzchownie problem mo\u017cemy rozwi\u0105za\u0107 przez u\u017cycie tzw. _sztuczki z haszowaniem_ (_hashing trick_). B\u0119dziemy potrzebowa\u0107 funkcji mieszaj\u0105cej (haszuj\u0105cej) $H$, kt\u00f3ra rzutuje napisy na liczby, kt\u00f3rych reprezentacja binarna sk\u0142ada si\u0119 z $b$ bit\u00f3w:\n",
"\n",
"$$H : \\Sigma^{*} \\rightarrow \\{0,\\dots,2^b-1\\}$$\n",
"\n",
"($\\Sigma^{*}$ to zbi\u00f3r wszystkich napis\u00f3w.)\n",
"\n",
"**Pytanie:** Czy funkcja $H$ mo\u017ce by\u0107 r\u00f3\u017cnowarto\u015bciowa?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jako funkcji $H$ mo\u017cemy np. u\u017cy\u0107 funkcji MurmurHash2 lub 3."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Hash64 0x4a80abc136f926e7"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x6c3a641663470e2c"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x6c3a641663470e2c"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0xa714568917576314"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x875d9e7e413747c8"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x13ce831936ebc69e"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0xb04ce6229407c882"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x6ecd7bae29ae0450"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Digest.Murmur64\n",
"\n",
"hash64 \"Komputer\"\n",
"hash64 \"komputer\"\n",
"hash64 \"komputer\"\n",
"hash64 \"komputerze\"\n",
"hash64 \"komputerek\"\n",
"hash64 \"abrakadabra\"\n",
"hash64 \"\"\n",
"hash64 \" \"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** podobne napisy maj\u0105 zupe\u0142nie r\u00f3\u017cne warto\u015bci funkcji haszuj\u0105cej, czy to dobrze, czy to \u017ale?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Musimy tylko sparametryzowa\u0107 nasz\u0105 funkcj\u0119 rozmiarem \"odcisku\" (parametr $b$)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3628"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"25364"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"2877"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"50846"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"12"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"\n",
"import Data.Text\n",
"\n",
"-- pomocnicza funkcja, kt\u00f3ra konwertuje warto\u015b\u0107 specjalnego\n",
"-- typu Hash64 do zwyk\u0142ej liczby ca\u0142kowitej\n",
"hashValueAsInteger :: Hash64 -> Integer\n",
"hashValueAsInteger = toInteger . asWord64\n",
"\n",
"-- unpack to funkcja, kt\u00f3ra warto\u015b\u0107 typu String konwertuje do Text\n",
"hash :: Integer -> Text -> Integer\n",
"hash b t = hashValueAsInteger (hash64 $ unpack t) `mod` (2 ^ b)\n",
"\n",
"hash 16 \"komputer\"\n",
"hash 16 \"komputerze\"\n",
"hash 16 \"komputerem\"\n",
"hash 16 \"abrakadabra\"\n",
"hash 4 \"komputer\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Jakie warto\u015bci $b$ b\u0119d\u0105 bezsensowne?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sztuczka z haszowaniem polega na tym, \u017ce zamiast numerowa\u0107 s\u0142owa korzystaj\u0105c ze s\u0142ownika, po prostu u\u017cywamy funkcji haszuj\u0105cej. W ten spos\u00f3b wektor b\u0119dzie _zawsze_ rozmiar $2^b$ - bez wzgl\u0119du na rozmiar s\u0142ownika."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zacznijmy od przywo\u0142ania wszystkich potrzebnych definicji."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE QuasiQuotes #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Text.Regex.PCRE.Heavy\n",
"\n",
"isStopWord :: Text -> Bool\n",
"isStopWord \"w\" = True\n",
"isStopWord \"jest\" = True\n",
"isStopWord \"\u017ce\" = True\n",
"isStopWord w = w \u2248 [re|^\\p{P}+$|]\n",
"\n",
"\n",
"removeStopWords :: [Text] -> [Text]\n",
"removeStopWords = filter (not . isStopWord)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE QuasiQuotes #-}\n",
"{-# LANGUAGE FlexibleContexts #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Prelude hiding(words, take)\n",
"import Text.Regex.PCRE.Heavy\n",
"import Data.Map as Map hiding(take, map, filter)\n",
"import Data.Set as Set hiding(map)\n",
"\n",
"tokenize :: Text -> [Text]\n",
"tokenize = map fst . scan [re|C\\+\\+|[\\p{L}0-9]+|\\p{P}|]\n",
"\n",
"\n",
"mockInflectionDictionary :: Map Text Text\n",
"mockInflectionDictionary = Map.fromList [\n",
" (\"kota\", \"kot\"),\n",
" (\"butach\", \"but\"),\n",
" (\"masz\", \"mie\u0107\"),\n",
" (\"ma\", \"mie\u0107\"),\n",
" (\"buta\", \"but\"),\n",
" (\"zgubi\u0142em\", \"zgubi\u0107\")]\n",
"\n",
"lemmatizeWord :: Map Text Text -> Text -> Text\n",
"lemmatizeWord dict w = findWithDefault w w dict\n",
"\n",
"lemmatize :: Map Text Text -> [Text] -> [Text]\n",
"lemmatize dict = map (lemmatizeWord dict)\n",
"\n",
"\n",
"poorMansStemming = Data.Text.take 6\n",
"\n",
"normalize :: Text -> [Text]\n",
"normalize = map poorMansStemming . removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
"\n",
"getVocabulary :: [Text] -> Set Text \n",
"getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
" \n",
"idf :: [[Text]] -> Text -> Double\n",
"idf coll t = log (fromIntegral n / fromIntegral df)\n",
" where df = Prelude.length $ Prelude.filter (\\d -> t `elem` d) coll\n",
" n = Prelude.length coll\n",
" \n",
"vectorizeTfIdf :: Int -> [[Text]] -> Map Int Text -> [Text] -> [Double]\n",
"vectorizeTfIdf vecSize coll v doc = map (\\i -> count (v ! i) doc * idf coll (v ! i)) [0..(vecSize-1)]\n",
" where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import System.IO\n",
"import Data.List.Split as SP\n",
"\n",
"legendsh <- openFile \"legendy.txt\" ReadMode\n",
"hSetEncoding legendsh utf8\n",
"contents <- hGetContents legendsh\n",
"ls = Prelude.lines contents\n",
"items = map (map pack . SP.splitOn \"\\t\") ls\n",
"\n",
"labelsL = map Prelude.head items\n",
"collectionL = map (!!1) items\n",
"\n",
"collectionLNormalized = map normalize collectionL\n",
"voc' = getVocabulary collectionL\n",
"\n",
"vocLSize = Prelude.length voc'\n",
"\n",
"vocL :: Map Int Text\n",
"vocL = Map.fromList $ zip [0..] $ Set.toList voc'\n",
"\n",
"invvocL :: Map Text Int\n",
"invvocL = Map.fromList $ zip (Set.toList voc') [0..]\n",
"\n",
"lVectorized = map (vectorizeTfIdf vocLSize collectionLNormalized vocL) collectionLNormalized\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber x = printf \"% 7.2f\" x</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber = printf \"% 7.2f\"</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Avoid lambda</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ l -> pack $ printf \"% 7s\" l</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">pack . printf \"% 7s\"</div></div>"
],
"text/plain": [
"Line 5: Eta reduce\n",
"Found:\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"Why not:\n",
"formatNumber = printf \"% 7.2f\"Line 11: Use zipWith\n",
"Found:\n",
"map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]\n",
"Why not:\n",
"zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]Line 12: Avoid lambda\n",
"Found:\n",
"\\ l -> pack $ printf \"% 7s\" l\n",
"Why not:\n",
"pack . printf \"% 7s\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Text.Printf\n",
"import Data.List (take)\n",
"\n",
"formatNumber :: Double -> String\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"\n",
"similarTo :: ([Double] -> [Double] -> Double) -> [[Double]] -> Int -> Text\n",
"similarTo simFun vs ix = pack $ Prelude.unwords $ map (formatNumber . ((vs !! ix) `simFun`)) vs\n",
"\n",
"paintMatrix :: ([Double] -> [Double] -> Double) -> [Text] -> [[Double]] -> Text\n",
"paintMatrix simFun labels vs = header <> \"\\n\" <> Data.Text.unlines (map (\\(lab, ix) -> lab <> \" \" <> similarTo simFun vs ix) $ zip labels [0..(Prelude.length vs - 1)])\n",
" where header = \" \" <> Data.Text.unwords (map (\\l -> pack $ printf \"% 7s\" l) labels)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.02 0.01 0.01 0.03 0.02 0.02 0.04 0.03 0.02 0.01 0.02 0.03\n",
"w_lud 0.02 1.00 0.02 0.05 0.04 0.01 0.03 0.04 0.06 0.01 0.02 0.03 0.06\n",
"ba_hy 0.01 0.02 1.00 0.01 0.02 0.03 0.03 0.04 0.08 0.22 0.01 0.04 0.01\n",
"w_lap 0.01 0.05 0.01 1.00 0.01 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.00\n",
"ne_dz 0.03 0.04 0.02 0.01 1.00 0.04 0.03 0.07 0.08 0.06 0.03 0.03 0.05\n",
"be_wy 0.02 0.01 0.03 0.01 0.04 1.00 0.01 0.03 0.21 0.01 0.02 0.25 0.01\n",
"zw_oz 0.02 0.03 0.03 0.00 0.03 0.01 1.00 0.04 0.03 0.00 0.01 0.02 0.02\n",
"mo_zu 0.04 0.04 0.04 0.01 0.07 0.03 0.04 1.00 0.10 0.02 0.09 0.05 0.04\n",
"be_wy 0.03 0.06 0.08 0.02 0.08 0.21 0.03 0.10 1.00 0.05 0.03 0.24 0.04\n",
"ba_hy 0.02 0.01 0.22 0.00 0.06 0.01 0.00 0.02 0.05 1.00 0.01 0.02 0.00\n",
"mo_zu 0.01 0.02 0.01 0.00 0.03 0.02 0.01 0.09 0.03 0.01 1.00 0.01 0.02\n",
"be_wy 0.02 0.03 0.04 0.00 0.03 0.25 0.02 0.05 0.24 0.02 0.01 1.00 0.02\n",
"w_lud 0.03 0.06 0.01 0.00 0.05 0.01 0.02 0.04 0.04 0.00 0.02 0.02 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limit = 13\n",
"labelsLimited = Data.List.take limit labelsL\n",
"limitedL = Data.List.take limit lVectorized\n",
"\n",
"vectorNorm :: [Double] -> Double\n",
"vectorNorm vs = sqrt $ sum $ map (\\x -> x * x) vs\n",
"\n",
"toUnitVector :: [Double] -> [Double]\n",
"toUnitVector vs = map (/ n) vs\n",
" where n = vectorNorm vs\n",
"\n",
"\n",
"(\u2715) :: [Double] -> [Double] -> Double\n",
"(\u2715) v1 v2 = sum $ Prelude.zipWith (*) v1 v2\n",
"\n",
"cosineSim v1 v2 = toUnitVector v1 \u2715 toUnitVector v2\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powy\u017csza macierz reprezentuje por\u00f3wnanie przy u\u017cyciu podobie\u0144stwa kosinusowego. Spr\u00f3bujmy teraz u\u017cy\u0107 g\u0119stszych wektor\u00f3w przy u\u017cyciu hashing trick. Jako warto\u015b\u0107 $b$ przyjmijmy 6.\n",
"\n",
"Zobaczmy najpierw, w kt\u00f3re \"przegr\u00f3dki\" b\u0119d\u0105 wpada\u0142y poszczeg\u00f3lne wyrazy s\u0142ownika.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(\"0\",32),(\"00\",4),(\"01\",4),(\"07\",40),(\"09\",44),(\"1\",1),(\"10\",61),(\"100\",27),(\"12\",58),(\"13\",51),(\"131\",37),(\"15\",30),(\"16\",21),(\"17\",58),(\"18\",55),(\"19\",35),(\"1997r\",61),(\"2\",62),(\"20\",28),(\"2006\",44),(\"2008\",19),(\"2009\",4),(\"2010\",3),(\"22\",27),(\"23\",34),(\"24\",7),(\"25\",29),(\"26\",35),(\"27\",44),(\"28\",61),(\"29\",30),(\"3\",56),(\"30\",55),(\"300\",38),(\"31\",45),(\"4\",53),(\"40\",39),(\"42\",43),(\"48\",53),(\"49\",13),(\"5\",31),(\"50\",32),(\"56\",38),(\"57\",55),(\"6\",59),(\"7\",27),(\"8\",34),(\"a\",27),(\"aaa\",33),(\"absolu\",11),(\"absurd\",18),(\"aby\",12),(\"adnym\",10),(\"adres\",15),(\"adrese\",62),(\"afroam\",3),(\"afryce\",46),(\"agresy\",57),(\"ah\",37),(\"aha\",42),(\"aig\",56),(\"akadem\",18),(\"akcja\",0),(\"akcje\",21),(\"akompa\",13),(\"aktor\",26),(\"akurat\",7),(\"albino\",27),(\"albo\",44),(\"ale\",7),(\"alfa\",58),(\"alkoho\",56),(\"altern\",38),(\"ameryk\",11),(\"amp\",62),(\"anakon\",34),(\"analiz\",62),(\"andrze\",63),(\"anegdo\",43),(\"ang\",37),(\"anga\\380o\",27),(\"anglii\",33),(\"ani\",22),(\"anonsu\",36),(\"antono\",3),(\"antykr\",41),(\"apetyt\",16),(\"apolit\",39),(\"apropo\",54),(\"apteki\",20),(\"aqua\",59),(\"archit\",61),(\"aromat\",44),(\"artyku\",31),(\"asami\",22),(\"astron\",59),(\"asy\\347ci\",60),(\"atmosf\",37),(\"audycj\",50),(\"auta\",38)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"map (\\t -> (t, hash 6 t)) $ Data.List.take 100 $ Set.toList voc'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Czy jakie\u015b dwa termy wpad\u0142y do jednej przegr\u00f3dki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stw\u00f3rzmy najpierw funkcj\u0119, kt\u00f3ra b\u0119dzie wektoryzowa\u0142a pojedynczy term $t$. Po prostu stworzymy wektor, kt\u00f3re b\u0119dzie mia\u0142 rozmiar $2^b$, wsz\u0119dzie b\u0119dzie mia\u0142 0 z wyj\u0105tkiem pozycji o numerze $H_b(t)$ - tam wpiszmy odwrotn\u0105 cz\u0119sto\u015b\u0107 dokumentow\u0105.\n",
"\n",
"$$\\vec{t} = [0,\\dots,\\idf_c t,\\dots,0]$$\n",
"\n",
"Teraz dla dokumentu $d = (t_1,\\dots,t_n)$ i dla schematu wa\u017cenia tf-idf:\n",
"\n",
"$$\\vec{d} = \\sum \\vec{t_i}$$"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.268683541318364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wordVector :: Integer -> [[Text]] -> Text -> [Double]\n",
"wordVector b coll term = map selector [0..vecSize]\n",
" where vecSize = 2^b - 1\n",
" wordFingerprint = hash b term\n",
" selector i \n",
" | i == wordFingerprint = idf coll term\n",
" | otherwise = 0.0\n",
"\n",
"wordVector 6 collectionLNormalized \"aromat\"\n",
"wordVector 6 collectionLNormalized \"albo\"\n",
"wordVector 6 collectionLNormalized \"akcja\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz wystarczy zsumowa\u0107 wektory dla poszczeg\u00f3lnych s\u0142\u00f3w, \u017ceby otrzyma\u0107 wektor dokumentu. Najpierw zdefiniujmy sobie sum\u0119 wektorow\u0105."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1.2,4.0,3.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"(+++) :: [Double] -> [Double] -> [Double]\n",
"(+++) = Prelude.zipWith (+)\n",
"\n",
"[0.2, 0.5, 1.0] +++ [1.0, 3.5, 2.0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Przydatna b\u0119dzie jeszcze funkcja, kt\u00f3ra tworzy wektor z samymi zerami o zadanej d\u0142ugo\u015bci:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"zero :: Int -> [Double]\n",
"zero s = Prelude.replicate s 0.0\n",
"\n",
"zero (2^6)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">vectorizeWithHashingTrick b coll doc\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b) doc</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">vectorizeWithHashingTrick b coll\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b)</div></div>"
],
"text/plain": [
"Line 3: Eta reduce\n",
"Found:\n",
"vectorizeWithHashingTrick b coll doc\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b) doc\n",
"Why not:\n",
"vectorizeWithHashingTrick b coll\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[5.242936783195232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.856470206220483,0.0,0.0,1.1700712526502546,0.5947071077466928,0.0,5.712940412440966,3.0708470981669183,0.0,0.0,4.465908118654584,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,4.788681510917635,0.0,3.7727609380946383,0.0,1.575536360758419,0.0,3.079613757534693,0.0,4.465908118654584,0.0,4.588010815455483,4.465908118654584,0.0,1.5214691394881432,0.0,0.0,0.0,0.0,4.465908118654584,2.5199979695992702,0.0,1.5214691394881432,8.388148398070203e-2,0.0,4.465908118654584,0.0,0.0,3.367295829986474,0.0,3.7727609380946383,0.0,1.5214691394881432,0.0,3.7727609380946383,0.0,0.0,0.0,3.367295829986474,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.734591659972947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.734591659972947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.003275201291313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.931816237309167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"vectorizeWithHashingTrick :: Integer -> [[Text]] -> [Text] -> [Double]\n",
"vectorizeWithHashingTrick b coll doc = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2^b) doc\n",
"\n",
"vectorizeWithHashingTrick 6 collectionLNormalized $ collectionLNormalized !! 3\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"aromat\", \"albo\", \"akcja\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"albo\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"albo\", \"albo\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"09\"]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy, jak zag\u0119szczenie wp\u0142ywa na macierz podobie\u0144stwa."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.37 0.21 0.28 0.35 0.22 0.32 0.45 0.47 0.21 0.25 0.20 0.39\n",
"w_lud 0.37 1.00 0.28 0.18 0.38 0.15 0.20 0.35 0.36 0.14 0.17 0.19 0.33\n",
"ba_hy 0.21 0.28 1.00 0.08 0.20 0.18 0.24 0.29 0.30 0.27 0.17 0.15 0.24\n",
"w_lap 0.28 0.18 0.08 1.00 0.10 0.11 0.11 0.30 0.17 0.06 0.07 0.13 0.21\n",
"ne_dz 0.35 0.38 0.20 0.10 1.00 0.32 0.30 0.52 0.44 0.27 0.36 0.26 0.41\n",
"be_wy 0.22 0.15 0.18 0.11 0.32 1.00 0.26 0.26 0.39 0.15 0.23 0.43 0.22\n",
"zw_oz 0.32 0.20 0.24 0.11 0.30 0.26 1.00 0.38 0.36 0.06 0.18 0.20 0.29\n",
"mo_zu 0.45 0.35 0.29 0.30 0.52 0.26 0.38 1.00 0.54 0.23 0.39 0.38 0.51\n",
"be_wy 0.47 0.36 0.30 0.17 0.44 0.39 0.36 0.54 1.00 0.26 0.37 0.42 0.48\n",
"ba_hy 0.21 0.14 0.27 0.06 0.27 0.15 0.06 0.23 0.26 1.00 0.24 0.10 0.27\n",
"mo_zu 0.25 0.17 0.17 0.07 0.36 0.23 0.18 0.39 0.37 0.24 1.00 0.20 0.34\n",
"be_wy 0.20 0.19 0.15 0.13 0.26 0.43 0.20 0.38 0.42 0.10 0.20 1.00 0.29\n",
"w_lud 0.39 0.33 0.24 0.21 0.41 0.22 0.29 0.51 0.48 0.27 0.34 0.29 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lVectorized' = map (vectorizeWithHashingTrick 8 collectionLNormalized) collectionLNormalized\n",
"limitedL' = Data.List.take limit lVectorized'\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Co si\u0119 stanie, gdy zwi\u0119kszymy $b$, a co je\u015bli zmniejszymi?\n",
"\n",
"Zalety sztuczki z haszowaniem:\n",
"\n",
"* zagwarantowany sta\u0142y rozmiar wektora\n",
"* szybsze obliczenia\n",
"* w naturalny spos\u00f3b uwzgl\u0119dniamy termy, kt\u00f3rych nie by\u0142o w pocz\u0105tkowej kolekcji (ale uwaga na idf!)\n",
"* nie musimy pami\u0119ta\u0107 odzworowania rzutuj\u0105cego s\u0142owa na ich numery\n",
"\n",
"Wady:\n",
"\n",
"* dwa r\u00f3\u017cne s\u0142owa mog\u0105 wpa\u015b\u0107 do jednej przegr\u00f3dki (szczeg\u00f3lnie cz\u0119ste, je\u015bli $b$ jest za ma\u0142e)\n",
"* je\u015bli $b$ ustawimy za du\u017ce, wektory mog\u0105 by\u0107 nawet wi\u0119ksze ni\u017c w przypadku standardowego podej\u015bcia\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word2vec\n",
"\n",
"A mo\u017ce istnieje dobra wr\u00f3\u017cka, kt\u00f3ra da\u0142aby nam dobre wektory s\u0142\u00f3w (z kt\u00f3rych b\u0119dziemy sk\u0142adali proste wektory dokument\u00f3w przez sumowanie)?\n",
"\n",
"**Pytanie:** Jakie w\u0142asno\u015bci powinny mie\u0107 dobre wektory s\u0142\u00f3w?\n",
"\n",
"Tak! Istniej\u0105 gotowe \"bazy danych\" wektor\u00f3w. Jedn\u0105 z najpopularniejszych (i najstarszych) metod uzyskiwania takich wektor\u00f3w jest Word2vec. Jak dok\u0142adnie Word2vec, dowiemy si\u0119 p\u00f3\u017aniej, na dzisiaj po prostu u\u017cyjmy tych wektor\u00f3w.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Najpierw wprowad\u017amy alternatywn\u0105 normalizacj\u0119 zgodn\u0105 z tym, jak zosta\u0142 wygenerowany model."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"normalize' :: Text -> [Text]\n",
"normalize' = removeStopWords . map toLower . tokenize\n",
"\n",
"normalize' \"Ala ma kota.\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"mam"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kumpla"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ktory"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zdawal"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"walentynki"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"polozyl"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"koperte"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"dla"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"laski"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"z"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kartka"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"desce"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"rozdzielczej"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"egzaminator"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wziol"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"karteke"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"powiedzial"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ze"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"znade"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wypisal"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"papierek"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"po"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"egzaminie"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"hehe"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"filmik"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"dobry"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLNormalized' = map normalize' collectionL\n",
"collectionLNormalized' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-2.305081844329834,0.3418600857257843,4.44999361038208,0.9008448719978333,-2.1629886627197266,1.0206516981124878,4.157524108886719,2.5060904026031494,-0.17275184392929077,4.085052967071533,2.236677408218384,-2.3315281867980957,0.5224806070327759,0.15804219245910645,-1.5636622905731201,-1.2624900341033936,-0.3161393105983734,-1.971177101135254,1.4859644174575806,-0.1742715835571289,1.209444284439087,4.063786193728447e-2,-0.2808700501918793,-0.5895432233810425,-4.126195430755615,-2.690922260284424,1.4975452423095703,-0.25380706787109375,-4.5767364501953125,-1.7726246118545532,2.938936710357666,-0.7173141837120056,-2.4317402839660645,-4.206724643707275,0.6768773198127747,2.236821413040161,4.1044291108846664e-2,1.6991114616394043,1.2354476377367973e-2,-3.079916000366211,-1.7430219650268555,1.8969229459762573,-0.4897139072418213,1.1981141567230225,2.431124687194824,0.39453181624412537,1.9735784530639648,2.124225378036499,-4.338796138763428,-0.954145610332489,3.3927927017211914,0.8821511268615723,5.120451096445322e-3,2.917816638946533,-2.035374164581299,3.3221969604492188,-4.981880187988281,-1.105080008506775,-4.093905448913574,-1.5998111963272095,0.6372298002243042,-0.7565107345581055,0.4038744270801544,0.685226321220398,2.137610912322998,-0.4390018582344055,1.007287859916687,0.19681350886821747,-2.598611354827881,-1.8872140645980835,1.6989527940750122,1.6458508968353271,-5.091184616088867,1.4902764558792114,-0.4839307367801666,-2.840092420578003,1.0180696249008179,0.7615311741828918,1.8135554790496826,-0.30493396520614624,3.5879104137420654,1.4585649967193604,3.2775094509124756,-1.1610190868377686,-2.3159284591674805,4.1530327796936035,-4.67172384262085,-0.8594478964805603,-0.860812783241272,-0.31788957118988037,0.7260096669197083,0.1879102736711502,-0.15789580345153809,1.9434200525283813,-1.9945732355117798,1.8799400329589844,-0.5253798365592957,-0.2834266722202301,-0.8012301921844482,1.5093021392822266]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"100"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE BangPatterns #-}\n",
"\n",
"import Data.Word2Vec.Model\n",
"import Data.Maybe (catMaybes, fromJust)\n",
"import qualified Data.Vector.Storable as V\n",
"\n",
"model <- readWord2VecModel \"tiny.bin\"\n",
"\n",
"toOurVector :: WVector -> [Double]\n",
"toOurVector (WVector v _) = map realToFrac $ V.toList v\n",
"\n",
"balwanV = toOurVector $ fromJust $ getVector model \"ba\u0142wan\"\n",
"balwanV\n",
"Prelude.length balwanV\n",
"\n",
"vectorizeWord2vec model d = Prelude.foldr (+++) (zero 100) $ map toOurVector $ catMaybes $ map (getVector model) d\n",
"\n",
"collectionLVectorized'' = map (vectorizeWord2vec model) collectionLNormalized'"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-26.834667675197124,2.568521626293659,37.66925026476383,9.381511189043522,-32.04328362643719,-19.734033070504665,55.21128339320421,14.215368987061083,23.60182836651802,38.74189975857735,0.16257449332624674,-47.983866568654776,-36.917382495012134,36.08420217037201,13.996580198407173,-30.473296120762825,21.28328724205494,30.601420499384403,-40.5945385559462,16.043263137340546,-8.694086126983166,-41.90418399870396,-10.448782376945019,-0.21028679609298706,9.586350612342358,-46.172676257789135,46.27567541599274,11.25023115798831,9.00947591662407,-43.525397814810276,22.09978771582246,56.93886440992355,-23.428963833488524,-1.4649565666913986,21.969609811902046,-21.504647210240364,24.955158293247223,-8.328911297023296,-31.118815276771784,0.22846409678459167,12.212224327027798,-28.337586268782616,-24.105730276554823,3.36764569953084,8.270942151546478,33.71851025521755,30.665825616568327,-24.134687054902315,-31.72916578501463,35.20022106170654,71.15121555328369,-15.448215141892433,-41.27439119666815,3.0322337672114372,9.768462024629116,38.911416467279196,-9.848581969738007,-20.030757322907448,6.734442539513111,-84.9070791369304,38.147536396980286,4.3607237339019775,-25.426255017518997,5.240264508873224,-32.71464269608259,2.095752328634262,2.4292337521910667,32.93906496465206,-51.44473773613572,0.5551527962088585,-6.1982685178518295,20.187213011085987,-52.809339098632336,-10.458874322474003,13.979218572378159,-38.16066548228264,27.336308609694242,5.3437707126140594,-32.01269288826734,-38.117460787296295,-9.337415304034948,38.90077601373196,-2.158842660486698,-44.878454223275185,23.69188129901886,-54.10413733869791,-41.30505630373955,-37.28948371112347,-65.8488347530365,32.51569982431829,3.781733974814415,72.77320172637701,6.847739472985268,63.77478001266718,24.26227615773678,7.260737741366029,10.931276574730873,-17.388786104973406,9.978045962750912,5.968699499964714]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLVectorized'' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.92 0.85 0.77 0.87 0.90 0.92 0.88 0.87 0.87 0.89 0.89 0.89\n",
"w_lud 0.92 1.00 0.92 0.72 0.93 0.93 0.91 0.94 0.95 0.86 0.94 0.94 0.96\n",
"ba_hy 0.85 0.92 1.00 0.69 0.89 0.91 0.83 0.89 0.95 0.86 0.87 0.94 0.90\n",
"w_lap 0.77 0.72 0.69 1.00 0.60 0.74 0.67 0.65 0.68 0.58 0.68 0.73 0.66\n",
"ne_dz 0.87 0.93 0.89 0.60 1.00 0.90 0.87 0.95 0.94 0.86 0.93 0.90 0.95\n",
"be_wy 0.90 0.93 0.91 0.74 0.90 1.00 0.89 0.89 0.91 0.85 0.91 0.96 0.94\n",
"zw_oz 0.92 0.91 0.83 0.67 0.87 0.89 1.00 0.89 0.86 0.86 0.91 0.85 0.90\n",
"mo_zu 0.88 0.94 0.89 0.65 0.95 0.89 0.89 1.00 0.97 0.85 0.95 0.91 0.96\n",
"be_wy 0.87 0.95 0.95 0.68 0.94 0.91 0.86 0.97 1.00 0.84 0.93 0.95 0.95\n",
"ba_hy 0.87 0.86 0.86 0.58 0.86 0.85 0.86 0.85 0.84 1.00 0.83 0.85 0.84\n",
"mo_zu 0.89 0.94 0.87 0.68 0.93 0.91 0.91 0.95 0.93 0.83 1.00 0.91 0.96\n",
"be_wy 0.89 0.94 0.94 0.73 0.90 0.96 0.85 0.91 0.95 0.85 0.91 1.00 0.94\n",
"w_lud 0.89 0.96 0.90 0.66 0.95 0.94 0.90 0.96 0.95 0.84 0.96 0.94 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limitedL'' = Data.List.take limit collectionLVectorized''\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mo\u017cemy pr\u00f3bowa\u0107 mno\u017cy\u0107 wektory z modelu Word2vec z idf. Najpierw zdefiniujmy mno\u017cenie przez skalar."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[2.5,0.0,5.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"(***) :: Double -> [Double] -> [Double]\n",
"(***) s = map (*s)\n",
"\n",
"2.5 *** [1.0, 0.0, 2.0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz b\u0119dziemy przemna\u017cali wektory Word2vec przez idf (jako skalar)."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Fuse foldr/map</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">Prelude.foldr (+++) (zero 100)\n",
" $ map (\\ (t, Just v) -> idf coll t *** toOurVector v)\n",
" $ Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">foldr\n",
" ((+++) . (\\ (t, Just v) -> idf coll t *** toOurVector v))\n",
" (zero 100)\n",
" (Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d)</div></div>"
],
"text/plain": [
"Line 4: Fuse foldr/map\n",
"Found:\n",
"Prelude.foldr (+++) (zero 100)\n",
" $ map (\\ (t, Just v) -> idf coll t *** toOurVector v)\n",
" $ Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d\n",
"Why not:\n",
"foldr\n",
" ((+++) . (\\ (t, Just v) -> idf coll t *** toOurVector v))\n",
" (zero 100)\n",
" (Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Maybe (isJust)\n",
"\n",
"vectorizeWord2vecIdf model coll d = \n",
" Prelude.foldr (+++) (zero 100) \n",
" $ map (\\(t, Just v) -> idf coll t *** toOurVector v) \n",
" $ Prelude.filter (\\(_, v) -> isJust v)\n",
" $ map (\\t -> (t, getVector model t)) d\n",
"\n",
"collectionLVectorized''' = map (vectorizeWord2vecIdf model collectionLNormalized') collectionLNormalized'"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-35.63830397762308,32.606312678971506,102.20663646169147,56.00417395285867,-130.56709475346878,-14.916644370325773,55.15817632053957,83.2241937686228,26.432875116296394,48.94350344147367,11.370669191277202,-59.54579267200742,-116.01687192456801,60.53824040579282,39.84659684249884,-34.37377085402866,104.53525319069323,45.53363024094972,-34.25020197907558,-43.9007702604392,35.36538495508536,-59.81737728971619,-1.5823889595648828,-50.211106838043655,14.83789867297237,-109.45917608219175,86.56767915592452,-32.170794763065615,29.559930839016644,-126.81686726526162,-9.918908360030228,47.14965938694648,5.955083439147183,41.24417782948478,3.592410260515919,72.10649687523313,61.374776273461855,60.28687760276824,-28.886499026001676,-8.710633131022206,-68.73464623080284,-37.95272838994007,-26.390548039392165,-14.241950251566944,74.6286124718925,46.21889022510431,72.23999508751568,-19.597547074284556,-20.160749174807382,99.49036127458763,131.98057386978817,-23.842794956628147,-62.381675411749846,-19.366936151725387,1.4839595614144327,60.40520721416763,-7.70311857607342,-31.75784386529525,48.71818084466781,-202.41827342135582,138.5639100010709,12.447619757719652,-39.38375639132277,27.877688543771935,-87.00559882214534,56.45689362090545,37.89098984507379,103.78465196444151,-166.10094891357176,-50.83382060940457,11.574060187412977,74.00519869734406,-97.00170731343235,32.18159534728971,-11.280059681646494,-40.701643971890256,74.64230137346699,0.7613112917269982,-6.103424218278271,-150.47551072570587,-21.714627635239918,91.26690441786137,62.91576955719526,-92.35700140312395,-25.421583980267307,-67.87480813505826,-120.16245846953592,-68.89155479679258,-122.00206448376261,35.263603445401785,6.416282520155956,203.41225708856086,-62.42983953251155,59.36113672119048,40.00275897200196,-62.55633545667429,89.66866371308245,-42.287712072353834,-72.59490110281287,52.23637641217955]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLVectorized''' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.83 0.78 0.63 0.78 0.81 0.83 0.76 0.77 0.80 0.77 0.79 0.79\n",
"w_lud 0.83 1.00 0.82 0.60 0.84 0.84 0.84 0.85 0.86 0.74 0.86 0.83 0.90\n",
"ba_hy 0.78 0.82 1.00 0.57 0.78 0.84 0.77 0.79 0.90 0.75 0.74 0.89 0.85\n",
"w_lap 0.63 0.60 0.57 1.00 0.38 0.60 0.50 0.43 0.52 0.45 0.55 0.65 0.47\n",
"ne_dz 0.78 0.84 0.78 0.38 1.00 0.81 0.79 0.90 0.89 0.77 0.81 0.81 0.90\n",
"be_wy 0.81 0.84 0.84 0.60 0.81 1.00 0.82 0.76 0.83 0.74 0.81 0.92 0.88\n",
"zw_oz 0.83 0.84 0.77 0.50 0.79 0.82 1.00 0.77 0.77 0.74 0.82 0.75 0.83\n",
"mo_zu 0.76 0.85 0.79 0.43 0.90 0.76 0.77 1.00 0.93 0.74 0.87 0.80 0.90\n",
"be_wy 0.77 0.86 0.90 0.52 0.89 0.83 0.77 0.93 1.00 0.72 0.81 0.89 0.92\n",
"ba_hy 0.80 0.74 0.75 0.45 0.77 0.74 0.74 0.74 0.72 1.00 0.66 0.73 0.72\n",
"mo_zu 0.77 0.86 0.74 0.55 0.81 0.81 0.82 0.87 0.81 0.66 1.00 0.80 0.88\n",
"be_wy 0.79 0.83 0.89 0.65 0.81 0.92 0.75 0.80 0.89 0.73 0.80 1.00 0.87\n",
"w_lud 0.79 0.90 0.85 0.47 0.90 0.88 0.83 0.90 0.92 0.72 0.88 0.87 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limitedL''' = Data.List.take limit collectionLVectorized'''\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL'''"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Haskell",
"language": "haskell",
"name": "haskell"
},
"language_info": {
"codemirror_mode": "ihaskell",
"file_extension": ".hs",
"mimetype": "text/x-haskell",
"name": "haskell",
"pygments_lexer": "Haskell",
"version": "8.10.4"
},
"author": "Filip Grali\u0144ski",
"email": "filipg@amu.edu.pl",
"lang": "pl",
"subtitle": "5.G\u0119ste reprezentacje wektorowe[wyk\u0142ad]",
"title": "Ekstrakcja informacji",
"year": "2021"
2021-04-07 20:37:40 +02:00
},
2021-09-27 07:57:37 +02:00
"nbformat": 4,
"nbformat_minor": 4
}