aitech-eks-pub/wyk/05_Geste_wektory.ipynb
2021-04-07 08:43:05 +02:00

1197 lines
38 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zagęszczamy wektory\n",
"\n",
"Podstawowy problem z wektorową reprezentacją typu tf-idf polega na tym, że wektory dokumentów (i macierz całej kolekcji dokumentów) są _rzadkie_, tzn. zawierają dużo zer. W praktyce potrzebujemy bardziej \"gęstej\" czy \"kompaktowej\" reprezentacji numerycznej dokumentów. \n",
"\n",
"## _Hashing trick_\n",
"\n",
"Powierzchownie problem możemy rozwiązać przez użycie tzw. _sztuczki z haszowaniem_ (_hashing trick_). Będziemy potrzebować funkcji mieszającej (haszującej) $H$, która rzutuje na napisy na liczby, których reprezentacja binarna składa się z $b$ bitów:\n",
"\n",
"$$H : V \\rightarrow \\{0,\\dots,2^b-1\\}$$\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jako funkcji $H$ możemy np. użyć funkcji MurmurHash3."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Hash64 0x6c3a641663470e2c"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0xa714568917576314"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x875d9e7e413747c8"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x13ce831936ebc69e"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Digest.Murmur64\n",
"\n",
"hash64 \"komputer\"\n",
"hash64 \"komputerze\"\n",
"hash64 \"komputerek\"\n",
"hash64 \"abrakadabra\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** podobne napisy mają zupełnie różne wartości funkcji haszującej, czy to dobrze, czy to źle?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Musimy tylko sparametryzować naszą funkcję rozmiarem \"odcisku\" (parametr $b$)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3628"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"25364"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"2877"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"50846"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"12"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"\n",
"import Data.Text\n",
"\n",
"-- pomocnicza funkcja, która konwertuje wartość specjalnego\n",
"-- typu Hash64 do zwykłej liczby całkowitej\n",
"hashValueAsInteger :: Hash64 -> Integer\n",
"hashValueAsInteger = toInteger . asWord64\n",
"\n",
"-- unpack to funkcja, która wartość typu String konwertuje do Text\n",
"hash :: Integer -> Text -> Integer\n",
"hash b t = hashValueAsInteger (hash64 $ unpack t) `mod` (2 ^ b)\n",
"\n",
"hash 16 \"komputer\"\n",
"hash 16 \"komputerze\"\n",
"hash 16 \"komputerem\"\n",
"hash 16 \"abrakadabra\"\n",
"hash 4 \"komputer\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Jakie wartości $b$ będą bezsensowne?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sztuczka z haszowaniem polega na tym, że zamiast numerować słowa korzystając ze słownika, po prostu używamy funkcji haszującej. W ten sposób wektor będzie _zawsze_ rozmiar $2^b$ - bez względu na rozmiar słownika."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zacznijmy od przywołania wszystkich potrzebnych definicji."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE QuasiQuotes #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Text.Regex.PCRE.Heavy\n",
"\n",
"isStopWord :: Text -> Bool\n",
"isStopWord \"w\" = True\n",
"isStopWord \"jest\" = True\n",
"isStopWord \"że\" = True\n",
"isStopWord w = w ≈ [re|^\\p{P}+$|]\n",
"\n",
"\n",
"removeStopWords :: [Text] -> [Text]\n",
"removeStopWords = filter (not . isStopWord)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE QuasiQuotes #-}\n",
"{-# LANGUAGE FlexibleContexts #-}\n",
"\n",
"import Data.Text hiding(map, filter, zip)\n",
"import Prelude hiding(words, take)\n",
"import Text.Regex.PCRE.Heavy\n",
"import Data.Map as Map hiding(take, map, filter)\n",
"import Data.Set as Set hiding(map)\n",
"\n",
"tokenize :: Text -> [Text]\n",
"tokenize = map fst . scan [re|C\\+\\+|[\\p{L}0-9]+|\\p{P}|]\n",
"\n",
"\n",
"mockInflectionDictionary :: Map Text Text\n",
"mockInflectionDictionary = Map.fromList [\n",
" (\"kota\", \"kot\"),\n",
" (\"butach\", \"but\"),\n",
" (\"masz\", \"mieć\"),\n",
" (\"ma\", \"mieć\"),\n",
" (\"buta\", \"but\"),\n",
" (\"zgubiłem\", \"zgubić\")]\n",
"\n",
"lemmatizeWord :: Map Text Text -> Text -> Text\n",
"lemmatizeWord dict w = findWithDefault w w dict\n",
"\n",
"lemmatize :: Map Text Text -> [Text] -> [Text]\n",
"lemmatize dict = map (lemmatizeWord dict)\n",
"\n",
"\n",
"poorMansStemming = Data.Text.take 6\n",
"\n",
"normalize :: Text -> [Text]\n",
"normalize = map poorMansStemming . removeStopWords . map toLower . lemmatize mockInflectionDictionary . tokenize\n",
"\n",
"getVocabulary :: [Text] -> Set Text \n",
"getVocabulary = Set.unions . map (Set.fromList . normalize) \n",
" \n",
"idf :: [[Text]] -> Text -> Double\n",
"idf coll t = log (fromIntegral n / fromIntegral df)\n",
" where df = Prelude.length $ Prelude.filter (\\d -> t `elem` d) coll\n",
" n = Prelude.length coll\n",
" \n",
"vectorizeTfIdf :: Int -> [[Text]] -> Map Int Text -> [Text] -> [Double]\n",
"vectorizeTfIdf vecSize coll v doc = map (\\i -> count (v ! i) doc * idf coll (v ! i)) [0..(vecSize-1)]\n",
" where count t doc = fromIntegral $ (Prelude.length . Prelude.filter (== t)) doc "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import System.IO\n",
"import Data.List.Split as SP\n",
"\n",
"legendsh <- openFile \"legendy.txt\" ReadMode\n",
"hSetEncoding legendsh utf8\n",
"contents <- hGetContents legendsh\n",
"ls = Prelude.lines contents\n",
"items = map (map pack . SP.splitOn \"\\t\") ls\n",
"\n",
"labelsL = map Prelude.head items\n",
"collectionL = map (!!1) items\n",
"\n",
"collectionLNormalized = map normalize collectionL\n",
"voc' = getVocabulary collectionL\n",
"\n",
"vocLSize = Prelude.length voc'\n",
"\n",
"vocL :: Map Int Text\n",
"vocL = Map.fromList $ zip [0..] $ Set.toList voc'\n",
"\n",
"invvocL :: Map Text Int\n",
"invvocL = Map.fromList $ zip (Set.toList voc') [0..]\n",
"\n",
"lVectorized = map (vectorizeTfIdf vocLSize collectionLNormalized vocL) collectionLNormalized\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber x = printf \"% 7.2f\" x</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">formatNumber = printf \"% 7.2f\"</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Use zipWith</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]</div></div><div class=\"suggestion-name\" style=\"clear:both;\">Avoid lambda</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">\\ l -> pack $ printf \"% 7s\" l</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">pack . printf \"% 7s\"</div></div>"
],
"text/plain": [
"Line 5: Eta reduce\n",
"Found:\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"Why not:\n",
"formatNumber = printf \"% 7.2f\"Line 11: Use zipWith\n",
"Found:\n",
"map (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix)\n",
" $ zip labels [0 .. (Prelude.length vs - 1)]\n",
"Why not:\n",
"zipWith\n",
" (curry (\\ (lab, ix) -> lab <> \" \" <> similarTo simFun vs ix))\n",
" labels [0 .. (Prelude.length vs - 1)]Line 12: Avoid lambda\n",
"Found:\n",
"\\ l -> pack $ printf \"% 7s\" l\n",
"Why not:\n",
"pack . printf \"% 7s\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Text.Printf\n",
"import Data.List (take)\n",
"\n",
"formatNumber :: Double -> String\n",
"formatNumber x = printf \"% 7.2f\" x\n",
"\n",
"similarTo :: ([Double] -> [Double] -> Double) -> [[Double]] -> Int -> Text\n",
"similarTo simFun vs ix = pack $ Prelude.unwords $ map (formatNumber . ((vs !! ix) `simFun`)) vs\n",
"\n",
"paintMatrix :: ([Double] -> [Double] -> Double) -> [Text] -> [[Double]] -> Text\n",
"paintMatrix simFun labels vs = header <> \"\\n\" <> Data.Text.unlines (map (\\(lab, ix) -> lab <> \" \" <> similarTo simFun vs ix) $ zip labels [0..(Prelude.length vs - 1)])\n",
" where header = \" \" <> Data.Text.unwords (map (\\l -> pack $ printf \"% 7s\" l) labels)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.02 0.01 0.01 0.03 0.02 0.02 0.04 0.03 0.02 0.01 0.02 0.03\n",
"w_lud 0.02 1.00 0.02 0.05 0.04 0.01 0.03 0.04 0.06 0.01 0.02 0.03 0.06\n",
"ba_hy 0.01 0.02 1.00 0.01 0.02 0.03 0.03 0.04 0.08 0.22 0.01 0.04 0.01\n",
"w_lap 0.01 0.05 0.01 1.00 0.01 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.00\n",
"ne_dz 0.03 0.04 0.02 0.01 1.00 0.04 0.03 0.07 0.08 0.06 0.03 0.03 0.05\n",
"be_wy 0.02 0.01 0.03 0.01 0.04 1.00 0.01 0.03 0.21 0.01 0.02 0.25 0.01\n",
"zw_oz 0.02 0.03 0.03 0.00 0.03 0.01 1.00 0.04 0.03 0.00 0.01 0.02 0.02\n",
"mo_zu 0.04 0.04 0.04 0.01 0.07 0.03 0.04 1.00 0.10 0.02 0.09 0.05 0.04\n",
"be_wy 0.03 0.06 0.08 0.02 0.08 0.21 0.03 0.10 1.00 0.05 0.03 0.24 0.04\n",
"ba_hy 0.02 0.01 0.22 0.00 0.06 0.01 0.00 0.02 0.05 1.00 0.01 0.02 0.00\n",
"mo_zu 0.01 0.02 0.01 0.00 0.03 0.02 0.01 0.09 0.03 0.01 1.00 0.01 0.02\n",
"be_wy 0.02 0.03 0.04 0.00 0.03 0.25 0.02 0.05 0.24 0.02 0.01 1.00 0.02\n",
"w_lud 0.03 0.06 0.01 0.00 0.05 0.01 0.02 0.04 0.04 0.00 0.02 0.02 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limit = 13\n",
"labelsLimited = Data.List.take limit labelsL\n",
"limitedL = Data.List.take limit lVectorized\n",
"\n",
"vectorNorm :: [Double] -> Double\n",
"vectorNorm vs = sqrt $ sum $ map (\\x -> x * x) vs\n",
"\n",
"toUnitVector :: [Double] -> [Double]\n",
"toUnitVector vs = map (/ n) vs\n",
" where n = vectorNorm vs\n",
"\n",
"\n",
"(✕) :: [Double] -> [Double] -> Double\n",
"(✕) v1 v2 = sum $ Prelude.zipWith (*) v1 v2\n",
"\n",
"cosineSim v1 v2 = toUnitVector v1 ✕ toUnitVector v2\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powyższa macierz reprezentuje porównanie przy użyciu podobieństwa kosinusowego. Spróbujmy teraz użyć gęstszych wektorów przy użyciu hashing trick. Jako wartość $b$ przyjmijmy 6.\n",
"\n",
"Zobaczmy najpierw, w które \"przegródki\" będą wpadały poszczególne wyrazy słownika.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(\"0\",32),(\"00\",4),(\"01\",4),(\"07\",40),(\"09\",44),(\"1\",1),(\"10\",61),(\"100\",27),(\"12\",58),(\"13\",51),(\"131\",37),(\"15\",30),(\"16\",21),(\"17\",58),(\"18\",55),(\"19\",35),(\"1997r\",61),(\"2\",62),(\"20\",28),(\"2006\",44),(\"2008\",19),(\"2009\",4),(\"2010\",3),(\"22\",27),(\"23\",34),(\"24\",7),(\"25\",29),(\"26\",35),(\"27\",44),(\"28\",61),(\"29\",30),(\"3\",56),(\"30\",55),(\"300\",38),(\"31\",45),(\"4\",53),(\"40\",39),(\"42\",43),(\"48\",53),(\"49\",13),(\"5\",31),(\"50\",32),(\"56\",38),(\"57\",55),(\"6\",59),(\"7\",27),(\"8\",34),(\"a\",27),(\"aaa\",33),(\"absolu\",11),(\"absurd\",18),(\"aby\",12),(\"adnym\",10),(\"adres\",15),(\"adrese\",62),(\"afroam\",3),(\"afryce\",46),(\"agresy\",57),(\"ah\",37),(\"aha\",42),(\"aig\",56),(\"akadem\",18),(\"akcja\",0),(\"akcje\",21),(\"akompa\",13),(\"aktor\",26),(\"akurat\",7),(\"albino\",27),(\"albo\",44),(\"ale\",7),(\"alfa\",58),(\"alkoho\",56),(\"altern\",38),(\"ameryk\",11),(\"amp\",62),(\"anakon\",34),(\"analiz\",62),(\"andrze\",63),(\"anegdo\",43),(\"ang\",37),(\"anga\\380o\",27),(\"anglii\",33),(\"ani\",22),(\"anonsu\",36),(\"antono\",3),(\"antykr\",41),(\"apetyt\",16),(\"apolit\",39),(\"apropo\",54),(\"apteki\",20),(\"aqua\",59),(\"archit\",61),(\"aromat\",44),(\"artyku\",31),(\"asami\",22),(\"astron\",59),(\"asy\\347ci\",60),(\"atmosf\",37),(\"audycj\",50),(\"auta\",38)]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"map (\\t -> (t, hash 6 t)) $ Data.List.take 100 $ Set.toList voc'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Czy jakieś dwa termy wpadły do jednej przegródki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stwórzmy najpierw funkcję, która będzie wektoryzowała pojedynczy term $t$. Po prostu stworzymy wektor, które będzie miał rozmiar $2^b$, wszędzie będzie miał 0 z wyjątkiem pozycji o numerze $H_b(t)$ - tam wpiszmy odwrotną częstość dokumentową."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wordVector :: Integer -> [[Text]] -> Text -> [Double]\n",
"wordVector b coll term = map selector [0..vecSize]\n",
" where vecSize = 2^b - 1\n",
" wordFingerprint = hash b term\n",
" selector i \n",
" | i == wordFingerprint = idf coll term\n",
" | otherwise = 0.0\n",
"\n",
"wordVector 6 collectionLNormalized \"ameryk\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz wystarczy zsumować wektory dla poszczególnych słów, żeby otrzymać wektor dokumentu. Najpierw zdefiniujmy sobie sumę wektorową."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1.2,4.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"(+++) :: [Double] -> [Double] -> [Double]\n",
"(+++) = Prelude.zipWith (+)\n",
"\n",
"[0.2, 0.5] +++ [1.0, 3.5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Przydatna będzie jeszcze funkcja, która tworzy wektor z samymi zerami o zadanej długości:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"zero :: Int -> [Double]\n",
"zero s = Prelude.replicate s 0.0\n",
"\n",
"zero (2^6)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[5.242936783195232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.856470206220483,0.0,0.0,1.1700712526502546,0.5947071077466928,0.0,5.712940412440966,3.0708470981669183,0.0,0.0,4.465908118654584,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,4.788681510917635,0.0,3.7727609380946383,0.0,1.575536360758419,0.0,3.079613757534693,0.0,4.465908118654584,0.0,4.588010815455483,4.465908118654584,0.0,1.5214691394881432,0.0,0.0,0.0,0.0,4.465908118654584,2.5199979695992702,0.0,1.5214691394881432,8.388148398070203e-2,0.0,4.465908118654584,0.0,0.0,3.367295829986474,0.0,3.7727609380946383,0.0,1.5214691394881432,0.0,3.7727609380946383,0.0,0.0,0.0,3.367295829986474,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"vectorizeWithHashingTrick :: Integer -> [[Text]] -> [Text] -> [Double]\n",
"vectorizeWithHashingTrick b coll = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2^b)\n",
"\n",
"vectorizeWithHashingTrick 6 collectionLNormalized $ collectionLNormalized !! 3\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy, jak zagęszczenie wpływa na macierz podobieństwa."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.66 0.40 0.58 0.65 0.52 0.66 0.79 0.76 0.41 0.59 0.44 0.72\n",
"w_lud 0.66 1.00 0.51 0.53 0.66 0.43 0.57 0.76 0.68 0.38 0.47 0.42 0.62\n",
"ba_hy 0.40 0.51 1.00 0.42 0.55 0.29 0.41 0.54 0.58 0.54 0.47 0.24 0.50\n",
"w_lap 0.58 0.53 0.42 1.00 0.41 0.35 0.54 0.59 0.53 0.19 0.47 0.34 0.53\n",
"ne_dz 0.65 0.66 0.55 0.41 1.00 0.56 0.56 0.79 0.74 0.55 0.68 0.57 0.69\n",
"be_wy 0.52 0.43 0.29 0.35 0.56 1.00 0.51 0.54 0.64 0.28 0.59 0.61 0.49\n",
"zw_oz 0.66 0.57 0.41 0.54 0.56 0.51 1.00 0.72 0.61 0.29 0.55 0.48 0.63\n",
"mo_zu 0.79 0.76 0.54 0.59 0.79 0.54 0.72 1.00 0.79 0.49 0.73 0.58 0.79\n",
"be_wy 0.76 0.68 0.58 0.53 0.74 0.64 0.61 0.79 1.00 0.49 0.72 0.61 0.74\n",
"ba_hy 0.41 0.38 0.54 0.19 0.55 0.28 0.29 0.49 0.49 1.00 0.37 0.32 0.48\n",
"mo_zu 0.59 0.47 0.47 0.47 0.68 0.59 0.55 0.73 0.72 0.37 1.00 0.53 0.71\n",
"be_wy 0.44 0.42 0.24 0.34 0.57 0.61 0.48 0.58 0.61 0.32 0.53 1.00 0.54\n",
"w_lud 0.72 0.62 0.50 0.53 0.69 0.49 0.63 0.79 0.74 0.48 0.71 0.54 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lVectorized' = map (vectorizeWithHashingTrick 6 collectionLNormalized) collectionLNormalized\n",
"limitedL' = Data.List.take limit lVectorized'\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pytanie:** Co się stanie, gdy zwiększymy $b$, a co jeśli zmniejszymi?\n",
"\n",
"Zalety sztuczki z haszowaniem:\n",
"\n",
"* zagwarantowany stały rozmiar wektora\n",
"* szybsze obliczenia\n",
"* w naturalny sposób uwzględniamy termy, których nie było w początkowej kolekcji (ale uwaga na idf!)\n",
"* nie musimy pamiętać odzworowania rzutującego słowa na ich numery\n",
"\n",
"Wady:\n",
"\n",
"* dwa różne słowa mogą wpaść do jednej przegródki (szczególnie częste, jeśli $b$ jest za małe)\n",
"* jeśli $b$ ustawimy za duże, wektory mogą być nawet większe niż w przypadku standardowego podejścia\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word2vec\n",
"\n",
"A może istnieje dobra wróżka, która dałaby nam dobre wektory słów (z których będziemy składali proste wektory dokumentów przez sumowanie)?\n",
"\n",
"**Pytanie:** Jakie własności powinny mieć dobre wektory słów?\n",
"\n",
"Tak! Istnieją gotowe \"bazy danych\" wektorów. Jedną z najpopularniejszych (i najstarszych) metod uzyskiwania takich wektorów jest Word2vec. Jak dokładnie Word2vec, dowiemy się później, na dzisiaj po prostu użyjmy tych wektorów.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Najpierw wprowadźmy alternatywną normalizację zgodną z tym, jak został wygenerowany model."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ala"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kota"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"normalize' :: Text -> [Text]\n",
"normalize' = removeStopWords . map toLower . tokenize\n",
"\n",
"normalize' \"Ala ma kota.\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"mam"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kumpla"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ktory"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"zdawal"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"walentynki"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"polozyl"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"koperte"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"dla"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"laski"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"z"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"kartka"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"na"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"desce"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"rozdzielczej"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"egzaminator"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wziol"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ta"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"karteke"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"powiedzial"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ze"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"ma"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"znade"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"wypisal"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"mu"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"papierek"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"i"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"po"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"egzaminie"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"hehe"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"filmik"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"dobry"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLNormalized' = map normalize' collectionL\n",
"collectionLNormalized' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-2.305081844329834,0.3418600857257843,4.44999361038208,0.9008448719978333,-2.1629886627197266,1.0206516981124878,4.157524108886719,2.5060904026031494,-0.17275184392929077,4.085052967071533,2.236677408218384,-2.3315281867980957,0.5224806070327759,0.15804219245910645,-1.5636622905731201,-1.2624900341033936,-0.3161393105983734,-1.971177101135254,1.4859644174575806,-0.1742715835571289,1.209444284439087,4.063786193728447e-2,-0.2808700501918793,-0.5895432233810425,-4.126195430755615,-2.690922260284424,1.4975452423095703,-0.25380706787109375,-4.5767364501953125,-1.7726246118545532,2.938936710357666,-0.7173141837120056,-2.4317402839660645,-4.206724643707275,0.6768773198127747,2.236821413040161,4.1044291108846664e-2,1.6991114616394043,1.2354476377367973e-2,-3.079916000366211,-1.7430219650268555,1.8969229459762573,-0.4897139072418213,1.1981141567230225,2.431124687194824,0.39453181624412537,1.9735784530639648,2.124225378036499,-4.338796138763428,-0.954145610332489,3.3927927017211914,0.8821511268615723,5.120451096445322e-3,2.917816638946533,-2.035374164581299,3.3221969604492188,-4.981880187988281,-1.105080008506775,-4.093905448913574,-1.5998111963272095,0.6372298002243042,-0.7565107345581055,0.4038744270801544,0.685226321220398,2.137610912322998,-0.4390018582344055,1.007287859916687,0.19681350886821747,-2.598611354827881,-1.8872140645980835,1.6989527940750122,1.6458508968353271,-5.091184616088867,1.4902764558792114,-0.4839307367801666,-2.840092420578003,1.0180696249008179,0.7615311741828918,1.8135554790496826,-0.30493396520614624,3.5879104137420654,1.4585649967193604,3.2775094509124756,-1.1610190868377686,-2.3159284591674805,4.1530327796936035,-4.67172384262085,-0.8594478964805603,-0.860812783241272,-0.31788957118988037,0.7260096669197083,0.1879102736711502,-0.15789580345153809,1.9434200525283813,-1.9945732355117798,1.8799400329589844,-0.5253798365592957,-0.2834266722202301,-0.8012301921844482,1.5093021392822266]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"100"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{-# LANGUAGE OverloadedStrings #-}\n",
"{-# LANGUAGE BangPatterns #-}\n",
"\n",
"import Data.Word2Vec.Model\n",
"import Data.Maybe (catMaybes, fromJust)\n",
"import qualified Data.Vector.Storable as V\n",
"\n",
"model <- readWord2VecModel \"tiny.bin\"\n",
"\n",
"toOurVector :: WVector -> [Double]\n",
"toOurVector (WVector v _) = map realToFrac $ V.toList v\n",
"\n",
"balwanV = toOurVector $ fromJust $ getVector model \"bałwan\"\n",
"balwanV\n",
"Prelude.length balwanV\n",
"\n",
"vectorizeWord2vec model d = Prelude.foldr (+++) (zero 100) $ map toOurVector $ catMaybes $ map (getVector model) d\n",
"\n",
"collectionLVectorized'' = map (vectorizeWord2vec model) collectionLNormalized'"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-26.834667675197124,2.568521626293659,37.66925026476383,9.381511189043522,-32.04328362643719,-19.734033070504665,55.21128339320421,14.215368987061083,23.60182836651802,38.74189975857735,0.16257449332624674,-47.983866568654776,-36.917382495012134,36.08420217037201,13.996580198407173,-30.473296120762825,21.28328724205494,30.601420499384403,-40.5945385559462,16.043263137340546,-8.694086126983166,-41.90418399870396,-10.448782376945019,-0.21028679609298706,9.586350612342358,-46.172676257789135,46.27567541599274,11.25023115798831,9.00947591662407,-43.525397814810276,22.09978771582246,56.93886440992355,-23.428963833488524,-1.4649565666913986,21.969609811902046,-21.504647210240364,24.955158293247223,-8.328911297023296,-31.118815276771784,0.22846409678459167,12.212224327027798,-28.337586268782616,-24.105730276554823,3.36764569953084,8.270942151546478,33.71851025521755,30.665825616568327,-24.134687054902315,-31.72916578501463,35.20022106170654,71.15121555328369,-15.448215141892433,-41.27439119666815,3.0322337672114372,9.768462024629116,38.911416467279196,-9.848581969738007,-20.030757322907448,6.734442539513111,-84.9070791369304,38.147536396980286,4.3607237339019775,-25.426255017518997,5.240264508873224,-32.71464269608259,2.095752328634262,2.4292337521910667,32.93906496465206,-51.44473773613572,0.5551527962088585,-6.1982685178518295,20.187213011085987,-52.809339098632336,-10.458874322474003,13.979218572378159,-38.16066548228264,27.336308609694242,5.3437707126140594,-32.01269288826734,-38.117460787296295,-9.337415304034948,38.90077601373196,-2.158842660486698,-44.878454223275185,23.69188129901886,-54.10413733869791,-41.30505630373955,-37.28948371112347,-65.8488347530365,32.51569982431829,3.781733974814415,72.77320172637701,6.847739472985268,63.77478001266718,24.26227615773678,7.260737741366029,10.931276574730873,-17.388786104973406,9.978045962750912,5.968699499964714]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLVectorized'' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.92 0.85 0.77 0.87 0.90 0.92 0.88 0.87 0.87 0.89 0.89 0.89\n",
"w_lud 0.92 1.00 0.92 0.72 0.93 0.93 0.91 0.94 0.95 0.86 0.94 0.94 0.96\n",
"ba_hy 0.85 0.92 1.00 0.69 0.89 0.91 0.83 0.89 0.95 0.86 0.87 0.94 0.90\n",
"w_lap 0.77 0.72 0.69 1.00 0.60 0.74 0.67 0.65 0.68 0.58 0.68 0.73 0.66\n",
"ne_dz 0.87 0.93 0.89 0.60 1.00 0.90 0.87 0.95 0.94 0.86 0.93 0.90 0.95\n",
"be_wy 0.90 0.93 0.91 0.74 0.90 1.00 0.89 0.89 0.91 0.85 0.91 0.96 0.94\n",
"zw_oz 0.92 0.91 0.83 0.67 0.87 0.89 1.00 0.89 0.86 0.86 0.91 0.85 0.90\n",
"mo_zu 0.88 0.94 0.89 0.65 0.95 0.89 0.89 1.00 0.97 0.85 0.95 0.91 0.96\n",
"be_wy 0.87 0.95 0.95 0.68 0.94 0.91 0.86 0.97 1.00 0.84 0.93 0.95 0.95\n",
"ba_hy 0.87 0.86 0.86 0.58 0.86 0.85 0.86 0.85 0.84 1.00 0.83 0.85 0.84\n",
"mo_zu 0.89 0.94 0.87 0.68 0.93 0.91 0.91 0.95 0.93 0.83 1.00 0.91 0.96\n",
"be_wy 0.89 0.94 0.94 0.73 0.90 0.96 0.85 0.91 0.95 0.85 0.91 1.00 0.94\n",
"w_lud 0.89 0.96 0.90 0.66 0.95 0.94 0.90 0.96 0.95 0.84 0.96 0.94 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limitedL'' = Data.List.take limit collectionLVectorized''\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL''"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Haskell",
"language": "haskell",
"name": "haskell"
},
"language_info": {
"codemirror_mode": "ihaskell",
"file_extension": ".hs",
"mimetype": "text/x-haskell",
"name": "haskell",
"pygments_lexer": "Haskell",
"version": "8.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}