Uzupełnienia

This commit is contained in:
Filip Gralinski 2021-04-07 20:37:40 +02:00
parent 73286e2803
commit 7f9c8d2f7c
2 changed files with 466 additions and 38 deletions

View File

@ -44,12 +44,12 @@
"source": [ "source": [
"# Zadanie domowe (maksymalnie 160 punktów)\n", "# Zadanie domowe (maksymalnie 160 punktów)\n",
"\n", "\n",
"Proszę stworzyć wyszukiwarkę dla wybranej kolekcji dokumentów (innej niż przykładowe w solr).\n", "Proszę stworzyć wyszukiwarkę dla wybranej kolekcji dokumentów (innej niż przykładowe w Solr).\n",
"\n", "\n",
"## Warunki konieczne do zaliczenia zadania\n", "## Warunki konieczne do zaliczenia zadania\n",
" \n", " \n",
" * użycie gotowego silnika wuszukiwarki (np. Solr lub Elastisearch)\n", " * użycie gotowego silnika wyszukiwarki (np. Solr lub Elasticsearch)\n",
" * zaindeksowanie conajmniej 40 tys. dokumentów\n", " * zaindeksowanie co najmniej 40 tys. dokumentów\n",
" * zaimplementowanie frontendu w postaci aplikacji webowej. Użytkownik nie ma korzystać z panelu admina. Aplikacja webowa może być napisana w dowolnym języku/frameworku.\n", " * zaimplementowanie frontendu w postaci aplikacji webowej. Użytkownik nie ma korzystać z panelu admina. Aplikacja webowa może być napisana w dowolnym języku/frameworku.\n",
" * dokumenty powinny stanowić sensowną, rzeczywistą kolekcję. Proszę nie losować dokumentów, powielać, itp.\n", " * dokumenty powinny stanowić sensowną, rzeczywistą kolekcję. Proszę nie losować dokumentów, powielać, itp.\n",
" \n", " \n",
@ -63,10 +63,11 @@
" * zindeksowanie > 0,5 mln dokumentów: +20 punktów, > 5 mln - +40 punktów\n", " * zindeksowanie > 0,5 mln dokumentów: +20 punktów, > 5 mln - +40 punktów\n",
" * wizualizacja (wykres, mapa): +20 punktów\n", " * wizualizacja (wykres, mapa): +20 punktów\n",
" * użycie ciekawej funkcji niewymienionej wyżej: +20 punktów\n", " * użycie ciekawej funkcji niewymienionej wyżej: +20 punktów\n",
" * za oddanie do 21.04 - +10 punktów\n",
" * maksimum do zdobycia z tego zadania: 160 punktów\n", " * maksimum do zdobycia z tego zadania: 160 punktów\n",
" \n", " \n",
"## Zaliczenie\n", "## Zaliczenie\n",
" * termin zaliczenia to 20.04 (na zajęciach)\n", " * termin zaliczenia to 21.04 lub 28.04 (na zajęciach)\n",
" * proszę zaznaczyć w MS TEAMS, że Państwo zrobili zadanie w assigments\n", " * proszę zaznaczyć w MS TEAMS, że Państwo zrobili zadanie w assigments\n",
" * zdawanie zadania będzie na zajęciach. Proszę przygotować prezentację do 5 minut" " * zdawanie zadania będzie na zajęciach. Proszę przygotować prezentację do 5 minut"
] ]

View File

@ -10,11 +10,13 @@
"\n", "\n",
"## _Hashing trick_\n", "## _Hashing trick_\n",
"\n", "\n",
"Powierzchownie problem możemy rozwiązać przez użycie tzw. _sztuczki z haszowaniem_ (_hashing trick_). Będziemy potrzebować funkcji mieszającej (haszującej) $H$, która rzutuje na napisy na liczby, których reprezentacja binarna składa się z $b$ bitów:\n", "Powierzchownie problem możemy rozwiązać przez użycie tzw. _sztuczki z haszowaniem_ (_hashing trick_). Będziemy potrzebować funkcji mieszającej (haszującej) $H$, która rzutuje napisy na liczby, których reprezentacja binarna składa się z $b$ bitów:\n",
"\n", "\n",
"$$H : V \\rightarrow \\{0,\\dots,2^b-1\\}$$\n", "$$H : \\Sigma^{*} \\rightarrow \\{0,\\dots,2^b-1\\}$$\n",
"\n", "\n",
"($\\Sigma^{*}$ to zbiór wszystkich napisów.)\n",
"\n", "\n",
"**Pytanie:** Czy funkcja $H$ może być różnowartościowa?\n",
"\n" "\n"
] ]
}, },
@ -22,14 +24,32 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Jako funkcji $H$ możemy np. użyć funkcji MurmurHash3." "Jako funkcji $H$ możemy np. użyć funkcji MurmurHash2 lub 3."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 28,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{
"data": {
"text/plain": [
"Hash64 0x4a80abc136f926e7"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x6c3a641663470e2c"
]
},
"metadata": {},
"output_type": "display_data"
},
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
@ -65,15 +85,37 @@
}, },
"metadata": {}, "metadata": {},
"output_type": "display_data" "output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0xb04ce6229407c882"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Hash64 0x6ecd7bae29ae0450"
]
},
"metadata": {},
"output_type": "display_data"
} }
], ],
"source": [ "source": [
"import Data.Digest.Murmur64\n", "import Data.Digest.Murmur64\n",
"\n", "\n",
"hash64 \"Komputer\"\n",
"hash64 \"komputer\"\n",
"hash64 \"komputer\"\n", "hash64 \"komputer\"\n",
"hash64 \"komputerze\"\n", "hash64 \"komputerze\"\n",
"hash64 \"komputerek\"\n", "hash64 \"komputerek\"\n",
"hash64 \"abrakadabra\"\n" "hash64 \"abrakadabra\"\n",
"hash64 \"\"\n",
"hash64 \" \"\n"
] ]
}, },
{ {
@ -514,18 +556,42 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Stwórzmy najpierw funkcję, która będzie wektoryzowała pojedynczy term $t$. Po prostu stworzymy wektor, które będzie miał rozmiar $2^b$, wszędzie będzie miał 0 z wyjątkiem pozycji o numerze $H_b(t)$ - tam wpiszmy odwrotną częstość dokumentową." "Stwórzmy najpierw funkcję, która będzie wektoryzowała pojedynczy term $t$. Po prostu stworzymy wektor, które będzie miał rozmiar $2^b$, wszędzie będzie miał 0 z wyjątkiem pozycji o numerze $H_b(t)$ - tam wpiszmy odwrotną częstość dokumentową.\n",
"\n",
"$$\\vec{t} = [0,\\dots,\\idf_c t,\\dots,0]$$\n",
"\n",
"Teraz dla dokumentu $d = (t_1,\\dots,t_n)$ i dla schematu ważenia tf-idf:\n",
"\n",
"$$\\vec{d} = \\sum \\vec{t_i}$$"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 9, "execution_count": 31,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.7727609380946383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]" "[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.465908118654584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.268683541318364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
] ]
}, },
"metadata": {}, "metadata": {},
@ -541,7 +607,9 @@
" | i == wordFingerprint = idf coll term\n", " | i == wordFingerprint = idf coll term\n",
" | otherwise = 0.0\n", " | otherwise = 0.0\n",
"\n", "\n",
"wordVector 6 collectionLNormalized \"ameryk\"" "wordVector 6 collectionLNormalized \"aromat\"\n",
"wordVector 6 collectionLNormalized \"albo\"\n",
"wordVector 6 collectionLNormalized \"akcja\""
] ]
}, },
{ {
@ -553,13 +621,13 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 10, "execution_count": 32,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"[1.2,4.0]" "[1.2,4.0,3.0]"
] ]
}, },
"metadata": {}, "metadata": {},
@ -570,7 +638,7 @@
"(+++) :: [Double] -> [Double] -> [Double]\n", "(+++) :: [Double] -> [Double] -> [Double]\n",
"(+++) = Prelude.zipWith (+)\n", "(+++) = Prelude.zipWith (+)\n",
"\n", "\n",
"[0.2, 0.5] +++ [1.0, 3.5]" "[0.2, 0.5, 1.0] +++ [1.0, 3.5, 2.0]"
] ]
}, },
{ {
@ -582,7 +650,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 11, "execution_count": 33,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -604,9 +672,110 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 39,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Eta reduce</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">vectorizeWithHashingTrick b coll doc\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b) doc</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">vectorizeWithHashingTrick b coll\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b)</div></div>"
],
"text/plain": [
"Line 3: Eta reduce\n",
"Found:\n",
"vectorizeWithHashingTrick b coll doc\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b) doc\n",
"Why not:\n",
"vectorizeWithHashingTrick b coll\n",
" = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2 ^ b)"
]
},
"metadata": {},
"output_type": "display_data"
},
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
@ -615,14 +784,54 @@
}, },
"metadata": {}, "metadata": {},
"output_type": "display_data" "output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.734591659972947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.734591659972947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.003275201291313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"[3.367295829986474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.931816237309167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]"
]
},
"metadata": {},
"output_type": "display_data"
} }
], ],
"source": [ "source": [
"\n", "\n",
"vectorizeWithHashingTrick :: Integer -> [[Text]] -> [Text] -> [Double]\n", "vectorizeWithHashingTrick :: Integer -> [[Text]] -> [Text] -> [Double]\n",
"vectorizeWithHashingTrick b coll = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2^b)\n", "vectorizeWithHashingTrick b coll doc = Prelude.foldr ((+++) . wordVector b coll) (zero $ 2^b) doc\n",
"\n", "\n",
"vectorizeWithHashingTrick 6 collectionLNormalized $ collectionLNormalized !! 3\n" "vectorizeWithHashingTrick 6 collectionLNormalized $ collectionLNormalized !! 3\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"aromat\", \"albo\", \"akcja\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"albo\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"albo\", \"albo\"]\n",
"vectorizeWithHashingTrick 6 collectionLNormalized [\"akcja\", \"aromat\", \"09\"]\n"
] ]
}, },
{ {
@ -634,26 +843,26 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 43,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n", " na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.66 0.40 0.58 0.65 0.52 0.66 0.79 0.76 0.41 0.59 0.44 0.72\n", "na_ak 1.00 0.37 0.21 0.28 0.35 0.22 0.32 0.45 0.47 0.21 0.25 0.20 0.39\n",
"w_lud 0.66 1.00 0.51 0.53 0.66 0.43 0.57 0.76 0.68 0.38 0.47 0.42 0.62\n", "w_lud 0.37 1.00 0.28 0.18 0.38 0.15 0.20 0.35 0.36 0.14 0.17 0.19 0.33\n",
"ba_hy 0.40 0.51 1.00 0.42 0.55 0.29 0.41 0.54 0.58 0.54 0.47 0.24 0.50\n", "ba_hy 0.21 0.28 1.00 0.08 0.20 0.18 0.24 0.29 0.30 0.27 0.17 0.15 0.24\n",
"w_lap 0.58 0.53 0.42 1.00 0.41 0.35 0.54 0.59 0.53 0.19 0.47 0.34 0.53\n", "w_lap 0.28 0.18 0.08 1.00 0.10 0.11 0.11 0.30 0.17 0.06 0.07 0.13 0.21\n",
"ne_dz 0.65 0.66 0.55 0.41 1.00 0.56 0.56 0.79 0.74 0.55 0.68 0.57 0.69\n", "ne_dz 0.35 0.38 0.20 0.10 1.00 0.32 0.30 0.52 0.44 0.27 0.36 0.26 0.41\n",
"be_wy 0.52 0.43 0.29 0.35 0.56 1.00 0.51 0.54 0.64 0.28 0.59 0.61 0.49\n", "be_wy 0.22 0.15 0.18 0.11 0.32 1.00 0.26 0.26 0.39 0.15 0.23 0.43 0.22\n",
"zw_oz 0.66 0.57 0.41 0.54 0.56 0.51 1.00 0.72 0.61 0.29 0.55 0.48 0.63\n", "zw_oz 0.32 0.20 0.24 0.11 0.30 0.26 1.00 0.38 0.36 0.06 0.18 0.20 0.29\n",
"mo_zu 0.79 0.76 0.54 0.59 0.79 0.54 0.72 1.00 0.79 0.49 0.73 0.58 0.79\n", "mo_zu 0.45 0.35 0.29 0.30 0.52 0.26 0.38 1.00 0.54 0.23 0.39 0.38 0.51\n",
"be_wy 0.76 0.68 0.58 0.53 0.74 0.64 0.61 0.79 1.00 0.49 0.72 0.61 0.74\n", "be_wy 0.47 0.36 0.30 0.17 0.44 0.39 0.36 0.54 1.00 0.26 0.37 0.42 0.48\n",
"ba_hy 0.41 0.38 0.54 0.19 0.55 0.28 0.29 0.49 0.49 1.00 0.37 0.32 0.48\n", "ba_hy 0.21 0.14 0.27 0.06 0.27 0.15 0.06 0.23 0.26 1.00 0.24 0.10 0.27\n",
"mo_zu 0.59 0.47 0.47 0.47 0.68 0.59 0.55 0.73 0.72 0.37 1.00 0.53 0.71\n", "mo_zu 0.25 0.17 0.17 0.07 0.36 0.23 0.18 0.39 0.37 0.24 1.00 0.20 0.34\n",
"be_wy 0.44 0.42 0.24 0.34 0.57 0.61 0.48 0.58 0.61 0.32 0.53 1.00 0.54\n", "be_wy 0.20 0.19 0.15 0.13 0.26 0.43 0.20 0.38 0.42 0.10 0.20 1.00 0.29\n",
"w_lud 0.72 0.62 0.50 0.53 0.69 0.49 0.63 0.79 0.74 0.48 0.71 0.54 1.00" "w_lud 0.39 0.33 0.24 0.21 0.41 0.22 0.29 0.51 0.48 0.27 0.34 0.29 1.00"
] ]
}, },
"metadata": {}, "metadata": {},
@ -661,7 +870,7 @@
} }
], ],
"source": [ "source": [
"lVectorized' = map (vectorizeWithHashingTrick 6 collectionLNormalized) collectionLNormalized\n", "lVectorized' = map (vectorizeWithHashingTrick 8 collectionLNormalized) collectionLNormalized\n",
"limitedL' = Data.List.take limit lVectorized'\n", "limitedL' = Data.List.take limit lVectorized'\n",
"\n", "\n",
"paintMatrix cosineSim labelsLimited limitedL'" "paintMatrix cosineSim labelsLimited limitedL'"
@ -1071,7 +1280,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 17, "execution_count": 16,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1117,7 +1326,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 18, "execution_count": 17,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1136,7 +1345,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 19, "execution_count": 18,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1168,6 +1377,224 @@
"paintMatrix cosineSim labelsLimited limitedL''" "paintMatrix cosineSim labelsLimited limitedL''"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Możemy próbować mnożyć wektory z modelu Word2vec z idf. Najpierw zdefiniujmy mnożenie przez skalar."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[2.5,0.0,5.0]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"(***) :: Double -> [Double] -> [Double]\n",
"(***) s = map (*s)\n",
"\n",
"2.5 *** [1.0, 0.0, 2.0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz będziemy przemnażali wektory Word2vec przez idf (jako skalar)."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>/* Styles used for the Hoogle display in the pager */\n",
".hoogle-doc {\n",
"display: block;\n",
"padding-bottom: 1.3em;\n",
"padding-left: 0.4em;\n",
"}\n",
".hoogle-code {\n",
"display: block;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"}\n",
".hoogle-text {\n",
"display: block;\n",
"}\n",
".hoogle-name {\n",
"color: green;\n",
"font-weight: bold;\n",
"}\n",
".hoogle-head {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-sub {\n",
"display: block;\n",
"margin-left: 0.4em;\n",
"}\n",
".hoogle-package {\n",
"font-weight: bold;\n",
"font-style: italic;\n",
"}\n",
".hoogle-module {\n",
"font-weight: bold;\n",
"}\n",
".hoogle-class {\n",
"font-weight: bold;\n",
"}\n",
".get-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"display: block;\n",
"white-space: pre-wrap;\n",
"}\n",
".show-type {\n",
"color: green;\n",
"font-weight: bold;\n",
"font-family: monospace;\n",
"margin-left: 1em;\n",
"}\n",
".mono {\n",
"font-family: monospace;\n",
"display: block;\n",
"}\n",
".err-msg {\n",
"color: red;\n",
"font-style: italic;\n",
"font-family: monospace;\n",
"white-space: pre;\n",
"display: block;\n",
"}\n",
"#unshowable {\n",
"color: red;\n",
"font-weight: bold;\n",
"}\n",
".err-msg.in.collapse {\n",
"padding-top: 0.7em;\n",
"}\n",
".highlight-code {\n",
"white-space: pre;\n",
"font-family: monospace;\n",
"}\n",
".suggestion-warning { \n",
"font-weight: bold;\n",
"color: rgb(200, 130, 0);\n",
"}\n",
".suggestion-error { \n",
"font-weight: bold;\n",
"color: red;\n",
"}\n",
".suggestion-name {\n",
"font-weight: bold;\n",
"}\n",
"</style><div class=\"suggestion-name\" style=\"clear:both;\">Fuse foldr/map</div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Found:</div><div class=\"highlight-code\" id=\"haskell\">Prelude.foldr (+++) (zero 100)\n",
" $ map (\\ (t, Just v) -> idf coll t *** toOurVector v)\n",
" $ Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d</div></div><div class=\"suggestion-row\" style=\"float: left;\"><div class=\"suggestion-warning\">Why Not:</div><div class=\"highlight-code\" id=\"haskell\">foldr\n",
" ((+++) . (\\ (t, Just v) -> idf coll t *** toOurVector v))\n",
" (zero 100)\n",
" (Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d)</div></div>"
],
"text/plain": [
"Line 4: Fuse foldr/map\n",
"Found:\n",
"Prelude.foldr (+++) (zero 100)\n",
" $ map (\\ (t, Just v) -> idf coll t *** toOurVector v)\n",
" $ Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d\n",
"Why not:\n",
"foldr\n",
" ((+++) . (\\ (t, Just v) -> idf coll t *** toOurVector v))\n",
" (zero 100)\n",
" (Prelude.filter (\\ (_, v) -> isJust v)\n",
" $ map (\\ t -> (t, getVector model t)) d)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import Data.Maybe (isJust)\n",
"\n",
"vectorizeWord2vecIdf model coll d = \n",
" Prelude.foldr (+++) (zero 100) \n",
" $ map (\\(t, Just v) -> idf coll t *** toOurVector v) \n",
" $ Prelude.filter (\\(_, v) -> isJust v)\n",
" $ map (\\t -> (t, getVector model t)) d\n",
"\n",
"collectionLVectorized''' = map (vectorizeWord2vecIdf model collectionLNormalized') collectionLNormalized'"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-35.63830397762308,32.606312678971506,102.20663646169147,56.00417395285867,-130.56709475346878,-14.916644370325773,55.15817632053957,83.2241937686228,26.432875116296394,48.94350344147367,11.370669191277202,-59.54579267200742,-116.01687192456801,60.53824040579282,39.84659684249884,-34.37377085402866,104.53525319069323,45.53363024094972,-34.25020197907558,-43.9007702604392,35.36538495508536,-59.81737728971619,-1.5823889595648828,-50.211106838043655,14.83789867297237,-109.45917608219175,86.56767915592452,-32.170794763065615,29.559930839016644,-126.81686726526162,-9.918908360030228,47.14965938694648,5.955083439147183,41.24417782948478,3.592410260515919,72.10649687523313,61.374776273461855,60.28687760276824,-28.886499026001676,-8.710633131022206,-68.73464623080284,-37.95272838994007,-26.390548039392165,-14.241950251566944,74.6286124718925,46.21889022510431,72.23999508751568,-19.597547074284556,-20.160749174807382,99.49036127458763,131.98057386978817,-23.842794956628147,-62.381675411749846,-19.366936151725387,1.4839595614144327,60.40520721416763,-7.70311857607342,-31.75784386529525,48.71818084466781,-202.41827342135582,138.5639100010709,12.447619757719652,-39.38375639132277,27.877688543771935,-87.00559882214534,56.45689362090545,37.89098984507379,103.78465196444151,-166.10094891357176,-50.83382060940457,11.574060187412977,74.00519869734406,-97.00170731343235,32.18159534728971,-11.280059681646494,-40.701643971890256,74.64230137346699,0.7613112917269982,-6.103424218278271,-150.47551072570587,-21.714627635239918,91.26690441786137,62.91576955719526,-92.35700140312395,-25.421583980267307,-67.87480813505826,-120.16245846953592,-68.89155479679258,-122.00206448376261,35.263603445401785,6.416282520155956,203.41225708856086,-62.42983953251155,59.36113672119048,40.00275897200196,-62.55633545667429,89.66866371308245,-42.287712072353834,-72.59490110281287,52.23637641217955]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"collectionLVectorized''' !! 3"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" na_ak w_lud ba_hy w_lap ne_dz be_wy zw_oz mo_zu be_wy ba_hy mo_zu be_wy w_lud\n",
"na_ak 1.00 0.83 0.78 0.63 0.78 0.81 0.83 0.76 0.77 0.80 0.77 0.79 0.79\n",
"w_lud 0.83 1.00 0.82 0.60 0.84 0.84 0.84 0.85 0.86 0.74 0.86 0.83 0.90\n",
"ba_hy 0.78 0.82 1.00 0.57 0.78 0.84 0.77 0.79 0.90 0.75 0.74 0.89 0.85\n",
"w_lap 0.63 0.60 0.57 1.00 0.38 0.60 0.50 0.43 0.52 0.45 0.55 0.65 0.47\n",
"ne_dz 0.78 0.84 0.78 0.38 1.00 0.81 0.79 0.90 0.89 0.77 0.81 0.81 0.90\n",
"be_wy 0.81 0.84 0.84 0.60 0.81 1.00 0.82 0.76 0.83 0.74 0.81 0.92 0.88\n",
"zw_oz 0.83 0.84 0.77 0.50 0.79 0.82 1.00 0.77 0.77 0.74 0.82 0.75 0.83\n",
"mo_zu 0.76 0.85 0.79 0.43 0.90 0.76 0.77 1.00 0.93 0.74 0.87 0.80 0.90\n",
"be_wy 0.77 0.86 0.90 0.52 0.89 0.83 0.77 0.93 1.00 0.72 0.81 0.89 0.92\n",
"ba_hy 0.80 0.74 0.75 0.45 0.77 0.74 0.74 0.74 0.72 1.00 0.66 0.73 0.72\n",
"mo_zu 0.77 0.86 0.74 0.55 0.81 0.81 0.82 0.87 0.81 0.66 1.00 0.80 0.88\n",
"be_wy 0.79 0.83 0.89 0.65 0.81 0.92 0.75 0.80 0.89 0.73 0.80 1.00 0.87\n",
"w_lud 0.79 0.90 0.85 0.47 0.90 0.88 0.83 0.90 0.92 0.72 0.88 0.87 1.00"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"limitedL''' = Data.List.take limit collectionLVectorized'''\n",
"\n",
"paintMatrix cosineSim labelsLimited limitedL'''"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,