Remove unused files

2023-06-27 19:15:34 +02:00 · 2023-06-27 19:15:34 +02:00 · b5b575bd45
commit b5b575bd45
parent 14d3dc0e04
16 changed files with 0 additions and 47572 deletions
--- a/09_Zanurzenia_slow.ipynb
+++ b/09_Zanurzenia_slow.ipynb
@ -1,634 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
-    "<div class=\"alert alert-block alert-info\">\n",
-    "<h1> Modelowanie języka</h1>\n",
-    "<h2> 09. <i>Zanurzenia słów (Word2vec)</i>  [wykład]</h2> \n",
-    "<h3> Filip Graliński (2022)</h3>\n",
-    "</div>\n",
-    "\n",
-    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Zanurzenia słów (Word2vec)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "W praktyce stosowalność słowosieci okazała się zaskakująco\n",
-    "ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
-    "wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### „Wymiary” słów\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
-    "$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
-    "prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
-    "(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
-    "\n",
-    "$$P(u|v) \\approx P(u'|v').$$\n",
-    "\n",
-    "$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Wymiary określone z góry?\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
-    "określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
-    "„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
-    "\n",
-    "-   czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
-    "-   czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
-    "-   czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
-    "    socjolingwistycznym)?\n",
-    "-   czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
-    "-   czy słowo jest rzeczownikiem czy czasownikiem?\n",
-    "-   czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
-    "-   czy słowo jest nazwą czy słowem pospolitym?\n",
-    "-   czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
-    "-   …\n",
-    "\n",
-    "W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
-    "możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Bigramowy model języka oparty na zanurzeniach\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
-    "**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Słownik\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
-    "ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
-    "po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
-    "na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
-    "\n",
-    "Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from itertools import islice\n",
-    "import regex as re\n",
-    "import sys\n",
-    "from torchtext.vocab import build_vocab_from_iterator\n",
-    "import pickle\n",
-    "import lzma"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "1027"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from itertools import islice\n",
-    "import regex as re\n",
-    "import sys\n",
-    "from torchtext.vocab import build_vocab_from_iterator\n",
-    "import lzma\n",
-    "\n",
-    "\n",
-    "def get_words_from_line(line):\n",
-    "  line = line.rstrip()\n",
-    "  yield '<s>'\n",
-    "  for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
-    "     yield m.group(0).lower()\n",
-    "  yield '</s>'\n",
-    "\n",
-    "\n",
-    "def get_word_lines_from_file(file_name):\n",
-    "  with lzma.open(file_name, 'r') as fh:\n",
-    "    for line in fh:\n",
-    "       yield get_words_from_line(line.decode('utf-8'))\n",
-    "\n",
-    "vocab_size = 20000\n",
-    "\n",
-    "vocab = build_vocab_from_iterator(\n",
-    "    get_word_lines_from_file('train/in.tsv.xz'),\n",
-    "    max_tokens = vocab_size,\n",
-    "    specials = ['<unk>'])\n",
-    "\n",
-    "vocab['human']"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['<unk>', '\\\\', 'the', '-\\\\', 'nmighty']"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "vocab.lookup_tokens([0, 1, 2, 10, 12345])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with open('vocabulary.pickle', 'wb') as fh:\n",
-    "  pickle.dump(vocab, fh)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Definicja sieci\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/jacob/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
-      "  input = module(input)\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "tensor(2.9869e-05, grad_fn=<SelectBackward0>)"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from torch import nn\n",
-    "import torch\n",
-    "\n",
-    "embed_size = 100\n",
-    "\n",
-    "class SimpleBigramNeuralLanguageModel(nn.Module):\n",
-    "  def __init__(self, vocabulary_size, embedding_size):\n",
-    "      super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
-    "      self.model = nn.Sequential(\n",
-    "          nn.Embedding(vocabulary_size, embedding_size),\n",
-    "          nn.Linear(embedding_size, vocabulary_size),\n",
-    "          nn.Softmax()\n",
-    "      )\n",
-    "\n",
-    "  def forward(self, x):\n",
-    "      return self.model(x)\n",
-    "\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
-    "\n",
-    "vocab.set_default_index(vocab['<unk>'])\n",
-    "ixs = torch.tensor(vocab.forward(['is']))\n",
-    "out = model(ixs)\n",
-    "out[0][vocab['the']]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
-    "\n",
-    "    shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from torch.utils.data import IterableDataset\n",
-    "import itertools\n",
-    "\n",
-    "def look_ahead_iterator(gen):\n",
-    "   prev = None\n",
-    "   for item in gen:\n",
-    "      if prev is not None:\n",
-    "         yield (prev, item)\n",
-    "      prev = item\n",
-    "\n",
-    "class Bigrams(IterableDataset):\n",
-    "  def __init__(self, text_file, vocabulary_size):\n",
-    "      self.vocab = build_vocab_from_iterator(\n",
-    "         get_word_lines_from_file(text_file),\n",
-    "         max_tokens = vocabulary_size,\n",
-    "         specials = ['<unk>'])\n",
-    "      self.vocab.set_default_index(self.vocab['<unk>'])\n",
-    "      self.vocabulary_size = vocabulary_size\n",
-    "      self.text_file = text_file\n",
-    "\n",
-    "  def __iter__(self):\n",
-    "     return look_ahead_iterator(\n",
-    "         (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
-    "\n",
-    "train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "(43, 0)"
-      ]
-     },
-     "execution_count": 9,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from torch.utils.data import DataLoader\n",
-    "\n",
-    "next(iter(train_dataset))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[tensor([   2,    5,   51, 3481,  231]), tensor([   5,   51, 3481,  231,    4])]"
-     ]
-    }
-   ],
-   "source": [
-    "from torch.utils.data import DataLoader\n",
-    "\n",
-    "next(iter(DataLoader(train_dataset, batch_size=5)))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "None"
-     ]
-    }
-   ],
-   "source": [
-    "device = 'cpu'\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
-    "data = DataLoader(train_dataset, batch_size=5000)\n",
-    "optimizer = torch.optim.Adam(model.parameters())\n",
-    "criterion = torch.nn.NLLLoss()\n",
-    "\n",
-    "model.train()\n",
-    "step = 0\n",
-    "for x, y in data:\n",
-    "   x = x.to(device)\n",
-    "   y = y.to(device)\n",
-    "   optimizer.zero_grad()\n",
-    "   ypredicted = model(x)\n",
-    "   loss = criterion(torch.log(ypredicted), y)\n",
-    "   if step % 100 == 0:\n",
-    "      print(step, loss)\n",
-    "   step += 1\n",
-    "   loss.backward()\n",
-    "   optimizer.step()\n",
-    "\n",
-    "torch.save(model.state_dict(), 'model1.bin')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('ciebie', 73, 0.1580502986907959), ('mnie', 26, 0.15395283699035645), ('<unk>', 0, 0.12862136960029602), ('nas', 83, 0.0410110242664814), ('niego', 172, 0.03281523287296295), ('niej', 245, 0.02104802615940571), ('siebie', 181, 0.020788608118891716), ('którego', 365, 0.019379809498786926), ('was', 162, 0.013852755539119244), ('wszystkich', 235, 0.01381855271756649)]"
-     ]
-    }
-   ],
-   "source": [
-    "device = 'cuda'\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
-    "model.load_state_dict(torch.load('model1.bin'))\n",
-    "model.eval()\n",
-    "\n",
-    "ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
-    "\n",
-    "out = model(ixs)\n",
-    "top = torch.topk(out[0], 10)\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('.', 3, 0.404473215341568), (',', 4, 0.14222915470600128), ('z', 14, 0.10945753753185272), ('?', 6, 0.09583134204149246), ('w', 10, 0.050338443368673325), ('na', 12, 0.020703863352537155), ('i', 11, 0.016762692481279373), ('<unk>', 0, 0.014571071602404118), ('...', 15, 0.01453721895813942), ('</s>', 1, 0.011769450269639492)]"
-     ]
-    }
-   ],
-   "source": [
-    "vocab = train_dataset.vocab\n",
-    "ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
-    "\n",
-    "out = model(ixs)\n",
-    "top = torch.topk(out[0], 10)\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('poszedł', 1087, 1.0), ('idziesz', 1050, 0.4907470941543579), ('przyjeżdża', 4920, 0.45242372155189514), ('pojechałam', 12784, 0.4342481195926666), ('wrócił', 1023, 0.431664377450943), ('dobrać', 10351, 0.4312002956867218), ('stałeś', 5738, 0.4258835017681122), ('poszła', 1563, 0.41979148983955383), ('trafiłam', 18857, 0.4109022617340088), ('jedzie', 1674, 0.4091658890247345)]"
-     ]
-    }
-   ],
-   "source": [
-    "cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
-    "\n",
-    "embeddings = model.model[0].weight\n",
-    "\n",
-    "vec = embeddings[vocab['poszedł']]\n",
-    "\n",
-    "similarities = cos(vec, embeddings)\n",
-    "\n",
-    "top = torch.topk(similarities, 10)\n",
-    "\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Zapis przy użyciu wzoru matematycznego\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
-    "\n",
-    "$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
-    "\n",
-    "gdzie:\n",
-    "\n",
-    "-   $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
-    "-   $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
-    "-   $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
-    "-   $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "##### Hiperparametry\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Zauważmy, że nasz model ma dwa hiperparametry:\n",
-    "\n",
-    "-   $m$ — rozmiar zanurzenia,\n",
-    "-   $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
-    "    rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
-    "    najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
-    "\n",
-    "Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
-    "polepszenia wyników naszego modelu.\n",
-    "\n",
-    "**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Diagram sieci\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
-    "warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
-    "sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
-    "\n",
-    "![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Zanurzenie jako mnożenie przez macierz\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
-    "odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
-    "mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
-    "wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
-    "podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
-    "złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
-    "\n",
-    "Wówczas wzór przyjmie postać:\n",
-    "\n",
-    "$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
-    "\n",
-    "gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
-    "\n",
-    "**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
-    "\n",
-    "W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
-    "\n",
-    "![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
-    "\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
-  },
-  "org": null
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
--- a/dev-0/out-embed-100.tsv
+++ b/dev-0/out-embed-100.tsv
--- a/dev-0/out-embed-500.tsv
+++ b/dev-0/out-embed-500.tsv
--- a/dev-0/out.tsv
+++ b/dev-0/out.tsv
--- a/gonito.yaml
+++ b/gonito.yaml
@ -1,13 +0,0 @@
-description: nn, trigram, previous and next
-tags:
-  - neural-network
-  - trigram
-params:
-  epochs: 1
-  vocab-size: 20000
-  batch-size: 10000
-  embed-size:
-    - 100
-    - 500
-    - 1000
-  topk: 10
--- a/lm0.py
+++ b/lm0.py
@ -1,15 +0,0 @@
-import sys
-import random
-
-distribs = [
-    'a:0.6 the:0.2 :0.2',
-    'the:0.7 a:0.2 :0.1',
-    'the:0.9 :0.1',
-    'the:0.3 be:0.2 to:0.15 of:0.15 and:0.05 a:0.05 in:0.05 :0.05',
-]
-
-for line in sys.stdin:
-    ctx = line.split('\t')[6:]
-
-    i = random.randint(0, len(distribs) - 1)
-    print(distribs[i])
--- a/lm1.py
+++ b/lm1.py
@ -1,56 +0,0 @@
-import sys
-import random
-from tqdm import tqdm
-from collections import defaultdict
-import pickle
-import os
-
-corpus = []
-
-with open('train/in.tsv', 'r') as f:
-    print('Reading corpus...')
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-
-        corpus.append(ctx[0] + 'BLANK' + ctx[1])
-
-corpus = ' '.join(corpus)
-corpus = corpus.replace('-\n', '')
-corpus = corpus.replace('\\n', ' ')
-corpus = corpus.replace('\n', ' ')
-corpus = corpus.split(' ')
-
-if (os.path.exists('distrib.pkl')):
-    print('Loading distribution...')
-    distrib = pickle.load(open('distrib.pkl', 'rb'))
-else:
-    print('Generating distribution...')
-    distrib = defaultdict(lambda: defaultdict(int))
-    for i in tqdm(range(len(corpus) - 1)):
-        distrib[corpus[i]][corpus[i+1]] += 1
-
-    with open('distrib.pkl', 'wb') as f:
-        print('Saving distribution...')
-        pickle.dump(dict(distrib), f)
-
-results = []
-with open('dev-0/in.tsv', 'r') as f:
-    print('Generating output...')
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        last_word = ctx[0].split(' ')[-1]
-        try:
-            blank_word = max(distrib[last_word], key=distrib[last_word].get)
-        except:
-            blank_word = 'NONE'
-        results.append(blank_word)
-
-with open('dev-0/out.tsv', 'w') as f:
-    print('Writing output...')
-    for result in tqdm(results):
-        if result == 'NONE':
-            f.write('a:0.6 the:0.2 :0.2')
-        else:
-            f.write(f'{result}:0.9 :0.1')
-
-            
--- a/lmn.py
+++ b/lmn.py
@ -1,83 +0,0 @@
-from tqdm import tqdm
-from numpy import argmax
-
-def preprocess(corpus):
-    corpus = corpus.replace('-\n', '')
-    corpus = corpus.replace('\\n', ' ')
-    corpus = corpus.replace('\n', ' ')
-    corpus = corpus.replace('.', ' EOS')
-
-    return corpus
-
-def generate_freq(tokens):
-    tokens_freq = {}
-    for token in tqdm(tokens):
-        if token not in tokens_freq:
-            tokens_freq[token] = 1
-        else:
-            tokens_freq[token] += 1
-
-    return tokens_freq
-
-def generate_ngrams(tokens, n):
-    ngrams = []
-    for i in tqdm(range(len(tokens) - n + 1)):
-        ngrams.append(tokens[i:i+n])
-
-    return ngrams
-
-def generate_distribution(unique_tokens, tokens_freq, bigrams_freq):
-    n = len(unique_tokens)
-    distribution = [[] * n] * n
-    for i in tqdm(n):
-        denominator = tokens_freq[unique_tokens[i]]
-        for j in range(n):
-            try:
-                numerator = bigrams_freq[unique_tokens[i] + unique_tokens[j]]
-            except:
-                numerator = 0
-            distribution[unique_tokens[i] + unique_tokens[j]] = numerator / denominator
-
-    return distribution
-
-with open('train/in.tsv', 'r') as f:
-    print('Reading corpus...')
-    corpus = []
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        corpus.append(ctx[0] + 'BLANK' + ctx[1])
-
-print('Preprocessing corpus...')
-corpus = preprocess(' '.join(corpus))
-
-tokens = corpus.split()
-unique_tokens = set(sorted(corpus))
-print('Generating tokens frequency...')
-tokens_freq = generate_freq(tokens)
-print('Generating n-grams...')
-bigrams = generate_ngrams(tokens, 2)
-print('Generating bigrams frequency...')
-bigrams_freq = generate_freq(bigrams)
-print('Generate distribution...')
-distribution = generate_distribution(unique_tokens, tokens_freq, bigrams_freq)
-
-
-with open('dev-0/in.tsv', 'r') as f:
-    print('Generating output...')
-    results = []
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        last_word = preprocess(ctx[0]).split(' ')[-1]
-        try:
-            blank_word = unique_tokens[argmax(distribution[unique_tokens.index(last_word)])]
-        except:
-            blank_word = 'NONE'
-        results.append(blank_word)
-
-with open('dev-0/out.tsv', 'w') as f:
-    print('Writing output...')
-    for result in tqdm(results):
-        if result == 'NONE':
-            f.write('a:0.6 the:0.2 :0.2')
-        else:
-            f.write(f'{result}:0.9 :0.1')
--- a/ripped.py
+++ b/ripped.py
@ -1,153 +0,0 @@
-import lzma
-import matplotlib.pyplot as plt
-from math import log
-from collections import OrderedDict
-from collections import Counter
-import regex as re
-from itertools import islice
-
-def freq_list(g, top=None):
-    c = Counter(g)
-
-    if top is None:
-       items = c.items()
-    else:
-       items = c.most_common(top)
-
-    return OrderedDict(sorted(items, key=lambda t: -t[1]))
-
-def get_words(t):
-    for m in re.finditer(r'[\p{L}0-9-\*]+', t):
-        yield m.group(0)
-
-def ngrams(iter, size):
-  ngram = []
-  for item in iter:
-    ngram.append(item)
-    if len(ngram) == size:
-        yield tuple(ngram)
-        ngram = ngram[1:]
-
-PREFIX_TRAIN = 'train' 
-words = []
-
-counter_lines = 0
-with lzma.open(f'{PREFIX_TRAIN}/in.tsv.xz', 'r') as train, open(f'{PREFIX_TRAIN}/expected.tsv', 'r') as expected:
-    for t_line, e_line in zip(train, expected):
-        t_line = t_line.decode("utf-8")
-
-        t_line = t_line.rstrip()
-        e_line = e_line.rstrip()
-
-        t_line_splitted_by_tab = t_line.split('\t')
-        
-        t_line_cleared = t_line_splitted_by_tab[-2] + ' ' + e_line + ' ' + t_line_splitted_by_tab[-1]
-
-        words += t_line_cleared.split()
-
-        counter_lines+=1
-        if counter_lines > 90000:
-            break
-
-# lzmaFile = lzma.open('dev-0/in.tsv.xz', 'rb')
-
-# content = lzmaFile.read().decode("utf-8")
-# words = get_words(trainset)
-
-ngrams_ = ngrams(words, 2)
-
-
-def create_probabilities_bigrams(w_c, b_c):
-    probabilities_bigrams = {}
-    for bigram, bigram_amount in b_c.items():
-        if bigram_amount <=2:
-            continue
-        p_word_before = bigram_amount / w_c[bigram[0]] 
-        p_word_after = bigram_amount / w_c[bigram[1]]
-        probabilities_bigrams[bigram] = (p_word_before, p_word_after)
-
-    return probabilities_bigrams
-
-words_c = Counter(words)
-word_=''
-bigram_c = Counter(ngrams_)
-ngrams_=''
-probabilities = create_probabilities_bigrams(words_c, bigram_c)
-
-
-items = probabilities.items()
-probabilities = OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
-items=''
-# sorted_by_freq = freq_list(ngrams)
-
-PREFIX_VALID = 'dev-0'
-
-def count_probabilities(w_b, w_a, probs, w_c, b_c):
-    results_before = {}
-    results_after = {}
-    for bigram, probses in probs.items():
-        if len(results_before) > 20 or len(results_after) > 20:
-            break
-        if w_b == bigram[0]:
-            results_before[bigram] = probses[0]
-        if w_a == bigram[1]:
-            results_after[bigram] = probses[1]
-    a=1
-    best_ = {}
-
-    for bigram, probses in results_before.items():
-        for bigram_2, probses_2 in results_after.items():
-            best_[bigram[1]] = probses * probses_2
-
-    for bigram, probses in results_after.items():
-            for bigram_2, probses_2 in results_before.items():
-                if bigram[0] in best_:
-                    if probses * probses_2 < probses_2:
-                        continue
-                best_[bigram[0]] = probses * probses_2
-
-    items = best_.items()
-    return OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
-
-
-with lzma.open(f'{PREFIX_VALID}/in.tsv.xz', 'r') as train:
-    for t_line in train:
-        t_line = t_line.decode("utf-8")
-
-        t_line = t_line.rstrip()
-        t_line = t_line.replace('\\n', ' ')
-
-
-        t_line_splitted_by_tab = t_line.split('\t')
-        
-
-        words_pre = t_line_splitted_by_tab[-2].split()
-
-        words_po = t_line_splitted_by_tab[-1].split()
-
-        w_pre = words_pre[-1]
-        w_po = words_po[0]
-
-        probs_ordered = count_probabilities(w_pre, w_po,probabilities, words_c, bigram_c)
-        if len(probs_ordered) ==0:
-            print(f"the:0.5 a:0.3 :0.2")
-            continue
-        result_string = ''
-        counter_ = 0
-        for word_, p in probs_ordered.items():
-            if counter_>4:
-                break
-            re_ = re.search(r'\p{L}+', word_)
-            if re_:
-                word_cleared = re_.group(0)
-                result_string += f"{word_cleared}:{str(p)} "
-
-            else:
-                if result_string == '':
-                    result_string = f"the:0.5 a:0.3 "
-                continue
-
-            counter_+=1
-        result_string += ':0.1'
-        print(result_string)
-        a=1
--- a/run.py
+++ b/run.py
@ -1,233 +0,0 @@
-import lzma
-import regex as re
-from torchtext.vocab import build_vocab_from_iterator
-from torch import nn
-import pickle
-from os.path import exists
-from torch.utils.data import IterableDataset
-import itertools
-from torch.utils.data import DataLoader
-import torch
-from matplotlib import pyplot as plt
-from tqdm import tqdm
-
-
-def get_words_from_line(line):
-    line = line.rstrip()
-    line = line.split("\t")
-    text = line[-2] + " " + line[-1]
-    text = re.sub(r"\\\\+n", " ", text)
-    text = re.sub('[^A-Za-z ]+', '', text)
-    for t in text.split():
-        yield t
-
-
-def get_word_lines_from_file(file_name):
-    with lzma.open(file_name, "r") as fh:
-        for line in fh:
-            yield get_words_from_line(line.decode("utf-8"))
-
-
-def look_ahead_iterator(gen):
-    first = None
-    second = None
-    for item in gen:
-        if first is not None and second is not None:
-            yield (first, second, item)
-        first = second
-        second = item
-
-
-class Trigrams(IterableDataset):
-    def __init__(self, text_file, vocabulary_size):
-        self.vocab = build_vocab_from_iterator(
-            get_word_lines_from_file(text_file),
-            max_tokens=vocabulary_size,
-            specials=["<unk>"],
-        )
-        self.vocab.set_default_index(self.vocab["<unk>"])
-        self.vocabulary_size = vocabulary_size
-        self.text_file = text_file
-
-    def __iter__(self):
-        return look_ahead_iterator(
-            (
-                self.vocab[t]
-                for t in itertools.chain.from_iterable(
-                    get_word_lines_from_file(self.text_file)
-                )
-            )
-        )
-
-
-class TrigramModel(nn.Module):
-    def __init__(self, vocab_size, embedding_dim, hidden_dim):
-        super(TrigramModel, self).__init__()
-        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
-        self.hidden = nn.Linear(embedding_dim * 2, hidden_dim)
-        self.output = nn.Linear(hidden_dim, vocab_size)
-        self.softmax = nn.Softmax()
-
-    def forward(self, x, y):
-        x = self.embeddings(x)
-        y = self.embeddings(y)
-        z = self.hidden(torch.cat([x, y], dim=1))
-        z = self.output(z)
-        z = self.softmax(z)
-        return z
-
-
-embed_size = 500
-vocab_size = 20000
-vocab_path = "vocabulary.pickle"
-if exists(vocab_path):
-    print("Loading vocabulary from file...")
-    with open(vocab_path, "rb") as fh:
-        vocab = pickle.load(fh)
-else:
-    print("Building vocabulary...")
-    vocab = build_vocab_from_iterator(
-        get_word_lines_from_file("train/in.tsv.xz"),
-        max_tokens=vocab_size,
-        specials=["<unk>"],
-    )
-
-    with open(vocab_path, "wb") as fh:
-        pickle.dump(vocab, fh)
-
-device = "cuda" if torch.cuda.is_available() else "cpu"
-
-print("Using device:", device)
-dataset_path = 'train/dataset.pickle'
-if exists(dataset_path):
-    print("Loading dataset from file...")
-    with open(dataset_path, "rb") as fh:
-        train_dataset = pickle.load(fh)
-else:
-    print("Building dataset...")
-    train_dataset = Trigrams("train/in.tsv.xz", vocab_size)
-    with open(dataset_path, "wb") as fh:
-        pickle.dump(train_dataset, fh)
-
-print("Building model...")
-model = TrigramModel(vocab_size, embed_size, 64).to(device)
-data = DataLoader(train_dataset, batch_size=10000)
-optimizer = torch.optim.Adam(model.parameters())
-criterion = torch.nn.NLLLoss()
-
-print("Training model...")
-model.train()
-losses = []
-step = 0
-max_steps = 1000
-
-for x, y, z in tqdm(data):
-    x = x.to(device)
-    y = y.to(device)
-    z = z.to(device)
-
-    optimizer.zero_grad()
-    ypredicted = model(x, z)
-    loss = criterion(torch.log(ypredicted), y)
-    losses.append(loss.item())
-    loss.backward()
-    optimizer.step()
-    step += 1
-    if step > max_steps:
-        break
-
-plt.plot(losses)
-plt.show()
-
-torch.save(model.state_dict(), f"trigram_model-embed_{embed_size}.bin")
-
-vocab_unique = set(train_dataset.vocab.get_stoi().keys())
-
-output = []
-print('Predicting dev...')
-with lzma.open("dev-0/in.tsv.xz", encoding='utf8', mode="rt") as file:
-    for line in tqdm(file):
-        line = line.split("\t")
-
-        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
-        first_word = re.sub('[^A-Za-z]+', '', first_word)
-
-        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
-        nenxt_word = re.sub('[^A-Za-z]+', '', next_word)
-
-        if first_word not in vocab_unique:
-            word = "<unk>"
-        if next_word not in vocab_unique:
-            word = "<unk>"
-
-        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
-        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
-
-        out = model(first_word, next_word)
-
-        top = torch.topk(out[0], 10)
-        top_indices = top.indices.tolist()
-        top_probs = top.values.tolist()
-        unk_bonus = 1 - sum(top_probs)
-        top_words = vocab.lookup_tokens(top_indices)
-        top_zipped = list(zip(top_words, top_probs))
-
-        res = ""
-        for w, p in top_zipped:
-            if w == "<unk>":
-                res += f":{(p + unk_bonus):.4f} "
-            else:
-                res += f"{w}:{p:.4f} "
-
-        res = res[:-1]
-        res += "\n"
-        output.append(res)
-
-with open(f"dev-0/out-embed-{embed_size}.tsv", mode="w") as file:
-    file.writelines(output)
-
-
-model.eval()
-
-output = []
-print('Predicting test...')
-with lzma.open("test-A/in.tsv.xz", encoding='utf8', mode="rt") as file:
-    for line in tqdm(file):
-        line = line.split("\t")
-
-        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
-        first_word = re.sub('[^A-Za-z]+', '', first_word)
-
-        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
-        next_word = re.sub('[^A-Za-z]+', '', next_word)
-
-        if first_word not in vocab_unique:
-            word = "<unk>"
-        if next_word not in vocab_unique:
-            word = "<unk>"
-
-        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
-        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
-
-        out = model(first_word, next_word)
-
-        top = torch.topk(out[0], 10)
-        top_indices = top.indices.tolist()
-        top_probs = top.values.tolist()
-        unk_bonus = 1 - sum(top_probs)
-        top_words = vocab.lookup_tokens(top_indices)
-        top_zipped = list(zip(top_words, top_probs))
-
-        res = ""
-        for w, p in top_zipped:
-            if w == "<unk>":
-                res += f":{(p + unk_bonus):.4f} "
-            else:
-                res += f"{w}:{p:.4f} "
-
-        res = res[:-1]
-        res += "\n"
-        output.append(res)
-
-with open(f"test-A/out-embed-{embed_size}.tsv", mode="w") as file:
-    file.writelines(output)
--- a/test-A/in.tsv.xz
+++ b/test-A/in.tsv.xz
--- a/test-A/out-embed-100.tsv
+++ b/test-A/out-embed-100.tsv
--- a/test-A/out-embed-500.tsv
+++ b/test-A/out-embed-500.tsv
--- a/trigram_model-50_steps-embed_100.bin
+++ b/trigram_model-50_steps-embed_100.bin
--- a/trigram_model-embed_100.bin
+++ b/trigram_model-embed_100.bin
--- a/trigram_model-embed_500.bin
+++ b/trigram_model-embed_500.bin