Remove unused files

2023-06-27 19:15:34 +02:00 · 2023-06-27 19:15:34 +02:00 · b5b575bd45
commit b5b575bd45
parent 14d3dc0e04
16 changed files with 0 additions and 47572 deletions
--- a/09_Zanurzenia_slow.ipynb
+++ b/09_Zanurzenia_slow.ipynb
@ -1,634 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<h1> Modelowanie języka</h1>\n",
    "<h2> 09. <i>Zanurzenia słów (Word2vec)</i>  [wykład]</h2> \n",
    "<h3> Filip Graliński (2022)</h3>\n",
    "</div>\n",
    "\n",
    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zanurzenia słów (Word2vec)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "W praktyce stosowalność słowosieci okazała się zaskakująco\n",
    "ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
    "wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### „Wymiary” słów\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
    "$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
    "prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
    "(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
    "\n",
    "$$P(u|v) \\approx P(u'|v').$$\n",
    "\n",
    "$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wymiary określone z góry?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
    "określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
    "„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
    "\n",
    "-   czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
    "-   czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
    "-   czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
    "    socjolingwistycznym)?\n",
    "-   czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
    "-   czy słowo jest rzeczownikiem czy czasownikiem?\n",
    "-   czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
    "-   czy słowo jest nazwą czy słowem pospolitym?\n",
    "-   czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
    "-   …\n",
    "\n",
    "W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
    "możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bigramowy model języka oparty na zanurzeniach\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
    "**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Słownik\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
    "ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
    "po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
    "na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
    "\n",
    "Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import islice\n",
    "import regex as re\n",
    "import sys\n",
    "from torchtext.vocab import build_vocab_from_iterator\n",
    "import pickle\n",
    "import lzma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1027"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from itertools import islice\n",
    "import regex as re\n",
    "import sys\n",
    "from torchtext.vocab import build_vocab_from_iterator\n",
    "import lzma\n",
    "\n",
    "\n",
    "def get_words_from_line(line):\n",
    "  line = line.rstrip()\n",
    "  yield '<s>'\n",
    "  for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
    "     yield m.group(0).lower()\n",
    "  yield '</s>'\n",
    "\n",
    "\n",
    "def get_word_lines_from_file(file_name):\n",
    "  with lzma.open(file_name, 'r') as fh:\n",
    "    for line in fh:\n",
    "       yield get_words_from_line(line.decode('utf-8'))\n",
    "\n",
    "vocab_size = 20000\n",
    "\n",
    "vocab = build_vocab_from_iterator(\n",
    "    get_word_lines_from_file('train/in.tsv.xz'),\n",
    "    max_tokens = vocab_size,\n",
    "    specials = ['<unk>'])\n",
    "\n",
    "vocab['human']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<unk>', '\\\\', 'the', '-\\\\', 'nmighty']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab.lookup_tokens([0, 1, 2, 10, 12345])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('vocabulary.pickle', 'wb') as fh:\n",
    "  pickle.dump(vocab, fh)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Definicja sieci\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jacob/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
      "  input = module(input)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor(2.9869e-05, grad_fn=<SelectBackward0>)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from torch import nn\n",
    "import torch\n",
    "\n",
    "embed_size = 100\n",
    "\n",
    "class SimpleBigramNeuralLanguageModel(nn.Module):\n",
    "  def __init__(self, vocabulary_size, embedding_size):\n",
    "      super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
    "      self.model = nn.Sequential(\n",
    "          nn.Embedding(vocabulary_size, embedding_size),\n",
    "          nn.Linear(embedding_size, vocabulary_size),\n",
    "          nn.Softmax()\n",
    "      )\n",
    "\n",
    "  def forward(self, x):\n",
    "      return self.model(x)\n",
    "\n",
    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
    "\n",
    "vocab.set_default_index(vocab['<unk>'])\n",
    "ixs = torch.tensor(vocab.forward(['is']))\n",
    "out = model(ixs)\n",
    "out[0][vocab['the']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
    "\n",
    "    shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import IterableDataset\n",
    "import itertools\n",
    "\n",
    "def look_ahead_iterator(gen):\n",
    "   prev = None\n",
    "   for item in gen:\n",
    "      if prev is not None:\n",
    "         yield (prev, item)\n",
    "      prev = item\n",
    "\n",
    "class Bigrams(IterableDataset):\n",
    "  def __init__(self, text_file, vocabulary_size):\n",
    "      self.vocab = build_vocab_from_iterator(\n",
    "         get_word_lines_from_file(text_file),\n",
    "         max_tokens = vocabulary_size,\n",
    "         specials = ['<unk>'])\n",
    "      self.vocab.set_default_index(self.vocab['<unk>'])\n",
    "      self.vocabulary_size = vocabulary_size\n",
    "      self.text_file = text_file\n",
    "\n",
    "  def __iter__(self):\n",
    "     return look_ahead_iterator(\n",
    "         (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
    "\n",
    "train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(43, 0)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "next(iter(train_dataset))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[tensor([   2,    5,   51, 3481,  231]), tensor([   5,   51, 3481,  231,    4])]"
     ]
    }
   ],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "next(iter(DataLoader(train_dataset, batch_size=5)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None"
     ]
    }
   ],
   "source": [
    "device = 'cpu'\n",
    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
    "data = DataLoader(train_dataset, batch_size=5000)\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "criterion = torch.nn.NLLLoss()\n",
    "\n",
    "model.train()\n",
    "step = 0\n",
    "for x, y in data:\n",
    "   x = x.to(device)\n",
    "   y = y.to(device)\n",
    "   optimizer.zero_grad()\n",
    "   ypredicted = model(x)\n",
    "   loss = criterion(torch.log(ypredicted), y)\n",
    "   if step % 100 == 0:\n",
    "      print(step, loss)\n",
    "   step += 1\n",
    "   loss.backward()\n",
    "   optimizer.step()\n",
    "\n",
    "torch.save(model.state_dict(), 'model1.bin')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('ciebie', 73, 0.1580502986907959), ('mnie', 26, 0.15395283699035645), ('<unk>', 0, 0.12862136960029602), ('nas', 83, 0.0410110242664814), ('niego', 172, 0.03281523287296295), ('niej', 245, 0.02104802615940571), ('siebie', 181, 0.020788608118891716), ('którego', 365, 0.019379809498786926), ('was', 162, 0.013852755539119244), ('wszystkich', 235, 0.01381855271756649)]"
     ]
    }
   ],
   "source": [
    "device = 'cuda'\n",
    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
    "model.load_state_dict(torch.load('model1.bin'))\n",
    "model.eval()\n",
    "\n",
    "ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
    "\n",
    "out = model(ixs)\n",
    "top = torch.topk(out[0], 10)\n",
    "top_indices = top.indices.tolist()\n",
    "top_probs = top.values.tolist()\n",
    "top_words = vocab.lookup_tokens(top_indices)\n",
    "list(zip(top_words, top_indices, top_probs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('.', 3, 0.404473215341568), (',', 4, 0.14222915470600128), ('z', 14, 0.10945753753185272), ('?', 6, 0.09583134204149246), ('w', 10, 0.050338443368673325), ('na', 12, 0.020703863352537155), ('i', 11, 0.016762692481279373), ('<unk>', 0, 0.014571071602404118), ('...', 15, 0.01453721895813942), ('</s>', 1, 0.011769450269639492)]"
     ]
    }
   ],
   "source": [
    "vocab = train_dataset.vocab\n",
    "ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
    "\n",
    "out = model(ixs)\n",
    "top = torch.topk(out[0], 10)\n",
    "top_indices = top.indices.tolist()\n",
    "top_probs = top.values.tolist()\n",
    "top_words = vocab.lookup_tokens(top_indices)\n",
    "list(zip(top_words, top_indices, top_probs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('poszedł', 1087, 1.0), ('idziesz', 1050, 0.4907470941543579), ('przyjeżdża', 4920, 0.45242372155189514), ('pojechałam', 12784, 0.4342481195926666), ('wrócił', 1023, 0.431664377450943), ('dobrać', 10351, 0.4312002956867218), ('stałeś', 5738, 0.4258835017681122), ('poszła', 1563, 0.41979148983955383), ('trafiłam', 18857, 0.4109022617340088), ('jedzie', 1674, 0.4091658890247345)]"
     ]
    }
   ],
   "source": [
    "cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
    "\n",
    "embeddings = model.model[0].weight\n",
    "\n",
    "vec = embeddings[vocab['poszedł']]\n",
    "\n",
    "similarities = cos(vec, embeddings)\n",
    "\n",
    "top = torch.topk(similarities, 10)\n",
    "\n",
    "top_indices = top.indices.tolist()\n",
    "top_probs = top.values.tolist()\n",
    "top_words = vocab.lookup_tokens(top_indices)\n",
    "list(zip(top_words, top_indices, top_probs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Zapis przy użyciu wzoru matematycznego\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
    "\n",
    "$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
    "\n",
    "gdzie:\n",
    "\n",
    "-   $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
    "-   $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
    "-   $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
    "-   $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Hiperparametry\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zauważmy, że nasz model ma dwa hiperparametry:\n",
    "\n",
    "-   $m$ — rozmiar zanurzenia,\n",
    "-   $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
    "    rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
    "    najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
    "\n",
    "Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
    "polepszenia wyników naszego modelu.\n",
    "\n",
    "**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Diagram sieci\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
    "warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
    "sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
    "\n",
    "![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Zanurzenie jako mnożenie przez macierz\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
    "odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
    "mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
    "wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
    "podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
    "złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
    "\n",
    "Wówczas wzór przyjmie postać:\n",
    "\n",
    "$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
    "\n",
    "gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
    "\n",
    "**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
    "\n",
    "W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
    "\n",
    "![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "org": null
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/dev-0/out-embed-100.tsv
+++ b/dev-0/out-embed-100.tsv
--- a/dev-0/out-embed-500.tsv
+++ b/dev-0/out-embed-500.tsv
--- a/dev-0/out.tsv
+++ b/dev-0/out.tsv
--- a/gonito.yaml
+++ b/gonito.yaml
@ -1,13 +0,0 @@
 description: nn, trigram, previous and next
 tags:
  - neural-network
  - trigram
 params:
  epochs: 1
  vocab-size: 20000
  batch-size: 10000
  embed-size:
    - 100
    - 500
    - 1000
  topk: 10
--- a/lm0.py
+++ b/lm0.py
@ -1,15 +0,0 @@
 import sys
 import random
 distribs = [
    'a:0.6 the:0.2 :0.2',
    'the:0.7 a:0.2 :0.1',
    'the:0.9 :0.1',
    'the:0.3 be:0.2 to:0.15 of:0.15 and:0.05 a:0.05 in:0.05 :0.05',
 ]
 for line in sys.stdin:
    ctx = line.split('\t')[6:]
    i = random.randint(0, len(distribs) - 1)
    print(distribs[i])
--- a/lm1.py
+++ b/lm1.py
@ -1,56 +0,0 @@
 import sys
 import random
 from tqdm import tqdm
 from collections import defaultdict
 import pickle
 import os
 corpus = []
 with open('train/in.tsv', 'r') as f:
    print('Reading corpus...')
    for line in tqdm(f):
        ctx = line.split('\t')[6:]
        corpus.append(ctx[0] + 'BLANK' + ctx[1])
 corpus = ' '.join(corpus)
 corpus = corpus.replace('-\n', '')
 corpus = corpus.replace('\\n', ' ')
 corpus = corpus.replace('\n', ' ')
 corpus = corpus.split(' ')
 if (os.path.exists('distrib.pkl')):
    print('Loading distribution...')
    distrib = pickle.load(open('distrib.pkl', 'rb'))
 else:
    print('Generating distribution...')
    distrib = defaultdict(lambda: defaultdict(int))
    for i in tqdm(range(len(corpus) - 1)):
        distrib[corpus[i]][corpus[i+1]] += 1
    with open('distrib.pkl', 'wb') as f:
        print('Saving distribution...')
        pickle.dump(dict(distrib), f)
 results = []
 with open('dev-0/in.tsv', 'r') as f:
    print('Generating output...')
    for line in tqdm(f):
        ctx = line.split('\t')[6:]
        last_word = ctx[0].split(' ')[-1]
        try:
            blank_word = max(distrib[last_word], key=distrib[last_word].get)
        except:
            blank_word = 'NONE'
        results.append(blank_word)
 with open('dev-0/out.tsv', 'w') as f:
    print('Writing output...')
    for result in tqdm(results):
        if result == 'NONE':
            f.write('a:0.6 the:0.2 :0.2')
        else:
            f.write(f'{result}:0.9 :0.1')
--- a/lmn.py
+++ b/lmn.py
@ -1,83 +0,0 @@
 from tqdm import tqdm
 from numpy import argmax
 def preprocess(corpus):
    corpus = corpus.replace('-\n', '')
    corpus = corpus.replace('\\n', ' ')
    corpus = corpus.replace('\n', ' ')
    corpus = corpus.replace('.', ' EOS')
    return corpus
 def generate_freq(tokens):
    tokens_freq = {}
    for token in tqdm(tokens):
        if token not in tokens_freq:
            tokens_freq[token] = 1
        else:
            tokens_freq[token] += 1
    return tokens_freq
 def generate_ngrams(tokens, n):
    ngrams = []
    for i in tqdm(range(len(tokens) - n + 1)):
        ngrams.append(tokens[i:i+n])
    return ngrams
 def generate_distribution(unique_tokens, tokens_freq, bigrams_freq):
    n = len(unique_tokens)
    distribution = [[] * n] * n
    for i in tqdm(n):
        denominator = tokens_freq[unique_tokens[i]]
        for j in range(n):
            try:
                numerator = bigrams_freq[unique_tokens[i] + unique_tokens[j]]
            except:
                numerator = 0
            distribution[unique_tokens[i] + unique_tokens[j]] = numerator / denominator
    return distribution
 with open('train/in.tsv', 'r') as f:
    print('Reading corpus...')
    corpus = []
    for line in tqdm(f):
        ctx = line.split('\t')[6:]
        corpus.append(ctx[0] + 'BLANK' + ctx[1])
 print('Preprocessing corpus...')
 corpus = preprocess(' '.join(corpus))
 tokens = corpus.split()
 unique_tokens = set(sorted(corpus))
 print('Generating tokens frequency...')
 tokens_freq = generate_freq(tokens)
 print('Generating n-grams...')
 bigrams = generate_ngrams(tokens, 2)
 print('Generating bigrams frequency...')
 bigrams_freq = generate_freq(bigrams)
 print('Generate distribution...')
 distribution = generate_distribution(unique_tokens, tokens_freq, bigrams_freq)
 with open('dev-0/in.tsv', 'r') as f:
    print('Generating output...')
    results = []
    for line in tqdm(f):
        ctx = line.split('\t')[6:]
        last_word = preprocess(ctx[0]).split(' ')[-1]
        try:
            blank_word = unique_tokens[argmax(distribution[unique_tokens.index(last_word)])]
        except:
            blank_word = 'NONE'
        results.append(blank_word)
 with open('dev-0/out.tsv', 'w') as f:
    print('Writing output...')
    for result in tqdm(results):
        if result == 'NONE':
            f.write('a:0.6 the:0.2 :0.2')
        else:
            f.write(f'{result}:0.9 :0.1')
--- a/ripped.py
+++ b/ripped.py
@ -1,153 +0,0 @@
 import lzma
 import matplotlib.pyplot as plt
 from math import log
 from collections import OrderedDict
 from collections import Counter
 import regex as re
 from itertools import islice
 def freq_list(g, top=None):
    c = Counter(g)
    if top is None:
       items = c.items()
    else:
       items = c.most_common(top)
    return OrderedDict(sorted(items, key=lambda t: -t[1]))
 def get_words(t):
    for m in re.finditer(r'[\p{L}0-9-\*]+', t):
        yield m.group(0)
 def ngrams(iter, size):
  ngram = []
  for item in iter:
    ngram.append(item)
    if len(ngram) == size:
        yield tuple(ngram)
        ngram = ngram[1:]
 PREFIX_TRAIN = 'train' 
 words = []
 counter_lines = 0
 with lzma.open(f'{PREFIX_TRAIN}/in.tsv.xz', 'r') as train, open(f'{PREFIX_TRAIN}/expected.tsv', 'r') as expected:
    for t_line, e_line in zip(train, expected):
        t_line = t_line.decode("utf-8")
        t_line = t_line.rstrip()
        e_line = e_line.rstrip()
        t_line_splitted_by_tab = t_line.split('\t')
        t_line_cleared = t_line_splitted_by_tab[-2] + ' ' + e_line + ' ' + t_line_splitted_by_tab[-1]
        words += t_line_cleared.split()
        counter_lines+=1
        if counter_lines > 90000:
            break
 # lzmaFile = lzma.open('dev-0/in.tsv.xz', 'rb')
 # content = lzmaFile.read().decode("utf-8")
 # words = get_words(trainset)
 ngrams_ = ngrams(words, 2)
 def create_probabilities_bigrams(w_c, b_c):
    probabilities_bigrams = {}
    for bigram, bigram_amount in b_c.items():
        if bigram_amount <=2:
            continue
        p_word_before = bigram_amount / w_c[bigram[0]] 
        p_word_after = bigram_amount / w_c[bigram[1]]
        probabilities_bigrams[bigram] = (p_word_before, p_word_after)
    return probabilities_bigrams
 words_c = Counter(words)
 word_=''
 bigram_c = Counter(ngrams_)
 ngrams_=''
 probabilities = create_probabilities_bigrams(words_c, bigram_c)
 items = probabilities.items()
 probabilities = OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
 items=''
 # sorted_by_freq = freq_list(ngrams)
 PREFIX_VALID = 'dev-0'
 def count_probabilities(w_b, w_a, probs, w_c, b_c):
    results_before = {}
    results_after = {}
    for bigram, probses in probs.items():
        if len(results_before) > 20 or len(results_after) > 20:
            break
        if w_b == bigram[0]:
            results_before[bigram] = probses[0]
        if w_a == bigram[1]:
            results_after[bigram] = probses[1]
    a=1
    best_ = {}
    for bigram, probses in results_before.items():
        for bigram_2, probses_2 in results_after.items():
            best_[bigram[1]] = probses * probses_2
    for bigram, probses in results_after.items():
            for bigram_2, probses_2 in results_before.items():
                if bigram[0] in best_:
                    if probses * probses_2 < probses_2:
                        continue
                best_[bigram[0]] = probses * probses_2
    items = best_.items()
    return OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
 with lzma.open(f'{PREFIX_VALID}/in.tsv.xz', 'r') as train:
    for t_line in train:
        t_line = t_line.decode("utf-8")
        t_line = t_line.rstrip()
        t_line = t_line.replace('\\n', ' ')
        t_line_splitted_by_tab = t_line.split('\t')
        words_pre = t_line_splitted_by_tab[-2].split()
        words_po = t_line_splitted_by_tab[-1].split()
        w_pre = words_pre[-1]
        w_po = words_po[0]
        probs_ordered = count_probabilities(w_pre, w_po,probabilities, words_c, bigram_c)
        if len(probs_ordered) ==0:
            print(f"the:0.5 a:0.3 :0.2")
            continue
        result_string = ''
        counter_ = 0
        for word_, p in probs_ordered.items():
            if counter_>4:
                break
            re_ = re.search(r'\p{L}+', word_)
            if re_:
                word_cleared = re_.group(0)
                result_string += f"{word_cleared}:{str(p)} "
            else:
                if result_string == '':
                    result_string = f"the:0.5 a:0.3 "
                continue
            counter_+=1
        result_string += ':0.1'
        print(result_string)
        a=1
--- a/run.py
+++ b/run.py
@ -1,233 +0,0 @@
 import lzma
 import regex as re
 from torchtext.vocab import build_vocab_from_iterator
 from torch import nn
 import pickle
 from os.path import exists
 from torch.utils.data import IterableDataset
 import itertools
 from torch.utils.data import DataLoader
 import torch
 from matplotlib import pyplot as plt
 from tqdm import tqdm
 def get_words_from_line(line):
    line = line.rstrip()
    line = line.split("\t")
    text = line[-2] + " " + line[-1]
    text = re.sub(r"\\\\+n", " ", text)
    text = re.sub('[^A-Za-z ]+', '', text)
    for t in text.split():
        yield t
 def get_word_lines_from_file(file_name):
    with lzma.open(file_name, "r") as fh:
        for line in fh:
            yield get_words_from_line(line.decode("utf-8"))
 def look_ahead_iterator(gen):
    first = None
    second = None
    for item in gen:
        if first is not None and second is not None:
            yield (first, second, item)
        first = second
        second = item
 class Trigrams(IterableDataset):
    def __init__(self, text_file, vocabulary_size):
        self.vocab = build_vocab_from_iterator(
            get_word_lines_from_file(text_file),
            max_tokens=vocabulary_size,
            specials=["<unk>"],
        )
        self.vocab.set_default_index(self.vocab["<unk>"])
        self.vocabulary_size = vocabulary_size
        self.text_file = text_file
    def __iter__(self):
        return look_ahead_iterator(
            (
                self.vocab[t]
                for t in itertools.chain.from_iterable(
                    get_word_lines_from_file(self.text_file)
                )
            )
        )
 class TrigramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(TrigramModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.hidden = nn.Linear(embedding_dim * 2, hidden_dim)
        self.output = nn.Linear(hidden_dim, vocab_size)
        self.softmax = nn.Softmax()
    def forward(self, x, y):
        x = self.embeddings(x)
        y = self.embeddings(y)
        z = self.hidden(torch.cat([x, y], dim=1))
        z = self.output(z)
        z = self.softmax(z)
        return z
 embed_size = 500
 vocab_size = 20000
 vocab_path = "vocabulary.pickle"
 if exists(vocab_path):
    print("Loading vocabulary from file...")
    with open(vocab_path, "rb") as fh:
        vocab = pickle.load(fh)
 else:
    print("Building vocabulary...")
    vocab = build_vocab_from_iterator(
        get_word_lines_from_file("train/in.tsv.xz"),
        max_tokens=vocab_size,
        specials=["<unk>"],
    )
    with open(vocab_path, "wb") as fh:
        pickle.dump(vocab, fh)
 device = "cuda" if torch.cuda.is_available() else "cpu"
 print("Using device:", device)
 dataset_path = 'train/dataset.pickle'
 if exists(dataset_path):
    print("Loading dataset from file...")
    with open(dataset_path, "rb") as fh:
        train_dataset = pickle.load(fh)
 else:
    print("Building dataset...")
    train_dataset = Trigrams("train/in.tsv.xz", vocab_size)
    with open(dataset_path, "wb") as fh:
        pickle.dump(train_dataset, fh)
 print("Building model...")
 model = TrigramModel(vocab_size, embed_size, 64).to(device)
 data = DataLoader(train_dataset, batch_size=10000)
 optimizer = torch.optim.Adam(model.parameters())
 criterion = torch.nn.NLLLoss()
 print("Training model...")
 model.train()
 losses = []
 step = 0
 max_steps = 1000
 for x, y, z in tqdm(data):
    x = x.to(device)
    y = y.to(device)
    z = z.to(device)
    optimizer.zero_grad()
    ypredicted = model(x, z)
    loss = criterion(torch.log(ypredicted), y)
    losses.append(loss.item())
    loss.backward()
    optimizer.step()
    step += 1
    if step > max_steps:
        break
 plt.plot(losses)
 plt.show()
 torch.save(model.state_dict(), f"trigram_model-embed_{embed_size}.bin")
 vocab_unique = set(train_dataset.vocab.get_stoi().keys())
 output = []
 print('Predicting dev...')
 with lzma.open("dev-0/in.tsv.xz", encoding='utf8', mode="rt") as file:
    for line in tqdm(file):
        line = line.split("\t")
        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
        first_word = re.sub('[^A-Za-z]+', '', first_word)
        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
        nenxt_word = re.sub('[^A-Za-z]+', '', next_word)
        if first_word not in vocab_unique:
            word = "<unk>"
        if next_word not in vocab_unique:
            word = "<unk>"
        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
        out = model(first_word, next_word)
        top = torch.topk(out[0], 10)
        top_indices = top.indices.tolist()
        top_probs = top.values.tolist()
        unk_bonus = 1 - sum(top_probs)
        top_words = vocab.lookup_tokens(top_indices)
        top_zipped = list(zip(top_words, top_probs))
        res = ""
        for w, p in top_zipped:
            if w == "<unk>":
                res += f":{(p + unk_bonus):.4f} "
            else:
                res += f"{w}:{p:.4f} "
        res = res[:-1]
        res += "\n"
        output.append(res)
 with open(f"dev-0/out-embed-{embed_size}.tsv", mode="w") as file:
    file.writelines(output)
 model.eval()
 output = []
 print('Predicting test...')
 with lzma.open("test-A/in.tsv.xz", encoding='utf8', mode="rt") as file:
    for line in tqdm(file):
        line = line.split("\t")
        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
        first_word = re.sub('[^A-Za-z]+', '', first_word)
        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
        next_word = re.sub('[^A-Za-z]+', '', next_word)
        if first_word not in vocab_unique:
            word = "<unk>"
        if next_word not in vocab_unique:
            word = "<unk>"
        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
        out = model(first_word, next_word)
        top = torch.topk(out[0], 10)
        top_indices = top.indices.tolist()
        top_probs = top.values.tolist()
        unk_bonus = 1 - sum(top_probs)
        top_words = vocab.lookup_tokens(top_indices)
        top_zipped = list(zip(top_words, top_probs))
        res = ""
        for w, p in top_zipped:
            if w == "<unk>":
                res += f":{(p + unk_bonus):.4f} "
            else:
                res += f"{w}:{p:.4f} "
        res = res[:-1]
        res += "\n"
        output.append(res)
 with open(f"test-A/out-embed-{embed_size}.tsv", mode="w") as file:
    file.writelines(output)
--- a/test-A/in.tsv.xz
+++ b/test-A/in.tsv.xz
--- a/test-A/out-embed-100.tsv
+++ b/test-A/out-embed-100.tsv
--- a/test-A/out-embed-500.tsv
+++ b/test-A/out-embed-500.tsv
--- a/trigram_model-50_steps-embed_100.bin
+++ b/trigram_model-50_steps-embed_100.bin
--- a/trigram_model-embed_100.bin
+++ b/trigram_model-embed_100.bin
--- a/trigram_model-embed_500.bin
+++ b/trigram_model-embed_500.bin