x oputput

Fix output
Fixed output
2023-06-28 11:20:09 +02:00 · 2023-06-28 11:14:56 +02:00 · 2023-06-28 11:13:36 +02:00 · 2023-06-28 11:10:47 +02:00 · 2023-06-28 11:09:05 +02:00 · 2023-06-28 11:06:17 +02:00
18 changed files with 7671 additions and 47572 deletions
--- a/09_Zanurzenia_slow.ipynb
+++ b/09_Zanurzenia_slow.ipynb
@ -1,634 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
-    "<div class=\"alert alert-block alert-info\">\n",
-    "<h1> Modelowanie języka</h1>\n",
-    "<h2> 09. <i>Zanurzenia słów (Word2vec)</i>  [wykład]</h2> \n",
-    "<h3> Filip Graliński (2022)</h3>\n",
-    "</div>\n",
-    "\n",
-    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Zanurzenia słów (Word2vec)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "W praktyce stosowalność słowosieci okazała się zaskakująco\n",
-    "ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
-    "wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### „Wymiary” słów\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
-    "$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
-    "prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
-    "(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
-    "\n",
-    "$$P(u|v) \\approx P(u'|v').$$\n",
-    "\n",
-    "$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Wymiary określone z góry?\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
-    "określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
-    "„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
-    "\n",
-    "-   czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
-    "-   czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
-    "-   czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
-    "    socjolingwistycznym)?\n",
-    "-   czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
-    "-   czy słowo jest rzeczownikiem czy czasownikiem?\n",
-    "-   czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
-    "-   czy słowo jest nazwą czy słowem pospolitym?\n",
-    "-   czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
-    "-   …\n",
-    "\n",
-    "W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
-    "możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Bigramowy model języka oparty na zanurzeniach\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
-    "**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Słownik\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
-    "ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
-    "po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
-    "na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
-    "\n",
-    "Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from itertools import islice\n",
-    "import regex as re\n",
-    "import sys\n",
-    "from torchtext.vocab import build_vocab_from_iterator\n",
-    "import pickle\n",
-    "import lzma"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "1027"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from itertools import islice\n",
-    "import regex as re\n",
-    "import sys\n",
-    "from torchtext.vocab import build_vocab_from_iterator\n",
-    "import lzma\n",
-    "\n",
-    "\n",
-    "def get_words_from_line(line):\n",
-    "  line = line.rstrip()\n",
-    "  yield '<s>'\n",
-    "  for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
-    "     yield m.group(0).lower()\n",
-    "  yield '</s>'\n",
-    "\n",
-    "\n",
-    "def get_word_lines_from_file(file_name):\n",
-    "  with lzma.open(file_name, 'r') as fh:\n",
-    "    for line in fh:\n",
-    "       yield get_words_from_line(line.decode('utf-8'))\n",
-    "\n",
-    "vocab_size = 20000\n",
-    "\n",
-    "vocab = build_vocab_from_iterator(\n",
-    "    get_word_lines_from_file('train/in.tsv.xz'),\n",
-    "    max_tokens = vocab_size,\n",
-    "    specials = ['<unk>'])\n",
-    "\n",
-    "vocab['human']"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['<unk>', '\\\\', 'the', '-\\\\', 'nmighty']"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "vocab.lookup_tokens([0, 1, 2, 10, 12345])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with open('vocabulary.pickle', 'wb') as fh:\n",
-    "  pickle.dump(vocab, fh)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Definicja sieci\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/jacob/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
-      "  input = module(input)\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "tensor(2.9869e-05, grad_fn=<SelectBackward0>)"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from torch import nn\n",
-    "import torch\n",
-    "\n",
-    "embed_size = 100\n",
-    "\n",
-    "class SimpleBigramNeuralLanguageModel(nn.Module):\n",
-    "  def __init__(self, vocabulary_size, embedding_size):\n",
-    "      super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
-    "      self.model = nn.Sequential(\n",
-    "          nn.Embedding(vocabulary_size, embedding_size),\n",
-    "          nn.Linear(embedding_size, vocabulary_size),\n",
-    "          nn.Softmax()\n",
-    "      )\n",
-    "\n",
-    "  def forward(self, x):\n",
-    "      return self.model(x)\n",
-    "\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
-    "\n",
-    "vocab.set_default_index(vocab['<unk>'])\n",
-    "ixs = torch.tensor(vocab.forward(['is']))\n",
-    "out = model(ixs)\n",
-    "out[0][vocab['the']]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
-    "\n",
-    "    shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from torch.utils.data import IterableDataset\n",
-    "import itertools\n",
-    "\n",
-    "def look_ahead_iterator(gen):\n",
-    "   prev = None\n",
-    "   for item in gen:\n",
-    "      if prev is not None:\n",
-    "         yield (prev, item)\n",
-    "      prev = item\n",
-    "\n",
-    "class Bigrams(IterableDataset):\n",
-    "  def __init__(self, text_file, vocabulary_size):\n",
-    "      self.vocab = build_vocab_from_iterator(\n",
-    "         get_word_lines_from_file(text_file),\n",
-    "         max_tokens = vocabulary_size,\n",
-    "         specials = ['<unk>'])\n",
-    "      self.vocab.set_default_index(self.vocab['<unk>'])\n",
-    "      self.vocabulary_size = vocabulary_size\n",
-    "      self.text_file = text_file\n",
-    "\n",
-    "  def __iter__(self):\n",
-    "     return look_ahead_iterator(\n",
-    "         (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
-    "\n",
-    "train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "(43, 0)"
-      ]
-     },
-     "execution_count": 9,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from torch.utils.data import DataLoader\n",
-    "\n",
-    "next(iter(train_dataset))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[tensor([   2,    5,   51, 3481,  231]), tensor([   5,   51, 3481,  231,    4])]"
-     ]
-    }
-   ],
-   "source": [
-    "from torch.utils.data import DataLoader\n",
-    "\n",
-    "next(iter(DataLoader(train_dataset, batch_size=5)))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "None"
-     ]
-    }
-   ],
-   "source": [
-    "device = 'cpu'\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
-    "data = DataLoader(train_dataset, batch_size=5000)\n",
-    "optimizer = torch.optim.Adam(model.parameters())\n",
-    "criterion = torch.nn.NLLLoss()\n",
-    "\n",
-    "model.train()\n",
-    "step = 0\n",
-    "for x, y in data:\n",
-    "   x = x.to(device)\n",
-    "   y = y.to(device)\n",
-    "   optimizer.zero_grad()\n",
-    "   ypredicted = model(x)\n",
-    "   loss = criterion(torch.log(ypredicted), y)\n",
-    "   if step % 100 == 0:\n",
-    "      print(step, loss)\n",
-    "   step += 1\n",
-    "   loss.backward()\n",
-    "   optimizer.step()\n",
-    "\n",
-    "torch.save(model.state_dict(), 'model1.bin')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('ciebie', 73, 0.1580502986907959), ('mnie', 26, 0.15395283699035645), ('<unk>', 0, 0.12862136960029602), ('nas', 83, 0.0410110242664814), ('niego', 172, 0.03281523287296295), ('niej', 245, 0.02104802615940571), ('siebie', 181, 0.020788608118891716), ('którego', 365, 0.019379809498786926), ('was', 162, 0.013852755539119244), ('wszystkich', 235, 0.01381855271756649)]"
-     ]
-    }
-   ],
-   "source": [
-    "device = 'cuda'\n",
-    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
-    "model.load_state_dict(torch.load('model1.bin'))\n",
-    "model.eval()\n",
-    "\n",
-    "ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
-    "\n",
-    "out = model(ixs)\n",
-    "top = torch.topk(out[0], 10)\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('.', 3, 0.404473215341568), (',', 4, 0.14222915470600128), ('z', 14, 0.10945753753185272), ('?', 6, 0.09583134204149246), ('w', 10, 0.050338443368673325), ('na', 12, 0.020703863352537155), ('i', 11, 0.016762692481279373), ('<unk>', 0, 0.014571071602404118), ('...', 15, 0.01453721895813942), ('</s>', 1, 0.011769450269639492)]"
-     ]
-    }
-   ],
-   "source": [
-    "vocab = train_dataset.vocab\n",
-    "ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
-    "\n",
-    "out = model(ixs)\n",
-    "top = torch.topk(out[0], 10)\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[('poszedł', 1087, 1.0), ('idziesz', 1050, 0.4907470941543579), ('przyjeżdża', 4920, 0.45242372155189514), ('pojechałam', 12784, 0.4342481195926666), ('wrócił', 1023, 0.431664377450943), ('dobrać', 10351, 0.4312002956867218), ('stałeś', 5738, 0.4258835017681122), ('poszła', 1563, 0.41979148983955383), ('trafiłam', 18857, 0.4109022617340088), ('jedzie', 1674, 0.4091658890247345)]"
-     ]
-    }
-   ],
-   "source": [
-    "cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
-    "\n",
-    "embeddings = model.model[0].weight\n",
-    "\n",
-    "vec = embeddings[vocab['poszedł']]\n",
-    "\n",
-    "similarities = cos(vec, embeddings)\n",
-    "\n",
-    "top = torch.topk(similarities, 10)\n",
-    "\n",
-    "top_indices = top.indices.tolist()\n",
-    "top_probs = top.values.tolist()\n",
-    "top_words = vocab.lookup_tokens(top_indices)\n",
-    "list(zip(top_words, top_indices, top_probs))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Zapis przy użyciu wzoru matematycznego\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
-    "\n",
-    "$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
-    "\n",
-    "gdzie:\n",
-    "\n",
-    "-   $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
-    "-   $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
-    "-   $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
-    "-   $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "##### Hiperparametry\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Zauważmy, że nasz model ma dwa hiperparametry:\n",
-    "\n",
-    "-   $m$ — rozmiar zanurzenia,\n",
-    "-   $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
-    "    rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
-    "    najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
-    "\n",
-    "Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
-    "polepszenia wyników naszego modelu.\n",
-    "\n",
-    "**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Diagram sieci\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
-    "warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
-    "sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
-    "\n",
-    "![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Zanurzenie jako mnożenie przez macierz\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
-    "odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
-    "mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
-    "wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
-    "podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
-    "złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
-    "\n",
-    "Wówczas wzór przyjmie postać:\n",
-    "\n",
-    "$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
-    "\n",
-    "gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
-    "\n",
-    "**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
-    "\n",
-    "W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
-    "\n",
-    "![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
-    "\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
-  },
-  "org": null
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
--- a/dev-0/out-embed-100.tsv
+++ b/dev-0/out-embed-100.tsv
--- a/dev-0/out-embed-500.tsv
+++ b/dev-0/out-embed-500.tsv
--- a/dev-0/out.tsv
+++ b/dev-0/out.tsv
--- a/flan-t5.ipynb
+++ b/flan-t5.ipynb
@ -0,0 +1,257 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/kubakaczmarek/anaconda3/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration\n",
+    "import torch\n",
+    "import lzma\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DEVICE = torch.device(\"mps\") if torch.backends.mps.is_available() else torch.device(\"cpu\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "T5_PATH = 't5-base'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/kubakaczmarek/anaconda3/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:164: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.\n",
+      "For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.\n",
+      "- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.\n",
+      "- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.\n",
+      "- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)\n",
+    "t5_config = T5Config.from_pretrained(T5_PATH)\n",
+    "t5_mlm = T5ForConditionalGeneration.from_pretrained(T5_PATH, config=t5_config).to(DEVICE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def preprocess_data(X):\n",
+    "    parsed_data = []\n",
+    "\n",
+    "    for line in X:\n",
+    "        left = line.strip().split('\\t')[6].replace('\\\\n', ' ').split(' ')\n",
+    "        right = line.strip().split('\\t')[7].replace('\\\\n', ' ').split(' ')\n",
+    "\n",
+    "        if len(left) + len(right) > 330:\n",
+    "            text = f\"{' '.join(left[-100:])} <extra_id_0> {' '.join(right[:100])})\"\n",
+    "        else:\n",
+    "            text = f\"{' '.join(left)} <extra_id_0> {' '.join(right)})\"\n",
+    "\n",
+    "        parsed_data.append(text)\n",
+    "\n",
+    "    return parsed_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def decode(output):\n",
+    "    _txt = t5_tokenizer.decode(output[2:], skip_special_tokens=False, clean_up_tokenization_spaces=False)\n",
+    "    end = _txt.index('<extra_id_1>')\n",
+    "    \n",
+    "    return _txt[:end]\n",
+    "\n",
+    "def parse_output(outputs):\n",
+    "    parsed = set([decode(output) for output in outputs])\n",
+    "    res = ''\n",
+    "    sum = 0\n",
+    "    for i, token in enumerate(parsed):\n",
+    "        res += f\"{token}:{1 / (i + 4)} \"\n",
+    "        sum += 1 / (i + 4)\n",
+    "\n",
+    "    res += f\":{1-sum}\"\n",
+    "    \n",
+    "    return res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with lzma.open('test-A/in.tsv.xz', 'rt') as f:\n",
+    "    X = f.readlines()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = preprocess_data(X)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Token indices sequence length is longer than the specified maximum sequence length for this model (556 > 512). Running this sequence through the model will result in indexing errors\n"
+     ]
+    }
+   ],
+   "source": [
+    "with open('test-A/out.tsv', mode='wt', encoding='utf-8') as f:\n",
+    "    for line in tqdm(X):\n",
+    "        try:\n",
+    "            encoded = t5_tokenizer.encode_plus(line, add_special_tokens=True, return_tensors='pt')\n",
+    "            input_ids = encoded['input_ids'].to(DEVICE)\n",
+    "            outputs = t5_mlm.generate(input_ids=input_ids, \n",
+    "                          num_beams=5, num_return_sequences=5,\n",
+    "                          max_length=5)\n",
+    "            f.write(parse_output(outputs) + '\\n')\n",
+    "        except:\n",
+    "            f.write('the:0.9 :0.1\\n')\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "input_text = \"anywhere un-\\nless its somewhere. Well, I says,\\nI'm glad to hear that, but, accord-\\ning to your figures, I left myself\\nwhere 1 was, which is five miles near-\\ner to myself than I was when we\\nwere where we are now.\\nWe have now reached Slidell.\\nThat's a fine place. The people\\ndown there remind me of bananas-\\nthey come and go in bunches. 811-\\ndell used to be noted for her tough\\npeople. Now she is noted for be,\\ntough steaks. Well, I certainly got\\none there. When the waiter brought\\nit in it was so small I thought. It\\nwas a crack in the plate. I skid,\\nwaiter what else have you got? +He\\nbrought me in two codfish and one\\nsmelt. I said, waiter have you got\\npigs feet? He said no, rheumatism\\nmakes me walk that way. I sald,\\nhow is the pumpkin pie?\tsaid\\nit's all squash. The best I could get\\nin that hotel was a soup sandwich.\\nAfter the table battle the waiter and\\nI signed an armistice. I then went\\nover to the hotel clerk and asked for\\na room. He said with or without a\\nbed? I said, with a bed. He said,\\nI don't think I 'have' a bed long\\nenough for you. I said, well, I'll\\naddtwo feettoitwhenIgetinit.\\nHe gave me a lovely room on the\\ntop floor. It was one of those rooms\\nthat stands on each side. If you\\nhappen to get up in the middle of\\nthe night you want to be sure and\\nget up in the middle of the room.\\nThat night I dreamt I was eating\\nflannel cakes. When I woke up half\\nof the blanket was gone. I must\\nhave got up on the wrong side of the\\nbed, for next morning I had an awful\\nheadache. I told the manager about\\nit. He said, you have rheumatic\\npains. I said, no, I think it is on,\\nof those attic room pains. I nad to\\ngetupat5a.m.inthemorningso\\nthey could use the sheet to set the\\nbreakfast table.\".replace('\\n', ' ').split('\\t')\n",
+    "input_text = f\"{input_text[0]} <extra_id_0> {input_text[1]}\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "encoded = t5_tokenizer.encode_plus(input_text, add_special_tokens=True, return_tensors='pt')\n",
+    "input_ids = encoded['input_ids'].to(DEVICE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outputs = t5_mlm.generate(input_ids=input_ids, \n",
+    "                          num_beams=10, num_return_sequences=5,\n",
+    "                          max_length=5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parsed = parse_output(outputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'+He:0.25 He:0.2 I:0.16666666666666666 :0.3833333333333333'"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "parsed"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/gonito.yaml
+++ b/gonito.yaml
@ -1,13 +0,0 @@
-description: nn, trigram, previous and next
-tags:
-  - neural-network
-  - trigram
-params:
-  epochs: 1
-  vocab-size: 20000
-  batch-size: 10000
-  embed-size:
-    - 100
-    - 500
-    - 1000
-  topk: 10
--- a/lm0.py
+++ b/lm0.py
@ -1,15 +0,0 @@
-import sys
-import random
-
-distribs = [
-    'a:0.6 the:0.2 :0.2',
-    'the:0.7 a:0.2 :0.1',
-    'the:0.9 :0.1',
-    'the:0.3 be:0.2 to:0.15 of:0.15 and:0.05 a:0.05 in:0.05 :0.05',
-]
-
-for line in sys.stdin:
-    ctx = line.split('\t')[6:]
-
-    i = random.randint(0, len(distribs) - 1)
-    print(distribs[i])
--- a/lm1.py
+++ b/lm1.py
@ -1,56 +0,0 @@
-import sys
-import random
-from tqdm import tqdm
-from collections import defaultdict
-import pickle
-import os
-
-corpus = []
-
-with open('train/in.tsv', 'r') as f:
-    print('Reading corpus...')
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-
-        corpus.append(ctx[0] + 'BLANK' + ctx[1])
-
-corpus = ' '.join(corpus)
-corpus = corpus.replace('-\n', '')
-corpus = corpus.replace('\\n', ' ')
-corpus = corpus.replace('\n', ' ')
-corpus = corpus.split(' ')
-
-if (os.path.exists('distrib.pkl')):
-    print('Loading distribution...')
-    distrib = pickle.load(open('distrib.pkl', 'rb'))
-else:
-    print('Generating distribution...')
-    distrib = defaultdict(lambda: defaultdict(int))
-    for i in tqdm(range(len(corpus) - 1)):
-        distrib[corpus[i]][corpus[i+1]] += 1
-
-    with open('distrib.pkl', 'wb') as f:
-        print('Saving distribution...')
-        pickle.dump(dict(distrib), f)
-
-results = []
-with open('dev-0/in.tsv', 'r') as f:
-    print('Generating output...')
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        last_word = ctx[0].split(' ')[-1]
-        try:
-            blank_word = max(distrib[last_word], key=distrib[last_word].get)
-        except:
-            blank_word = 'NONE'
-        results.append(blank_word)
-
-with open('dev-0/out.tsv', 'w') as f:
-    print('Writing output...')
-    for result in tqdm(results):
-        if result == 'NONE':
-            f.write('a:0.6 the:0.2 :0.2')
-        else:
-            f.write(f'{result}:0.9 :0.1')
-
-            
--- a/lmn.py
+++ b/lmn.py
@ -1,83 +0,0 @@
-from tqdm import tqdm
-from numpy import argmax
-
-def preprocess(corpus):
-    corpus = corpus.replace('-\n', '')
-    corpus = corpus.replace('\\n', ' ')
-    corpus = corpus.replace('\n', ' ')
-    corpus = corpus.replace('.', ' EOS')
-
-    return corpus
-
-def generate_freq(tokens):
-    tokens_freq = {}
-    for token in tqdm(tokens):
-        if token not in tokens_freq:
-            tokens_freq[token] = 1
-        else:
-            tokens_freq[token] += 1
-
-    return tokens_freq
-
-def generate_ngrams(tokens, n):
-    ngrams = []
-    for i in tqdm(range(len(tokens) - n + 1)):
-        ngrams.append(tokens[i:i+n])
-
-    return ngrams
-
-def generate_distribution(unique_tokens, tokens_freq, bigrams_freq):
-    n = len(unique_tokens)
-    distribution = [[] * n] * n
-    for i in tqdm(n):
-        denominator = tokens_freq[unique_tokens[i]]
-        for j in range(n):
-            try:
-                numerator = bigrams_freq[unique_tokens[i] + unique_tokens[j]]
-            except:
-                numerator = 0
-            distribution[unique_tokens[i] + unique_tokens[j]] = numerator / denominator
-
-    return distribution
-
-with open('train/in.tsv', 'r') as f:
-    print('Reading corpus...')
-    corpus = []
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        corpus.append(ctx[0] + 'BLANK' + ctx[1])
-
-print('Preprocessing corpus...')
-corpus = preprocess(' '.join(corpus))
-
-tokens = corpus.split()
-unique_tokens = set(sorted(corpus))
-print('Generating tokens frequency...')
-tokens_freq = generate_freq(tokens)
-print('Generating n-grams...')
-bigrams = generate_ngrams(tokens, 2)
-print('Generating bigrams frequency...')
-bigrams_freq = generate_freq(bigrams)
-print('Generate distribution...')
-distribution = generate_distribution(unique_tokens, tokens_freq, bigrams_freq)
-
-
-with open('dev-0/in.tsv', 'r') as f:
-    print('Generating output...')
-    results = []
-    for line in tqdm(f):
-        ctx = line.split('\t')[6:]
-        last_word = preprocess(ctx[0]).split(' ')[-1]
-        try:
-            blank_word = unique_tokens[argmax(distribution[unique_tokens.index(last_word)])]
-        except:
-            blank_word = 'NONE'
-        results.append(blank_word)
-
-with open('dev-0/out.tsv', 'w') as f:
-    print('Writing output...')
-    for result in tqdm(results):
-        if result == 'NONE':
-            f.write('a:0.6 the:0.2 :0.2')
-        else:
-            f.write(f'{result}:0.9 :0.1')
--- a/ripped.py
+++ b/ripped.py
@ -1,153 +0,0 @@
-import lzma
-import matplotlib.pyplot as plt
-from math import log
-from collections import OrderedDict
-from collections import Counter
-import regex as re
-from itertools import islice
-
-def freq_list(g, top=None):
-    c = Counter(g)
-
-    if top is None:
-       items = c.items()
-    else:
-       items = c.most_common(top)
-
-    return OrderedDict(sorted(items, key=lambda t: -t[1]))
-
-def get_words(t):
-    for m in re.finditer(r'[\p{L}0-9-\*]+', t):
-        yield m.group(0)
-
-def ngrams(iter, size):
-  ngram = []
-  for item in iter:
-    ngram.append(item)
-    if len(ngram) == size:
-        yield tuple(ngram)
-        ngram = ngram[1:]
-
-PREFIX_TRAIN = 'train' 
-words = []
-
-counter_lines = 0
-with lzma.open(f'{PREFIX_TRAIN}/in.tsv.xz', 'r') as train, open(f'{PREFIX_TRAIN}/expected.tsv', 'r') as expected:
-    for t_line, e_line in zip(train, expected):
-        t_line = t_line.decode("utf-8")
-
-        t_line = t_line.rstrip()
-        e_line = e_line.rstrip()
-
-        t_line_splitted_by_tab = t_line.split('\t')
-        
-        t_line_cleared = t_line_splitted_by_tab[-2] + ' ' + e_line + ' ' + t_line_splitted_by_tab[-1]
-
-        words += t_line_cleared.split()
-
-        counter_lines+=1
-        if counter_lines > 90000:
-            break
-
-# lzmaFile = lzma.open('dev-0/in.tsv.xz', 'rb')
-
-# content = lzmaFile.read().decode("utf-8")
-# words = get_words(trainset)
-
-ngrams_ = ngrams(words, 2)
-
-
-def create_probabilities_bigrams(w_c, b_c):
-    probabilities_bigrams = {}
-    for bigram, bigram_amount in b_c.items():
-        if bigram_amount <=2:
-            continue
-        p_word_before = bigram_amount / w_c[bigram[0]] 
-        p_word_after = bigram_amount / w_c[bigram[1]]
-        probabilities_bigrams[bigram] = (p_word_before, p_word_after)
-
-    return probabilities_bigrams
-
-words_c = Counter(words)
-word_=''
-bigram_c = Counter(ngrams_)
-ngrams_=''
-probabilities = create_probabilities_bigrams(words_c, bigram_c)
-
-
-items = probabilities.items()
-probabilities = OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
-items=''
-# sorted_by_freq = freq_list(ngrams)
-
-PREFIX_VALID = 'dev-0'
-
-def count_probabilities(w_b, w_a, probs, w_c, b_c):
-    results_before = {}
-    results_after = {}
-    for bigram, probses in probs.items():
-        if len(results_before) > 20 or len(results_after) > 20:
-            break
-        if w_b == bigram[0]:
-            results_before[bigram] = probses[0]
-        if w_a == bigram[1]:
-            results_after[bigram] = probses[1]
-    a=1
-    best_ = {}
-
-    for bigram, probses in results_before.items():
-        for bigram_2, probses_2 in results_after.items():
-            best_[bigram[1]] = probses * probses_2
-
-    for bigram, probses in results_after.items():
-            for bigram_2, probses_2 in results_before.items():
-                if bigram[0] in best_:
-                    if probses * probses_2 < probses_2:
-                        continue
-                best_[bigram[0]] = probses * probses_2
-
-    items = best_.items()
-    return OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
-
-
-with lzma.open(f'{PREFIX_VALID}/in.tsv.xz', 'r') as train:
-    for t_line in train:
-        t_line = t_line.decode("utf-8")
-
-        t_line = t_line.rstrip()
-        t_line = t_line.replace('\\n', ' ')
-
-
-        t_line_splitted_by_tab = t_line.split('\t')
-        
-
-        words_pre = t_line_splitted_by_tab[-2].split()
-
-        words_po = t_line_splitted_by_tab[-1].split()
-
-        w_pre = words_pre[-1]
-        w_po = words_po[0]
-
-        probs_ordered = count_probabilities(w_pre, w_po,probabilities, words_c, bigram_c)
-        if len(probs_ordered) ==0:
-            print(f"the:0.5 a:0.3 :0.2")
-            continue
-        result_string = ''
-        counter_ = 0
-        for word_, p in probs_ordered.items():
-            if counter_>4:
-                break
-            re_ = re.search(r'\p{L}+', word_)
-            if re_:
-                word_cleared = re_.group(0)
-                result_string += f"{word_cleared}:{str(p)} "
-
-            else:
-                if result_string == '':
-                    result_string = f"the:0.5 a:0.3 "
-                continue
-
-            counter_+=1
-        result_string += ':0.1'
-        print(result_string)
-        a=1
--- a/run.py
+++ b/run.py
@ -1,233 +0,0 @@
-import lzma
-import regex as re
-from torchtext.vocab import build_vocab_from_iterator
-from torch import nn
-import pickle
-from os.path import exists
-from torch.utils.data import IterableDataset
-import itertools
-from torch.utils.data import DataLoader
-import torch
-from matplotlib import pyplot as plt
-from tqdm import tqdm
-
-
-def get_words_from_line(line):
-    line = line.rstrip()
-    line = line.split("\t")
-    text = line[-2] + " " + line[-1]
-    text = re.sub(r"\\\\+n", " ", text)
-    text = re.sub('[^A-Za-z ]+', '', text)
-    for t in text.split():
-        yield t
-
-
-def get_word_lines_from_file(file_name):
-    with lzma.open(file_name, "r") as fh:
-        for line in fh:
-            yield get_words_from_line(line.decode("utf-8"))
-
-
-def look_ahead_iterator(gen):
-    first = None
-    second = None
-    for item in gen:
-        if first is not None and second is not None:
-            yield (first, second, item)
-        first = second
-        second = item
-
-
-class Trigrams(IterableDataset):
-    def __init__(self, text_file, vocabulary_size):
-        self.vocab = build_vocab_from_iterator(
-            get_word_lines_from_file(text_file),
-            max_tokens=vocabulary_size,
-            specials=["<unk>"],
-        )
-        self.vocab.set_default_index(self.vocab["<unk>"])
-        self.vocabulary_size = vocabulary_size
-        self.text_file = text_file
-
-    def __iter__(self):
-        return look_ahead_iterator(
-            (
-                self.vocab[t]
-                for t in itertools.chain.from_iterable(
-                    get_word_lines_from_file(self.text_file)
-                )
-            )
-        )
-
-
-class TrigramModel(nn.Module):
-    def __init__(self, vocab_size, embedding_dim, hidden_dim):
-        super(TrigramModel, self).__init__()
-        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
-        self.hidden = nn.Linear(embedding_dim * 2, hidden_dim)
-        self.output = nn.Linear(hidden_dim, vocab_size)
-        self.softmax = nn.Softmax()
-
-    def forward(self, x, y):
-        x = self.embeddings(x)
-        y = self.embeddings(y)
-        z = self.hidden(torch.cat([x, y], dim=1))
-        z = self.output(z)
-        z = self.softmax(z)
-        return z
-
-
-embed_size = 500
-vocab_size = 20000
-vocab_path = "vocabulary.pickle"
-if exists(vocab_path):
-    print("Loading vocabulary from file...")
-    with open(vocab_path, "rb") as fh:
-        vocab = pickle.load(fh)
-else:
-    print("Building vocabulary...")
-    vocab = build_vocab_from_iterator(
-        get_word_lines_from_file("train/in.tsv.xz"),
-        max_tokens=vocab_size,
-        specials=["<unk>"],
-    )
-
-    with open(vocab_path, "wb") as fh:
-        pickle.dump(vocab, fh)
-
-device = "cuda" if torch.cuda.is_available() else "cpu"
-
-print("Using device:", device)
-dataset_path = 'train/dataset.pickle'
-if exists(dataset_path):
-    print("Loading dataset from file...")
-    with open(dataset_path, "rb") as fh:
-        train_dataset = pickle.load(fh)
-else:
-    print("Building dataset...")
-    train_dataset = Trigrams("train/in.tsv.xz", vocab_size)
-    with open(dataset_path, "wb") as fh:
-        pickle.dump(train_dataset, fh)
-
-print("Building model...")
-model = TrigramModel(vocab_size, embed_size, 64).to(device)
-data = DataLoader(train_dataset, batch_size=10000)
-optimizer = torch.optim.Adam(model.parameters())
-criterion = torch.nn.NLLLoss()
-
-print("Training model...")
-model.train()
-losses = []
-step = 0
-max_steps = 1000
-
-for x, y, z in tqdm(data):
-    x = x.to(device)
-    y = y.to(device)
-    z = z.to(device)
-
-    optimizer.zero_grad()
-    ypredicted = model(x, z)
-    loss = criterion(torch.log(ypredicted), y)
-    losses.append(loss.item())
-    loss.backward()
-    optimizer.step()
-    step += 1
-    if step > max_steps:
-        break
-
-plt.plot(losses)
-plt.show()
-
-torch.save(model.state_dict(), f"trigram_model-embed_{embed_size}.bin")
-
-vocab_unique = set(train_dataset.vocab.get_stoi().keys())
-
-output = []
-print('Predicting dev...')
-with lzma.open("dev-0/in.tsv.xz", encoding='utf8', mode="rt") as file:
-    for line in tqdm(file):
-        line = line.split("\t")
-
-        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
-        first_word = re.sub('[^A-Za-z]+', '', first_word)
-
-        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
-        nenxt_word = re.sub('[^A-Za-z]+', '', next_word)
-
-        if first_word not in vocab_unique:
-            word = "<unk>"
-        if next_word not in vocab_unique:
-            word = "<unk>"
-
-        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
-        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
-
-        out = model(first_word, next_word)
-
-        top = torch.topk(out[0], 10)
-        top_indices = top.indices.tolist()
-        top_probs = top.values.tolist()
-        unk_bonus = 1 - sum(top_probs)
-        top_words = vocab.lookup_tokens(top_indices)
-        top_zipped = list(zip(top_words, top_probs))
-
-        res = ""
-        for w, p in top_zipped:
-            if w == "<unk>":
-                res += f":{(p + unk_bonus):.4f} "
-            else:
-                res += f"{w}:{p:.4f} "
-
-        res = res[:-1]
-        res += "\n"
-        output.append(res)
-
-with open(f"dev-0/out-embed-{embed_size}.tsv", mode="w") as file:
-    file.writelines(output)
-
-
-model.eval()
-
-output = []
-print('Predicting test...')
-with lzma.open("test-A/in.tsv.xz", encoding='utf8', mode="rt") as file:
-    for line in tqdm(file):
-        line = line.split("\t")
-
-        first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
-        first_word = re.sub('[^A-Za-z]+', '', first_word)
-
-        next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
-        next_word = re.sub('[^A-Za-z]+', '', next_word)
-
-        if first_word not in vocab_unique:
-            word = "<unk>"
-        if next_word not in vocab_unique:
-            word = "<unk>"
-
-        first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
-        next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
-
-        out = model(first_word, next_word)
-
-        top = torch.topk(out[0], 10)
-        top_indices = top.indices.tolist()
-        top_probs = top.values.tolist()
-        unk_bonus = 1 - sum(top_probs)
-        top_words = vocab.lookup_tokens(top_indices)
-        top_zipped = list(zip(top_words, top_probs))
-
-        res = ""
-        for w, p in top_zipped:
-            if w == "<unk>":
-                res += f":{(p + unk_bonus):.4f} "
-            else:
-                res += f"{w}:{p:.4f} "
-
-        res = res[:-1]
-        res += "\n"
-        output.append(res)
-
-with open(f"test-A/out-embed-{embed_size}.tsv", mode="w") as file:
-    file.writelines(output)
--- a/test-A/in.tsv.xz
+++ b/test-A/in.tsv.xz
--- a/test-A/out-embed-100.tsv
+++ b/test-A/out-embed-100.tsv
--- a/test-A/out-embed-500.tsv
+++ b/test-A/out-embed-500.tsv
--- a/test-A/out.tsv
+++ b/test-A/out.tsv
--- a/trigram_model-50_steps-embed_100.bin
+++ b/trigram_model-50_steps-embed_100.bin
--- a/trigram_model-embed_100.bin
+++ b/trigram_model-embed_100.bin
--- a/trigram_model-embed_500.bin
+++ b/trigram_model-embed_500.bin
Author	SHA1	Message	Date
Jakub Kaczmarek	5ca3bc7f46	x oputput	2023-06-28 11:20:09 +02:00
Jakub Kaczmarek	daaf54ee51	Fix output	2023-06-28 11:14:56 +02:00
Jakub Kaczmarek	a1b385c10f	Fixed output	2023-06-28 11:13:36 +02:00
Jakub Kaczmarek	377d57bce9	Fixed output	2023-06-28 11:10:47 +02:00
Jakub Kaczmarek	8d2fdfedb7	Fixed output	2023-06-28 11:09:05 +02:00
Jakub Kaczmarek	c5dec351ea	Updated inference	2023-06-28 11:06:17 +02:00
Jakub Kaczmarek	0e2008e6fa	Fix formatting	2023-06-28 10:37:29 +02:00
Jakub Kaczmarek	92c425b111	Fix output	2023-06-28 10:36:22 +02:00
Jakub Kaczmarek	13e3392879	Fix output	2023-06-28 10:35:13 +02:00
Jakub Kaczmarek	89d89385da	Remove double words at teh beginning	2023-06-28 10:32:40 +02:00
Jakub Kaczmarek	bbcd5f2f0c	Remove double words	2023-06-28 10:28:12 +02:00
Jakub Kaczmarek	bb9f531b39	Flan t-5 reults	2023-06-28 10:19:19 +02:00
Jakub Kaczmarek	96d7b0d1b4	Updated input truncation	2023-06-27 22:05:55 +02:00
Jakub Kaczmarek	206dc89e55	Fix formatting	2023-06-27 21:08:44 +02:00
Jakub Kaczmarek	35176820c3	Truncated inputs	2023-06-27 21:07:15 +02:00
Jakub Kaczmarek	0f39b4566e	Fix out formatting v3	2023-06-27 19:51:59 +02:00
Jakub Kaczmarek	4ac652d175	Fix out formatting	2023-06-27 19:38:34 +02:00
Jakub Kaczmarek	938d3654d7	Fix output formatting	2023-06-27 19:35:00 +02:00
Jakub Kaczmarek	28da46d28f	First finetuned model	2023-06-27 19:16:04 +02:00
Jakub Kaczmarek	b5b575bd45	Remove unused files	2023-06-27 19:15:34 +02:00