wip

2023-05-08 18:54:03 +02:00 · 2023-05-08 18:54:03 +02:00 · 51bcff674a
commit 51bcff674a
21 changed files with 895030 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,8 @@
+
+*~
+*.swp
+*.bak
+*.pyc
+*.o
+.DS_Store
+.token
--- a/09_Zanurzenia_slow.ipynb
+++ b/09_Zanurzenia_slow.ipynb
@ -0,0 +1,919 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
+    "<div class=\"alert alert-block alert-info\">\n",
+    "<h1> Modelowanie języka</h1>\n",
+    "<h2> 09. <i>Zanurzenia słów (Word2vec)</i>  [wykład]</h2> \n",
+    "<h3> Filip Graliński (2022)</h3>\n",
+    "</div>\n",
+    "\n",
+    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zanurzenia słów (Word2vec)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "W praktyce stosowalność słowosieci okazała się zaskakująco\n",
+    "ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
+    "wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### „Wymiary” słów\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
+    "$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
+    "prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
+    "(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
+    "\n",
+    "$$P(u|v) \\approx P(u'|v').$$\n",
+    "\n",
+    "$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Wymiary określone z góry?\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
+    "określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
+    "„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
+    "\n",
+    "-   czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
+    "-   czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
+    "-   czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
+    "    socjolingwistycznym)?\n",
+    "-   czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
+    "-   czy słowo jest rzeczownikiem czy czasownikiem?\n",
+    "-   czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
+    "-   czy słowo jest nazwą czy słowem pospolitym?\n",
+    "-   czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
+    "-   …\n",
+    "\n",
+    "W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
+    "możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Bigramowy model języka oparty na zanurzeniach\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
+    "**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Słownik\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
+    "ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
+    "po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
+    "na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
+    "\n",
+    "Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "12531"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from itertools import islice\n",
+    "import regex as re\n",
+    "import sys\n",
+    "from torchtext.vocab import build_vocab_from_iterator\n",
+    "import lzma\n",
+    "\n",
+    "def get_words_from_line(line):\n",
+    "  line = line.rstrip()\n",
+    "  yield '<s>'\n",
+    "  for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
+    "     yield m.group(0).lower()\n",
+    "  yield '</s>'\n",
+    "\n",
+    "\n",
+    "def get_word_lines_from_file(file_name):\n",
+    "  counter=0\n",
+    "  with lzma.open(file_name, 'r') as fh:\n",
+    "    for line in fh:\n",
+    "      counter+=1\n",
+    "      if counter == 100000:\n",
+    "        break\n",
+    "      line = line.decode(\"utf-8\")\n",
+    "      yield get_words_from_line(line)\n",
+    "\n",
+    "vocab_size = 20000\n",
+    "\n",
+    "vocab = build_vocab_from_iterator(\n",
+    "    get_word_lines_from_file('train/in.tsv.xz'),\n",
+    "    max_tokens = vocab_size,\n",
+    "    specials = ['<unk>'])\n",
+    "\n",
+    "vocab['jest']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "with open(\"vocab.pickle\", 'wb') as handle:\n",
+    "    pickle.dump(vocab, handle)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"vocab.pickle\", 'rb') as handle:\n",
+    "    vocab = pickle.load( handle)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "838"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab['love']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vocab.lookup_token()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.\n",
+      "Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.\n",
+      "To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.\n",
+      "Defaulting to user installation because normal site-packages is not writeable\n",
+      "Collecting torchtext\n",
+      "  Downloading torchtext-0.15.1-cp38-cp38-manylinux1_x86_64.whl (2.0 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m661.8 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: numpy in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (1.19.3)\n",
+      "Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (4.64.0)\n",
+      "Collecting torchdata==0.6.0\n",
+      "  Downloading torchdata-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/4.6 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n",
+      "\u001b[?25hRequirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (2.26.0)\n",
+      "Collecting torch==2.0.0\n",
+      "  Downloading torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m619.9/619.9 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0m\n",
+      "\u001b[?25hCollecting nvidia-nccl-cu11==2.14.3\n",
+      "  Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m177.1/177.1 MB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99\n",
+      "  Using cached nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\n",
+      "Collecting nvidia-cuda-nvrtc-cu11==11.7.99\n",
+      "  Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\n",
+      "Collecting nvidia-cudnn-cu11==8.5.0.96\n",
+      "  Using cached nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\n",
+      "Requirement already satisfied: networkx in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (2.8)\n",
+      "Collecting nvidia-cublas-cu11==11.10.3.66\n",
+      "  Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\n",
+      "Collecting nvidia-cusolver-cu11==11.4.0.1\n",
+      "  Downloading nvidia_cusolver_cu11-11.4.0.1-2-py3-none-manylinux1_x86_64.whl (102.6 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m102.6/102.6 MB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: typing-extensions in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (4.2.0)\n",
+      "Collecting nvidia-cusparse-cu11==11.7.4.91\n",
+      "  Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m173.2/173.2 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101\n",
+      "  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n",
+      "\u001b[?25hCollecting nvidia-curand-cu11==10.2.10.91\n",
+      "  Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 MB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: jinja2 in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.0.3)\n",
+      "Collecting nvidia-cufft-cu11==10.9.0.58\n",
+      "  Downloading nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.4/168.4 MB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCollecting sympy\n",
+      "  Downloading sympy-1.11.1-py3-none-any.whl (6.5 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCollecting triton==2.0.0\n",
+      "  Downloading triton-2.0.0-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.2 MB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.2/63.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: filelock in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.6.0)\n",
+      "Collecting nvidia-nvtx-cu11==11.7.91\n",
+      "  Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.6/98.6 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: urllib3>=1.25 in /home/mikolaj/.local/lib/python3.8/site-packages (from torchdata==0.6.0->torchtext) (1.26.9)\n",
+      "Requirement already satisfied: setuptools in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (63.4.2)\n",
+      "Requirement already satisfied: wheel in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (0.37.1)\n",
+      "Collecting lit\n",
+      "  Downloading lit-16.0.1.tar.gz (137 kB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.9/137.9 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
+      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
+      "\u001b[?25hCollecting cmake\n",
+      "  Using cached cmake-3.26.3-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.0 MB)\n",
+      "Requirement already satisfied: charset-normalizer~=2.0.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.0.10)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.10)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2022.12.7)\n",
+      "Requirement already satisfied: MarkupSafe>=2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from jinja2->torch==2.0.0->torchtext) (2.0.1)\n",
+      "Collecting mpmath>=0.19\n",
+      "  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.2/536.2 kB\u001b[0m \u001b[31m768.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
+      "\u001b[?25hBuilding wheels for collected packages: lit\n",
+      "  Building wheel for lit (setup.py) ... \u001b[?25ldone\n",
+      "\u001b[?25h  Created wheel for lit: filename=lit-16.0.1-py3-none-any.whl size=88173 sha256=fca0dda7f2dc27a2885356559af2c2b6bc26994156ad1efae9f15f63d3866468\n",
+      "  Stored in directory: /home/mikolaj/.cache/pip/wheels/12/14/ba/87be46a564f97692e6cd1f6d7a1deeb5bff2821d45a52e8d7a\n",
+      "Successfully built lit\n",
+      "Installing collected packages: mpmath, lit, cmake, sympy, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, nvidia-cudnn-cu11, triton, torch, torchdata, torchtext\n",
+      "  Attempting uninstall: torch\n",
+      "    Found existing installation: torch 1.10.0\n",
+      "    Uninstalling torch-1.10.0:\n",
+      "      Successfully uninstalled torch-1.10.0\n",
+      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+      "laserembeddings 1.1.2 requires sacremoses==0.0.35, which is not installed.\n",
+      "unbabel-comet 1.1.0 requires numpy>=1.20.0, but you have numpy 1.19.3 which is incompatible.\n",
+      "unbabel-comet 1.1.0 requires scipy>=1.5.4, but you have scipy 1.4.1 which is incompatible.\n",
+      "unbabel-comet 1.1.0 requires torch<=1.10.0,>=1.6.0, but you have torch 2.0.0 which is incompatible.\n",
+      "torchvision 0.11.1 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
+      "torchaudio 0.10.0 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
+      "laserembeddings 1.1.2 requires torch<2.0.0,>=1.0.1.post2, but you have torch 2.0.0 which is incompatible.\u001b[0m\u001b[31m\n",
+      "\u001b[0mSuccessfully installed cmake-3.26.3 lit-16.0.1 mpmath-1.3.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 sympy-1.11.1 torch-2.0.0 torchdata-0.6.0 torchtext-0.15.1 triton-2.0.0\n",
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip3 install torchtext"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['<unk>', '\\\\', 'the', '-\\\\', 'wno']"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.lookup_tokens([0, 1, 2, 10, 12345])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Definicja sieci\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def look_ahead_iterator(gen):\n",
+    "    prev = None\n",
+    "    for item in gen:\n",
+    "        if prev is not None:\n",
+    "            yield (prev, item)\n",
+    "        prev = item"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[(1, 2), (2, 3), (3, 4), (4, 5), (5, 'X'), ('X', 6)]"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "list(look_ahead_iterator([1,2,3,4,5, 'X', 6 ]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import itertools"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch import nn\n",
+    "import torch\n",
+    "\n",
+    "embed_size = 100\n",
+    "\n",
+    "class SimpleBigramNeuralLanguageModel(nn.Module):\n",
+    "  def __init__(self, vocabulary_size, embedding_size):\n",
+    "      super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
+    "      self.model = nn.Sequential(\n",
+    "          nn.Embedding(vocabulary_size, embedding_size),\n",
+    "          nn.Linear(embedding_size, vocabulary_size),\n",
+    "          nn.Softmax()\n",
+    "      )\n",
+    "\n",
+    "  def forward(self, x):\n",
+    "      return self.model(x)\n",
+    "\n",
+    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
+    "\n",
+    "vocab.set_default_index(vocab['<unk>'])\n",
+    "ixs = torch.tensor(vocab.forward(['pies']))\n",
+    "# out[0][vocab['jest']]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
+    "\n",
+    "    shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import IterableDataset\n",
+    "import itertools\n",
+    "\n",
+    "def look_ahead_iterator(gen):\n",
+    "   prev = None\n",
+    "   for item in gen:\n",
+    "      if prev is not None:\n",
+    "         yield (prev, item)\n",
+    "      prev = item\n",
+    "\n",
+    "class Bigrams(IterableDataset):\n",
+    "  def __init__(self, text_file, vocabulary_size):\n",
+    "      self.vocab = build_vocab_from_iterator(\n",
+    "         get_word_lines_from_file(text_file),\n",
+    "         max_tokens = vocabulary_size,\n",
+    "         specials = ['<unk>'])\n",
+    "      self.vocab.set_default_index(self.vocab['<unk>'])\n",
+    "      self.vocabulary_size = vocabulary_size\n",
+    "      self.text_file = text_file\n",
+    "\n",
+    "  def __iter__(self):\n",
+    "     return look_ahead_iterator(\n",
+    "         (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
+    "\n",
+    "train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "TypeError",
+     "evalue": "'tuple' object is not an iterator",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m/tmp/ipykernel_12664/602008184.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mDataLoader\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mTypeError\u001b[0m: 'tuple' object is not an iterator"
+     ]
+    }
+   ],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "next(iter(train_dataset))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['<s>', '<unk>']"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.lookup_tokens([43, 0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[tensor([   2,    5,   51, 3481,  231]), tensor([   5,   51, 3481,  231,    4])]"
+     ]
+    }
+   ],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "next(iter(DataLoader(train_dataset, batch_size=5)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/mikolaj/.local/lib/python3.8/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
+      "  input = module(input)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0 tensor(10.0877, grad_fn=<NllLossBackward0>)\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/mikolaj/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)\n",
+      "  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "100 tensor(8.4388, grad_fn=<NllLossBackward0>)\n",
+      "200 tensor(7.7335, grad_fn=<NllLossBackward0>)\n",
+      "300 tensor(7.1300, grad_fn=<NllLossBackward0>)\n",
+      "400 tensor(6.7325, grad_fn=<NllLossBackward0>)\n",
+      "500 tensor(6.4705, grad_fn=<NllLossBackward0>)\n",
+      "600 tensor(6.0460, grad_fn=<NllLossBackward0>)\n",
+      "700 tensor(5.8104, grad_fn=<NllLossBackward0>)\n",
+      "800 tensor(5.8110, grad_fn=<NllLossBackward0>)\n",
+      "900 tensor(5.7169, grad_fn=<NllLossBackward0>)\n",
+      "1000 tensor(5.7580, grad_fn=<NllLossBackward0>)\n",
+      "1100 tensor(5.6787, grad_fn=<NllLossBackward0>)\n",
+      "1200 tensor(5.4501, grad_fn=<NllLossBackward0>)\n"
+     ]
+    },
+    {
+     "ename": "KeyboardInterrupt",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
+      "\u001b[0;32m/tmp/ipykernel_12664/1293343661.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     11\u001b[0m    \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdevice\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     12\u001b[0m    \u001b[0moptimizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzero_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m    \u001b[0mypredicted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     14\u001b[0m    \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mypredicted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m    \u001b[0;32mif\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;36m100\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1499\u001b[0m                 \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1500\u001b[0m                 or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1502\u001b[0m         \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1503\u001b[0m         \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/tmp/ipykernel_12664/517511851.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m       \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSimpleBigramNeuralLanguageModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvocab_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0membed_size\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1499\u001b[0m                 \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1500\u001b[0m                 or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1502\u001b[0m         \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1503\u001b[0m         \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/container.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m    215\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    216\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m             \u001b[0minput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    218\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1499\u001b[0m                 \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1500\u001b[0m                 or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1502\u001b[0m         \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1503\u001b[0m         \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m   1457\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1458\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1459\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1460\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1461\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mextra_repr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/functional.py\u001b[0m in \u001b[0;36msoftmax\u001b[0;34m(input, dim, _stacklevel, dtype)\u001b[0m\n\u001b[1;32m   1841\u001b[0m         \u001b[0mdim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_get_softmax_dim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"softmax\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1842\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1843\u001b[0;31m         \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1844\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1845\u001b[0m         \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
+     ]
+    }
+   ],
+   "source": [
+    "device = 'cpu'\n",
+    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
+    "data = DataLoader(train_dataset, batch_size=5000)\n",
+    "optimizer = torch.optim.Adam(model.parameters())\n",
+    "criterion = torch.nn.NLLLoss()\n",
+    "\n",
+    "model.train()\n",
+    "step = 0\n",
+    "for x, y in data:\n",
+    "   x = x.to(device)\n",
+    "   y = y.to(device)\n",
+    "   optimizer.zero_grad()\n",
+    "   ypredicted = model(x)\n",
+    "   loss = criterion(torch.log(ypredicted), y)\n",
+    "   if step % 100 == 0:\n",
+    "      print(step, loss)\n",
+    "   step += 1\n",
+    "   loss.backward()\n",
+    "   optimizer.step()\n",
+    "\n",
+    "torch.save(model.state_dict(), 'model1.bin')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.save(model.state_dict(), 'model1.bin')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[(',', 3, 0.12514805793762207),\n",
+       " ('\\\\', 1, 0.07237359136343002),\n",
+       " ('<unk>', 0, 0.06839419901371002),\n",
+       " ('.', 4, 0.06109621003270149),\n",
+       " ('of', 5, 0.04557998105883598),\n",
+       " ('and', 6, 0.03565318509936333),\n",
+       " ('the', 2, 0.029342489317059517),\n",
+       " ('to', 7, 0.02185475267469883),\n",
+       " ('-\\\\', 10, 0.018097609281539917),\n",
+       " ('in', 9, 0.016023961827158928)]"
+      ]
+     },
+     "execution_count": 37,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device = 'cpu'\n",
+    "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
+    "model.load_state_dict(torch.load('model1.bin'))\n",
+    "model.eval()\n",
+    "\n",
+    "ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
+    "\n",
+    "out = model(ixs)\n",
+    "top = torch.topk(out[0], 10)\n",
+    "top_indices = top.indices.tolist()\n",
+    "top_probs = top.values.tolist()\n",
+    "top_words = vocab.lookup_tokens(top_indices)\n",
+    "list(zip(top_words, top_indices, top_probs))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[(',', 3, 0.12514805793762207),\n",
+       " ('\\\\', 1, 0.07237359136343002),\n",
+       " ('<unk>', 0, 0.06839419901371002),\n",
+       " ('.', 4, 0.06109621003270149),\n",
+       " ('of', 5, 0.04557998105883598),\n",
+       " ('and', 6, 0.03565318509936333),\n",
+       " ('the', 2, 0.029342489317059517),\n",
+       " ('to', 7, 0.02185475267469883),\n",
+       " ('-\\\\', 10, 0.018097609281539917),\n",
+       " ('in', 9, 0.016023961827158928)]"
+      ]
+     },
+     "execution_count": 38,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab = train_dataset.vocab\n",
+    "ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
+    "\n",
+    "out = model(ixs)\n",
+    "top = torch.topk(out[0], 10)\n",
+    "top_indices = top.indices.tolist()\n",
+    "top_probs = top.values.tolist()\n",
+    "top_words = vocab.lookup_tokens(top_indices)\n",
+    "list(zip(top_words, top_indices, top_probs))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('<unk>', 0, 1.0000001192092896),\n",
+       " ('nb', 1958, 0.4407886266708374),\n",
+       " ('refrain', 14092, 0.4395471513271332),\n",
+       " ('cat', 3391, 0.4154242277145386),\n",
+       " ('enjoying', 7521, 0.3915165066719055),\n",
+       " ('active', 1383, 0.38935279846191406),\n",
+       " ('stewart', 4816, 0.3806381821632385),\n",
+       " ('omit', 15600, 0.380504310131073),\n",
+       " ('2041095573313', 11912, 0.37909239530563354),\n",
+       " ('shut', 3863, 0.3778260052204132)]"
+      ]
+     },
+     "execution_count": 39,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
+    "\n",
+    "embeddings = model.model[0].weight\n",
+    "\n",
+    "vec = embeddings[vocab['poszedł']]\n",
+    "\n",
+    "similarities = cos(vec, embeddings)\n",
+    "\n",
+    "top = torch.topk(similarities, 10)\n",
+    "\n",
+    "top_indices = top.indices.tolist()\n",
+    "top_probs = top.values.tolist()\n",
+    "top_words = vocab.lookup_tokens(top_indices)\n",
+    "list(zip(top_words, top_indices, top_probs))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Zapis przy użyciu wzoru matematycznego\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
+    "\n",
+    "$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
+    "\n",
+    "gdzie:\n",
+    "\n",
+    "-   $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
+    "-   $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
+    "-   $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
+    "-   $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### Hiperparametry\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Zauważmy, że nasz model ma dwa hiperparametry:\n",
+    "\n",
+    "-   $m$ — rozmiar zanurzenia,\n",
+    "-   $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
+    "    rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
+    "    najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
+    "\n",
+    "Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
+    "polepszenia wyników naszego modelu.\n",
+    "\n",
+    "**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Diagram sieci\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
+    "warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
+    "sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
+    "\n",
+    "![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Zanurzenie jako mnożenie przez macierz\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
+    "odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
+    "mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
+    "wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
+    "podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
+    "złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
+    "\n",
+    "Wówczas wzór przyjmie postać:\n",
+    "\n",
+    "$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
+    "\n",
+    "gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
+    "\n",
+    "**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
+    "\n",
+    "W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
+    "\n",
+    "![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.8.12 64-bit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  },
+  "org": null,
+  "vscode": {
+   "interpreter": {
+    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/README.md
+++ b/README.md
@ -0,0 +1,9 @@
+Challenging America word-gap prediction
+===================================
+
+Guess a word in a gap.
+
+Evaluation metric
+-----------------
+
+LikelihoodHashed is the metric
--- a/config.txt
+++ b/config.txt
@ -0,0 +1 @@
+--metric PerplexityHashed --precision 2  --in-header in-header.tsv  --out-header out-header.tsv
--- a/create_vocab.py
+++ b/create_vocab.py
@ -0,0 +1,35 @@
+from itertools import islice
+import regex as re
+import sys
+from torchtext.vocab import build_vocab_from_iterator
+import lzma
+
+def get_words_from_line(line):
+  line = line.rstrip()
+  yield '<s>'
+  for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
+     yield m.group(0).lower()
+  yield '</s>'
+
+
+def get_word_lines_from_file(file_name):
+  counter=0
+  with lzma.open(file_name, 'r') as fh:
+    for line in fh:
+      counter+=1
+      # if counter == 10000:
+      #   break
+      line = line.decode("utf-8")
+      yield get_words_from_line(line)
+
+
+vocab_size = 20000
+
+vocab = build_vocab_from_iterator(
+    get_word_lines_from_file('train/in.tsv.xz'),
+    max_tokens = vocab_size,
+    specials = ['<unk>'])
+
+import pickle
+with open("vocab.pickle", 'wb') as handle:
+    pickle.dump(vocab, handle)
--- a/dev-0/expected.tsv
+++ b/dev-0/expected.tsv
--- a/dev-0/hate-speech-info.tsv
+++ b/dev-0/hate-speech-info.tsv
--- a/dev-0/in.tsv.xz
+++ b/dev-0/in.tsv.xz
--- a/dev-0/out.tsv
+++ b/dev-0/out.tsv
--- a/BIN
+++ b/BIN
--- a/in-header.tsv
+++ b/in-header.tsv
@ -0,0 +1 @@
+FileId	Year	LeftContext	RightContext
--- a/in-header.tsv.1
+++ b/in-header.tsv.1
@ -0,0 +1,657 @@
+<!DOCTYPE html>
+<html lang="en-US" class="theme-">
+<head>
+	<meta charset="utf-8">
+	<meta name="viewport" content="width=device-width, initial-scale=1">
+	<title>challenging-america-word-gap-prediction/in-header.tsv at 4gram -  challenging-america-word-gap-prediction - Gitea: Git with a cup of tea</title>
+	<link rel="manifest" href="data:application/json;base64,eyJuYW1lIjoiR2l0ZWE6IEdpdCB3aXRoIGEgY3VwIG9mIHRlYSIsInNob3J0X25hbWUiOiJHaXRlYTogR2l0IHdpdGggYSBjdXAgb2YgdGVhIiwic3RhcnRfdXJsIjoiaHR0cHM6Ly9naXQud21pLmFtdS5lZHUucGwvIiwiaWNvbnMiOlt7InNyYyI6Imh0dHBzOi8vZ2l0LndtaS5hbXUuZWR1LnBsL2Fzc2V0cy9pbWcvbG9nby5wbmciLCJ0eXBlIjoiaW1hZ2UvcG5nIiwic2l6ZXMiOiI1MTJ4NTEyIn0seyJzcmMiOiJodHRwczovL2dpdC53bWkuYW11LmVkdS5wbC9hc3NldHMvaW1nL2xvZ28uc3ZnIiwidHlwZSI6ImltYWdlL3N2Zyt4bWwiLCJzaXplcyI6IjUxMng1MTIifV19"/>
+	<meta name="theme-color" content="#6cc644">
+	<meta name="default-theme" content="auto" />
+	<meta name="author" content="s444463" />
+	<meta name="description" content="challenging-america-word-gap-prediction" />
+	<meta name="keywords" content="go,git,self-hosted,gitea">
+	<meta name="referrer" content="no-referrer" />
+
+	<script>
+		<!--   -->
+		window.config = {
+			appVer: '1.16.4',
+			appSubUrl: '',
+			assetUrlPrefix: '\/assets',
+			runModeIsProd:  true ,
+			customEmojis: {"codeberg":":codeberg:","git":":git:","gitea":":gitea:","github":":github:","gitlab":":gitlab:","gogs":":gogs:"},
+			useServiceWorker:  true ,
+			csrfToken: 'p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw',
+			pageData: {},
+			requireTribute:  null ,
+			notificationSettings: {"EventSourceUpdateTime":10000,"MaxTimeout":60000,"MinTimeout":10000,"TimeoutStep":10000}, 
+			enableTimeTracking:  true ,
+			
+			mermaidMaxSourceCharacters:  5000 ,
+			
+			i18n: {
+				copy_success: 'Copied!',
+				copy_error: 'Copy failed',
+				error_occurred: 'An error occurred',
+				network_error: 'Network error',
+			},
+		};
+		
+		window.config.pageData = window.config.pageData || {};
+	</script>
+	<link rel="icon" href="/assets/img/logo.svg" type="image/svg+xml">
+	<link rel="alternate icon" href="/assets/img/favicon.png" type="image/png">
+	<link rel="stylesheet" href="/assets/css/index.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
+	<noscript>
+		<style>
+			.dropdown:hover > .menu { display: block; }
+			.ui.secondary.menu .dropdown.item > .menu { margin-top: 0; }
+		</style>
+	</noscript>
+
+	
+		<meta property="og:title" content="challenging-america-word-gap-prediction" />
+		<meta property="og:url" content="https://git.wmi.amu.edu.pl/s444463/challenging-america-word-gap-prediction" />
+		
+	
+	<meta property="og:type" content="object" />
+	
+		<meta property="og:image" content="https://git.wmi.amu.edu.pl/avatars/a6fe95f301e02b3472a0560d70cc307a" />
+	
+
+<meta property="og:site_name" content="Gitea: Git with a cup of tea" />
+
+	<link rel="stylesheet" href="/assets/css/theme-auto.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
+
+
+<link rel="stylesheet/less" type="text/css" href="/assets/css/jupyter.less" />
+
+<script src="/assets/js/less.js" ></script>
+
+
+
+
+
+
+
+
+
+
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-MML-AM_CHTML-full,Safe"> </script>
+    
+    <script type="text/x-mathjax-config">
+    init_mathjax = function() {
+        if (window.MathJax) {
+        // MathJax loaded
+            MathJax.Hub.Config({
+                TeX: {
+                    equationNumbers: {
+                    autoNumber: "AMS",
+                    useLabelIds: true
+                    }
+                },
+                tex2jax: {
+                    inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+                    displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+                    processEscapes: true,
+                    processEnvironments: true
+                },
+                displayAlign: 'center',
+                CommonHTML: {
+                    linebreaks: { 
+                    automatic: true 
+                    }
+                },
+                "HTML-CSS": {
+                    linebreaks: { 
+                    automatic: true 
+                    }
+                }
+            });
+        
+            MathJax.Hub.Queue(["Typeset", MathJax.Hub]);
+        }
+    }
+    init_mathjax();
+    </script>
+    
+
+<script type="text/javascript">
+function giteaSanitizerHack() {
+    var imgs = document.getElementsByClassName("nb-image-output");
+    var i;
+    for (i = 0; i < imgs.length; i++) {
+        imgs[i].src=imgs[i].src.replace('https://gitea.sanitizer.hack/','');
+    }
+}
+window.onload = giteaSanitizerHack;
+</script>
+
+</head>
+<body>
+	
+
+	<div class="full height">
+		<noscript>This website works better with JavaScript.</noscript>
+
+		
+
+		
+			<div class="ui top secondary stackable main menu following bar light">
+				<div class="ui container" id="navbar">
+	<div class="item brand" style="justify-content: space-between;">
+		<a href="/" data-content="Home">
+			<img class="ui mini image" width="30" height="30" src="/assets/img/logo.svg">
+		</a>
+		<div class="ui basic icon button mobile-only" id="navbar-expand-toggle">
+			<i class="sidebar icon"></i>
+		</div>
+	</div>
+
+	
+		<a class="item " href="/explore/repos">Explore</a>
+	
+
+	
+
+	
+
+
+	
+		<a class="item" target="_blank" rel="noopener noreferrer" href="https://docs.gitea.io">Help</a>
+		<div class="right stackable menu">
+			
+			<a class="item" rel="nofollow" href="/user/login?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
+				<svg viewBox="0 0 16 16" class="svg octicon-sign-in" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.75C2 1.784 2.784 1 3.75 1h2.5a.75.75 0 0 1 0 1.5h-2.5a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h2.5a.75.75 0 0 1 0 1.5h-2.5A1.75 1.75 0 0 1 2 13.25V2.75zm6.56 4.5 1.97-1.97a.75.75 0 1 0-1.06-1.06L6.22 7.47a.75.75 0 0 0 0 1.06l3.25 3.25a.75.75 0 1 0 1.06-1.06L8.56 8.75h5.69a.75.75 0 0 0 0-1.5H8.56z"/></svg> Sign In
+			</a>
+		</div>
+	
+</div>
+
+			</div>
+		
+
+
+
+<div class="page-content repository file list ">
+	<div class="header-wrapper">
+
+	<div class="ui container">
+		<div class="repo-header">
+			<div class="repo-title-wrap df fc">
+				<div class="repo-title">
+					
+					
+						<div class="repo-icon mr-3">
+	
+		
+			<svg viewBox="0 0 16 16" class="svg octicon-repo" width="32" height="32" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 1 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 0 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 0 1 1-1h8zM5 12.25v3.25a.25.25 0 0 0 .4.2l1.45-1.087a.25.25 0 0 1 .3 0L8.6 15.7a.25.25 0 0 0 .4-.2v-3.25a.25.25 0 0 0-.25-.25h-3.5a.25.25 0 0 0-.25.25z"/></svg>
+		
+	
+</div>
+
+					
+					<a href="/s444463">s444463</a>
+					<div class="mx-2">/</div>
+					<a href="/s444463/challenging-america-word-gap-prediction">challenging-america-word-gap-prediction</a>
+					<div class="labels df ac fw">
+						
+							
+								
+							
+						
+						
+					</div>
+				</div>
+				
+				
+				
+			</div>
+			
+				<div class="repo-buttons">
+					
+					<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/watch?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
+						<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
+						<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to watch this repository." data-position="top center">
+							<button type="submit" class="ui compact small basic button" disabled>
+								<svg viewBox="0 0 16 16" class="svg octicon-eye" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.679 7.932c.412-.621 1.242-1.75 2.366-2.717C5.175 4.242 6.527 3.5 8 3.5c1.473 0 2.824.742 3.955 1.715 1.124.967 1.954 2.096 2.366 2.717a.119.119 0 0 1 0 .136c-.412.621-1.242 1.75-2.366 2.717C10.825 11.758 9.473 12.5 8 12.5c-1.473 0-2.824-.742-3.955-1.715C2.92 9.818 2.09 8.69 1.679 8.068a.119.119 0 0 1 0-.136zM8 2c-1.981 0-3.67.992-4.933 2.078C1.797 5.169.88 6.423.43 7.1a1.619 1.619 0 0 0 0 1.798c.45.678 1.367 1.932 2.637 3.024C4.329 13.008 6.019 14 8 14c1.981 0 3.67-.992 4.933-2.078 1.27-1.091 2.187-2.345 2.637-3.023a1.619 1.619 0 0 0 0-1.798c-.45-.678-1.367-1.932-2.637-3.023C11.671 2.992 9.981 2 8 2zm0 8a2 2 0 1 0 0-4 2 2 0 0 0 0 4z"/></svg>Watch
+							</button>
+							<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/watchers">
+								1
+							</a>
+						</div>
+					</form>
+					
+						<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/star?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
+							<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
+							<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to star this repository." data-position="top center">
+								<button type="submit" class="ui compact small basic button" disabled>
+									<svg viewBox="0 0 16 16" class="svg octicon-star" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.75.75 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79l.72-4.194L.818 6.374a.75.75 0 0 1 .416-1.28l4.21-.611L7.327.668A.75.75 0 0 1 8 .25zm0 2.445L6.615 5.5a.75.75 0 0 1-.564.41l-3.097.45 2.24 2.184a.75.75 0 0 1 .216.664l-.528 3.084 2.769-1.456a.75.75 0 0 1 .698 0l2.77 1.456-.53-3.084a.75.75 0 0 1 .216-.664l2.24-2.183-3.096-.45a.75.75 0 0 1-.564-.41L8 2.694v.001z"/></svg>Star
+								</button>
+								<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/stars">
+									0
+								</a>
+							</div>
+						</form>
+					
+					
+						<div class="ui labeled button
+							
+								tooltip disabled
+							"
+							
+								data-content="Sign in to fork this repository."
+							
+						data-position="top center" data-variation="tiny" tabindex="0">
+							<a class="ui compact small basic button"
+								
+									
+								
+							>
+								<svg viewBox="0 0 16 16" class="svg octicon-repo-forked" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M5 3.25a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm0 2.122a2.25 2.25 0 1 0-1.5 0v.878A2.25 2.25 0 0 0 5.75 8.5h1.5v2.128a2.251 2.251 0 1 0 1.5 0V8.5h1.5a2.25 2.25 0 0 0 2.25-2.25v-.878a2.25 2.25 0 1 0-1.5 0v.878a.75.75 0 0 1-.75.75h-4.5A.75.75 0 0 1 5 6.25v-.878zm3.75 7.378a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm3-8.75a.75.75 0 1 0 0-1.5.75.75 0 0 0 0 1.5z"/></svg>Fork
+							</a>
+							<div class="ui small modal" id="fork-repo-modal">
+								<svg viewBox="0 0 16 16" class="close inside svg octicon-x" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M3.72 3.72a.75.75 0 0 1 1.06 0L8 6.94l3.22-3.22a.75.75 0 1 1 1.06 1.06L9.06 8l3.22 3.22a.75.75 0 1 1-1.06 1.06L8 9.06l-3.22 3.22a.75.75 0 0 1-1.06-1.06L6.94 8 3.72 4.78a.75.75 0 0 1 0-1.06z"/></svg>
+								<div class="header">
+									You&#39;ve already forked challenging-america-word-gap-prediction
+								</div>
+								<div class="content tl">
+									<div class="ui list">
+										
+									</div>
+									
+								</div>
+							</div>
+							<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/forks">
+								0
+							</a>
+						</div>
+					
+				</div>
+			
+		</div>
+	</div>
+
+	<div class="ui tabs container">
+		
+			<div class="ui tabular stackable menu navbar">
+				
+				<a class="active item" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram">
+					<svg viewBox="0 0 16 16" class="svg octicon-code" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M4.72 3.22a.75.75 0 0 1 1.06 1.06L2.06 8l3.72 3.72a.75.75 0 1 1-1.06 1.06L.47 8.53a.75.75 0 0 1 0-1.06l4.25-4.25zm6.56 0a.75.75 0 1 0-1.06 1.06L13.94 8l-3.72 3.72a.75.75 0 1 0 1.06 1.06l4.25-4.25a.75.75 0 0 0 0-1.06l-4.25-4.25z"/></svg> Code
+				</a>
+				
+
+				
+					<a class=" item" href="/s444463/challenging-america-word-gap-prediction/issues">
+						<svg viewBox="0 0 16 16" class="svg octicon-issue-opened" width="16" height="16" aria-hidden="true"><path d="M8 9.5a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3z"/><path fill-rule="evenodd" d="M8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0zM1.5 8a6.5 6.5 0 1 1 13 0 6.5 6.5 0 0 1-13 0z"/></svg> Issues
+						
+					</a>
+				
+
+				
+
+				
+					<a class=" item" href="/s444463/challenging-america-word-gap-prediction/pulls">
+						<svg viewBox="0 0 16 16" class="svg octicon-git-pull-request" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.177 3.073 9.573.677A.25.25 0 0 1 10 .854v4.792a.25.25 0 0 1-.427.177L7.177 3.427a.25.25 0 0 1 0-.354zM3.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122v5.256a2.251 2.251 0 1 1-1.5 0V5.372A2.25 2.25 0 0 1 1.5 3.25zM11 2.5h-1V4h1a1 1 0 0 1 1 1v5.628a2.251 2.251 0 1 0 1.5 0V5A2.5 2.5 0 0 0 11 2.5zm1 10.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0zM3.75 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5z"/></svg> Pull Requests
+						
+					</a>
+				
+
+				
+					<a href="/s444463/challenging-america-word-gap-prediction/projects" class=" item">
+						<svg viewBox="0 0 16 16" class="svg octicon-project" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.75 0A1.75 1.75 0 0 0 0 1.75v12.5C0 15.216.784 16 1.75 16h12.5A1.75 1.75 0 0 0 16 14.25V1.75A1.75 1.75 0 0 0 14.25 0H1.75zM1.5 1.75a.25.25 0 0 1 .25-.25h12.5a.25.25 0 0 1 .25.25v12.5a.25.25 0 0 1-.25.25H1.75a.25.25 0 0 1-.25-.25V1.75zM11.75 3a.75.75 0 0 0-.75.75v7.5a.75.75 0 0 0 1.5 0v-7.5a.75.75 0 0 0-.75-.75zm-8.25.75a.75.75 0 0 1 1.5 0v5.5a.75.75 0 0 1-1.5 0v-5.5zM8 3a.75.75 0 0 0-.75.75v3.5a.75.75 0 0 0 1.5 0v-3.5A.75.75 0 0 0 8 3z"/></svg> Projects
+						
+					</a>
+				
+
+				
+				<a class=" item" href="/s444463/challenging-america-word-gap-prediction/releases">
+					<svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> Releases
+					
+				</a>
+				
+
+				
+					<a class=" item" href="/s444463/challenging-america-word-gap-prediction/wiki" >
+						<svg viewBox="0 0 16 16" class="svg octicon-book" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M0 1.75A.75.75 0 0 1 .75 1h4.253c1.227 0 2.317.59 3 1.501A3.744 3.744 0 0 1 11.006 1h4.245a.75.75 0 0 1 .75.75v10.5a.75.75 0 0 1-.75.75h-4.507a2.25 2.25 0 0 0-1.591.659l-.622.621a.75.75 0 0 1-1.06 0l-.622-.621A2.25 2.25 0 0 0 5.258 13H.75a.75.75 0 0 1-.75-.75V1.75zm8.755 3a2.25 2.25 0 0 1 2.25-2.25H14.5v9h-3.757c-.71 0-1.4.201-1.992.572l.004-7.322zm-1.504 7.324.004-5.073-.002-2.253A2.25 2.25 0 0 0 5.003 2.5H1.5v9h3.757a3.75 3.75 0 0 1 1.994.574z"/></svg> Wiki
+					</a>
+				
+
+				
+					<a class=" item" href="/s444463/challenging-america-word-gap-prediction/activity">
+						<svg viewBox="0 0 16 16" class="svg octicon-pulse" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6 2a.75.75 0 0 1 .696.471L10 10.731l1.304-3.26A.75.75 0 0 1 12 7h3.25a.75.75 0 0 1 0 1.5h-2.742l-1.812 4.528a.75.75 0 0 1-1.392 0L6 4.77 4.696 8.03A.75.75 0 0 1 4 8.5H.75a.75.75 0 0 1 0-1.5h2.742l1.812-4.529A.75.75 0 0 1 6 2z"/></svg> Activity
+					</a>
+				
+
+				
+
+				
+			</div>
+		
+	</div>
+	<div class="ui tabs divider"></div>
+</div>
+
+	<div class="ui container ">
+		
+
+
+
+		<div class="ui repo-description">
+			<div id="repo-desc">
+				
+				<a class="link" href=""></a>
+			</div>
+			
+		</div>
+		<div class="mt-3" id="repo-topics">
+		
+		
+		</div>
+		
+		<div class="hide" id="validate_prompt">
+			<span id="count_prompt">You can not select more than 25 topics</span>
+			<span id="format_prompt">Topics must start with a letter or number, can include dashes (&#39;-&#39;) and can be up to 35 characters long.</span>
+		</div>
+		
+		<div class="ui segments repository-summary repository-summary-language-stats mt-3">
+	<div class="ui segment sub-menu repository-menu">
+		<div class="ui two horizontal center link list">
+			
+				<div class="item">
+					<a class="ui" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram"><svg viewBox="0 0 16 16" class="svg octicon-history" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.643 3.143.427 1.927A.25.25 0 0 0 0 2.104V5.75c0 .138.112.25.25.25h3.646a.25.25 0 0 0 .177-.427L2.715 4.215a6.5 6.5 0 1 1-1.18 4.458.75.75 0 1 0-1.493.154 8.001 8.001 0 1 0 1.6-5.684zM7.75 4a.75.75 0 0 1 .75.75v2.992l2.028.812a.75.75 0 0 1-.557 1.392l-2.5-1A.75.75 0 0 1 7 8.25v-3.5A.75.75 0 0 1 7.75 4z"/></svg> <b>4</b> Commits</a>
+				</div>
+				<div class="item">
+					<a class="ui" href="/s444463/challenging-america-word-gap-prediction/branches"><svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg> <b>2</b> Branches</a>
+				</div>
+				
+					<div class="item">
+						<a class="ui" href="/s444463/challenging-america-word-gap-prediction/tags"><svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> <b>0</b> Tags</a>
+					</div>
+				
+				<div class="item">
+					<span class="ui"><svg viewBox="0 0 16 16" class="svg octicon-database" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 3.5c0-.133.058-.318.282-.55.227-.237.592-.484 1.1-.708C4.899 1.795 6.354 1.5 8 1.5c1.647 0 3.102.295 4.117.742.51.224.874.47 1.101.707.224.233.282.418.282.551 0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 5.205 9.646 5.5 8 5.5c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707-.224-.233-.282-.418-.282-.551zM1 3.5c0-.626.292-1.165.7-1.59.406-.422.956-.767 1.579-1.041C4.525.32 6.195 0 8 0c1.805 0 3.475.32 4.722.869.622.274 1.172.62 1.578 1.04.408.426.7.965.7 1.591v9c0 .626-.292 1.165-.7 1.59-.406.422-.956.767-1.579 1.041C11.476 15.68 9.806 16 8 16c-1.805 0-3.475-.32-4.721-.869-.623-.274-1.173-.62-1.579-1.04-.408-.426-.7-.965-.7-1.591v-9zM2.5 8V5.724c.241.15.503.286.779.407C4.525 6.68 6.195 7 8 7c1.805 0 3.475-.32 4.722-.869.275-.121.537-.257.778-.407V8c0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 9.705 9.646 10 8 10c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707C2.558 8.318 2.5 8.133 2.5 8zm0 2.225V12.5c0 .133.058.318.282.55.227.237.592.484 1.1.708 1.016.447 2.471.742 4.118.742 1.647 0 3.102-.295 4.117-.742.51-.224.874-.47 1.101-.707.224-.233.282-.418.282-.551v-2.275c-.241.15-.503.285-.778.406-1.247.549-2.917.869-4.722.869-1.805 0-3.475-.32-4.721-.869a6.236 6.236 0 0 1-.779-.406z"/></svg> <b>284 MiB</b></span>
+				</div>
+			
+		</div>
+	</div>
+	
+	<div class="ui segment sub-menu language-stats-details" style="display: none">
+		<div class="ui horizontal center link list">
+			
+			<div class="item df ac jc">
+				<i class="color-icon mr-3" style="background-color: #3572A5"></i>
+				<span class="bold mr-3">
+					
+						Python
+					
+				</span>
+				100%
+			</div>
+			
+		</div>
+	</div>
+	<a class="ui segment language-stats">
+		
+		<div class="bar" style="width: 100%; background-color: #3572A5">&nbsp;</div>
+		
+	</a>
+	
+</div>
+
+		<div class="ui stackable secondary menu mobile--margin-between-items mobile--no-negative-margins">
+			
+
+<div class="fitted item choose reference mr-1">
+	<div class="ui floating filter dropdown custom" data-can-create-branch="false" data-no-results="No results found.">
+		<div class="ui basic small compact button" @click="menuVisible = !menuVisible" @keyup.enter="menuVisible = !menuVisible">
+			<span class="text">
+				
+					<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
+					Branch:
+					<strong>4gram</strong>
+				
+			</span>
+			<svg viewBox="0 0 16 16" class="dropdown icon svg octicon-triangle-down" width="14" height="14" aria-hidden="true"><path d="m4.427 7.427 3.396 3.396a.25.25 0 0 0 .354 0l3.396-3.396A.25.25 0 0 0 11.396 7H4.604a.25.25 0 0 0-.177.427z"/></svg>
+		</div>
+		<div class="data" style="display: none" data-mode="branches">
+			
+				
+					<div class="item branch selected" data-url="/s444463/challenging-america-word-gap-prediction/src/branch/4gram/in-header.tsv">4gram</div>
+				
+					<div class="item branch " data-url="/s444463/challenging-america-word-gap-prediction/src/branch/master/in-header.tsv">master</div>
+				
+			
+			
+		</div>
+		<div class="menu transition" :class="{visible: menuVisible}" v-if="menuVisible" v-cloak>
+			<div class="ui icon search input">
+				<i class="icon df ac jc m-0"><svg viewBox="0 0 16 16" class="svg octicon-filter" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M.75 3a.75.75 0 0 0 0 1.5h14.5a.75.75 0 0 0 0-1.5H.75zM3 7.75A.75.75 0 0 1 3.75 7h8.5a.75.75 0 0 1 0 1.5h-8.5A.75.75 0 0 1 3 7.75zm3 4a.75.75 0 0 1 .75-.75h2.5a.75.75 0 0 1 0 1.5h-2.5a.75.75 0 0 1-.75-.75z"/></svg></i>
+				<input name="search" ref="searchField" autocomplete="off" v-model="searchTerm" @keydown="keydown($event)" placeholder="Filter branch or tag...">
+			</div>
+			
+				<div class="header branch-tag-choice">
+					<div class="ui grid">
+						<div class="two column row">
+							<a class="reference column" href="#" @click="createTag = false; mode = 'branches'; focusSearchField()">
+								<span class="text" :class="{black: mode == 'branches'}">
+									<svg viewBox="0 0 16 16" class="mr-2 svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>Branches
+								</span>
+							</a>
+							<a class="reference column" href="#" @click="createTag = true; mode = 'tags'; focusSearchField()">
+								<span class="text" :class="{black: mode == 'tags'}">
+									<svg viewBox="0 0 16 16" class="mr-2 svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg>Tags
+								</span>
+							</a>
+						</div>
+					</div>
+				</div>
+			
+			<div class="scrolling menu" ref="scrollContainer">
+				<div v-for="(item, index) in filteredItems" :key="item.name" class="item" :class="{selected: item.selected, active: active == index}" @click="selectItem(item)" :ref="'listItem' + index">${ item.name }</div>
+				<div class="item" v-if="showCreateNewBranch" :class="{active: active == filteredItems.length}" :ref="'listItem' + filteredItems.length">
+					<a href="#" @click="createNewBranch()">
+						<div v-show="createTag">
+							<i class="reference tags icon"></i>
+							Create tag <strong>${ searchTerm }</strong>
+						</div>
+						<div v-show="!createTag">
+							<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
+							Create branch <strong>${ searchTerm }</strong>
+						</div>
+						<div class="text small">
+							
+								from &#39;4gram&#39;
+							
+						</div>
+					</a>
+					<form ref="newBranchForm" action="/s444463/challenging-america-word-gap-prediction/branches/_new/branch/4gram" method="post">
+						<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
+						<input type="hidden" name="new_branch_name" v-model="searchTerm">
+						<input type="hidden" name="create_tag" v-model="createTag">
+					</form>
+				</div>
+			</div>
+			<div class="message" v-if="showNoResults">${ noResults }</div>
+		</div>
+	</div>
+</div>
+
+			
+			
+			
+			
+				<div class="fitted item"><span class="ui breadcrumb repo-path"><a class="section" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram" title="challenging-america-word-gap-prediction">challenging-america-word-ga...</a><span class="divider">/</span><span class="active section" title="in-header.tsv">in-header.tsv</span></span></div>
+			
+			<div class="right fitted item mr-0" id="file-buttons">
+				<div class="ui tiny primary buttons">
+					
+						
+						
+					
+					
+				</div>
+
+			</div>
+			<div class="fitted item">
+				
+			</div>
+			<div class="fitted item">
+				
+				
+			</div>
+		</div>
+		
+			<div class="tab-size-8 non-diff-file-content">
+	<h4 class="file-header ui top attached header df ac sb">
+		<div class="file-header-left df ac">
+			
+				<div class="file-info text grey normal mono">
+					
+					
+					
+						<div class="file-info-entry">
+							37 B
+						</div>
+					
+					
+				</div>
+			
+		</div>
+		<div class="file-header-right file-actions df ac">
+			
+			
+				<div class="ui buttons mr-2">
+					<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv">Raw</a>
+					
+						<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/src/commit/65d889d6525d3949dd1ace045393124a7afb1f0e/in-header.tsv">Permalink</a>
+					
+					
+						<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/blame/branch/4gram/in-header.tsv">Blame</a>
+					
+					<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram/in-header.tsv">History</a>
+					
+				</div>
+				<a download href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv"><span class="btn-octicon tooltip" data-content="Download file" data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-download" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.47 10.78a.75.75 0 0 0 1.06 0l3.75-3.75a.75.75 0 0 0-1.06-1.06L8.75 8.44V1.75a.75.75 0 0 0-1.5 0v6.69L4.78 5.97a.75.75 0 0 0-1.06 1.06l3.75 3.75zM3.75 13a.75.75 0 0 0 0 1.5h8.5a.75.75 0 0 0 0-1.5h-8.5z"/></svg></span></a>
+				
+					
+						<span class="btn-octicon tooltip disabled" data-content="You must fork this repository to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-pencil" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.013 1.427a1.75 1.75 0 0 1 2.474 0l1.086 1.086a1.75 1.75 0 0 1 0 2.474l-8.61 8.61c-.21.21-.47.364-.756.445l-3.251.93a.75.75 0 0 1-.927-.928l.929-3.25a1.75 1.75 0 0 1 .445-.758l8.61-8.61zm1.414 1.06a.25.25 0 0 0-.354 0L10.811 3.75l1.439 1.44 1.263-1.263a.25.25 0 0 0 0-.354l-1.086-1.086zM11.189 6.25 9.75 4.81l-6.286 6.287a.25.25 0 0 0-.064.108l-.558 1.953 1.953-.558a.249.249 0 0 0 .108-.064l6.286-6.286z"/></svg></span>
+					
+					
+						<span class="btn-octicon tooltip disabled" data-content="You must have write access to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-trash" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.5 1.75a.25.25 0 0 1 .25-.25h2.5a.25.25 0 0 1 .25.25V3h-3V1.75zm4.5 0V3h2.25a.75.75 0 0 1 0 1.5H2.75a.75.75 0 0 1 0-1.5H5V1.75C5 .784 5.784 0 6.75 0h2.5C10.216 0 11 .784 11 1.75zM4.496 6.675a.75.75 0 1 0-1.492.15l.66 6.6A1.75 1.75 0 0 0 5.405 15h5.19c.9 0 1.652-.681 1.741-1.576l.66-6.6a.75.75 0 0 0-1.492-.149l-.66 6.6a.25.25 0 0 1-.249.225h-5.19a.25.25 0 0 1-.249-.225l-.66-6.6z"/></svg></span>
+					
+				
+			
+		</div>
+	</h4>
+	<div class="ui attached table unstackable segment">
+		
+	
+
+
+		<div class="file-view markup csv">
+			
+				<table class="data-table"><tr><th class="line-num">1</th><th>FileId</th><th>Year</th><th>LeftContext</th><th>RightContext</th></tr></table>
+			
+		</div>
+	</div>
+</div>
+
+		
+	</div>
+</div>
+
+
+	
+
+	</div>
+
+	
+
+	<footer>
+	<div class="ui container">
+		<div class="ui left">
+			Powered by Gitea  
+		</div>
+		<div class="ui right links">
+			
+			<div class="ui language bottom floating slide up dropdown link item">
+				<svg viewBox="0 0 16 16" class="svg octicon-globe" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.543 7.25h2.733c.144-2.074.866-3.756 1.58-4.948.12-.197.237-.381.353-.552a6.506 6.506 0 0 0-4.666 5.5zm2.733 1.5H1.543a6.506 6.506 0 0 0 4.666 5.5 11.13 11.13 0 0 1-.352-.552c-.715-1.192-1.437-2.874-1.581-4.948zm1.504 0h4.44a9.637 9.637 0 0 1-1.363 4.177c-.306.51-.612.919-.857 1.215a9.978 9.978 0 0 1-.857-1.215A9.637 9.637 0 0 1 5.78 8.75zm4.44-1.5H5.78a9.637 9.637 0 0 1 1.363-4.177c.306-.51.612-.919.857-1.215.245.296.55.705.857 1.215A9.638 9.638 0 0 1 10.22 7.25zm1.504 1.5c-.144 2.074-.866 3.756-1.58 4.948-.12.197-.237.381-.353.552a6.506 6.506 0 0 0 4.666-5.5h-2.733zm2.733-1.5h-2.733c-.144-2.074-.866-3.756-1.58-4.948a11.738 11.738 0 0 0-.353-.552 6.506 6.506 0 0 1 4.666 5.5zM8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0z"/></svg>
+				<div class="text">English</div>
+				<div class="menu language-menu">
+					
+						<a lang="id-ID" data-url="/?lang=id-ID" class="item ">bahasa Indonesia</a>
+					
+						<a lang="de-DE" data-url="/?lang=de-DE" class="item ">Deutsch</a>
+					
+						<a lang="en-US" data-url="/?lang=en-US" class="item active selected">English</a>
+					
+						<a lang="es-ES" data-url="/?lang=es-ES" class="item ">español</a>
+					
+						<a lang="fr-FR" data-url="/?lang=fr-FR" class="item ">français</a>
+					
+						<a lang="it-IT" data-url="/?lang=it-IT" class="item ">italiano</a>
+					
+						<a lang="lv-LV" data-url="/?lang=lv-LV" class="item ">latviešu</a>
+					
+						<a lang="hu-HU" data-url="/?lang=hu-HU" class="item ">magyar nyelv</a>
+					
+						<a lang="nl-NL" data-url="/?lang=nl-NL" class="item ">Nederlands</a>
+					
+						<a lang="pl-PL" data-url="/?lang=pl-PL" class="item ">polski</a>
+					
+						<a lang="pt-PT" data-url="/?lang=pt-PT" class="item ">Português de Portugal</a>
+					
+						<a lang="pt-BR" data-url="/?lang=pt-BR" class="item ">português do Brasil</a>
+					
+						<a lang="fi-FI" data-url="/?lang=fi-FI" class="item ">suomi</a>
+					
+						<a lang="sv-SE" data-url="/?lang=sv-SE" class="item ">svenska</a>
+					
+						<a lang="tr-TR" data-url="/?lang=tr-TR" class="item ">Türkçe</a>
+					
+						<a lang="cs-CZ" data-url="/?lang=cs-CZ" class="item ">čeština</a>
+					
+						<a lang="el-GR" data-url="/?lang=el-GR" class="item ">ελληνικά</a>
+					
+						<a lang="bg-BG" data-url="/?lang=bg-BG" class="item ">български</a>
+					
+						<a lang="ru-RU" data-url="/?lang=ru-RU" class="item ">русский</a>
+					
+						<a lang="sr-SP" data-url="/?lang=sr-SP" class="item ">српски</a>
+					
+						<a lang="uk-UA" data-url="/?lang=uk-UA" class="item ">Українська</a>
+					
+						<a lang="fa-IR" data-url="/?lang=fa-IR" class="item ">فارسی</a>
+					
+						<a lang="ml-IN" data-url="/?lang=ml-IN" class="item ">മലയാളം</a>
+					
+						<a lang="ja-JP" data-url="/?lang=ja-JP" class="item ">日本語</a>
+					
+						<a lang="zh-CN" data-url="/?lang=zh-CN" class="item ">简体中文</a>
+					
+						<a lang="zh-TW" data-url="/?lang=zh-TW" class="item ">繁體中文（台灣）</a>
+					
+						<a lang="zh-HK" data-url="/?lang=zh-HK" class="item ">繁體中文（香港）</a>
+					
+						<a lang="ko-KR" data-url="/?lang=ko-KR" class="item ">한국어</a>
+					
+				</div>
+			</div>
+			<a href="/assets/js/licenses.txt">Licenses</a>
+			<a href="/api/swagger">API</a>
+			<a target="_blank" rel="noopener noreferrer" href="https://gitea.io">Website</a>
+			
+			
+		</div>
+	</div>
+</footer>
+
+
+
+
+	<script src="/assets/js/index.js?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c"></script>
+
+</body>
+</html>
+
--- a/inference.py
+++ b/inference.py
@ -0,0 +1,89 @@
+from torch import nn
+import torch
+
+
+from torch.utils.data import IterableDataset
+import itertools
+import lzma
+import regex as re
+import pickle
+
+class SimpleTrigramNeuralLanguageModel(nn.Module):
+  def __init__(self, vocabulary_size, embedding_size):
+      super(SimpleTrigramNeuralLanguageModel, self).__init__()
+      self.embedings = nn.Embedding(vocabulary_size, embedding_size)
+      self.linear = nn.Linear(embedding_size*2, vocabulary_size)
+      self.softmax = nn.Softmax()
+
+    #   self.model = nn.Sequential(
+    #       nn.Embedding(vocabulary_size, embedding_size),
+    #       nn.Linear(embedding_size, vocabulary_size),
+    #       nn.Softmax()
+    #   )
+
+  def forward(self, x):
+      emb_1 = self.embedings(x[0])
+      emb_2 = self.embedings(x[1])
+
+      concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
+      y = self.softmax(concated)
+
+      return y
+vocab_size = 20000
+embed_size = 100
+model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
+
+model.load_state_dict(torch.load('model1_5400.bin'))
+model.eval()
+
+with open("vocab.pickle", 'rb') as handle:
+    vocab = pickle.load(handle)
+vocab.set_default_index(vocab['<unk>'])
+
+device = 'cpu'
+# data = DataLoader(train_dataset, batch_size=5000)
+optimizer = torch.optim.Adam(model.parameters())
+criterion = torch.nn.NLLLoss()
+
+test_pred = ['ala', 'has', 'cat']
+
+step = 0
+
+with lzma.open('dev-0/in.tsv.xz', 'rb') as file:
+    for line in file:
+        line = line.decode('utf-8')
+        line = line.rstrip()
+        
+        line_splitted = line.split('\t')[-2:] 
+
+        prev = line[0].split(' ')[-1]
+        next = line[1].split(' ')[0]
+
+
+        x = torch.tensor(vocab.forward([prev]))
+        z = torch.tensor(vocab.forward([next]))
+        x = x.to(device)
+        z = z.to(device)
+        ypredicted = model([x, z])
+
+        top = torch.topk(ypredicted[0], 5000)
+
+        top_indices = top.indices.tolist()
+        top_probs = top.values.tolist()
+        top_words = vocab.lookup_tokens(top_indices)
+
+        string_to_print = ''
+        sum_probs = 0
+        for w, p in zip(top_words, top_probs):
+            if '<unk>' in w:
+                continue
+            if re.search(r'\p{L}+', w):
+                string_to_print += f"{w}:{p} "
+                sum_probs += p
+        if string_to_print == '':
+            print(f"the:0.5 a:0.3 :0.2")
+            continue
+        unknow_prob = 1 - sum_probs
+        string_to_print += f":{unknow_prob}"
+
+        print(string_to_print)
--- a/out-header.tsv
+++ b/out-header.tsv
@ -0,0 +1 @@
+Word
--- a/test-A/hate-speech-info.tsv
+++ b/test-A/hate-speech-info.tsv
--- a/test-A/in.tsv.xz
+++ b/test-A/in.tsv.xz
--- a/test-A/out.tsv
+++ b/test-A/out.tsv
--- a/train/expected.tsv
+++ b/train/expected.tsv
--- a/train/hate-speech-info.tsv
+++ b/train/hate-speech-info.tsv
--- a/train/in.tsv.xz
+++ b/train/in.tsv.xz
--- a/trigrams.py
+++ b/trigrams.py
@ -0,0 +1,124 @@
+
+
+from torch import nn
+import torch
+
+
+from torch.utils.data import IterableDataset
+import itertools
+import lzma
+import regex as re
+import pickle
+
+def look_ahead_iterator(gen):
+    prev = None
+    current = None
+    next = None
+    for next in gen:
+        if prev is not None and current is not None:
+            yield (prev, current, next)
+        prev = current
+        current = next
+
+def get_words_from_line(line):
+  line = line.rstrip()
+  yield '<s>'
+  for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
+     yield m.group(0).lower()
+  yield '</s>'
+
+
+def get_word_lines_from_file(file_name):
+  counter=0
+  with lzma.open(file_name, 'r') as fh:
+    for line in fh:
+      counter+=1
+      if counter == 100000:
+        break
+      line = line.decode("utf-8")
+      yield get_words_from_line(line)
+
+
+
+class Trigrams(IterableDataset):
+  def load_vocab(self):
+    with open("vocab.pickle", 'rb') as handle:
+        vocab = pickle.load( handle)
+    return vocab
+
+  def __init__(self, text_file, vocabulary_size):
+      self.vocab = self.load_vocab()
+      self.vocab.set_default_index(self.vocab['<unk>'])
+      self.vocabulary_size = vocabulary_size
+      self.text_file = text_file
+
+  def __iter__(self):
+     return look_ahead_iterator(
+         (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))
+
+vocab_size = 20000
+
+train_dataset = Trigrams('train/in.tsv.xz', vocab_size)
+
+
+
+#=== trenowanie
+from torch import nn
+import torch
+from torch.utils.data import DataLoader
+embed_size = 100
+
+class SimpleTrigramNeuralLanguageModel(nn.Module):
+  def __init__(self, vocabulary_size, embedding_size):
+      super(SimpleTrigramNeuralLanguageModel, self).__init__()
+      self.embedings = nn.Embedding(vocabulary_size, embedding_size)
+      self.linear = nn.Linear(embedding_size*2, vocabulary_size)
+      self.softmax = nn.Softmax()
+
+    #   self.model = nn.Sequential(
+    #       nn.Embedding(vocabulary_size, embedding_size),
+    #       nn.Linear(embedding_size, vocabulary_size),
+    #       nn.Softmax()
+    #   )
+
+  def forward(self, x):
+      emb_1 = self.embedings(x[0])
+      emb_2 = self.embedings(x[1])
+
+      concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
+      y = self.softmax(concated)
+
+      return y
+
+model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
+
+vocab = train_dataset.vocab
+
+vocab.set_default_index(vocab['<unk>'])
+ixs = torch.tensor(vocab.forward(['pies']))
+# out[0][vocab['jest']]
+
+
+device = 'cpu'
+model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size).to(device)
+data = DataLoader(train_dataset, batch_size=5000)
+optimizer = torch.optim.Adam(model.parameters())
+criterion = torch.nn.NLLLoss()
+
+model.train()
+step = 0
+for x, y, z in data:
+   x = x.to(device)
+   y = y.to(device)
+   z = z.to(device)
+   optimizer.zero_grad()
+   ypredicted = model([x, z])
+   loss = criterion(torch.log(ypredicted), y)
+   if step % 100 == 0:
+      print(step, loss)
+      torch.save(model.state_dict(), f'model1_{step}.bin')
+   step += 1
+   loss.backward()
+   optimizer.step()
+
+torch.save(model.state_dict(), 'model_tri1.bin')
				`@ -0,0 +1 @@`
				`--metric PerplexityHashed --precision 2 --in-header in-header.tsv --out-header out-header.tsv`