This commit is contained in:
Mikołaj Pokrywka 2023-05-08 18:54:03 +02:00
commit 51bcff674a
21 changed files with 895030 additions and 0 deletions

8
.gitignore vendored Normal file
View File

@ -0,0 +1,8 @@
*~
*.swp
*.bak
*.pyc
*.o
.DS_Store
.token

919
09_Zanurzenia_slow.ipynb Normal file
View File

@ -0,0 +1,919 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 09. <i>Zanurzenia słów (Word2vec)</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zanurzenia słów (Word2vec)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W praktyce stosowalność słowosieci okazała się zaskakująco\n",
"ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
"wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### „Wymiary” słów\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
"$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
"prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
"(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
"\n",
"$$P(u|v) \\approx P(u'|v').$$\n",
"\n",
"$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Wymiary określone z góry?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
"określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
"„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
"\n",
"- czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
"- czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
"- czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
" socjolingwistycznym)?\n",
"- czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
"- czy słowo jest rzeczownikiem czy czasownikiem?\n",
"- czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
"- czy słowo jest nazwą czy słowem pospolitym?\n",
"- czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
"- …\n",
"\n",
"W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
"możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bigramowy model języka oparty na zanurzeniach\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
"**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Słownik\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
"ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
"po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
"na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
"\n",
"Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12531"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from itertools import islice\n",
"import regex as re\n",
"import sys\n",
"from torchtext.vocab import build_vocab_from_iterator\n",
"import lzma\n",
"\n",
"def get_words_from_line(line):\n",
" line = line.rstrip()\n",
" yield '<s>'\n",
" for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
" yield m.group(0).lower()\n",
" yield '</s>'\n",
"\n",
"\n",
"def get_word_lines_from_file(file_name):\n",
" counter=0\n",
" with lzma.open(file_name, 'r') as fh:\n",
" for line in fh:\n",
" counter+=1\n",
" if counter == 100000:\n",
" break\n",
" line = line.decode(\"utf-8\")\n",
" yield get_words_from_line(line)\n",
"\n",
"vocab_size = 20000\n",
"\n",
"vocab = build_vocab_from_iterator(\n",
" get_word_lines_from_file('train/in.tsv.xz'),\n",
" max_tokens = vocab_size,\n",
" specials = ['<unk>'])\n",
"\n",
"vocab['jest']"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open(\"vocab.pickle\", 'wb') as handle:\n",
" pickle.dump(vocab, handle)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"with open(\"vocab.pickle\", 'rb') as handle:\n",
" vocab = pickle.load( handle)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"838"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab['love']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vocab.lookup_token()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.\n",
"Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.\n",
"To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.\n",
"Defaulting to user installation because normal site-packages is not writeable\n",
"Collecting torchtext\n",
" Downloading torchtext-0.15.1-cp38-cp38-manylinux1_x86_64.whl (2.0 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m661.8 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: numpy in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (1.19.3)\n",
"Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (4.64.0)\n",
"Collecting torchdata==0.6.0\n",
" Downloading torchdata-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/4.6 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n",
"\u001b[?25hRequirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (2.26.0)\n",
"Collecting torch==2.0.0\n",
" Downloading torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m619.9/619.9 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0m\n",
"\u001b[?25hCollecting nvidia-nccl-cu11==2.14.3\n",
" Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m177.1/177.1 MB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99\n",
" Using cached nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\n",
"Collecting nvidia-cuda-nvrtc-cu11==11.7.99\n",
" Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\n",
"Collecting nvidia-cudnn-cu11==8.5.0.96\n",
" Using cached nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\n",
"Requirement already satisfied: networkx in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (2.8)\n",
"Collecting nvidia-cublas-cu11==11.10.3.66\n",
" Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\n",
"Collecting nvidia-cusolver-cu11==11.4.0.1\n",
" Downloading nvidia_cusolver_cu11-11.4.0.1-2-py3-none-manylinux1_x86_64.whl (102.6 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m102.6/102.6 MB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: typing-extensions in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (4.2.0)\n",
"Collecting nvidia-cusparse-cu11==11.7.4.91\n",
" Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m173.2/173.2 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101\n",
" Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n",
"\u001b[?25hCollecting nvidia-curand-cu11==10.2.10.91\n",
" Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 MB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: jinja2 in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.0.3)\n",
"Collecting nvidia-cufft-cu11==10.9.0.58\n",
" Downloading nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.4/168.4 MB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting sympy\n",
" Downloading sympy-1.11.1-py3-none-any.whl (6.5 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting triton==2.0.0\n",
" Downloading triton-2.0.0-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.2 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.2/63.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: filelock in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.6.0)\n",
"Collecting nvidia-nvtx-cu11==11.7.91\n",
" Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.6/98.6 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: urllib3>=1.25 in /home/mikolaj/.local/lib/python3.8/site-packages (from torchdata==0.6.0->torchtext) (1.26.9)\n",
"Requirement already satisfied: setuptools in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (63.4.2)\n",
"Requirement already satisfied: wheel in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (0.37.1)\n",
"Collecting lit\n",
" Downloading lit-16.0.1.tar.gz (137 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.9/137.9 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25hCollecting cmake\n",
" Using cached cmake-3.26.3-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.0 MB)\n",
"Requirement already satisfied: charset-normalizer~=2.0.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.0.10)\n",
"Requirement already satisfied: idna<4,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.10)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2022.12.7)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from jinja2->torch==2.0.0->torchtext) (2.0.1)\n",
"Collecting mpmath>=0.19\n",
" Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.2/536.2 kB\u001b[0m \u001b[31m768.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hBuilding wheels for collected packages: lit\n",
" Building wheel for lit (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for lit: filename=lit-16.0.1-py3-none-any.whl size=88173 sha256=fca0dda7f2dc27a2885356559af2c2b6bc26994156ad1efae9f15f63d3866468\n",
" Stored in directory: /home/mikolaj/.cache/pip/wheels/12/14/ba/87be46a564f97692e6cd1f6d7a1deeb5bff2821d45a52e8d7a\n",
"Successfully built lit\n",
"Installing collected packages: mpmath, lit, cmake, sympy, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, nvidia-cudnn-cu11, triton, torch, torchdata, torchtext\n",
" Attempting uninstall: torch\n",
" Found existing installation: torch 1.10.0\n",
" Uninstalling torch-1.10.0:\n",
" Successfully uninstalled torch-1.10.0\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"laserembeddings 1.1.2 requires sacremoses==0.0.35, which is not installed.\n",
"unbabel-comet 1.1.0 requires numpy>=1.20.0, but you have numpy 1.19.3 which is incompatible.\n",
"unbabel-comet 1.1.0 requires scipy>=1.5.4, but you have scipy 1.4.1 which is incompatible.\n",
"unbabel-comet 1.1.0 requires torch<=1.10.0,>=1.6.0, but you have torch 2.0.0 which is incompatible.\n",
"torchvision 0.11.1 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
"torchaudio 0.10.0 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
"laserembeddings 1.1.2 requires torch<2.0.0,>=1.0.1.post2, but you have torch 2.0.0 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[0mSuccessfully installed cmake-3.26.3 lit-16.0.1 mpmath-1.3.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 sympy-1.11.1 torch-2.0.0 torchdata-0.6.0 torchtext-0.15.1 triton-2.0.0\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
"source": [
"!pip3 install torchtext"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['<unk>', '\\\\', 'the', '-\\\\', 'wno']"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab.lookup_tokens([0, 1, 2, 10, 12345])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Definicja sieci\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"def look_ahead_iterator(gen):\n",
" prev = None\n",
" for item in gen:\n",
" if prev is not None:\n",
" yield (prev, item)\n",
" prev = item"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(1, 2), (2, 3), (3, 4), (4, 5), (5, 'X'), ('X', 6)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(look_ahead_iterator([1,2,3,4,5, 'X', 6 ]))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import itertools"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"from torch import nn\n",
"import torch\n",
"\n",
"embed_size = 100\n",
"\n",
"class SimpleBigramNeuralLanguageModel(nn.Module):\n",
" def __init__(self, vocabulary_size, embedding_size):\n",
" super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
" self.model = nn.Sequential(\n",
" nn.Embedding(vocabulary_size, embedding_size),\n",
" nn.Linear(embedding_size, vocabulary_size),\n",
" nn.Softmax()\n",
" )\n",
"\n",
" def forward(self, x):\n",
" return self.model(x)\n",
"\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
"\n",
"vocab.set_default_index(vocab['<unk>'])\n",
"ixs = torch.tensor(vocab.forward(['pies']))\n",
"# out[0][vocab['jest']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
"\n",
" shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from torch.utils.data import IterableDataset\n",
"import itertools\n",
"\n",
"def look_ahead_iterator(gen):\n",
" prev = None\n",
" for item in gen:\n",
" if prev is not None:\n",
" yield (prev, item)\n",
" prev = item\n",
"\n",
"class Bigrams(IterableDataset):\n",
" def __init__(self, text_file, vocabulary_size):\n",
" self.vocab = build_vocab_from_iterator(\n",
" get_word_lines_from_file(text_file),\n",
" max_tokens = vocabulary_size,\n",
" specials = ['<unk>'])\n",
" self.vocab.set_default_index(self.vocab['<unk>'])\n",
" self.vocabulary_size = vocabulary_size\n",
" self.text_file = text_file\n",
"\n",
" def __iter__(self):\n",
" return look_ahead_iterator(\n",
" (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
"\n",
"train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'tuple' object is not an iterator",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/tmp/ipykernel_12664/602008184.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mDataLoader\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: 'tuple' object is not an iterator"
]
}
],
"source": [
"from torch.utils.data import DataLoader\n",
"\n",
"next(iter(train_dataset))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['<s>', '<unk>']"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab.lookup_tokens([43, 0])"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tensor([ 2, 5, 51, 3481, 231]), tensor([ 5, 51, 3481, 231, 4])]"
]
}
],
"source": [
"from torch.utils.data import DataLoader\n",
"\n",
"next(iter(DataLoader(train_dataset, batch_size=5)))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/mikolaj/.local/lib/python3.8/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
" input = module(input)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 tensor(10.0877, grad_fn=<NllLossBackward0>)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/mikolaj/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)\n",
" Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"100 tensor(8.4388, grad_fn=<NllLossBackward0>)\n",
"200 tensor(7.7335, grad_fn=<NllLossBackward0>)\n",
"300 tensor(7.1300, grad_fn=<NllLossBackward0>)\n",
"400 tensor(6.7325, grad_fn=<NllLossBackward0>)\n",
"500 tensor(6.4705, grad_fn=<NllLossBackward0>)\n",
"600 tensor(6.0460, grad_fn=<NllLossBackward0>)\n",
"700 tensor(5.8104, grad_fn=<NllLossBackward0>)\n",
"800 tensor(5.8110, grad_fn=<NllLossBackward0>)\n",
"900 tensor(5.7169, grad_fn=<NllLossBackward0>)\n",
"1000 tensor(5.7580, grad_fn=<NllLossBackward0>)\n",
"1100 tensor(5.6787, grad_fn=<NllLossBackward0>)\n",
"1200 tensor(5.4501, grad_fn=<NllLossBackward0>)\n"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/tmp/ipykernel_12664/1293343661.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdevice\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0moptimizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzero_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mypredicted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mypredicted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;36m100\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/tmp/ipykernel_12664/517511851.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSimpleBigramNeuralLanguageModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvocab_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0membed_size\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/container.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 216\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m \u001b[0minput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 218\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 1457\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1458\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1459\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1460\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1461\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mextra_repr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/functional.py\u001b[0m in \u001b[0;36msoftmax\u001b[0;34m(input, dim, _stacklevel, dtype)\u001b[0m\n\u001b[1;32m 1841\u001b[0m \u001b[0mdim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_get_softmax_dim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"softmax\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1842\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1843\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1844\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1845\u001b[0m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"device = 'cpu'\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
"data = DataLoader(train_dataset, batch_size=5000)\n",
"optimizer = torch.optim.Adam(model.parameters())\n",
"criterion = torch.nn.NLLLoss()\n",
"\n",
"model.train()\n",
"step = 0\n",
"for x, y in data:\n",
" x = x.to(device)\n",
" y = y.to(device)\n",
" optimizer.zero_grad()\n",
" ypredicted = model(x)\n",
" loss = criterion(torch.log(ypredicted), y)\n",
" if step % 100 == 0:\n",
" print(step, loss)\n",
" step += 1\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
"torch.save(model.state_dict(), 'model1.bin')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"torch.save(model.state_dict(), 'model1.bin')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(',', 3, 0.12514805793762207),\n",
" ('\\\\', 1, 0.07237359136343002),\n",
" ('<unk>', 0, 0.06839419901371002),\n",
" ('.', 4, 0.06109621003270149),\n",
" ('of', 5, 0.04557998105883598),\n",
" ('and', 6, 0.03565318509936333),\n",
" ('the', 2, 0.029342489317059517),\n",
" ('to', 7, 0.02185475267469883),\n",
" ('-\\\\', 10, 0.018097609281539917),\n",
" ('in', 9, 0.016023961827158928)]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"device = 'cpu'\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
"model.load_state_dict(torch.load('model1.bin'))\n",
"model.eval()\n",
"\n",
"ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
"\n",
"out = model(ixs)\n",
"top = torch.topk(out[0], 10)\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(',', 3, 0.12514805793762207),\n",
" ('\\\\', 1, 0.07237359136343002),\n",
" ('<unk>', 0, 0.06839419901371002),\n",
" ('.', 4, 0.06109621003270149),\n",
" ('of', 5, 0.04557998105883598),\n",
" ('and', 6, 0.03565318509936333),\n",
" ('the', 2, 0.029342489317059517),\n",
" ('to', 7, 0.02185475267469883),\n",
" ('-\\\\', 10, 0.018097609281539917),\n",
" ('in', 9, 0.016023961827158928)]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab = train_dataset.vocab\n",
"ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
"\n",
"out = model(ixs)\n",
"top = torch.topk(out[0], 10)\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('<unk>', 0, 1.0000001192092896),\n",
" ('nb', 1958, 0.4407886266708374),\n",
" ('refrain', 14092, 0.4395471513271332),\n",
" ('cat', 3391, 0.4154242277145386),\n",
" ('enjoying', 7521, 0.3915165066719055),\n",
" ('active', 1383, 0.38935279846191406),\n",
" ('stewart', 4816, 0.3806381821632385),\n",
" ('omit', 15600, 0.380504310131073),\n",
" ('2041095573313', 11912, 0.37909239530563354),\n",
" ('shut', 3863, 0.3778260052204132)]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
"\n",
"embeddings = model.model[0].weight\n",
"\n",
"vec = embeddings[vocab['poszedł']]\n",
"\n",
"similarities = cos(vec, embeddings)\n",
"\n",
"top = torch.topk(similarities, 10)\n",
"\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Zapis przy użyciu wzoru matematycznego\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
"\n",
"$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
"\n",
"gdzie:\n",
"\n",
"- $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
"- $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
"- $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
"- $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Hiperparametry\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zauważmy, że nasz model ma dwa hiperparametry:\n",
"\n",
"- $m$ — rozmiar zanurzenia,\n",
"- $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
" rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
" najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
"\n",
"Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
"polepszenia wyników naszego modelu.\n",
"\n",
"**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Diagram sieci\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
"warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
"sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
"\n",
"![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Zanurzenie jako mnożenie przez macierz\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
"odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
"mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
"wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
"podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
"złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
"\n",
"Wówczas wzór przyjmie postać:\n",
"\n",
"$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
"\n",
"gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
"\n",
"**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
"\n",
"W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
"\n",
"![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.12 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
},
"org": null,
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}

9
README.md Normal file
View File

@ -0,0 +1,9 @@
Challenging America word-gap prediction
===================================
Guess a word in a gap.
Evaluation metric
-----------------
LikelihoodHashed is the metric

1
config.txt Normal file
View File

@ -0,0 +1 @@
--metric PerplexityHashed --precision 2 --in-header in-header.tsv --out-header out-header.tsv

35
create_vocab.py Normal file
View File

@ -0,0 +1,35 @@
from itertools import islice
import regex as re
import sys
from torchtext.vocab import build_vocab_from_iterator
import lzma
def get_words_from_line(line):
line = line.rstrip()
yield '<s>'
for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
yield m.group(0).lower()
yield '</s>'
def get_word_lines_from_file(file_name):
counter=0
with lzma.open(file_name, 'r') as fh:
for line in fh:
counter+=1
# if counter == 10000:
# break
line = line.decode("utf-8")
yield get_words_from_line(line)
vocab_size = 20000
vocab = build_vocab_from_iterator(
get_word_lines_from_file('train/in.tsv.xz'),
max_tokens = vocab_size,
specials = ['<unk>'])
import pickle
with open("vocab.pickle", 'wb') as handle:
pickle.dump(vocab, handle)

10519
dev-0/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

10519
dev-0/hate-speech-info.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
dev-0/in.tsv.xz Normal file

Binary file not shown.

690
dev-0/out.tsv Normal file

File diff suppressed because one or more lines are too long

BIN
geval Executable file

Binary file not shown.

1
in-header.tsv Normal file
View File

@ -0,0 +1 @@
FileId Year LeftContext RightContext
1 FileId Year LeftContext RightContext

657
in-header.tsv.1 Normal file
View File

@ -0,0 +1,657 @@
<!DOCTYPE html>
<html lang="en-US" class="theme-">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>challenging-america-word-gap-prediction/in-header.tsv at 4gram - challenging-america-word-gap-prediction - Gitea: Git with a cup of tea</title>
<link rel="manifest" href="data:application/json;base64,eyJuYW1lIjoiR2l0ZWE6IEdpdCB3aXRoIGEgY3VwIG9mIHRlYSIsInNob3J0X25hbWUiOiJHaXRlYTogR2l0IHdpdGggYSBjdXAgb2YgdGVhIiwic3RhcnRfdXJsIjoiaHR0cHM6Ly9naXQud21pLmFtdS5lZHUucGwvIiwiaWNvbnMiOlt7InNyYyI6Imh0dHBzOi8vZ2l0LndtaS5hbXUuZWR1LnBsL2Fzc2V0cy9pbWcvbG9nby5wbmciLCJ0eXBlIjoiaW1hZ2UvcG5nIiwic2l6ZXMiOiI1MTJ4NTEyIn0seyJzcmMiOiJodHRwczovL2dpdC53bWkuYW11LmVkdS5wbC9hc3NldHMvaW1nL2xvZ28uc3ZnIiwidHlwZSI6ImltYWdlL3N2Zyt4bWwiLCJzaXplcyI6IjUxMng1MTIifV19"/>
<meta name="theme-color" content="#6cc644">
<meta name="default-theme" content="auto" />
<meta name="author" content="s444463" />
<meta name="description" content="challenging-america-word-gap-prediction" />
<meta name="keywords" content="go,git,self-hosted,gitea">
<meta name="referrer" content="no-referrer" />
<script>
<!-- -->
window.config = {
appVer: '1.16.4',
appSubUrl: '',
assetUrlPrefix: '\/assets',
runModeIsProd: true ,
customEmojis: {"codeberg":":codeberg:","git":":git:","gitea":":gitea:","github":":github:","gitlab":":gitlab:","gogs":":gogs:"},
useServiceWorker: true ,
csrfToken: 'p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw',
pageData: {},
requireTribute: null ,
notificationSettings: {"EventSourceUpdateTime":10000,"MaxTimeout":60000,"MinTimeout":10000,"TimeoutStep":10000},
enableTimeTracking: true ,
mermaidMaxSourceCharacters: 5000 ,
i18n: {
copy_success: 'Copied!',
copy_error: 'Copy failed',
error_occurred: 'An error occurred',
network_error: 'Network error',
},
};
window.config.pageData = window.config.pageData || {};
</script>
<link rel="icon" href="/assets/img/logo.svg" type="image/svg+xml">
<link rel="alternate icon" href="/assets/img/favicon.png" type="image/png">
<link rel="stylesheet" href="/assets/css/index.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
<noscript>
<style>
.dropdown:hover > .menu { display: block; }
.ui.secondary.menu .dropdown.item > .menu { margin-top: 0; }
</style>
</noscript>
<meta property="og:title" content="challenging-america-word-gap-prediction" />
<meta property="og:url" content="https://git.wmi.amu.edu.pl/s444463/challenging-america-word-gap-prediction" />
<meta property="og:type" content="object" />
<meta property="og:image" content="https://git.wmi.amu.edu.pl/avatars/a6fe95f301e02b3472a0560d70cc307a" />
<meta property="og:site_name" content="Gitea: Git with a cup of tea" />
<link rel="stylesheet" href="/assets/css/theme-auto.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
<link rel="stylesheet/less" type="text/css" href="/assets/css/jupyter.less" />
<script src="/assets/js/less.js" ></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-MML-AM_CHTML-full,Safe"> </script>
<script type="text/x-mathjax-config">
init_mathjax = function() {
if (window.MathJax) {
// MathJax loaded
MathJax.Hub.Config({
TeX: {
equationNumbers: {
autoNumber: "AMS",
useLabelIds: true
}
},
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true,
processEnvironments: true
},
displayAlign: 'center',
CommonHTML: {
linebreaks: {
automatic: true
}
},
"HTML-CSS": {
linebreaks: {
automatic: true
}
}
});
MathJax.Hub.Queue(["Typeset", MathJax.Hub]);
}
}
init_mathjax();
</script>
<script type="text/javascript">
function giteaSanitizerHack() {
var imgs = document.getElementsByClassName("nb-image-output");
var i;
for (i = 0; i < imgs.length; i++) {
imgs[i].src=imgs[i].src.replace('https://gitea.sanitizer.hack/','');
}
}
window.onload = giteaSanitizerHack;
</script>
</head>
<body>
<div class="full height">
<noscript>This website works better with JavaScript.</noscript>
<div class="ui top secondary stackable main menu following bar light">
<div class="ui container" id="navbar">
<div class="item brand" style="justify-content: space-between;">
<a href="/" data-content="Home">
<img class="ui mini image" width="30" height="30" src="/assets/img/logo.svg">
</a>
<div class="ui basic icon button mobile-only" id="navbar-expand-toggle">
<i class="sidebar icon"></i>
</div>
</div>
<a class="item " href="/explore/repos">Explore</a>
<a class="item" target="_blank" rel="noopener noreferrer" href="https://docs.gitea.io">Help</a>
<div class="right stackable menu">
<a class="item" rel="nofollow" href="/user/login?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
<svg viewBox="0 0 16 16" class="svg octicon-sign-in" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.75C2 1.784 2.784 1 3.75 1h2.5a.75.75 0 0 1 0 1.5h-2.5a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h2.5a.75.75 0 0 1 0 1.5h-2.5A1.75 1.75 0 0 1 2 13.25V2.75zm6.56 4.5 1.97-1.97a.75.75 0 1 0-1.06-1.06L6.22 7.47a.75.75 0 0 0 0 1.06l3.25 3.25a.75.75 0 1 0 1.06-1.06L8.56 8.75h5.69a.75.75 0 0 0 0-1.5H8.56z"/></svg> Sign In
</a>
</div>
</div>
</div>
<div class="page-content repository file list ">
<div class="header-wrapper">
<div class="ui container">
<div class="repo-header">
<div class="repo-title-wrap df fc">
<div class="repo-title">
<div class="repo-icon mr-3">
<svg viewBox="0 0 16 16" class="svg octicon-repo" width="32" height="32" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 1 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 0 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 0 1 1-1h8zM5 12.25v3.25a.25.25 0 0 0 .4.2l1.45-1.087a.25.25 0 0 1 .3 0L8.6 15.7a.25.25 0 0 0 .4-.2v-3.25a.25.25 0 0 0-.25-.25h-3.5a.25.25 0 0 0-.25.25z"/></svg>
</div>
<a href="/s444463">s444463</a>
<div class="mx-2">/</div>
<a href="/s444463/challenging-america-word-gap-prediction">challenging-america-word-gap-prediction</a>
<div class="labels df ac fw">
</div>
</div>
</div>
<div class="repo-buttons">
<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/watch?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to watch this repository." data-position="top center">
<button type="submit" class="ui compact small basic button" disabled>
<svg viewBox="0 0 16 16" class="svg octicon-eye" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.679 7.932c.412-.621 1.242-1.75 2.366-2.717C5.175 4.242 6.527 3.5 8 3.5c1.473 0 2.824.742 3.955 1.715 1.124.967 1.954 2.096 2.366 2.717a.119.119 0 0 1 0 .136c-.412.621-1.242 1.75-2.366 2.717C10.825 11.758 9.473 12.5 8 12.5c-1.473 0-2.824-.742-3.955-1.715C2.92 9.818 2.09 8.69 1.679 8.068a.119.119 0 0 1 0-.136zM8 2c-1.981 0-3.67.992-4.933 2.078C1.797 5.169.88 6.423.43 7.1a1.619 1.619 0 0 0 0 1.798c.45.678 1.367 1.932 2.637 3.024C4.329 13.008 6.019 14 8 14c1.981 0 3.67-.992 4.933-2.078 1.27-1.091 2.187-2.345 2.637-3.023a1.619 1.619 0 0 0 0-1.798c-.45-.678-1.367-1.932-2.637-3.023C11.671 2.992 9.981 2 8 2zm0 8a2 2 0 1 0 0-4 2 2 0 0 0 0 4z"/></svg>Watch
</button>
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/watchers">
1
</a>
</div>
</form>
<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/star?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to star this repository." data-position="top center">
<button type="submit" class="ui compact small basic button" disabled>
<svg viewBox="0 0 16 16" class="svg octicon-star" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.75.75 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79l.72-4.194L.818 6.374a.75.75 0 0 1 .416-1.28l4.21-.611L7.327.668A.75.75 0 0 1 8 .25zm0 2.445L6.615 5.5a.75.75 0 0 1-.564.41l-3.097.45 2.24 2.184a.75.75 0 0 1 .216.664l-.528 3.084 2.769-1.456a.75.75 0 0 1 .698 0l2.77 1.456-.53-3.084a.75.75 0 0 1 .216-.664l2.24-2.183-3.096-.45a.75.75 0 0 1-.564-.41L8 2.694v.001z"/></svg>Star
</button>
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/stars">
0
</a>
</div>
</form>
<div class="ui labeled button
tooltip disabled
"
data-content="Sign in to fork this repository."
data-position="top center" data-variation="tiny" tabindex="0">
<a class="ui compact small basic button"
>
<svg viewBox="0 0 16 16" class="svg octicon-repo-forked" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M5 3.25a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm0 2.122a2.25 2.25 0 1 0-1.5 0v.878A2.25 2.25 0 0 0 5.75 8.5h1.5v2.128a2.251 2.251 0 1 0 1.5 0V8.5h1.5a2.25 2.25 0 0 0 2.25-2.25v-.878a2.25 2.25 0 1 0-1.5 0v.878a.75.75 0 0 1-.75.75h-4.5A.75.75 0 0 1 5 6.25v-.878zm3.75 7.378a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm3-8.75a.75.75 0 1 0 0-1.5.75.75 0 0 0 0 1.5z"/></svg>Fork
</a>
<div class="ui small modal" id="fork-repo-modal">
<svg viewBox="0 0 16 16" class="close inside svg octicon-x" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M3.72 3.72a.75.75 0 0 1 1.06 0L8 6.94l3.22-3.22a.75.75 0 1 1 1.06 1.06L9.06 8l3.22 3.22a.75.75 0 1 1-1.06 1.06L8 9.06l-3.22 3.22a.75.75 0 0 1-1.06-1.06L6.94 8 3.72 4.78a.75.75 0 0 1 0-1.06z"/></svg>
<div class="header">
You&#39;ve already forked challenging-america-word-gap-prediction
</div>
<div class="content tl">
<div class="ui list">
</div>
</div>
</div>
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/forks">
0
</a>
</div>
</div>
</div>
</div>
<div class="ui tabs container">
<div class="ui tabular stackable menu navbar">
<a class="active item" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram">
<svg viewBox="0 0 16 16" class="svg octicon-code" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M4.72 3.22a.75.75 0 0 1 1.06 1.06L2.06 8l3.72 3.72a.75.75 0 1 1-1.06 1.06L.47 8.53a.75.75 0 0 1 0-1.06l4.25-4.25zm6.56 0a.75.75 0 1 0-1.06 1.06L13.94 8l-3.72 3.72a.75.75 0 1 0 1.06 1.06l4.25-4.25a.75.75 0 0 0 0-1.06l-4.25-4.25z"/></svg> Code
</a>
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/issues">
<svg viewBox="0 0 16 16" class="svg octicon-issue-opened" width="16" height="16" aria-hidden="true"><path d="M8 9.5a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3z"/><path fill-rule="evenodd" d="M8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0zM1.5 8a6.5 6.5 0 1 1 13 0 6.5 6.5 0 0 1-13 0z"/></svg> Issues
</a>
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/pulls">
<svg viewBox="0 0 16 16" class="svg octicon-git-pull-request" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.177 3.073 9.573.677A.25.25 0 0 1 10 .854v4.792a.25.25 0 0 1-.427.177L7.177 3.427a.25.25 0 0 1 0-.354zM3.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122v5.256a2.251 2.251 0 1 1-1.5 0V5.372A2.25 2.25 0 0 1 1.5 3.25zM11 2.5h-1V4h1a1 1 0 0 1 1 1v5.628a2.251 2.251 0 1 0 1.5 0V5A2.5 2.5 0 0 0 11 2.5zm1 10.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0zM3.75 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5z"/></svg> Pull Requests
</a>
<a href="/s444463/challenging-america-word-gap-prediction/projects" class=" item">
<svg viewBox="0 0 16 16" class="svg octicon-project" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.75 0A1.75 1.75 0 0 0 0 1.75v12.5C0 15.216.784 16 1.75 16h12.5A1.75 1.75 0 0 0 16 14.25V1.75A1.75 1.75 0 0 0 14.25 0H1.75zM1.5 1.75a.25.25 0 0 1 .25-.25h12.5a.25.25 0 0 1 .25.25v12.5a.25.25 0 0 1-.25.25H1.75a.25.25 0 0 1-.25-.25V1.75zM11.75 3a.75.75 0 0 0-.75.75v7.5a.75.75 0 0 0 1.5 0v-7.5a.75.75 0 0 0-.75-.75zm-8.25.75a.75.75 0 0 1 1.5 0v5.5a.75.75 0 0 1-1.5 0v-5.5zM8 3a.75.75 0 0 0-.75.75v3.5a.75.75 0 0 0 1.5 0v-3.5A.75.75 0 0 0 8 3z"/></svg> Projects
</a>
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/releases">
<svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> Releases
</a>
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/wiki" >
<svg viewBox="0 0 16 16" class="svg octicon-book" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M0 1.75A.75.75 0 0 1 .75 1h4.253c1.227 0 2.317.59 3 1.501A3.744 3.744 0 0 1 11.006 1h4.245a.75.75 0 0 1 .75.75v10.5a.75.75 0 0 1-.75.75h-4.507a2.25 2.25 0 0 0-1.591.659l-.622.621a.75.75 0 0 1-1.06 0l-.622-.621A2.25 2.25 0 0 0 5.258 13H.75a.75.75 0 0 1-.75-.75V1.75zm8.755 3a2.25 2.25 0 0 1 2.25-2.25H14.5v9h-3.757c-.71 0-1.4.201-1.992.572l.004-7.322zm-1.504 7.324.004-5.073-.002-2.253A2.25 2.25 0 0 0 5.003 2.5H1.5v9h3.757a3.75 3.75 0 0 1 1.994.574z"/></svg> Wiki
</a>
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/activity">
<svg viewBox="0 0 16 16" class="svg octicon-pulse" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6 2a.75.75 0 0 1 .696.471L10 10.731l1.304-3.26A.75.75 0 0 1 12 7h3.25a.75.75 0 0 1 0 1.5h-2.742l-1.812 4.528a.75.75 0 0 1-1.392 0L6 4.77 4.696 8.03A.75.75 0 0 1 4 8.5H.75a.75.75 0 0 1 0-1.5h2.742l1.812-4.529A.75.75 0 0 1 6 2z"/></svg> Activity
</a>
</div>
</div>
<div class="ui tabs divider"></div>
</div>
<div class="ui container ">
<div class="ui repo-description">
<div id="repo-desc">
<a class="link" href=""></a>
</div>
</div>
<div class="mt-3" id="repo-topics">
</div>
<div class="hide" id="validate_prompt">
<span id="count_prompt">You can not select more than 25 topics</span>
<span id="format_prompt">Topics must start with a letter or number, can include dashes (&#39;-&#39;) and can be up to 35 characters long.</span>
</div>
<div class="ui segments repository-summary repository-summary-language-stats mt-3">
<div class="ui segment sub-menu repository-menu">
<div class="ui two horizontal center link list">
<div class="item">
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram"><svg viewBox="0 0 16 16" class="svg octicon-history" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.643 3.143.427 1.927A.25.25 0 0 0 0 2.104V5.75c0 .138.112.25.25.25h3.646a.25.25 0 0 0 .177-.427L2.715 4.215a6.5 6.5 0 1 1-1.18 4.458.75.75 0 1 0-1.493.154 8.001 8.001 0 1 0 1.6-5.684zM7.75 4a.75.75 0 0 1 .75.75v2.992l2.028.812a.75.75 0 0 1-.557 1.392l-2.5-1A.75.75 0 0 1 7 8.25v-3.5A.75.75 0 0 1 7.75 4z"/></svg> <b>4</b> Commits</a>
</div>
<div class="item">
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/branches"><svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg> <b>2</b> Branches</a>
</div>
<div class="item">
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/tags"><svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> <b>0</b> Tags</a>
</div>
<div class="item">
<span class="ui"><svg viewBox="0 0 16 16" class="svg octicon-database" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 3.5c0-.133.058-.318.282-.55.227-.237.592-.484 1.1-.708C4.899 1.795 6.354 1.5 8 1.5c1.647 0 3.102.295 4.117.742.51.224.874.47 1.101.707.224.233.282.418.282.551 0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 5.205 9.646 5.5 8 5.5c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707-.224-.233-.282-.418-.282-.551zM1 3.5c0-.626.292-1.165.7-1.59.406-.422.956-.767 1.579-1.041C4.525.32 6.195 0 8 0c1.805 0 3.475.32 4.722.869.622.274 1.172.62 1.578 1.04.408.426.7.965.7 1.591v9c0 .626-.292 1.165-.7 1.59-.406.422-.956.767-1.579 1.041C11.476 15.68 9.806 16 8 16c-1.805 0-3.475-.32-4.721-.869-.623-.274-1.173-.62-1.579-1.04-.408-.426-.7-.965-.7-1.591v-9zM2.5 8V5.724c.241.15.503.286.779.407C4.525 6.68 6.195 7 8 7c1.805 0 3.475-.32 4.722-.869.275-.121.537-.257.778-.407V8c0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 9.705 9.646 10 8 10c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707C2.558 8.318 2.5 8.133 2.5 8zm0 2.225V12.5c0 .133.058.318.282.55.227.237.592.484 1.1.708 1.016.447 2.471.742 4.118.742 1.647 0 3.102-.295 4.117-.742.51-.224.874-.47 1.101-.707.224-.233.282-.418.282-.551v-2.275c-.241.15-.503.285-.778.406-1.247.549-2.917.869-4.722.869-1.805 0-3.475-.32-4.721-.869a6.236 6.236 0 0 1-.779-.406z"/></svg> <b>284 MiB</b></span>
</div>
</div>
</div>
<div class="ui segment sub-menu language-stats-details" style="display: none">
<div class="ui horizontal center link list">
<div class="item df ac jc">
<i class="color-icon mr-3" style="background-color: #3572A5"></i>
<span class="bold mr-3">
Python
</span>
100%
</div>
</div>
</div>
<a class="ui segment language-stats">
<div class="bar" style="width: 100%; background-color: #3572A5">&nbsp;</div>
</a>
</div>
<div class="ui stackable secondary menu mobile--margin-between-items mobile--no-negative-margins">
<div class="fitted item choose reference mr-1">
<div class="ui floating filter dropdown custom" data-can-create-branch="false" data-no-results="No results found.">
<div class="ui basic small compact button" @click="menuVisible = !menuVisible" @keyup.enter="menuVisible = !menuVisible">
<span class="text">
<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
Branch:
<strong>4gram</strong>
</span>
<svg viewBox="0 0 16 16" class="dropdown icon svg octicon-triangle-down" width="14" height="14" aria-hidden="true"><path d="m4.427 7.427 3.396 3.396a.25.25 0 0 0 .354 0l3.396-3.396A.25.25 0 0 0 11.396 7H4.604a.25.25 0 0 0-.177.427z"/></svg>
</div>
<div class="data" style="display: none" data-mode="branches">
<div class="item branch selected" data-url="/s444463/challenging-america-word-gap-prediction/src/branch/4gram/in-header.tsv">4gram</div>
<div class="item branch " data-url="/s444463/challenging-america-word-gap-prediction/src/branch/master/in-header.tsv">master</div>
</div>
<div class="menu transition" :class="{visible: menuVisible}" v-if="menuVisible" v-cloak>
<div class="ui icon search input">
<i class="icon df ac jc m-0"><svg viewBox="0 0 16 16" class="svg octicon-filter" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M.75 3a.75.75 0 0 0 0 1.5h14.5a.75.75 0 0 0 0-1.5H.75zM3 7.75A.75.75 0 0 1 3.75 7h8.5a.75.75 0 0 1 0 1.5h-8.5A.75.75 0 0 1 3 7.75zm3 4a.75.75 0 0 1 .75-.75h2.5a.75.75 0 0 1 0 1.5h-2.5a.75.75 0 0 1-.75-.75z"/></svg></i>
<input name="search" ref="searchField" autocomplete="off" v-model="searchTerm" @keydown="keydown($event)" placeholder="Filter branch or tag...">
</div>
<div class="header branch-tag-choice">
<div class="ui grid">
<div class="two column row">
<a class="reference column" href="#" @click="createTag = false; mode = 'branches'; focusSearchField()">
<span class="text" :class="{black: mode == 'branches'}">
<svg viewBox="0 0 16 16" class="mr-2 svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>Branches
</span>
</a>
<a class="reference column" href="#" @click="createTag = true; mode = 'tags'; focusSearchField()">
<span class="text" :class="{black: mode == 'tags'}">
<svg viewBox="0 0 16 16" class="mr-2 svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg>Tags
</span>
</a>
</div>
</div>
</div>
<div class="scrolling menu" ref="scrollContainer">
<div v-for="(item, index) in filteredItems" :key="item.name" class="item" :class="{selected: item.selected, active: active == index}" @click="selectItem(item)" :ref="'listItem' + index">${ item.name }</div>
<div class="item" v-if="showCreateNewBranch" :class="{active: active == filteredItems.length}" :ref="'listItem' + filteredItems.length">
<a href="#" @click="createNewBranch()">
<div v-show="createTag">
<i class="reference tags icon"></i>
Create tag <strong>${ searchTerm }</strong>
</div>
<div v-show="!createTag">
<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
Create branch <strong>${ searchTerm }</strong>
</div>
<div class="text small">
from &#39;4gram&#39;
</div>
</a>
<form ref="newBranchForm" action="/s444463/challenging-america-word-gap-prediction/branches/_new/branch/4gram" method="post">
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
<input type="hidden" name="new_branch_name" v-model="searchTerm">
<input type="hidden" name="create_tag" v-model="createTag">
</form>
</div>
</div>
<div class="message" v-if="showNoResults">${ noResults }</div>
</div>
</div>
</div>
<div class="fitted item"><span class="ui breadcrumb repo-path"><a class="section" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram" title="challenging-america-word-gap-prediction">challenging-america-word-ga...</a><span class="divider">/</span><span class="active section" title="in-header.tsv">in-header.tsv</span></span></div>
<div class="right fitted item mr-0" id="file-buttons">
<div class="ui tiny primary buttons">
</div>
</div>
<div class="fitted item">
</div>
<div class="fitted item">
</div>
</div>
<div class="tab-size-8 non-diff-file-content">
<h4 class="file-header ui top attached header df ac sb">
<div class="file-header-left df ac">
<div class="file-info text grey normal mono">
<div class="file-info-entry">
37 B
</div>
</div>
</div>
<div class="file-header-right file-actions df ac">
<div class="ui buttons mr-2">
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv">Raw</a>
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/src/commit/65d889d6525d3949dd1ace045393124a7afb1f0e/in-header.tsv">Permalink</a>
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/blame/branch/4gram/in-header.tsv">Blame</a>
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram/in-header.tsv">History</a>
</div>
<a download href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv"><span class="btn-octicon tooltip" data-content="Download file" data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-download" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.47 10.78a.75.75 0 0 0 1.06 0l3.75-3.75a.75.75 0 0 0-1.06-1.06L8.75 8.44V1.75a.75.75 0 0 0-1.5 0v6.69L4.78 5.97a.75.75 0 0 0-1.06 1.06l3.75 3.75zM3.75 13a.75.75 0 0 0 0 1.5h8.5a.75.75 0 0 0 0-1.5h-8.5z"/></svg></span></a>
<span class="btn-octicon tooltip disabled" data-content="You must fork this repository to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-pencil" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.013 1.427a1.75 1.75 0 0 1 2.474 0l1.086 1.086a1.75 1.75 0 0 1 0 2.474l-8.61 8.61c-.21.21-.47.364-.756.445l-3.251.93a.75.75 0 0 1-.927-.928l.929-3.25a1.75 1.75 0 0 1 .445-.758l8.61-8.61zm1.414 1.06a.25.25 0 0 0-.354 0L10.811 3.75l1.439 1.44 1.263-1.263a.25.25 0 0 0 0-.354l-1.086-1.086zM11.189 6.25 9.75 4.81l-6.286 6.287a.25.25 0 0 0-.064.108l-.558 1.953 1.953-.558a.249.249 0 0 0 .108-.064l6.286-6.286z"/></svg></span>
<span class="btn-octicon tooltip disabled" data-content="You must have write access to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-trash" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.5 1.75a.25.25 0 0 1 .25-.25h2.5a.25.25 0 0 1 .25.25V3h-3V1.75zm4.5 0V3h2.25a.75.75 0 0 1 0 1.5H2.75a.75.75 0 0 1 0-1.5H5V1.75C5 .784 5.784 0 6.75 0h2.5C10.216 0 11 .784 11 1.75zM4.496 6.675a.75.75 0 1 0-1.492.15l.66 6.6A1.75 1.75 0 0 0 5.405 15h5.19c.9 0 1.652-.681 1.741-1.576l.66-6.6a.75.75 0 0 0-1.492-.149l-.66 6.6a.25.25 0 0 1-.249.225h-5.19a.25.25 0 0 1-.249-.225l-.66-6.6z"/></svg></span>
</div>
</h4>
<div class="ui attached table unstackable segment">
<div class="file-view markup csv">
<table class="data-table"><tr><th class="line-num">1</th><th>FileId</th><th>Year</th><th>LeftContext</th><th>RightContext</th></tr></table>
</div>
</div>
</div>
</div>
</div>
</div>
<footer>
<div class="ui container">
<div class="ui left">
Powered by Gitea
</div>
<div class="ui right links">
<div class="ui language bottom floating slide up dropdown link item">
<svg viewBox="0 0 16 16" class="svg octicon-globe" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.543 7.25h2.733c.144-2.074.866-3.756 1.58-4.948.12-.197.237-.381.353-.552a6.506 6.506 0 0 0-4.666 5.5zm2.733 1.5H1.543a6.506 6.506 0 0 0 4.666 5.5 11.13 11.13 0 0 1-.352-.552c-.715-1.192-1.437-2.874-1.581-4.948zm1.504 0h4.44a9.637 9.637 0 0 1-1.363 4.177c-.306.51-.612.919-.857 1.215a9.978 9.978 0 0 1-.857-1.215A9.637 9.637 0 0 1 5.78 8.75zm4.44-1.5H5.78a9.637 9.637 0 0 1 1.363-4.177c.306-.51.612-.919.857-1.215.245.296.55.705.857 1.215A9.638 9.638 0 0 1 10.22 7.25zm1.504 1.5c-.144 2.074-.866 3.756-1.58 4.948-.12.197-.237.381-.353.552a6.506 6.506 0 0 0 4.666-5.5h-2.733zm2.733-1.5h-2.733c-.144-2.074-.866-3.756-1.58-4.948a11.738 11.738 0 0 0-.353-.552 6.506 6.506 0 0 1 4.666 5.5zM8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0z"/></svg>
<div class="text">English</div>
<div class="menu language-menu">
<a lang="id-ID" data-url="/?lang=id-ID" class="item ">bahasa Indonesia</a>
<a lang="de-DE" data-url="/?lang=de-DE" class="item ">Deutsch</a>
<a lang="en-US" data-url="/?lang=en-US" class="item active selected">English</a>
<a lang="es-ES" data-url="/?lang=es-ES" class="item ">español</a>
<a lang="fr-FR" data-url="/?lang=fr-FR" class="item ">français</a>
<a lang="it-IT" data-url="/?lang=it-IT" class="item ">italiano</a>
<a lang="lv-LV" data-url="/?lang=lv-LV" class="item ">latviešu</a>
<a lang="hu-HU" data-url="/?lang=hu-HU" class="item ">magyar nyelv</a>
<a lang="nl-NL" data-url="/?lang=nl-NL" class="item ">Nederlands</a>
<a lang="pl-PL" data-url="/?lang=pl-PL" class="item ">polski</a>
<a lang="pt-PT" data-url="/?lang=pt-PT" class="item ">Português de Portugal</a>
<a lang="pt-BR" data-url="/?lang=pt-BR" class="item ">português do Brasil</a>
<a lang="fi-FI" data-url="/?lang=fi-FI" class="item ">suomi</a>
<a lang="sv-SE" data-url="/?lang=sv-SE" class="item ">svenska</a>
<a lang="tr-TR" data-url="/?lang=tr-TR" class="item ">Türkçe</a>
<a lang="cs-CZ" data-url="/?lang=cs-CZ" class="item ">čeština</a>
<a lang="el-GR" data-url="/?lang=el-GR" class="item ">ελληνικά</a>
<a lang="bg-BG" data-url="/?lang=bg-BG" class="item ">български</a>
<a lang="ru-RU" data-url="/?lang=ru-RU" class="item ">русский</a>
<a lang="sr-SP" data-url="/?lang=sr-SP" class="item ">српски</a>
<a lang="uk-UA" data-url="/?lang=uk-UA" class="item ">Українська</a>
<a lang="fa-IR" data-url="/?lang=fa-IR" class="item ">فارسی</a>
<a lang="ml-IN" data-url="/?lang=ml-IN" class="item ">മലയാളം</a>
<a lang="ja-JP" data-url="/?lang=ja-JP" class="item ">日本語</a>
<a lang="zh-CN" data-url="/?lang=zh-CN" class="item ">简体中文</a>
<a lang="zh-TW" data-url="/?lang=zh-TW" class="item ">繁體中文(台灣)</a>
<a lang="zh-HK" data-url="/?lang=zh-HK" class="item ">繁體中文(香港)</a>
<a lang="ko-KR" data-url="/?lang=ko-KR" class="item ">한국어</a>
</div>
</div>
<a href="/assets/js/licenses.txt">Licenses</a>
<a href="/api/swagger">API</a>
<a target="_blank" rel="noopener noreferrer" href="https://gitea.io">Website</a>
</div>
</div>
</footer>
<script src="/assets/js/index.js?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c"></script>
</body>
</html>

89
inference.py Normal file
View File

@ -0,0 +1,89 @@
from torch import nn
import torch
from torch.utils.data import IterableDataset
import itertools
import lzma
import regex as re
import pickle
class SimpleTrigramNeuralLanguageModel(nn.Module):
def __init__(self, vocabulary_size, embedding_size):
super(SimpleTrigramNeuralLanguageModel, self).__init__()
self.embedings = nn.Embedding(vocabulary_size, embedding_size)
self.linear = nn.Linear(embedding_size*2, vocabulary_size)
self.softmax = nn.Softmax()
# self.model = nn.Sequential(
# nn.Embedding(vocabulary_size, embedding_size),
# nn.Linear(embedding_size, vocabulary_size),
# nn.Softmax()
# )
def forward(self, x):
emb_1 = self.embedings(x[0])
emb_2 = self.embedings(x[1])
concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
y = self.softmax(concated)
return y
vocab_size = 20000
embed_size = 100
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
model.load_state_dict(torch.load('model1_5400.bin'))
model.eval()
with open("vocab.pickle", 'rb') as handle:
vocab = pickle.load(handle)
vocab.set_default_index(vocab['<unk>'])
device = 'cpu'
# data = DataLoader(train_dataset, batch_size=5000)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.NLLLoss()
test_pred = ['ala', 'has', 'cat']
step = 0
with lzma.open('dev-0/in.tsv.xz', 'rb') as file:
for line in file:
line = line.decode('utf-8')
line = line.rstrip()
line_splitted = line.split('\t')[-2:]
prev = line[0].split(' ')[-1]
next = line[1].split(' ')[0]
x = torch.tensor(vocab.forward([prev]))
z = torch.tensor(vocab.forward([next]))
x = x.to(device)
z = z.to(device)
ypredicted = model([x, z])
top = torch.topk(ypredicted[0], 5000)
top_indices = top.indices.tolist()
top_probs = top.values.tolist()
top_words = vocab.lookup_tokens(top_indices)
string_to_print = ''
sum_probs = 0
for w, p in zip(top_words, top_probs):
if '<unk>' in w:
continue
if re.search(r'\p{L}+', w):
string_to_print += f"{w}:{p} "
sum_probs += p
if string_to_print == '':
print(f"the:0.5 a:0.3 :0.2")
continue
unknow_prob = 1 - sum_probs
string_to_print += f":{unknow_prob}"
print(string_to_print)

1
out-header.tsv Normal file
View File

@ -0,0 +1 @@
Word
1 Word

7414
test-A/hate-speech-info.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
test-A/in.tsv.xz Normal file

Binary file not shown.

0
test-A/out.tsv Normal file
View File

432022
train/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

432022
train/hate-speech-info.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
train/in.tsv.xz Normal file

Binary file not shown.

124
trigrams.py Normal file
View File

@ -0,0 +1,124 @@
from torch import nn
import torch
from torch.utils.data import IterableDataset
import itertools
import lzma
import regex as re
import pickle
def look_ahead_iterator(gen):
prev = None
current = None
next = None
for next in gen:
if prev is not None and current is not None:
yield (prev, current, next)
prev = current
current = next
def get_words_from_line(line):
line = line.rstrip()
yield '<s>'
for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
yield m.group(0).lower()
yield '</s>'
def get_word_lines_from_file(file_name):
counter=0
with lzma.open(file_name, 'r') as fh:
for line in fh:
counter+=1
if counter == 100000:
break
line = line.decode("utf-8")
yield get_words_from_line(line)
class Trigrams(IterableDataset):
def load_vocab(self):
with open("vocab.pickle", 'rb') as handle:
vocab = pickle.load( handle)
return vocab
def __init__(self, text_file, vocabulary_size):
self.vocab = self.load_vocab()
self.vocab.set_default_index(self.vocab['<unk>'])
self.vocabulary_size = vocabulary_size
self.text_file = text_file
def __iter__(self):
return look_ahead_iterator(
(self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))
vocab_size = 20000
train_dataset = Trigrams('train/in.tsv.xz', vocab_size)
#=== trenowanie
from torch import nn
import torch
from torch.utils.data import DataLoader
embed_size = 100
class SimpleTrigramNeuralLanguageModel(nn.Module):
def __init__(self, vocabulary_size, embedding_size):
super(SimpleTrigramNeuralLanguageModel, self).__init__()
self.embedings = nn.Embedding(vocabulary_size, embedding_size)
self.linear = nn.Linear(embedding_size*2, vocabulary_size)
self.softmax = nn.Softmax()
# self.model = nn.Sequential(
# nn.Embedding(vocabulary_size, embedding_size),
# nn.Linear(embedding_size, vocabulary_size),
# nn.Softmax()
# )
def forward(self, x):
emb_1 = self.embedings(x[0])
emb_2 = self.embedings(x[1])
concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
y = self.softmax(concated)
return y
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
vocab = train_dataset.vocab
vocab.set_default_index(vocab['<unk>'])
ixs = torch.tensor(vocab.forward(['pies']))
# out[0][vocab['jest']]
device = 'cpu'
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size).to(device)
data = DataLoader(train_dataset, batch_size=5000)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.NLLLoss()
model.train()
step = 0
for x, y, z in data:
x = x.to(device)
y = y.to(device)
z = z.to(device)
optimizer.zero_grad()
ypredicted = model([x, z])
loss = criterion(torch.log(ypredicted), y)
if step % 100 == 0:
print(step, loss)
torch.save(model.state_dict(), f'model1_{step}.bin')
step += 1
loss.backward()
optimizer.step()
torch.save(model.state_dict(), 'model_tri1.bin')