wip
This commit is contained in:
commit
51bcff674a
|
@ -0,0 +1,8 @@
|
|||
|
||||
*~
|
||||
*.swp
|
||||
*.bak
|
||||
*.pyc
|
||||
*.o
|
||||
.DS_Store
|
||||
.token
|
|
@ -0,0 +1,919 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Modelowanie języka</h1>\n",
|
||||
"<h2> 09. <i>Zanurzenia słów (Word2vec)</i> [wykład]</h2> \n",
|
||||
"<h3> Filip Graliński (2022)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Zanurzenia słów (Word2vec)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"W praktyce stosowalność słowosieci okazała się zaskakująco\n",
|
||||
"ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
|
||||
"wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### „Wymiary” słów\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
|
||||
"$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
|
||||
"prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
|
||||
"(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
|
||||
"\n",
|
||||
"$$P(u|v) \\approx P(u'|v').$$\n",
|
||||
"\n",
|
||||
"$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Wymiary określone z góry?\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
|
||||
"określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
|
||||
"„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
|
||||
"\n",
|
||||
"- czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
|
||||
"- czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
|
||||
"- czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
|
||||
" socjolingwistycznym)?\n",
|
||||
"- czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
|
||||
"- czy słowo jest rzeczownikiem czy czasownikiem?\n",
|
||||
"- czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
|
||||
"- czy słowo jest nazwą czy słowem pospolitym?\n",
|
||||
"- czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
|
||||
"- …\n",
|
||||
"\n",
|
||||
"W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
|
||||
"możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Bigramowy model języka oparty na zanurzeniach\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
|
||||
"**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Słownik\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
|
||||
"ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
|
||||
"po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
|
||||
"na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
|
||||
"\n",
|
||||
"Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"12531"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from itertools import islice\n",
|
||||
"import regex as re\n",
|
||||
"import sys\n",
|
||||
"from torchtext.vocab import build_vocab_from_iterator\n",
|
||||
"import lzma\n",
|
||||
"\n",
|
||||
"def get_words_from_line(line):\n",
|
||||
" line = line.rstrip()\n",
|
||||
" yield '<s>'\n",
|
||||
" for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
|
||||
" yield m.group(0).lower()\n",
|
||||
" yield '</s>'\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_word_lines_from_file(file_name):\n",
|
||||
" counter=0\n",
|
||||
" with lzma.open(file_name, 'r') as fh:\n",
|
||||
" for line in fh:\n",
|
||||
" counter+=1\n",
|
||||
" if counter == 100000:\n",
|
||||
" break\n",
|
||||
" line = line.decode(\"utf-8\")\n",
|
||||
" yield get_words_from_line(line)\n",
|
||||
"\n",
|
||||
"vocab_size = 20000\n",
|
||||
"\n",
|
||||
"vocab = build_vocab_from_iterator(\n",
|
||||
" get_word_lines_from_file('train/in.tsv.xz'),\n",
|
||||
" max_tokens = vocab_size,\n",
|
||||
" specials = ['<unk>'])\n",
|
||||
"\n",
|
||||
"vocab['jest']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pickle\n",
|
||||
"with open(\"vocab.pickle\", 'wb') as handle:\n",
|
||||
" pickle.dump(vocab, handle)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"vocab.pickle\", 'rb') as handle:\n",
|
||||
" vocab = pickle.load( handle)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"838"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab['love']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vocab.lookup_token()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.\n",
|
||||
"Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.\n",
|
||||
"To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.\n",
|
||||
"Defaulting to user installation because normal site-packages is not writeable\n",
|
||||
"Collecting torchtext\n",
|
||||
" Downloading torchtext-0.15.1-cp38-cp38-manylinux1_x86_64.whl (2.0 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m661.8 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: numpy in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (1.19.3)\n",
|
||||
"Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (4.64.0)\n",
|
||||
"Collecting torchdata==0.6.0\n",
|
||||
" Downloading torchdata-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/4.6 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (2.26.0)\n",
|
||||
"Collecting torch==2.0.0\n",
|
||||
" Downloading torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m619.9/619.9 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0m\n",
|
||||
"\u001b[?25hCollecting nvidia-nccl-cu11==2.14.3\n",
|
||||
" Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m177.1/177.1 MB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99\n",
|
||||
" Using cached nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\n",
|
||||
"Collecting nvidia-cuda-nvrtc-cu11==11.7.99\n",
|
||||
" Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\n",
|
||||
"Collecting nvidia-cudnn-cu11==8.5.0.96\n",
|
||||
" Using cached nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\n",
|
||||
"Requirement already satisfied: networkx in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (2.8)\n",
|
||||
"Collecting nvidia-cublas-cu11==11.10.3.66\n",
|
||||
" Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\n",
|
||||
"Collecting nvidia-cusolver-cu11==11.4.0.1\n",
|
||||
" Downloading nvidia_cusolver_cu11-11.4.0.1-2-py3-none-manylinux1_x86_64.whl (102.6 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m102.6/102.6 MB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: typing-extensions in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (4.2.0)\n",
|
||||
"Collecting nvidia-cusparse-cu11==11.7.4.91\n",
|
||||
" Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m173.2/173.2 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101\n",
|
||||
" Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n",
|
||||
"\u001b[?25hCollecting nvidia-curand-cu11==10.2.10.91\n",
|
||||
" Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 MB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: jinja2 in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.0.3)\n",
|
||||
"Collecting nvidia-cufft-cu11==10.9.0.58\n",
|
||||
" Downloading nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.4/168.4 MB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hCollecting sympy\n",
|
||||
" Downloading sympy-1.11.1-py3-none-any.whl (6.5 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hCollecting triton==2.0.0\n",
|
||||
" Downloading triton-2.0.0-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.2 MB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.2/63.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: filelock in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.6.0)\n",
|
||||
"Collecting nvidia-nvtx-cu11==11.7.91\n",
|
||||
" Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.6/98.6 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
|
||||
"\u001b[?25hRequirement already satisfied: urllib3>=1.25 in /home/mikolaj/.local/lib/python3.8/site-packages (from torchdata==0.6.0->torchtext) (1.26.9)\n",
|
||||
"Requirement already satisfied: setuptools in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (63.4.2)\n",
|
||||
"Requirement already satisfied: wheel in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (0.37.1)\n",
|
||||
"Collecting lit\n",
|
||||
" Downloading lit-16.0.1.tar.gz (137 kB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.9/137.9 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
|
||||
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n",
|
||||
"\u001b[?25hCollecting cmake\n",
|
||||
" Using cached cmake-3.26.3-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.0 MB)\n",
|
||||
"Requirement already satisfied: charset-normalizer~=2.0.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.0.10)\n",
|
||||
"Requirement already satisfied: idna<4,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.10)\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2022.12.7)\n",
|
||||
"Requirement already satisfied: MarkupSafe>=2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from jinja2->torch==2.0.0->torchtext) (2.0.1)\n",
|
||||
"Collecting mpmath>=0.19\n",
|
||||
" Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)\n",
|
||||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.2/536.2 kB\u001b[0m \u001b[31m768.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
|
||||
"\u001b[?25hBuilding wheels for collected packages: lit\n",
|
||||
" Building wheel for lit (setup.py) ... \u001b[?25ldone\n",
|
||||
"\u001b[?25h Created wheel for lit: filename=lit-16.0.1-py3-none-any.whl size=88173 sha256=fca0dda7f2dc27a2885356559af2c2b6bc26994156ad1efae9f15f63d3866468\n",
|
||||
" Stored in directory: /home/mikolaj/.cache/pip/wheels/12/14/ba/87be46a564f97692e6cd1f6d7a1deeb5bff2821d45a52e8d7a\n",
|
||||
"Successfully built lit\n",
|
||||
"Installing collected packages: mpmath, lit, cmake, sympy, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, nvidia-cudnn-cu11, triton, torch, torchdata, torchtext\n",
|
||||
" Attempting uninstall: torch\n",
|
||||
" Found existing installation: torch 1.10.0\n",
|
||||
" Uninstalling torch-1.10.0:\n",
|
||||
" Successfully uninstalled torch-1.10.0\n",
|
||||
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
|
||||
"laserembeddings 1.1.2 requires sacremoses==0.0.35, which is not installed.\n",
|
||||
"unbabel-comet 1.1.0 requires numpy>=1.20.0, but you have numpy 1.19.3 which is incompatible.\n",
|
||||
"unbabel-comet 1.1.0 requires scipy>=1.5.4, but you have scipy 1.4.1 which is incompatible.\n",
|
||||
"unbabel-comet 1.1.0 requires torch<=1.10.0,>=1.6.0, but you have torch 2.0.0 which is incompatible.\n",
|
||||
"torchvision 0.11.1 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
|
||||
"torchaudio 0.10.0 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n",
|
||||
"laserembeddings 1.1.2 requires torch<2.0.0,>=1.0.1.post2, but you have torch 2.0.0 which is incompatible.\u001b[0m\u001b[31m\n",
|
||||
"\u001b[0mSuccessfully installed cmake-3.26.3 lit-16.0.1 mpmath-1.3.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 sympy-1.11.1 torch-2.0.0 torchdata-0.6.0 torchtext-0.15.1 triton-2.0.0\n",
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!pip3 install torchtext"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['<unk>', '\\\\', 'the', '-\\\\', 'wno']"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab.lookup_tokens([0, 1, 2, 10, 12345])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Definicja sieci\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def look_ahead_iterator(gen):\n",
|
||||
" prev = None\n",
|
||||
" for item in gen:\n",
|
||||
" if prev is not None:\n",
|
||||
" yield (prev, item)\n",
|
||||
" prev = item"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(1, 2), (2, 3), (3, 4), (4, 5), (5, 'X'), ('X', 6)]"
|
||||
]
|
||||
},
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"list(look_ahead_iterator([1,2,3,4,5, 'X', 6 ]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import itertools"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from torch import nn\n",
|
||||
"import torch\n",
|
||||
"\n",
|
||||
"embed_size = 100\n",
|
||||
"\n",
|
||||
"class SimpleBigramNeuralLanguageModel(nn.Module):\n",
|
||||
" def __init__(self, vocabulary_size, embedding_size):\n",
|
||||
" super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
|
||||
" self.model = nn.Sequential(\n",
|
||||
" nn.Embedding(vocabulary_size, embedding_size),\n",
|
||||
" nn.Linear(embedding_size, vocabulary_size),\n",
|
||||
" nn.Softmax()\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" return self.model(x)\n",
|
||||
"\n",
|
||||
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
|
||||
"\n",
|
||||
"vocab.set_default_index(vocab['<unk>'])\n",
|
||||
"ixs = torch.tensor(vocab.forward(['pies']))\n",
|
||||
"# out[0][vocab['jest']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
|
||||
"\n",
|
||||
" shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from torch.utils.data import IterableDataset\n",
|
||||
"import itertools\n",
|
||||
"\n",
|
||||
"def look_ahead_iterator(gen):\n",
|
||||
" prev = None\n",
|
||||
" for item in gen:\n",
|
||||
" if prev is not None:\n",
|
||||
" yield (prev, item)\n",
|
||||
" prev = item\n",
|
||||
"\n",
|
||||
"class Bigrams(IterableDataset):\n",
|
||||
" def __init__(self, text_file, vocabulary_size):\n",
|
||||
" self.vocab = build_vocab_from_iterator(\n",
|
||||
" get_word_lines_from_file(text_file),\n",
|
||||
" max_tokens = vocabulary_size,\n",
|
||||
" specials = ['<unk>'])\n",
|
||||
" self.vocab.set_default_index(self.vocab['<unk>'])\n",
|
||||
" self.vocabulary_size = vocabulary_size\n",
|
||||
" self.text_file = text_file\n",
|
||||
"\n",
|
||||
" def __iter__(self):\n",
|
||||
" return look_ahead_iterator(\n",
|
||||
" (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
|
||||
"\n",
|
||||
"train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "TypeError",
|
||||
"evalue": "'tuple' object is not an iterator",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
||||
"\u001b[0;32m/tmp/ipykernel_12664/602008184.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mDataLoader\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
||||
"\u001b[0;31mTypeError\u001b[0m: 'tuple' object is not an iterator"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from torch.utils.data import DataLoader\n",
|
||||
"\n",
|
||||
"next(iter(train_dataset))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['<s>', '<unk>']"
|
||||
]
|
||||
},
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab.lookup_tokens([43, 0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[tensor([ 2, 5, 51, 3481, 231]), tensor([ 5, 51, 3481, 231, 4])]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from torch.utils.data import DataLoader\n",
|
||||
"\n",
|
||||
"next(iter(DataLoader(train_dataset, batch_size=5)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/home/mikolaj/.local/lib/python3.8/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
|
||||
" input = module(input)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0 tensor(10.0877, grad_fn=<NllLossBackward0>)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/home/mikolaj/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)\n",
|
||||
" Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100 tensor(8.4388, grad_fn=<NllLossBackward0>)\n",
|
||||
"200 tensor(7.7335, grad_fn=<NllLossBackward0>)\n",
|
||||
"300 tensor(7.1300, grad_fn=<NllLossBackward0>)\n",
|
||||
"400 tensor(6.7325, grad_fn=<NllLossBackward0>)\n",
|
||||
"500 tensor(6.4705, grad_fn=<NllLossBackward0>)\n",
|
||||
"600 tensor(6.0460, grad_fn=<NllLossBackward0>)\n",
|
||||
"700 tensor(5.8104, grad_fn=<NllLossBackward0>)\n",
|
||||
"800 tensor(5.8110, grad_fn=<NllLossBackward0>)\n",
|
||||
"900 tensor(5.7169, grad_fn=<NllLossBackward0>)\n",
|
||||
"1000 tensor(5.7580, grad_fn=<NllLossBackward0>)\n",
|
||||
"1100 tensor(5.6787, grad_fn=<NllLossBackward0>)\n",
|
||||
"1200 tensor(5.4501, grad_fn=<NllLossBackward0>)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"ename": "KeyboardInterrupt",
|
||||
"evalue": "",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
|
||||
"\u001b[0;32m/tmp/ipykernel_12664/1293343661.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdevice\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0moptimizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzero_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mypredicted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mypredicted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;36m100\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m/tmp/ipykernel_12664/517511851.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSimpleBigramNeuralLanguageModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvocab_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0membed_size\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/container.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 216\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m \u001b[0minput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 218\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 1457\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1458\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1459\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1460\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1461\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mextra_repr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/functional.py\u001b[0m in \u001b[0;36msoftmax\u001b[0;34m(input, dim, _stacklevel, dtype)\u001b[0m\n\u001b[1;32m 1841\u001b[0m \u001b[0mdim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_get_softmax_dim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"softmax\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1842\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1843\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1844\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1845\u001b[0m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||||
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"device = 'cpu'\n",
|
||||
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
|
||||
"data = DataLoader(train_dataset, batch_size=5000)\n",
|
||||
"optimizer = torch.optim.Adam(model.parameters())\n",
|
||||
"criterion = torch.nn.NLLLoss()\n",
|
||||
"\n",
|
||||
"model.train()\n",
|
||||
"step = 0\n",
|
||||
"for x, y in data:\n",
|
||||
" x = x.to(device)\n",
|
||||
" y = y.to(device)\n",
|
||||
" optimizer.zero_grad()\n",
|
||||
" ypredicted = model(x)\n",
|
||||
" loss = criterion(torch.log(ypredicted), y)\n",
|
||||
" if step % 100 == 0:\n",
|
||||
" print(step, loss)\n",
|
||||
" step += 1\n",
|
||||
" loss.backward()\n",
|
||||
" optimizer.step()\n",
|
||||
"\n",
|
||||
"torch.save(model.state_dict(), 'model1.bin')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"torch.save(model.state_dict(), 'model1.bin')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(',', 3, 0.12514805793762207),\n",
|
||||
" ('\\\\', 1, 0.07237359136343002),\n",
|
||||
" ('<unk>', 0, 0.06839419901371002),\n",
|
||||
" ('.', 4, 0.06109621003270149),\n",
|
||||
" ('of', 5, 0.04557998105883598),\n",
|
||||
" ('and', 6, 0.03565318509936333),\n",
|
||||
" ('the', 2, 0.029342489317059517),\n",
|
||||
" ('to', 7, 0.02185475267469883),\n",
|
||||
" ('-\\\\', 10, 0.018097609281539917),\n",
|
||||
" ('in', 9, 0.016023961827158928)]"
|
||||
]
|
||||
},
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"device = 'cpu'\n",
|
||||
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
|
||||
"model.load_state_dict(torch.load('model1.bin'))\n",
|
||||
"model.eval()\n",
|
||||
"\n",
|
||||
"ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
|
||||
"\n",
|
||||
"out = model(ixs)\n",
|
||||
"top = torch.topk(out[0], 10)\n",
|
||||
"top_indices = top.indices.tolist()\n",
|
||||
"top_probs = top.values.tolist()\n",
|
||||
"top_words = vocab.lookup_tokens(top_indices)\n",
|
||||
"list(zip(top_words, top_indices, top_probs))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(',', 3, 0.12514805793762207),\n",
|
||||
" ('\\\\', 1, 0.07237359136343002),\n",
|
||||
" ('<unk>', 0, 0.06839419901371002),\n",
|
||||
" ('.', 4, 0.06109621003270149),\n",
|
||||
" ('of', 5, 0.04557998105883598),\n",
|
||||
" ('and', 6, 0.03565318509936333),\n",
|
||||
" ('the', 2, 0.029342489317059517),\n",
|
||||
" ('to', 7, 0.02185475267469883),\n",
|
||||
" ('-\\\\', 10, 0.018097609281539917),\n",
|
||||
" ('in', 9, 0.016023961827158928)]"
|
||||
]
|
||||
},
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab = train_dataset.vocab\n",
|
||||
"ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
|
||||
"\n",
|
||||
"out = model(ixs)\n",
|
||||
"top = torch.topk(out[0], 10)\n",
|
||||
"top_indices = top.indices.tolist()\n",
|
||||
"top_probs = top.values.tolist()\n",
|
||||
"top_words = vocab.lookup_tokens(top_indices)\n",
|
||||
"list(zip(top_words, top_indices, top_probs))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[('<unk>', 0, 1.0000001192092896),\n",
|
||||
" ('nb', 1958, 0.4407886266708374),\n",
|
||||
" ('refrain', 14092, 0.4395471513271332),\n",
|
||||
" ('cat', 3391, 0.4154242277145386),\n",
|
||||
" ('enjoying', 7521, 0.3915165066719055),\n",
|
||||
" ('active', 1383, 0.38935279846191406),\n",
|
||||
" ('stewart', 4816, 0.3806381821632385),\n",
|
||||
" ('omit', 15600, 0.380504310131073),\n",
|
||||
" ('2041095573313', 11912, 0.37909239530563354),\n",
|
||||
" ('shut', 3863, 0.3778260052204132)]"
|
||||
]
|
||||
},
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
|
||||
"\n",
|
||||
"embeddings = model.model[0].weight\n",
|
||||
"\n",
|
||||
"vec = embeddings[vocab['poszedł']]\n",
|
||||
"\n",
|
||||
"similarities = cos(vec, embeddings)\n",
|
||||
"\n",
|
||||
"top = torch.topk(similarities, 10)\n",
|
||||
"\n",
|
||||
"top_indices = top.indices.tolist()\n",
|
||||
"top_probs = top.values.tolist()\n",
|
||||
"top_words = vocab.lookup_tokens(top_indices)\n",
|
||||
"list(zip(top_words, top_indices, top_probs))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Zapis przy użyciu wzoru matematycznego\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
|
||||
"\n",
|
||||
"$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
|
||||
"\n",
|
||||
"gdzie:\n",
|
||||
"\n",
|
||||
"- $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
|
||||
"- $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
|
||||
"- $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
|
||||
"- $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Hiperparametry\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Zauważmy, że nasz model ma dwa hiperparametry:\n",
|
||||
"\n",
|
||||
"- $m$ — rozmiar zanurzenia,\n",
|
||||
"- $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
|
||||
" rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
|
||||
" najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
|
||||
"\n",
|
||||
"Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
|
||||
"polepszenia wyników naszego modelu.\n",
|
||||
"\n",
|
||||
"**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Diagram sieci\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
|
||||
"warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
|
||||
"sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
|
||||
"\n",
|
||||
"![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Zanurzenie jako mnożenie przez macierz\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
|
||||
"odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
|
||||
"mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
|
||||
"wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
|
||||
"podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
|
||||
"złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
|
||||
"\n",
|
||||
"Wówczas wzór przyjmie postać:\n",
|
||||
"\n",
|
||||
"$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
|
||||
"\n",
|
||||
"gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
|
||||
"\n",
|
||||
"**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
|
||||
"\n",
|
||||
"W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
|
||||
"\n",
|
||||
"![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.8.12 64-bit",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.12"
|
||||
},
|
||||
"org": null,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
|
@ -0,0 +1,9 @@
|
|||
Challenging America word-gap prediction
|
||||
===================================
|
||||
|
||||
Guess a word in a gap.
|
||||
|
||||
Evaluation metric
|
||||
-----------------
|
||||
|
||||
LikelihoodHashed is the metric
|
|
@ -0,0 +1 @@
|
|||
--metric PerplexityHashed --precision 2 --in-header in-header.tsv --out-header out-header.tsv
|
|
@ -0,0 +1,35 @@
|
|||
from itertools import islice
|
||||
import regex as re
|
||||
import sys
|
||||
from torchtext.vocab import build_vocab_from_iterator
|
||||
import lzma
|
||||
|
||||
def get_words_from_line(line):
|
||||
line = line.rstrip()
|
||||
yield '<s>'
|
||||
for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
|
||||
yield m.group(0).lower()
|
||||
yield '</s>'
|
||||
|
||||
|
||||
def get_word_lines_from_file(file_name):
|
||||
counter=0
|
||||
with lzma.open(file_name, 'r') as fh:
|
||||
for line in fh:
|
||||
counter+=1
|
||||
# if counter == 10000:
|
||||
# break
|
||||
line = line.decode("utf-8")
|
||||
yield get_words_from_line(line)
|
||||
|
||||
|
||||
vocab_size = 20000
|
||||
|
||||
vocab = build_vocab_from_iterator(
|
||||
get_word_lines_from_file('train/in.tsv.xz'),
|
||||
max_tokens = vocab_size,
|
||||
specials = ['<unk>'])
|
||||
|
||||
import pickle
|
||||
with open("vocab.pickle", 'wb') as handle:
|
||||
pickle.dump(vocab, handle)
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Binary file not shown.
File diff suppressed because one or more lines are too long
|
@ -0,0 +1 @@
|
|||
FileId Year LeftContext RightContext
|
|
|
@ -0,0 +1,657 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en-US" class="theme-">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>challenging-america-word-gap-prediction/in-header.tsv at 4gram - challenging-america-word-gap-prediction - Gitea: Git with a cup of tea</title>
|
||||
<link rel="manifest" href="data:application/json;base64,eyJuYW1lIjoiR2l0ZWE6IEdpdCB3aXRoIGEgY3VwIG9mIHRlYSIsInNob3J0X25hbWUiOiJHaXRlYTogR2l0IHdpdGggYSBjdXAgb2YgdGVhIiwic3RhcnRfdXJsIjoiaHR0cHM6Ly9naXQud21pLmFtdS5lZHUucGwvIiwiaWNvbnMiOlt7InNyYyI6Imh0dHBzOi8vZ2l0LndtaS5hbXUuZWR1LnBsL2Fzc2V0cy9pbWcvbG9nby5wbmciLCJ0eXBlIjoiaW1hZ2UvcG5nIiwic2l6ZXMiOiI1MTJ4NTEyIn0seyJzcmMiOiJodHRwczovL2dpdC53bWkuYW11LmVkdS5wbC9hc3NldHMvaW1nL2xvZ28uc3ZnIiwidHlwZSI6ImltYWdlL3N2Zyt4bWwiLCJzaXplcyI6IjUxMng1MTIifV19"/>
|
||||
<meta name="theme-color" content="#6cc644">
|
||||
<meta name="default-theme" content="auto" />
|
||||
<meta name="author" content="s444463" />
|
||||
<meta name="description" content="challenging-america-word-gap-prediction" />
|
||||
<meta name="keywords" content="go,git,self-hosted,gitea">
|
||||
<meta name="referrer" content="no-referrer" />
|
||||
|
||||
<script>
|
||||
<!-- -->
|
||||
window.config = {
|
||||
appVer: '1.16.4',
|
||||
appSubUrl: '',
|
||||
assetUrlPrefix: '\/assets',
|
||||
runModeIsProd: true ,
|
||||
customEmojis: {"codeberg":":codeberg:","git":":git:","gitea":":gitea:","github":":github:","gitlab":":gitlab:","gogs":":gogs:"},
|
||||
useServiceWorker: true ,
|
||||
csrfToken: 'p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw',
|
||||
pageData: {},
|
||||
requireTribute: null ,
|
||||
notificationSettings: {"EventSourceUpdateTime":10000,"MaxTimeout":60000,"MinTimeout":10000,"TimeoutStep":10000},
|
||||
enableTimeTracking: true ,
|
||||
|
||||
mermaidMaxSourceCharacters: 5000 ,
|
||||
|
||||
i18n: {
|
||||
copy_success: 'Copied!',
|
||||
copy_error: 'Copy failed',
|
||||
error_occurred: 'An error occurred',
|
||||
network_error: 'Network error',
|
||||
},
|
||||
};
|
||||
|
||||
window.config.pageData = window.config.pageData || {};
|
||||
</script>
|
||||
<link rel="icon" href="/assets/img/logo.svg" type="image/svg+xml">
|
||||
<link rel="alternate icon" href="/assets/img/favicon.png" type="image/png">
|
||||
<link rel="stylesheet" href="/assets/css/index.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
|
||||
<noscript>
|
||||
<style>
|
||||
.dropdown:hover > .menu { display: block; }
|
||||
.ui.secondary.menu .dropdown.item > .menu { margin-top: 0; }
|
||||
</style>
|
||||
</noscript>
|
||||
|
||||
|
||||
<meta property="og:title" content="challenging-america-word-gap-prediction" />
|
||||
<meta property="og:url" content="https://git.wmi.amu.edu.pl/s444463/challenging-america-word-gap-prediction" />
|
||||
|
||||
|
||||
<meta property="og:type" content="object" />
|
||||
|
||||
<meta property="og:image" content="https://git.wmi.amu.edu.pl/avatars/a6fe95f301e02b3472a0560d70cc307a" />
|
||||
|
||||
|
||||
<meta property="og:site_name" content="Gitea: Git with a cup of tea" />
|
||||
|
||||
<link rel="stylesheet" href="/assets/css/theme-auto.css?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c">
|
||||
|
||||
|
||||
<link rel="stylesheet/less" type="text/css" href="/assets/css/jupyter.less" />
|
||||
|
||||
<script src="/assets/js/less.js" ></script>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-MML-AM_CHTML-full,Safe"> </script>
|
||||
|
||||
<script type="text/x-mathjax-config">
|
||||
init_mathjax = function() {
|
||||
if (window.MathJax) {
|
||||
// MathJax loaded
|
||||
MathJax.Hub.Config({
|
||||
TeX: {
|
||||
equationNumbers: {
|
||||
autoNumber: "AMS",
|
||||
useLabelIds: true
|
||||
}
|
||||
},
|
||||
tex2jax: {
|
||||
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
|
||||
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
|
||||
processEscapes: true,
|
||||
processEnvironments: true
|
||||
},
|
||||
displayAlign: 'center',
|
||||
CommonHTML: {
|
||||
linebreaks: {
|
||||
automatic: true
|
||||
}
|
||||
},
|
||||
"HTML-CSS": {
|
||||
linebreaks: {
|
||||
automatic: true
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
MathJax.Hub.Queue(["Typeset", MathJax.Hub]);
|
||||
}
|
||||
}
|
||||
init_mathjax();
|
||||
</script>
|
||||
|
||||
|
||||
<script type="text/javascript">
|
||||
function giteaSanitizerHack() {
|
||||
var imgs = document.getElementsByClassName("nb-image-output");
|
||||
var i;
|
||||
for (i = 0; i < imgs.length; i++) {
|
||||
imgs[i].src=imgs[i].src.replace('https://gitea.sanitizer.hack/','');
|
||||
}
|
||||
}
|
||||
window.onload = giteaSanitizerHack;
|
||||
</script>
|
||||
|
||||
</head>
|
||||
<body>
|
||||
|
||||
|
||||
<div class="full height">
|
||||
<noscript>This website works better with JavaScript.</noscript>
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="ui top secondary stackable main menu following bar light">
|
||||
<div class="ui container" id="navbar">
|
||||
<div class="item brand" style="justify-content: space-between;">
|
||||
<a href="/" data-content="Home">
|
||||
<img class="ui mini image" width="30" height="30" src="/assets/img/logo.svg">
|
||||
</a>
|
||||
<div class="ui basic icon button mobile-only" id="navbar-expand-toggle">
|
||||
<i class="sidebar icon"></i>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<a class="item " href="/explore/repos">Explore</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<a class="item" target="_blank" rel="noopener noreferrer" href="https://docs.gitea.io">Help</a>
|
||||
<div class="right stackable menu">
|
||||
|
||||
<a class="item" rel="nofollow" href="/user/login?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-sign-in" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.75C2 1.784 2.784 1 3.75 1h2.5a.75.75 0 0 1 0 1.5h-2.5a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h2.5a.75.75 0 0 1 0 1.5h-2.5A1.75 1.75 0 0 1 2 13.25V2.75zm6.56 4.5 1.97-1.97a.75.75 0 1 0-1.06-1.06L6.22 7.47a.75.75 0 0 0 0 1.06l3.25 3.25a.75.75 0 1 0 1.06-1.06L8.56 8.75h5.69a.75.75 0 0 0 0-1.5H8.56z"/></svg> Sign In
|
||||
</a>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="page-content repository file list ">
|
||||
<div class="header-wrapper">
|
||||
|
||||
<div class="ui container">
|
||||
<div class="repo-header">
|
||||
<div class="repo-title-wrap df fc">
|
||||
<div class="repo-title">
|
||||
|
||||
|
||||
<div class="repo-icon mr-3">
|
||||
|
||||
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-repo" width="32" height="32" aria-hidden="true"><path fill-rule="evenodd" d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 1 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 0 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 0 1 1-1h8zM5 12.25v3.25a.25.25 0 0 0 .4.2l1.45-1.087a.25.25 0 0 1 .3 0L8.6 15.7a.25.25 0 0 0 .4-.2v-3.25a.25.25 0 0 0-.25-.25h-3.5a.25.25 0 0 0-.25.25z"/></svg>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
<a href="/s444463">s444463</a>
|
||||
<div class="mx-2">/</div>
|
||||
<a href="/s444463/challenging-america-word-gap-prediction">challenging-america-word-gap-prediction</a>
|
||||
<div class="labels df ac fw">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<div class="repo-buttons">
|
||||
|
||||
<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/watch?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
|
||||
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
|
||||
<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to watch this repository." data-position="top center">
|
||||
<button type="submit" class="ui compact small basic button" disabled>
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-eye" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.679 7.932c.412-.621 1.242-1.75 2.366-2.717C5.175 4.242 6.527 3.5 8 3.5c1.473 0 2.824.742 3.955 1.715 1.124.967 1.954 2.096 2.366 2.717a.119.119 0 0 1 0 .136c-.412.621-1.242 1.75-2.366 2.717C10.825 11.758 9.473 12.5 8 12.5c-1.473 0-2.824-.742-3.955-1.715C2.92 9.818 2.09 8.69 1.679 8.068a.119.119 0 0 1 0-.136zM8 2c-1.981 0-3.67.992-4.933 2.078C1.797 5.169.88 6.423.43 7.1a1.619 1.619 0 0 0 0 1.798c.45.678 1.367 1.932 2.637 3.024C4.329 13.008 6.019 14 8 14c1.981 0 3.67-.992 4.933-2.078 1.27-1.091 2.187-2.345 2.637-3.023a1.619 1.619 0 0 0 0-1.798c-.45-.678-1.367-1.932-2.637-3.023C11.671 2.992 9.981 2 8 2zm0 8a2 2 0 1 0 0-4 2 2 0 0 0 0 4z"/></svg>Watch
|
||||
</button>
|
||||
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/watchers">
|
||||
1
|
||||
</a>
|
||||
</div>
|
||||
</form>
|
||||
|
||||
<form method="post" action="/s444463/challenging-america-word-gap-prediction/action/star?redirect_to=%2fs444463%2fchallenging-america-word-gap-prediction%2fsrc%2fbranch%2f4gram%2fin-header.tsv">
|
||||
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
|
||||
<div class="ui labeled button tooltip" tabindex="0" data-content="Sign in to star this repository." data-position="top center">
|
||||
<button type="submit" class="ui compact small basic button" disabled>
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-star" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.75.75 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79l.72-4.194L.818 6.374a.75.75 0 0 1 .416-1.28l4.21-.611L7.327.668A.75.75 0 0 1 8 .25zm0 2.445L6.615 5.5a.75.75 0 0 1-.564.41l-3.097.45 2.24 2.184a.75.75 0 0 1 .216.664l-.528 3.084 2.769-1.456a.75.75 0 0 1 .698 0l2.77 1.456-.53-3.084a.75.75 0 0 1 .216-.664l2.24-2.183-3.096-.45a.75.75 0 0 1-.564-.41L8 2.694v.001z"/></svg>Star
|
||||
</button>
|
||||
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/stars">
|
||||
0
|
||||
</a>
|
||||
</div>
|
||||
</form>
|
||||
|
||||
|
||||
<div class="ui labeled button
|
||||
|
||||
tooltip disabled
|
||||
"
|
||||
|
||||
data-content="Sign in to fork this repository."
|
||||
|
||||
data-position="top center" data-variation="tiny" tabindex="0">
|
||||
<a class="ui compact small basic button"
|
||||
|
||||
|
||||
|
||||
>
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-repo-forked" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M5 3.25a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm0 2.122a2.25 2.25 0 1 0-1.5 0v.878A2.25 2.25 0 0 0 5.75 8.5h1.5v2.128a2.251 2.251 0 1 0 1.5 0V8.5h1.5a2.25 2.25 0 0 0 2.25-2.25v-.878a2.25 2.25 0 1 0-1.5 0v.878a.75.75 0 0 1-.75.75h-4.5A.75.75 0 0 1 5 6.25v-.878zm3.75 7.378a.75.75 0 1 1-1.5 0 .75.75 0 0 1 1.5 0zm3-8.75a.75.75 0 1 0 0-1.5.75.75 0 0 0 0 1.5z"/></svg>Fork
|
||||
</a>
|
||||
<div class="ui small modal" id="fork-repo-modal">
|
||||
<svg viewBox="0 0 16 16" class="close inside svg octicon-x" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M3.72 3.72a.75.75 0 0 1 1.06 0L8 6.94l3.22-3.22a.75.75 0 1 1 1.06 1.06L9.06 8l3.22 3.22a.75.75 0 1 1-1.06 1.06L8 9.06l-3.22 3.22a.75.75 0 0 1-1.06-1.06L6.94 8 3.72 4.78a.75.75 0 0 1 0-1.06z"/></svg>
|
||||
<div class="header">
|
||||
You've already forked challenging-america-word-gap-prediction
|
||||
</div>
|
||||
<div class="content tl">
|
||||
<div class="ui list">
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<a class="ui basic label" href="/s444463/challenging-america-word-gap-prediction/forks">
|
||||
0
|
||||
</a>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="ui tabs container">
|
||||
|
||||
<div class="ui tabular stackable menu navbar">
|
||||
|
||||
<a class="active item" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-code" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M4.72 3.22a.75.75 0 0 1 1.06 1.06L2.06 8l3.72 3.72a.75.75 0 1 1-1.06 1.06L.47 8.53a.75.75 0 0 1 0-1.06l4.25-4.25zm6.56 0a.75.75 0 1 0-1.06 1.06L13.94 8l-3.72 3.72a.75.75 0 1 0 1.06 1.06l4.25-4.25a.75.75 0 0 0 0-1.06l-4.25-4.25z"/></svg> Code
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/issues">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-issue-opened" width="16" height="16" aria-hidden="true"><path d="M8 9.5a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3z"/><path fill-rule="evenodd" d="M8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0zM1.5 8a6.5 6.5 0 1 1 13 0 6.5 6.5 0 0 1-13 0z"/></svg> Issues
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/pulls">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-git-pull-request" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.177 3.073 9.573.677A.25.25 0 0 1 10 .854v4.792a.25.25 0 0 1-.427.177L7.177 3.427a.25.25 0 0 1 0-.354zM3.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122v5.256a2.251 2.251 0 1 1-1.5 0V5.372A2.25 2.25 0 0 1 1.5 3.25zM11 2.5h-1V4h1a1 1 0 0 1 1 1v5.628a2.251 2.251 0 1 0 1.5 0V5A2.5 2.5 0 0 0 11 2.5zm1 10.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0zM3.75 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5z"/></svg> Pull Requests
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<a href="/s444463/challenging-america-word-gap-prediction/projects" class=" item">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-project" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.75 0A1.75 1.75 0 0 0 0 1.75v12.5C0 15.216.784 16 1.75 16h12.5A1.75 1.75 0 0 0 16 14.25V1.75A1.75 1.75 0 0 0 14.25 0H1.75zM1.5 1.75a.25.25 0 0 1 .25-.25h12.5a.25.25 0 0 1 .25.25v12.5a.25.25 0 0 1-.25.25H1.75a.25.25 0 0 1-.25-.25V1.75zM11.75 3a.75.75 0 0 0-.75.75v7.5a.75.75 0 0 0 1.5 0v-7.5a.75.75 0 0 0-.75-.75zm-8.25.75a.75.75 0 0 1 1.5 0v5.5a.75.75 0 0 1-1.5 0v-5.5zM8 3a.75.75 0 0 0-.75.75v3.5a.75.75 0 0 0 1.5 0v-3.5A.75.75 0 0 0 8 3z"/></svg> Projects
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/releases">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> Releases
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/wiki" >
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-book" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M0 1.75A.75.75 0 0 1 .75 1h4.253c1.227 0 2.317.59 3 1.501A3.744 3.744 0 0 1 11.006 1h4.245a.75.75 0 0 1 .75.75v10.5a.75.75 0 0 1-.75.75h-4.507a2.25 2.25 0 0 0-1.591.659l-.622.621a.75.75 0 0 1-1.06 0l-.622-.621A2.25 2.25 0 0 0 5.258 13H.75a.75.75 0 0 1-.75-.75V1.75zm8.755 3a2.25 2.25 0 0 1 2.25-2.25H14.5v9h-3.757c-.71 0-1.4.201-1.992.572l.004-7.322zm-1.504 7.324.004-5.073-.002-2.253A2.25 2.25 0 0 0 5.003 2.5H1.5v9h3.757a3.75 3.75 0 0 1 1.994.574z"/></svg> Wiki
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<a class=" item" href="/s444463/challenging-america-word-gap-prediction/activity">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-pulse" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6 2a.75.75 0 0 1 .696.471L10 10.731l1.304-3.26A.75.75 0 0 1 12 7h3.25a.75.75 0 0 1 0 1.5h-2.742l-1.812 4.528a.75.75 0 0 1-1.392 0L6 4.77 4.696 8.03A.75.75 0 0 1 4 8.5H.75a.75.75 0 0 1 0-1.5h2.742l1.812-4.529A.75.75 0 0 1 6 2z"/></svg> Activity
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<div class="ui tabs divider"></div>
|
||||
</div>
|
||||
|
||||
<div class="ui container ">
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="ui repo-description">
|
||||
<div id="repo-desc">
|
||||
|
||||
<a class="link" href=""></a>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<div class="mt-3" id="repo-topics">
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<div class="hide" id="validate_prompt">
|
||||
<span id="count_prompt">You can not select more than 25 topics</span>
|
||||
<span id="format_prompt">Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.</span>
|
||||
</div>
|
||||
|
||||
<div class="ui segments repository-summary repository-summary-language-stats mt-3">
|
||||
<div class="ui segment sub-menu repository-menu">
|
||||
<div class="ui two horizontal center link list">
|
||||
|
||||
<div class="item">
|
||||
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram"><svg viewBox="0 0 16 16" class="svg octicon-history" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.643 3.143.427 1.927A.25.25 0 0 0 0 2.104V5.75c0 .138.112.25.25.25h3.646a.25.25 0 0 0 .177-.427L2.715 4.215a6.5 6.5 0 1 1-1.18 4.458.75.75 0 1 0-1.493.154 8.001 8.001 0 1 0 1.6-5.684zM7.75 4a.75.75 0 0 1 .75.75v2.992l2.028.812a.75.75 0 0 1-.557 1.392l-2.5-1A.75.75 0 0 1 7 8.25v-3.5A.75.75 0 0 1 7.75 4z"/></svg> <b>4</b> Commits</a>
|
||||
</div>
|
||||
<div class="item">
|
||||
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/branches"><svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg> <b>2</b> Branches</a>
|
||||
</div>
|
||||
|
||||
<div class="item">
|
||||
<a class="ui" href="/s444463/challenging-america-word-gap-prediction/tags"><svg viewBox="0 0 16 16" class="svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg> <b>0</b> Tags</a>
|
||||
</div>
|
||||
|
||||
<div class="item">
|
||||
<span class="ui"><svg viewBox="0 0 16 16" class="svg octicon-database" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 3.5c0-.133.058-.318.282-.55.227-.237.592-.484 1.1-.708C4.899 1.795 6.354 1.5 8 1.5c1.647 0 3.102.295 4.117.742.51.224.874.47 1.101.707.224.233.282.418.282.551 0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 5.205 9.646 5.5 8 5.5c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707-.224-.233-.282-.418-.282-.551zM1 3.5c0-.626.292-1.165.7-1.59.406-.422.956-.767 1.579-1.041C4.525.32 6.195 0 8 0c1.805 0 3.475.32 4.722.869.622.274 1.172.62 1.578 1.04.408.426.7.965.7 1.591v9c0 .626-.292 1.165-.7 1.59-.406.422-.956.767-1.579 1.041C11.476 15.68 9.806 16 8 16c-1.805 0-3.475-.32-4.721-.869-.623-.274-1.173-.62-1.579-1.04-.408-.426-.7-.965-.7-1.591v-9zM2.5 8V5.724c.241.15.503.286.779.407C4.525 6.68 6.195 7 8 7c1.805 0 3.475-.32 4.722-.869.275-.121.537-.257.778-.407V8c0 .133-.058.318-.282.55-.227.237-.592.484-1.1.708C11.101 9.705 9.646 10 8 10c-1.647 0-3.102-.295-4.117-.742-.51-.224-.874-.47-1.101-.707C2.558 8.318 2.5 8.133 2.5 8zm0 2.225V12.5c0 .133.058.318.282.55.227.237.592.484 1.1.708 1.016.447 2.471.742 4.118.742 1.647 0 3.102-.295 4.117-.742.51-.224.874-.47 1.101-.707.224-.233.282-.418.282-.551v-2.275c-.241.15-.503.285-.778.406-1.247.549-2.917.869-4.722.869-1.805 0-3.475-.32-4.721-.869a6.236 6.236 0 0 1-.779-.406z"/></svg> <b>284 MiB</b></span>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="ui segment sub-menu language-stats-details" style="display: none">
|
||||
<div class="ui horizontal center link list">
|
||||
|
||||
<div class="item df ac jc">
|
||||
<i class="color-icon mr-3" style="background-color: #3572A5"></i>
|
||||
<span class="bold mr-3">
|
||||
|
||||
Python
|
||||
|
||||
</span>
|
||||
100%
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<a class="ui segment language-stats">
|
||||
|
||||
<div class="bar" style="width: 100%; background-color: #3572A5"> </div>
|
||||
|
||||
</a>
|
||||
|
||||
</div>
|
||||
|
||||
<div class="ui stackable secondary menu mobile--margin-between-items mobile--no-negative-margins">
|
||||
|
||||
|
||||
<div class="fitted item choose reference mr-1">
|
||||
<div class="ui floating filter dropdown custom" data-can-create-branch="false" data-no-results="No results found.">
|
||||
<div class="ui basic small compact button" @click="menuVisible = !menuVisible" @keyup.enter="menuVisible = !menuVisible">
|
||||
<span class="text">
|
||||
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
|
||||
Branch:
|
||||
<strong>4gram</strong>
|
||||
|
||||
</span>
|
||||
<svg viewBox="0 0 16 16" class="dropdown icon svg octicon-triangle-down" width="14" height="14" aria-hidden="true"><path d="m4.427 7.427 3.396 3.396a.25.25 0 0 0 .354 0l3.396-3.396A.25.25 0 0 0 11.396 7H4.604a.25.25 0 0 0-.177.427z"/></svg>
|
||||
</div>
|
||||
<div class="data" style="display: none" data-mode="branches">
|
||||
|
||||
|
||||
<div class="item branch selected" data-url="/s444463/challenging-america-word-gap-prediction/src/branch/4gram/in-header.tsv">4gram</div>
|
||||
|
||||
<div class="item branch " data-url="/s444463/challenging-america-word-gap-prediction/src/branch/master/in-header.tsv">master</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
<div class="menu transition" :class="{visible: menuVisible}" v-if="menuVisible" v-cloak>
|
||||
<div class="ui icon search input">
|
||||
<i class="icon df ac jc m-0"><svg viewBox="0 0 16 16" class="svg octicon-filter" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M.75 3a.75.75 0 0 0 0 1.5h14.5a.75.75 0 0 0 0-1.5H.75zM3 7.75A.75.75 0 0 1 3.75 7h8.5a.75.75 0 0 1 0 1.5h-8.5A.75.75 0 0 1 3 7.75zm3 4a.75.75 0 0 1 .75-.75h2.5a.75.75 0 0 1 0 1.5h-2.5a.75.75 0 0 1-.75-.75z"/></svg></i>
|
||||
<input name="search" ref="searchField" autocomplete="off" v-model="searchTerm" @keydown="keydown($event)" placeholder="Filter branch or tag...">
|
||||
</div>
|
||||
|
||||
<div class="header branch-tag-choice">
|
||||
<div class="ui grid">
|
||||
<div class="two column row">
|
||||
<a class="reference column" href="#" @click="createTag = false; mode = 'branches'; focusSearchField()">
|
||||
<span class="text" :class="{black: mode == 'branches'}">
|
||||
<svg viewBox="0 0 16 16" class="mr-2 svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>Branches
|
||||
</span>
|
||||
</a>
|
||||
<a class="reference column" href="#" @click="createTag = true; mode = 'tags'; focusSearchField()">
|
||||
<span class="text" :class="{black: mode == 'tags'}">
|
||||
<svg viewBox="0 0 16 16" class="mr-2 svg octicon-tag" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M2.5 7.775V2.75a.25.25 0 0 1 .25-.25h5.025a.25.25 0 0 1 .177.073l6.25 6.25a.25.25 0 0 1 0 .354l-5.025 5.025a.25.25 0 0 1-.354 0l-6.25-6.25a.25.25 0 0 1-.073-.177zm-1.5 0V2.75C1 1.784 1.784 1 2.75 1h5.025c.464 0 .91.184 1.238.513l6.25 6.25a1.75 1.75 0 0 1 0 2.474l-5.026 5.026a1.75 1.75 0 0 1-2.474 0l-6.25-6.25A1.75 1.75 0 0 1 1 7.775zM6 5a1 1 0 1 0 0 2 1 1 0 0 0 0-2z"/></svg>Tags
|
||||
</span>
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="scrolling menu" ref="scrollContainer">
|
||||
<div v-for="(item, index) in filteredItems" :key="item.name" class="item" :class="{selected: item.selected, active: active == index}" @click="selectItem(item)" :ref="'listItem' + index">${ item.name }</div>
|
||||
<div class="item" v-if="showCreateNewBranch" :class="{active: active == filteredItems.length}" :ref="'listItem' + filteredItems.length">
|
||||
<a href="#" @click="createNewBranch()">
|
||||
<div v-show="createTag">
|
||||
<i class="reference tags icon"></i>
|
||||
Create tag <strong>${ searchTerm }</strong>
|
||||
</div>
|
||||
<div v-show="!createTag">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-git-branch" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25zM4.25 12a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zM3.5 3.25a.75.75 0 1 1 1.5 0 .75.75 0 0 1-1.5 0z"/></svg>
|
||||
Create branch <strong>${ searchTerm }</strong>
|
||||
</div>
|
||||
<div class="text small">
|
||||
|
||||
from '4gram'
|
||||
|
||||
</div>
|
||||
</a>
|
||||
<form ref="newBranchForm" action="/s444463/challenging-america-word-gap-prediction/branches/_new/branch/4gram" method="post">
|
||||
<input type="hidden" name="_csrf" value="p3_cvG3N1UMZPleED8GVFsIG_NE6MTY4MTMyODUyMDA1OTE2ODE0Nw">
|
||||
<input type="hidden" name="new_branch_name" v-model="searchTerm">
|
||||
<input type="hidden" name="create_tag" v-model="createTag">
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
<div class="message" v-if="showNoResults">${ noResults }</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="fitted item"><span class="ui breadcrumb repo-path"><a class="section" href="/s444463/challenging-america-word-gap-prediction/src/branch/4gram" title="challenging-america-word-gap-prediction">challenging-america-word-ga...</a><span class="divider">/</span><span class="active section" title="in-header.tsv">in-header.tsv</span></span></div>
|
||||
|
||||
<div class="right fitted item mr-0" id="file-buttons">
|
||||
<div class="ui tiny primary buttons">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<div class="fitted item">
|
||||
|
||||
</div>
|
||||
<div class="fitted item">
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tab-size-8 non-diff-file-content">
|
||||
<h4 class="file-header ui top attached header df ac sb">
|
||||
<div class="file-header-left df ac">
|
||||
|
||||
<div class="file-info text grey normal mono">
|
||||
|
||||
|
||||
|
||||
<div class="file-info-entry">
|
||||
37 B
|
||||
</div>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<div class="file-header-right file-actions df ac">
|
||||
|
||||
|
||||
<div class="ui buttons mr-2">
|
||||
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv">Raw</a>
|
||||
|
||||
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/src/commit/65d889d6525d3949dd1ace045393124a7afb1f0e/in-header.tsv">Permalink</a>
|
||||
|
||||
|
||||
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/blame/branch/4gram/in-header.tsv">Blame</a>
|
||||
|
||||
<a class="ui mini basic button" href="/s444463/challenging-america-word-gap-prediction/commits/branch/4gram/in-header.tsv">History</a>
|
||||
|
||||
</div>
|
||||
<a download href="/s444463/challenging-america-word-gap-prediction/raw/branch/4gram/in-header.tsv"><span class="btn-octicon tooltip" data-content="Download file" data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-download" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.47 10.78a.75.75 0 0 0 1.06 0l3.75-3.75a.75.75 0 0 0-1.06-1.06L8.75 8.44V1.75a.75.75 0 0 0-1.5 0v6.69L4.78 5.97a.75.75 0 0 0-1.06 1.06l3.75 3.75zM3.75 13a.75.75 0 0 0 0 1.5h8.5a.75.75 0 0 0 0-1.5h-8.5z"/></svg></span></a>
|
||||
|
||||
|
||||
<span class="btn-octicon tooltip disabled" data-content="You must fork this repository to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-pencil" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M11.013 1.427a1.75 1.75 0 0 1 2.474 0l1.086 1.086a1.75 1.75 0 0 1 0 2.474l-8.61 8.61c-.21.21-.47.364-.756.445l-3.251.93a.75.75 0 0 1-.927-.928l.929-3.25a1.75 1.75 0 0 1 .445-.758l8.61-8.61zm1.414 1.06a.25.25 0 0 0-.354 0L10.811 3.75l1.439 1.44 1.263-1.263a.25.25 0 0 0 0-.354l-1.086-1.086zM11.189 6.25 9.75 4.81l-6.286 6.287a.25.25 0 0 0-.064.108l-.558 1.953 1.953-.558a.249.249 0 0 0 .108-.064l6.286-6.286z"/></svg></span>
|
||||
|
||||
|
||||
<span class="btn-octicon tooltip disabled" data-content="You must have write access to make or propose changes to this file." data-position="bottom center"><svg viewBox="0 0 16 16" class="svg octicon-trash" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.5 1.75a.25.25 0 0 1 .25-.25h2.5a.25.25 0 0 1 .25.25V3h-3V1.75zm4.5 0V3h2.25a.75.75 0 0 1 0 1.5H2.75a.75.75 0 0 1 0-1.5H5V1.75C5 .784 5.784 0 6.75 0h2.5C10.216 0 11 .784 11 1.75zM4.496 6.675a.75.75 0 1 0-1.492.15l.66 6.6A1.75 1.75 0 0 0 5.405 15h5.19c.9 0 1.652-.681 1.741-1.576l.66-6.6a.75.75 0 0 0-1.492-.149l-.66 6.6a.25.25 0 0 1-.249.225h-5.19a.25.25 0 0 1-.249-.225l-.66-6.6z"/></svg></span>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</h4>
|
||||
<div class="ui attached table unstackable segment">
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="file-view markup csv">
|
||||
|
||||
<table class="data-table"><tr><th class="line-num">1</th><th>FileId</th><th>Year</th><th>LeftContext</th><th>RightContext</th></tr></table>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
<footer>
|
||||
<div class="ui container">
|
||||
<div class="ui left">
|
||||
Powered by Gitea
|
||||
</div>
|
||||
<div class="ui right links">
|
||||
|
||||
<div class="ui language bottom floating slide up dropdown link item">
|
||||
<svg viewBox="0 0 16 16" class="svg octicon-globe" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M1.543 7.25h2.733c.144-2.074.866-3.756 1.58-4.948.12-.197.237-.381.353-.552a6.506 6.506 0 0 0-4.666 5.5zm2.733 1.5H1.543a6.506 6.506 0 0 0 4.666 5.5 11.13 11.13 0 0 1-.352-.552c-.715-1.192-1.437-2.874-1.581-4.948zm1.504 0h4.44a9.637 9.637 0 0 1-1.363 4.177c-.306.51-.612.919-.857 1.215a9.978 9.978 0 0 1-.857-1.215A9.637 9.637 0 0 1 5.78 8.75zm4.44-1.5H5.78a9.637 9.637 0 0 1 1.363-4.177c.306-.51.612-.919.857-1.215.245.296.55.705.857 1.215A9.638 9.638 0 0 1 10.22 7.25zm1.504 1.5c-.144 2.074-.866 3.756-1.58 4.948-.12.197-.237.381-.353.552a6.506 6.506 0 0 0 4.666-5.5h-2.733zm2.733-1.5h-2.733c-.144-2.074-.866-3.756-1.58-4.948a11.738 11.738 0 0 0-.353-.552 6.506 6.506 0 0 1 4.666 5.5zM8 0a8 8 0 1 0 0 16A8 8 0 0 0 8 0z"/></svg>
|
||||
<div class="text">English</div>
|
||||
<div class="menu language-menu">
|
||||
|
||||
<a lang="id-ID" data-url="/?lang=id-ID" class="item ">bahasa Indonesia</a>
|
||||
|
||||
<a lang="de-DE" data-url="/?lang=de-DE" class="item ">Deutsch</a>
|
||||
|
||||
<a lang="en-US" data-url="/?lang=en-US" class="item active selected">English</a>
|
||||
|
||||
<a lang="es-ES" data-url="/?lang=es-ES" class="item ">español</a>
|
||||
|
||||
<a lang="fr-FR" data-url="/?lang=fr-FR" class="item ">français</a>
|
||||
|
||||
<a lang="it-IT" data-url="/?lang=it-IT" class="item ">italiano</a>
|
||||
|
||||
<a lang="lv-LV" data-url="/?lang=lv-LV" class="item ">latviešu</a>
|
||||
|
||||
<a lang="hu-HU" data-url="/?lang=hu-HU" class="item ">magyar nyelv</a>
|
||||
|
||||
<a lang="nl-NL" data-url="/?lang=nl-NL" class="item ">Nederlands</a>
|
||||
|
||||
<a lang="pl-PL" data-url="/?lang=pl-PL" class="item ">polski</a>
|
||||
|
||||
<a lang="pt-PT" data-url="/?lang=pt-PT" class="item ">Português de Portugal</a>
|
||||
|
||||
<a lang="pt-BR" data-url="/?lang=pt-BR" class="item ">português do Brasil</a>
|
||||
|
||||
<a lang="fi-FI" data-url="/?lang=fi-FI" class="item ">suomi</a>
|
||||
|
||||
<a lang="sv-SE" data-url="/?lang=sv-SE" class="item ">svenska</a>
|
||||
|
||||
<a lang="tr-TR" data-url="/?lang=tr-TR" class="item ">Türkçe</a>
|
||||
|
||||
<a lang="cs-CZ" data-url="/?lang=cs-CZ" class="item ">čeština</a>
|
||||
|
||||
<a lang="el-GR" data-url="/?lang=el-GR" class="item ">ελληνικά</a>
|
||||
|
||||
<a lang="bg-BG" data-url="/?lang=bg-BG" class="item ">български</a>
|
||||
|
||||
<a lang="ru-RU" data-url="/?lang=ru-RU" class="item ">русский</a>
|
||||
|
||||
<a lang="sr-SP" data-url="/?lang=sr-SP" class="item ">српски</a>
|
||||
|
||||
<a lang="uk-UA" data-url="/?lang=uk-UA" class="item ">Українська</a>
|
||||
|
||||
<a lang="fa-IR" data-url="/?lang=fa-IR" class="item ">فارسی</a>
|
||||
|
||||
<a lang="ml-IN" data-url="/?lang=ml-IN" class="item ">മലയാളം</a>
|
||||
|
||||
<a lang="ja-JP" data-url="/?lang=ja-JP" class="item ">日本語</a>
|
||||
|
||||
<a lang="zh-CN" data-url="/?lang=zh-CN" class="item ">简体中文</a>
|
||||
|
||||
<a lang="zh-TW" data-url="/?lang=zh-TW" class="item ">繁體中文(台灣)</a>
|
||||
|
||||
<a lang="zh-HK" data-url="/?lang=zh-HK" class="item ">繁體中文(香港)</a>
|
||||
|
||||
<a lang="ko-KR" data-url="/?lang=ko-KR" class="item ">한국어</a>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<a href="/assets/js/licenses.txt">Licenses</a>
|
||||
<a href="/api/swagger">API</a>
|
||||
<a target="_blank" rel="noopener noreferrer" href="https://gitea.io">Website</a>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
|
||||
|
||||
|
||||
<script src="/assets/js/index.js?v=88278c3fba7f4dbe8d14a7ab5c4cfc9c"></script>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -0,0 +1,89 @@
|
|||
from torch import nn
|
||||
import torch
|
||||
|
||||
|
||||
from torch.utils.data import IterableDataset
|
||||
import itertools
|
||||
import lzma
|
||||
import regex as re
|
||||
import pickle
|
||||
|
||||
class SimpleTrigramNeuralLanguageModel(nn.Module):
|
||||
def __init__(self, vocabulary_size, embedding_size):
|
||||
super(SimpleTrigramNeuralLanguageModel, self).__init__()
|
||||
self.embedings = nn.Embedding(vocabulary_size, embedding_size)
|
||||
self.linear = nn.Linear(embedding_size*2, vocabulary_size)
|
||||
self.softmax = nn.Softmax()
|
||||
|
||||
# self.model = nn.Sequential(
|
||||
# nn.Embedding(vocabulary_size, embedding_size),
|
||||
# nn.Linear(embedding_size, vocabulary_size),
|
||||
# nn.Softmax()
|
||||
# )
|
||||
|
||||
def forward(self, x):
|
||||
emb_1 = self.embedings(x[0])
|
||||
emb_2 = self.embedings(x[1])
|
||||
|
||||
concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
|
||||
y = self.softmax(concated)
|
||||
|
||||
return y
|
||||
vocab_size = 20000
|
||||
embed_size = 100
|
||||
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
|
||||
|
||||
model.load_state_dict(torch.load('model1_5400.bin'))
|
||||
model.eval()
|
||||
|
||||
with open("vocab.pickle", 'rb') as handle:
|
||||
vocab = pickle.load(handle)
|
||||
vocab.set_default_index(vocab['<unk>'])
|
||||
|
||||
device = 'cpu'
|
||||
# data = DataLoader(train_dataset, batch_size=5000)
|
||||
optimizer = torch.optim.Adam(model.parameters())
|
||||
criterion = torch.nn.NLLLoss()
|
||||
|
||||
test_pred = ['ala', 'has', 'cat']
|
||||
|
||||
step = 0
|
||||
|
||||
with lzma.open('dev-0/in.tsv.xz', 'rb') as file:
|
||||
for line in file:
|
||||
line = line.decode('utf-8')
|
||||
line = line.rstrip()
|
||||
|
||||
line_splitted = line.split('\t')[-2:]
|
||||
|
||||
prev = line[0].split(' ')[-1]
|
||||
next = line[1].split(' ')[0]
|
||||
|
||||
|
||||
x = torch.tensor(vocab.forward([prev]))
|
||||
z = torch.tensor(vocab.forward([next]))
|
||||
x = x.to(device)
|
||||
z = z.to(device)
|
||||
ypredicted = model([x, z])
|
||||
|
||||
top = torch.topk(ypredicted[0], 5000)
|
||||
|
||||
top_indices = top.indices.tolist()
|
||||
top_probs = top.values.tolist()
|
||||
top_words = vocab.lookup_tokens(top_indices)
|
||||
|
||||
string_to_print = ''
|
||||
sum_probs = 0
|
||||
for w, p in zip(top_words, top_probs):
|
||||
if '<unk>' in w:
|
||||
continue
|
||||
if re.search(r'\p{L}+', w):
|
||||
string_to_print += f"{w}:{p} "
|
||||
sum_probs += p
|
||||
if string_to_print == '':
|
||||
print(f"the:0.5 a:0.3 :0.2")
|
||||
continue
|
||||
unknow_prob = 1 - sum_probs
|
||||
string_to_print += f":{unknow_prob}"
|
||||
|
||||
print(string_to_print)
|
|
@ -0,0 +1 @@
|
|||
Word
|
|
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
@ -0,0 +1,124 @@
|
|||
|
||||
|
||||
from torch import nn
|
||||
import torch
|
||||
|
||||
|
||||
from torch.utils.data import IterableDataset
|
||||
import itertools
|
||||
import lzma
|
||||
import regex as re
|
||||
import pickle
|
||||
|
||||
def look_ahead_iterator(gen):
|
||||
prev = None
|
||||
current = None
|
||||
next = None
|
||||
for next in gen:
|
||||
if prev is not None and current is not None:
|
||||
yield (prev, current, next)
|
||||
prev = current
|
||||
current = next
|
||||
|
||||
def get_words_from_line(line):
|
||||
line = line.rstrip()
|
||||
yield '<s>'
|
||||
for m in re.finditer(r'[\p{L}0-9\*]+|\p{P}+', line):
|
||||
yield m.group(0).lower()
|
||||
yield '</s>'
|
||||
|
||||
|
||||
def get_word_lines_from_file(file_name):
|
||||
counter=0
|
||||
with lzma.open(file_name, 'r') as fh:
|
||||
for line in fh:
|
||||
counter+=1
|
||||
if counter == 100000:
|
||||
break
|
||||
line = line.decode("utf-8")
|
||||
yield get_words_from_line(line)
|
||||
|
||||
|
||||
|
||||
class Trigrams(IterableDataset):
|
||||
def load_vocab(self):
|
||||
with open("vocab.pickle", 'rb') as handle:
|
||||
vocab = pickle.load( handle)
|
||||
return vocab
|
||||
|
||||
def __init__(self, text_file, vocabulary_size):
|
||||
self.vocab = self.load_vocab()
|
||||
self.vocab.set_default_index(self.vocab['<unk>'])
|
||||
self.vocabulary_size = vocabulary_size
|
||||
self.text_file = text_file
|
||||
|
||||
def __iter__(self):
|
||||
return look_ahead_iterator(
|
||||
(self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))
|
||||
|
||||
vocab_size = 20000
|
||||
|
||||
train_dataset = Trigrams('train/in.tsv.xz', vocab_size)
|
||||
|
||||
|
||||
|
||||
#=== trenowanie
|
||||
from torch import nn
|
||||
import torch
|
||||
from torch.utils.data import DataLoader
|
||||
embed_size = 100
|
||||
|
||||
class SimpleTrigramNeuralLanguageModel(nn.Module):
|
||||
def __init__(self, vocabulary_size, embedding_size):
|
||||
super(SimpleTrigramNeuralLanguageModel, self).__init__()
|
||||
self.embedings = nn.Embedding(vocabulary_size, embedding_size)
|
||||
self.linear = nn.Linear(embedding_size*2, vocabulary_size)
|
||||
self.softmax = nn.Softmax()
|
||||
|
||||
# self.model = nn.Sequential(
|
||||
# nn.Embedding(vocabulary_size, embedding_size),
|
||||
# nn.Linear(embedding_size, vocabulary_size),
|
||||
# nn.Softmax()
|
||||
# )
|
||||
|
||||
def forward(self, x):
|
||||
emb_1 = self.embedings(x[0])
|
||||
emb_2 = self.embedings(x[1])
|
||||
|
||||
concated = self.linear(torch.cat((emb_1, emb_2), dim=1))
|
||||
y = self.softmax(concated)
|
||||
|
||||
return y
|
||||
|
||||
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size)
|
||||
|
||||
vocab = train_dataset.vocab
|
||||
|
||||
vocab.set_default_index(vocab['<unk>'])
|
||||
ixs = torch.tensor(vocab.forward(['pies']))
|
||||
# out[0][vocab['jest']]
|
||||
|
||||
|
||||
device = 'cpu'
|
||||
model = SimpleTrigramNeuralLanguageModel(vocab_size, embed_size).to(device)
|
||||
data = DataLoader(train_dataset, batch_size=5000)
|
||||
optimizer = torch.optim.Adam(model.parameters())
|
||||
criterion = torch.nn.NLLLoss()
|
||||
|
||||
model.train()
|
||||
step = 0
|
||||
for x, y, z in data:
|
||||
x = x.to(device)
|
||||
y = y.to(device)
|
||||
z = z.to(device)
|
||||
optimizer.zero_grad()
|
||||
ypredicted = model([x, z])
|
||||
loss = criterion(torch.log(ypredicted), y)
|
||||
if step % 100 == 0:
|
||||
print(step, loss)
|
||||
torch.save(model.state_dict(), f'model1_{step}.bin')
|
||||
step += 1
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
|
||||
torch.save(model.state_dict(), 'model_tri1.bin')
|
Loading…
Reference in New Issue