{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Modelowanie języka

\n", "

09. Zanurzenia słów (Word2vec) [wykład]

\n", "

Filip Graliński (2022)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zanurzenia słów (Word2vec)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W praktyce stosowalność słowosieci okazała się zaskakująco\n", "ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n", "wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### „Wymiary” słów\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n", "$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n", "prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n", "(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n", "\n", "$$P(u|v) \\approx P(u'|v').$$\n", "\n", "$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wymiary określone z góry?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n", "określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n", "„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n", "\n", "- czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n", "- czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n", "- czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n", " socjolingwistycznym)?\n", "- czy słowo jest w liczbie pojedynczej czy mnogiej?\n", "- czy słowo jest rzeczownikiem czy czasownikiem?\n", "- czy słowo jest rdzennym słowem czy zapożyczeniem?\n", "- czy słowo jest nazwą czy słowem pospolitym?\n", "- czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n", "- …\n", "\n", "W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n", "możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bigramowy model języka oparty na zanurzeniach\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n", "**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Słownik\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n", "ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n", "po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n", "na specjalny token `` reprezentujący nieznany (*unknown*) wyraz.\n", "\n", "Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12531" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from itertools import islice\n", "import regex as re\n", "import sys\n", "from torchtext.vocab import build_vocab_from_iterator\n", "import lzma\n", "\n", "def get_words_from_line(line):\n", " line = line.rstrip()\n", " yield ''\n", " for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n", " yield m.group(0).lower()\n", " yield ''\n", "\n", "\n", "def get_word_lines_from_file(file_name):\n", " counter=0\n", " with lzma.open(file_name, 'r') as fh:\n", " for line in fh:\n", " counter+=1\n", " if counter == 100000:\n", " break\n", " line = line.decode(\"utf-8\")\n", " yield get_words_from_line(line)\n", "\n", "vocab_size = 20000\n", "\n", "vocab = build_vocab_from_iterator(\n", " get_word_lines_from_file('train/in.tsv.xz'),\n", " max_tokens = vocab_size,\n", " specials = [''])\n", "\n", "vocab['jest']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "with open(\"vocab.pickle\", 'wb') as handle:\n", " pickle.dump(vocab, handle)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "with open(\"vocab.pickle\", 'rb') as handle:\n", " vocab = pickle.load( handle)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "838" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab['love']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vocab.lookup_token()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.\n", "Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.\n", "To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.\n", "Defaulting to user installation because normal site-packages is not writeable\n", "Collecting torchtext\n", " Downloading torchtext-0.15.1-cp38-cp38-manylinux1_x86_64.whl (2.0 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m661.8 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: numpy in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (1.19.3)\n", "Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (4.64.0)\n", "Collecting torchdata==0.6.0\n", " Downloading torchdata-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/4.6 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n", "\u001b[?25hRequirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from torchtext) (2.26.0)\n", "Collecting torch==2.0.0\n", " Downloading torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m619.9/619.9 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0m\n", "\u001b[?25hCollecting nvidia-nccl-cu11==2.14.3\n", " Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m177.1/177.1 MB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99\n", " Using cached nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\n", "Collecting nvidia-cuda-nvrtc-cu11==11.7.99\n", " Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\n", "Collecting nvidia-cudnn-cu11==8.5.0.96\n", " Using cached nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\n", "Requirement already satisfied: networkx in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (2.8)\n", "Collecting nvidia-cublas-cu11==11.10.3.66\n", " Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\n", "Collecting nvidia-cusolver-cu11==11.4.0.1\n", " Downloading nvidia_cusolver_cu11-11.4.0.1-2-py3-none-manylinux1_x86_64.whl (102.6 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m102.6/102.6 MB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: typing-extensions in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (4.2.0)\n", "Collecting nvidia-cusparse-cu11==11.7.4.91\n", " Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m173.2/173.2 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101\n", " Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n", "\u001b[?25hCollecting nvidia-curand-cu11==10.2.10.91\n", " Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 MB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: jinja2 in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.0.3)\n", "Collecting nvidia-cufft-cu11==10.9.0.58\n", " Downloading nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.4/168.4 MB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hCollecting sympy\n", " Downloading sympy-1.11.1-py3-none-any.whl (6.5 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hCollecting triton==2.0.0\n", " Downloading triton-2.0.0-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.2 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.2/63.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: filelock in /home/mikolaj/.local/lib/python3.8/site-packages (from torch==2.0.0->torchtext) (3.6.0)\n", "Collecting nvidia-nvtx-cu11==11.7.91\n", " Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.6/98.6 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: urllib3>=1.25 in /home/mikolaj/.local/lib/python3.8/site-packages (from torchdata==0.6.0->torchtext) (1.26.9)\n", "Requirement already satisfied: setuptools in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (63.4.2)\n", "Requirement already satisfied: wheel in /home/mikolaj/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==2.0.0->torchtext) (0.37.1)\n", "Collecting lit\n", " Downloading lit-16.0.1.tar.gz (137 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.9/137.9 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hCollecting cmake\n", " Using cached cmake-3.26.3-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.0 MB)\n", "Requirement already satisfied: charset-normalizer~=2.0.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.0.10)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2.10)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->torchtext) (2022.12.7)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from jinja2->torch==2.0.0->torchtext) (2.0.1)\n", "Collecting mpmath>=0.19\n", " Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.2/536.2 kB\u001b[0m \u001b[31m768.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25hBuilding wheels for collected packages: lit\n", " Building wheel for lit (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for lit: filename=lit-16.0.1-py3-none-any.whl size=88173 sha256=fca0dda7f2dc27a2885356559af2c2b6bc26994156ad1efae9f15f63d3866468\n", " Stored in directory: /home/mikolaj/.cache/pip/wheels/12/14/ba/87be46a564f97692e6cd1f6d7a1deeb5bff2821d45a52e8d7a\n", "Successfully built lit\n", "Installing collected packages: mpmath, lit, cmake, sympy, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, nvidia-cudnn-cu11, triton, torch, torchdata, torchtext\n", " Attempting uninstall: torch\n", " Found existing installation: torch 1.10.0\n", " Uninstalling torch-1.10.0:\n", " Successfully uninstalled torch-1.10.0\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "laserembeddings 1.1.2 requires sacremoses==0.0.35, which is not installed.\n", "unbabel-comet 1.1.0 requires numpy>=1.20.0, but you have numpy 1.19.3 which is incompatible.\n", "unbabel-comet 1.1.0 requires scipy>=1.5.4, but you have scipy 1.4.1 which is incompatible.\n", "unbabel-comet 1.1.0 requires torch<=1.10.0,>=1.6.0, but you have torch 2.0.0 which is incompatible.\n", "torchvision 0.11.1 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n", "torchaudio 0.10.0 requires torch==1.10.0, but you have torch 2.0.0 which is incompatible.\n", "laserembeddings 1.1.2 requires torch<2.0.0,>=1.0.1.post2, but you have torch 2.0.0 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[0mSuccessfully installed cmake-3.26.3 lit-16.0.1 mpmath-1.3.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 sympy-1.11.1 torch-2.0.0 torchdata-0.6.0 torchtext-0.15.1 triton-2.0.0\n", "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "!pip3 install torchtext" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['', '\\\\', 'the', '-\\\\', 'wno']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab.lookup_tokens([0, 1, 2, 10, 12345])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Definicja sieci\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def look_ahead_iterator(gen):\n", " prev = None\n", " for item in gen:\n", " if prev is not None:\n", " yield (prev, item)\n", " prev = item" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1, 2), (2, 3), (3, 4), (4, 5), (5, 'X'), ('X', 6)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(look_ahead_iterator([1,2,3,4,5, 'X', 6 ]))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "import itertools" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from torch import nn\n", "import torch\n", "\n", "embed_size = 100\n", "\n", "class SimpleBigramNeuralLanguageModel(nn.Module):\n", " def __init__(self, vocabulary_size, embedding_size):\n", " super(SimpleBigramNeuralLanguageModel, self).__init__()\n", " self.model = nn.Sequential(\n", " nn.Embedding(vocabulary_size, embedding_size),\n", " nn.Linear(embedding_size, vocabulary_size),\n", " nn.Softmax()\n", " )\n", "\n", " def forward(self, x):\n", " return self.model(x)\n", "\n", "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n", "\n", "vocab.set_default_index(vocab[''])\n", "ixs = torch.tensor(vocab.forward(['pies']))\n", "# out[0][vocab['jest']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n", "\n", " shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from torch.utils.data import IterableDataset\n", "import itertools\n", "\n", "def look_ahead_iterator(gen):\n", " prev = None\n", " for item in gen:\n", " if prev is not None:\n", " yield (prev, item)\n", " prev = item\n", "\n", "class Bigrams(IterableDataset):\n", " def __init__(self, text_file, vocabulary_size):\n", " self.vocab = build_vocab_from_iterator(\n", " get_word_lines_from_file(text_file),\n", " max_tokens = vocabulary_size,\n", " specials = [''])\n", " self.vocab.set_default_index(self.vocab[''])\n", " self.vocabulary_size = vocabulary_size\n", " self.text_file = text_file\n", "\n", " def __iter__(self):\n", " return look_ahead_iterator(\n", " (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n", "\n", "train_dataset = Bigrams('train/in.tsv.xz', vocab_size)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'tuple' object is not an iterator", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_12664/602008184.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mDataLoader\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'tuple' object is not an iterator" ] } ], "source": [ "from torch.utils.data import DataLoader\n", "\n", "next(iter(train_dataset))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['', '']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab.lookup_tokens([43, 0])" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[tensor([ 2, 5, 51, 3481, 231]), tensor([ 5, 51, 3481, 231, 4])]" ] } ], "source": [ "from torch.utils.data import DataLoader\n", "\n", "next(iter(DataLoader(train_dataset, batch_size=5)))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mikolaj/.local/lib/python3.8/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n", " input = module(input)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 tensor(10.0877, grad_fn=)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/mikolaj/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)\n", " Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "100 tensor(8.4388, grad_fn=)\n", "200 tensor(7.7335, grad_fn=)\n", "300 tensor(7.1300, grad_fn=)\n", "400 tensor(6.7325, grad_fn=)\n", "500 tensor(6.4705, grad_fn=)\n", "600 tensor(6.0460, grad_fn=)\n", "700 tensor(5.8104, grad_fn=)\n", "800 tensor(5.8110, grad_fn=)\n", "900 tensor(5.7169, grad_fn=)\n", "1000 tensor(5.7580, grad_fn=)\n", "1100 tensor(5.6787, grad_fn=)\n", "1200 tensor(5.4501, grad_fn=)\n" ] }, { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_12664/1293343661.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdevice\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0moptimizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzero_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mypredicted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcriterion\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mypredicted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;36m100\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/tmp/ipykernel_12664/517511851.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSimpleBigramNeuralLanguageModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvocab_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0membed_size\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/container.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 216\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m \u001b[0minput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 218\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1499\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_pre_hooks\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0m_global_backward_hooks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1500\u001b[0m or _global_forward_hooks or _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mforward_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1502\u001b[0m \u001b[0;31m# Do not call functions when jit is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1503\u001b[0m \u001b[0mfull_backward_hooks\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnon_full_backward_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/modules/activation.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 1457\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1458\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1459\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1460\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1461\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mextra_repr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.8/site-packages/torch/nn/functional.py\u001b[0m in \u001b[0;36msoftmax\u001b[0;34m(input, dim, _stacklevel, dtype)\u001b[0m\n\u001b[1;32m 1841\u001b[0m \u001b[0mdim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_get_softmax_dim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"softmax\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacklevel\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1842\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1843\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1844\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1845\u001b[0m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msoftmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "device = 'cpu'\n", "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n", "data = DataLoader(train_dataset, batch_size=5000)\n", "optimizer = torch.optim.Adam(model.parameters())\n", "criterion = torch.nn.NLLLoss()\n", "\n", "model.train()\n", "step = 0\n", "for x, y in data:\n", " x = x.to(device)\n", " y = y.to(device)\n", " optimizer.zero_grad()\n", " ypredicted = model(x)\n", " loss = criterion(torch.log(ypredicted), y)\n", " if step % 100 == 0:\n", " print(step, loss)\n", " step += 1\n", " loss.backward()\n", " optimizer.step()\n", "\n", "torch.save(model.state_dict(), 'model1.bin')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "torch.save(model.state_dict(), 'model1.bin')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n", "\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(',', 3, 0.12514805793762207),\n", " ('\\\\', 1, 0.07237359136343002),\n", " ('', 0, 0.06839419901371002),\n", " ('.', 4, 0.06109621003270149),\n", " ('of', 5, 0.04557998105883598),\n", " ('and', 6, 0.03565318509936333),\n", " ('the', 2, 0.029342489317059517),\n", " ('to', 7, 0.02185475267469883),\n", " ('-\\\\', 10, 0.018097609281539917),\n", " ('in', 9, 0.016023961827158928)]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = 'cpu'\n", "model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n", "model.load_state_dict(torch.load('model1.bin'))\n", "model.eval()\n", "\n", "ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n", "\n", "out = model(ixs)\n", "top = torch.topk(out[0], 10)\n", "top_indices = top.indices.tolist()\n", "top_probs = top.values.tolist()\n", "top_words = vocab.lookup_tokens(top_indices)\n", "list(zip(top_words, top_indices, top_probs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n", "\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(',', 3, 0.12514805793762207),\n", " ('\\\\', 1, 0.07237359136343002),\n", " ('', 0, 0.06839419901371002),\n", " ('.', 4, 0.06109621003270149),\n", " ('of', 5, 0.04557998105883598),\n", " ('and', 6, 0.03565318509936333),\n", " ('the', 2, 0.029342489317059517),\n", " ('to', 7, 0.02185475267469883),\n", " ('-\\\\', 10, 0.018097609281539917),\n", " ('in', 9, 0.016023961827158928)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = train_dataset.vocab\n", "ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n", "\n", "out = model(ixs)\n", "top = torch.topk(out[0], 10)\n", "top_indices = top.indices.tolist()\n", "top_probs = top.values.tolist()\n", "top_words = vocab.lookup_tokens(top_indices)\n", "list(zip(top_words, top_indices, top_probs))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('', 0, 1.0000001192092896),\n", " ('nb', 1958, 0.4407886266708374),\n", " ('refrain', 14092, 0.4395471513271332),\n", " ('cat', 3391, 0.4154242277145386),\n", " ('enjoying', 7521, 0.3915165066719055),\n", " ('active', 1383, 0.38935279846191406),\n", " ('stewart', 4816, 0.3806381821632385),\n", " ('omit', 15600, 0.380504310131073),\n", " ('2041095573313', 11912, 0.37909239530563354),\n", " ('shut', 3863, 0.3778260052204132)]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n", "\n", "embeddings = model.model[0].weight\n", "\n", "vec = embeddings[vocab['poszedł']]\n", "\n", "similarities = cos(vec, embeddings)\n", "\n", "top = torch.topk(similarities, 10)\n", "\n", "top_indices = top.indices.tolist()\n", "top_probs = top.values.tolist()\n", "top_words = vocab.lookup_tokens(top_indices)\n", "list(zip(top_words, top_indices, top_probs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Zapis przy użyciu wzoru matematycznego\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n", "\n", "$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n", "\n", "gdzie:\n", "\n", "- $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n", "- $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n", "- $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n", "- $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Hiperparametry\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zauważmy, że nasz model ma dwa hiperparametry:\n", "\n", "- $m$ — rozmiar zanurzenia,\n", "- $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n", " rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n", " najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, ``.\n", "\n", "Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n", "polepszenia wyników naszego modelu.\n", "\n", "**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Diagram sieci\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n", "warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n", "sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n", "\n", "![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Zanurzenie jako mnożenie przez macierz\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n", "odpytania (look-up\\_). Co ciekawe, zanurzenie można intepretować jako\n", "mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n", "wektora z gorącą jedynką (one-hot encoding\\_), tzn. słowo $w$ zostanie\n", "podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n", "złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n", "\n", "Wówczas wzór przyjmie postać:\n", "\n", "$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n", "\n", "gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n", "\n", "**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n", "\n", "W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n", "\n", "![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.12 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "org": null, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 1 }