342 lines
11 KiB
Plaintext
342 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Funkcja build_vocab_from_iterator automatycznie tworzy słownik na podstawie najczęściej występujących słów, jednocześnie traktując pozostałe słowa jako <unk>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\torchtext\\vocab\\__init__.py:4: UserWarning: \n",
|
|
"/!\\ IMPORTANT WARNING ABOUT TORCHTEXT STATUS /!\\ \n",
|
|
"Torchtext is deprecated and the last released version will be 0.18 (this one). You can silence this warning by calling the following at the beginnign of your scripts: `import torchtext; torchtext.disable_torchtext_deprecation_warning()`\n",
|
|
" warnings.warn(torchtext._TORCHTEXT_DEPRECATION_MSG)\n",
|
|
"c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\torchtext\\utils.py:4: UserWarning: \n",
|
|
"/!\\ IMPORTANT WARNING ABOUT TORCHTEXT STATUS /!\\ \n",
|
|
"Torchtext is deprecated and the last released version will be 0.18 (this one). You can silence this warning by calling the following at the beginnign of your scripts: `import torchtext; torchtext.disable_torchtext_deprecation_warning()`\n",
|
|
" warnings.warn(torchtext._TORCHTEXT_DEPRECATION_MSG)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from torchtext.vocab import build_vocab_from_iterator\n",
|
|
"import io\n",
|
|
"import zipfile\n",
|
|
"import torch\n",
|
|
"\n",
|
|
"\n",
|
|
"with zipfile.ZipFile(\"challenging_america_50k_texts.zip\") as zf:\n",
|
|
" with io.TextIOWrapper(zf.open(\"challenging_america_50k_texts.txt\"), encoding=\"utf-8\") as f:\n",
|
|
" data = f.readlines()\n",
|
|
"\n",
|
|
"def get_words_from_line(line):\n",
|
|
" line = line.rstrip()\n",
|
|
" for t in line.split():\n",
|
|
" yield t\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_word_lines_from_list(data):\n",
|
|
" for line in data:\n",
|
|
" yield get_words_from_line(line)\n",
|
|
"\n",
|
|
"vocab_size = 3000\n",
|
|
"vocab = build_vocab_from_iterator(\n",
|
|
" get_word_lines_from_list(data),\n",
|
|
" max_tokens = vocab_size,\n",
|
|
" specials = ['<unk>'])\n",
|
|
"vocab.set_default_index(vocab['<unk>'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Dostęp do par słowo, indeks ze słownika"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"<unk> 0\n",
|
|
"witness 1755\n",
|
|
"Supreme 2520\n",
|
|
"seems 577\n",
|
|
"her 51\n",
|
|
"! 503\n",
|
|
"were 40\n",
|
|
"Messrs. 1911\n",
|
|
"small 282\n",
|
|
"council 2064\n",
|
|
"but 35\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"vocab_dict = vocab.get_stoi()\n",
|
|
"for x, key in enumerate(vocab_dict):\n",
|
|
" print(key, vocab_dict[key])\n",
|
|
" if x == 10:\n",
|
|
" break"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Słownik dla podanego słowa w postaci stringa zwraca indeks w słowniku, który będzie wykorzystywany przez sieć neuronową (słowa nie znajdujące się w słowniku mapowane są na token unk, który posiada indeks 0 - zostało to zdefiniowane poprzez metodę vocab.set_default_index(vocab['<unk>']))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"5\n",
|
|
"1\n",
|
|
"208\n",
|
|
"0\n",
|
|
"0\n",
|
|
"Dla listy słów:\n",
|
|
"[5, 1]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(vocab[\"a\"])\n",
|
|
"print(vocab[\"the\"])\n",
|
|
"print(vocab[\"John\"])\n",
|
|
"print(vocab[\"awnifnawonf\"])\n",
|
|
"print(vocab[\"Poznań\"])\n",
|
|
"\n",
|
|
"print(\"Dla listy słów:\")\n",
|
|
"print(vocab.lookup_indices([\"a\", \"the\"]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Operacja odwrota (czyli odczytanie słowa na podstawie indeksu)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"the\n",
|
|
"attempt\n",
|
|
"Dla listy indeksów:\n",
|
|
"['<unk>', 'the', 'attempt', 'drew']\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(vocab.lookup_token(1))\n",
|
|
"print(vocab.lookup_token(1000))\n",
|
|
"\n",
|
|
"print(\"Dla listy indeksów:\")\n",
|
|
"print(vocab.lookup_tokens([0, 1, 1000, 2000]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Aby przekazać dane słowo (lub słowa) do sieci neuronowej, należy utworzyć na podstawie indeksu danego słowa Tensor"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor(1)\n",
|
|
"tensor([ 1, 2020])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"input_to_neural_network = torch.tensor(vocab[\"the\"])\n",
|
|
"print(input_to_neural_network)\n",
|
|
"\n",
|
|
"input_to_neural_network = torch.tensor(vocab.lookup_indices([\"the\", \"current\"]))\n",
|
|
"print(input_to_neural_network)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Zanurzenia słów uzyskujemy poprzez wykorzystanie warstwy torch.nn.Embedding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Uzyskanie wartości embeddingów dla tokenu o indeksie 1 oraz tokenu o indeksie 5\n",
|
|
"tensor([[-1.0945, -0.1461, 1.2927],\n",
|
|
" [-0.0303, 0.5213, 1.1486]], grad_fn=<EmbeddingBackward0>)\n",
|
|
"##########################################################\n",
|
|
"Uzyskanie wartości embeddingów dla batcha z tokenami\n",
|
|
"tensor([[[-1.0945, -0.1461, 1.2927],\n",
|
|
" [ 0.2963, 0.1083, 0.0797]],\n",
|
|
"\n",
|
|
" [[-0.9783, 1.1639, 0.3828],\n",
|
|
" [ 1.1856, 1.1943, -0.5562]],\n",
|
|
"\n",
|
|
" [[-0.3472, 0.5670, -1.2830],\n",
|
|
" [-1.0945, -0.1461, 1.2927]]], grad_fn=<EmbeddingBackward0>)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"vocab_size = 10\n",
|
|
"embedding_size = 3\n",
|
|
"embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)\n",
|
|
"input_to_embedding_layer = torch.IntTensor([1, 5])\n",
|
|
"print(\"Uzyskanie wartości embeddingów dla tokenu o indeksie 1 oraz tokenu o indeksie 5\")\n",
|
|
"print(embedding_layer((input_to_embedding_layer)))\n",
|
|
"\n",
|
|
"batched_input_to_embedding_layer = torch.IntTensor([[1, 4],\n",
|
|
" [2, 9],\n",
|
|
" [3, 1]])\n",
|
|
"print(\"##########################################################\")\n",
|
|
"print(\"Uzyskanie wartości embeddingów dla batcha z tokenami\")\n",
|
|
"print(embedding_layer((batched_input_to_embedding_layer)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Aby połączyć ze sobą zanurzenia kilku słów (podczas trenowania modelu ngramowego innego niż unigramowy/bigramowy), należy użyć konkatenacji. W torchu do tego wykorzystujemy funkcję torch.cat"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor([0.5000, 0.2000, 0.2000, 0.9100, 0.9200, 0.9300])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"first_embedding = torch.Tensor([0.5, 0.2, 0.2])\n",
|
|
"second_embedding = torch.Tensor([0.91, 0.92, 0.93])\n",
|
|
"\n",
|
|
"concatenated_embeddings = torch.cat([first_embedding, second_embedding])\n",
|
|
"print(concatenated_embeddings)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### W przypadku korzystania z batchy musimy zwrócić uwagę na właściwą oś (axis) po której będziemy dokonywać konkatenacji"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Nieprawidłowa konkatenacja embeddingów (stworzenie 4 tensorów o wymiarze 3, zamiast 2 tensorów o wymiarze 6)\n",
|
|
"tensor([[0.5000, 0.2000, 0.2000],\n",
|
|
" [0.5100, 0.2100, 0.2100],\n",
|
|
" [0.9100, 0.9200, 0.9300],\n",
|
|
" [0.9110, 0.9220, 0.9330]])\n",
|
|
"#########################################################################\n",
|
|
"Prawidłowa konkatenacja embeddingów dzięki właściwej wartości parametru axis\n",
|
|
"Wtedy pierwsze 3 wartości (czyli dla pierwszego Tensora są to wartości 0.5000, 0.2000, 0.2000) reprezentują embedding pierwszego słowa\n",
|
|
"a kolejne 3 wartości reprezentują embedding drugiego słowa (czyli 0.9100, 0.9200, 0.9300)\n",
|
|
"tensor([[0.5000, 0.2000, 0.2000, 0.9100, 0.9200, 0.9300],\n",
|
|
" [0.5100, 0.2100, 0.2100, 0.9110, 0.9220, 0.9330]])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"first_batched_embedding = torch.Tensor([[0.5, 0.2, 0.2], \n",
|
|
" [0.51, 0.21, 0.21]])\n",
|
|
"second_batched_embedding = torch.Tensor([[0.91, 0.92, 0.93],\n",
|
|
" [0.911, 0.922, 0.933]])\n",
|
|
"\n",
|
|
"concatenated_embeddings = torch.cat([first_batched_embedding, second_batched_embedding])\n",
|
|
"print(\"Nieprawidłowa konkatenacja embeddingów (stworzenie 4 tensorów o wymiarze 3, zamiast 2 tensorów o wymiarze 6)\")\n",
|
|
"print(concatenated_embeddings)\n",
|
|
"\n",
|
|
"properly_concatenated_embeddings = torch.cat([first_batched_embedding, second_batched_embedding], axis=1)\n",
|
|
"\n",
|
|
"print(\"#########################################################################\")\n",
|
|
"print(\"Prawidłowa konkatenacja embeddingów dzięki właściwej wartości parametru axis\")\n",
|
|
"print(\"Wtedy pierwsze 3 wartości (czyli dla pierwszego Tensora są to wartości 0.5000, 0.2000, 0.2000) reprezentują embedding pierwszego słowa\")\n",
|
|
"print(\"a kolejne 3 wartości reprezentują embedding drugiego słowa (czyli 0.9100, 0.9200, 0.9300)\")\n",
|
|
"print(properly_concatenated_embeddings)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "python39",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.18"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|