aitech-moj/cw/15_Model_transformer_autoregresywny.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<h1> Modelowanie Języka</h1>\n",
    "<h2> 15. <i>Model transformer autoregresywny</i>  [ćwiczenia]</h2> \n",
    "<h3> Jakub Pokrywka (2022)</h3>\n",
    "</div>\n",
    "\n",
    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://huggingface.co/gpt2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: transformers in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (4.19.2)\n",
      "Requirement already satisfied: tqdm>=4.27 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (4.64.0)\n",
      "Requirement already satisfied: numpy>=1.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (1.22.3)\n",
      "Requirement already satisfied: requests in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2.27.1)\n",
      "Requirement already satisfied: packaging>=20.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (21.3)\n",
      "Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.12.1)\n",
      "Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.6.0)\n",
      "Requirement already satisfied: filelock in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (3.7.0)\n",
      "Requirement already satisfied: regex!=2019.12.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2022.4.24)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (6.0)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.1.1)\n",
      "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from packaging>=20.0->transformers) (3.0.8)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2020.6.20)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (3.3)\n",
      "Requirement already satisfied: charset-normalizer~=2.0.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2.0.4)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (1.26.9)\n"
     ]
    }
   ],
   "source": [
    "!pip install transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline, set_seed, AutoTokenizer, AutoModel, AutoModelForCausalLM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### przykładowy tekst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "TEXT = 'Today, on my way to the university,'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## użycie modelu w bibliotece transormers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = \"gpt2\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "w przypadku długiego czasu inferencji lub za małą ilością RAMu użyj mniejszego modelu:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model_name = 'distilgpt2'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = AutoTokenizer.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoding = tokenizer(TEXT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8888 \t Today\n",
      "11 \t ,\n",
      "319 \t  on\n",
      "616 \t  my\n",
      "835 \t  way\n",
      "284 \t  to\n",
      "262 \t  the\n",
      "6403 \t  university\n",
      "11 \t ,\n"
     ]
    }
   ],
   "source": [
    "for token in encoding['input_ids']:\n",
    "    print(token, '\\t', tokenizer.decode(token))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "pt_model = AutoModel.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "poniżej pojawi się błąd, ponieważ na wejściu modelu muszą być tensory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pt_model(**encoding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today, on my way to the university,'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TEXT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoding = tokenizer(TEXT, return_tensors='pt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "?pt_model.forward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = pt_model(**encoding, output_hidden_states= True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 9, 768])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output.hidden_states[0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 9, 768])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output.hidden_states[1].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 9, 768])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output.hidden_states[2].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(output.hidden_states)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 9, 768])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output.last_hidden_state.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "pt_model = AutoModelForCausalLM.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = pt_model(**encoding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[[ -36.3292,  -36.3402,  -40.4228,  ...,  -46.0234,  -44.5284,\n",
       "           -37.1276],\n",
       "         [-114.9346, -116.5035, -117.9236,  ..., -117.8857, -119.3379,\n",
       "          -112.9298],\n",
       "         [-123.5036, -123.0548, -127.3876,  ..., -130.5238, -130.5279,\n",
       "          -123.2711],\n",
       "         ...,\n",
       "         [-101.3852, -101.2506, -103.6583,  ..., -103.3747, -107.7192,\n",
       "           -99.4521],\n",
       "         [ -83.0701,  -84.3884,  -91.9513,  ...,  -91.7482,  -93.3971,\n",
       "           -85.1204],\n",
       "         [ -91.2749,  -93.1332,  -93.6408,  ...,  -94.3482,  -93.4517,\n",
       "           -90.1472]]], grad_fn=<UnsafeViewBackward0>)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 9, 50257])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output[0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.return_types.topk(\n",
       "values=tensor([[ -32.8755,  -33.1021,  -33.9975,  -34.4861,  -34.5463],\n",
       "        [-105.5972, -106.3818, -106.3978, -106.9693, -107.0778],\n",
       "        [-113.2521, -114.7346, -114.8781, -114.9605, -115.0834],\n",
       "        [-118.2435, -119.2980, -119.5907, -119.6229, -119.7969],\n",
       "        [ -83.6241,  -84.6822,  -84.8526,  -85.4978,  -86.6938],\n",
       "        [ -79.9051,  -80.3284,  -81.6157,  -81.8538,  -82.9018],\n",
       "        [ -90.4443,  -90.7053,  -91.9059,  -92.0003,  -92.1531],\n",
       "        [ -75.2650,  -76.9698,  -77.5753,  -77.6700,  -77.8095],\n",
       "        [ -78.7985,  -81.5545,  -81.6846,  -81.8984,  -82.5938]],\n",
       "       grad_fn=<TopkBackward0>),\n",
       "indices=tensor([[   11,    13,   198,   290,   286],\n",
       "        [  262,   356,   314,   340,   257],\n",
       "        [  262,   257,  1737,  2901,  2805],\n",
       "        [  835,   717,   938, 10955,  1218],\n",
       "        [  284,   736,  1363,   503,   422],\n",
       "        [  670,   262,   616,   257,  1524],\n",
       "        [ 9003,  2607, 11550,  4436,  4495],\n",
       "        [   11,   314,   338,   284,   287],\n",
       "        [  314,   616,   257,   262,   612]]))"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.topk(output[0][0],5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([8888,   11,  319,  616,  835,  284,  262, 6403,   11])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding.input_ids[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Today, \t→  the\n",
      "Today, on \t→  the\n",
      "Today, on my \t→  way\n",
      "Today, on my way \t→  to\n",
      "Today, on my way to \t→  work\n",
      "Today, on my way to the \t→  airport\n",
      "Today, on my way to the university \t→ ,\n",
      "Today, on my way to the university, \t→  I\n"
     ]
    }
   ],
   "source": [
    "for i in range(1,len(encoding.input_ids[0])):\n",
    "    print(tokenizer.decode(encoding.input_ids[0][:i+1]), '\\t→', tokenizer.decode(torch.topk(output[0][0],1).indices[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### generowanie tekstu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[8888,   11,  319,  616,  835,  284,  262, 6403,   11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = TEXT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today, on my way to the university,'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[8888,   11,  319,  616,  835,  284,  262, 6403,   11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoding = tokenizer(text, return_tensors='pt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(10):\n",
    "    output =pt_model(**encoding)\n",
    "    text += tokenizer.decode(torch.topk(output[0][0][-1],1).indices)\n",
    "    encoding = tokenizer(text, return_tensors='pt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today, on my way to the university, I was approached by a man who was a student'"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Co można zrobić, żeby poprawić wynik? Strategie dekodowania:\n",
    "\n",
    "- greedy search\n",
    "- random sampling\n",
    "- random sampling with temperature\n",
    "- top-k sampling lub top-k sampling with temperature\n",
    "- top-p sampling (inna nazwa: nucleus sampling) lub top-p sampling with temperature\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### https://huggingface.co/tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "generator = pipeline('text-generation', model=model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today, on my way to the university,'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TEXT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, some of them would have been very pleased, and I'},\n",
       " {'generated_text': 'Today, on my way to the university, and he made me dinner, and he called me back'},\n",
       " {'generated_text': 'Today, on my way to the university, I saw three white girls who seemed a bit different—'},\n",
       " {'generated_text': 'Today, on my way to the university, I drove through the town, past trees and bushes,'},\n",
       " {'generated_text': 'Today, on my way to the university, I saw an elderly lady come up behind me.\"\\n'}]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_return_sequences=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://huggingface.co/docs/transformers/main_classes/text_generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, I was approached by a man who was a student at'}]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_beams=1, do_sample=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, I was approached by a man who was very nice and'}]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_beams=10, top_p = 0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, I was approached by a group of students who asked me'}]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_beams=10, temperature = 1.0 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, I noticed some young boys who was very active on campus'}]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_beams=10, temperature = 10.0 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'Today, on my way to the university, the trainees have noticed how a car could become an'}]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(TEXT, max_length=20, num_beams=10,  temperature = 100.0 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "inne możliwość:\n",
    "\n",
    "\n",
    "- repetition_penalty\n",
    "- length_penalty\n",
    "- no_repeat_ngram_size\n",
    "- bad_words_ids\n",
    "- force_words_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## huggingface API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://huggingface.co/gpt2?text=Today%2C+on+my+way+to+the+university"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import CTRLTokenizer, CTRLModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = CTRLTokenizer.from_pretrained(\"ctrl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CTRL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = tokenizer(\"Opinion My dog is cute\", return_tensors=\"pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[43213,   586,  3153,     8, 83781]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Pregnancy': 168629,\n",
       " 'Christianity': 7675,\n",
       " 'Explain': 106423,\n",
       " 'Fitness': 63440,\n",
       " 'Saving': 63163,\n",
       " 'Ask': 27171,\n",
       " 'Ass': 95985,\n",
       " 'Joke': 163509,\n",
       " 'Questions': 45622,\n",
       " 'Thoughts': 49605,\n",
       " 'Retail': 52342,\n",
       " 'Feminism': 164338,\n",
       " 'Writing': 11992,\n",
       " 'Atheism': 192263,\n",
       " 'Netflix': 48616,\n",
       " 'Computing': 39639,\n",
       " 'Opinion': 43213,\n",
       " 'Alone': 44967,\n",
       " 'Funny': 58917,\n",
       " 'Gaming': 40358,\n",
       " 'Human': 4088,\n",
       " 'India': 1331,\n",
       " 'Joker': 77138,\n",
       " 'Diet': 36206,\n",
       " 'Legal': 11859,\n",
       " 'Norman': 4939,\n",
       " 'Tip': 72689,\n",
       " 'Weight': 52343,\n",
       " 'Movies': 46273,\n",
       " 'Running': 23425,\n",
       " 'Science': 2090,\n",
       " 'Horror': 37793,\n",
       " 'Confession': 60572,\n",
       " 'Finance': 12250,\n",
       " 'Politics': 16360,\n",
       " 'Scary': 191985,\n",
       " 'Support': 12654,\n",
       " 'Technologies': 32516,\n",
       " 'Teenage': 66160,\n",
       " 'Event': 32769,\n",
       " 'Learned': 67460,\n",
       " 'Notion': 182770,\n",
       " 'Wikipedia': 37583,\n",
       " 'Books': 6665,\n",
       " 'Extract': 76050,\n",
       " 'Confessions': 102701,\n",
       " 'Conspiracy': 75932,\n",
       " 'Links': 63674,\n",
       " 'Narcissus': 150425,\n",
       " 'Relationship': 54766,\n",
       " 'Relationships': 134796,\n",
       " 'Reviews': 41671,\n",
       " 'News': 4256,\n",
       " 'Translation': 26820,\n",
       " 'multilingual': 128406}"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.control_codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages/transformers/models/ctrl/modeling_ctrl.py:43: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').\n",
      "  angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n"
     ]
    }
   ],
   "source": [
    "generator = pipeline('text-generation', model=\"ctrl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TEXT = \"Today\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generator(\"Opinion \" + TEXT, max_length = 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[{'generated_text': 'Opinion Today I learned that the US government has been spying on the citizens of other countries for years. \\n Score: 6 \\n \\n Title: CMV: I think that the US should not be involved in the Middle East \\n Text: I think that the US'}]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generator(\"Technologies \" + TEXT, max_length = 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[{'generated_text': 'Technologies Today \\n Score: 6 \\n \\n Title: The Internet is a great tool for the average person to get information and to share it with others. But it is also a great tool for the government to spy on us. \\n Score: 6 \\n \\n Title: The'}]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generator(\"Gaming \" + TEXT, max_length = 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[{'generated_text': 'Gaming Today \\n Score: 6 \\n \\n Title: I just got a new gaming pc and I have a question \\n Text: I just got a new gaming pc and I have a question \\n \\n I have a monitor that I bought a while back'}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zadanie \n",
    "\n",
    "Za pomocą GPT2 lub distillGPT wygenerować odpowiedzi dla wyzwania challanging america. Nie trzeba douczać modelu."
   ]
  }
 ],
 "metadata": {
  "author": "Jakub Pokrywka",
  "email": "kubapok@wmi.amu.edu.pl",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "lang": "pl",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
  "title": "Ekstrakcja informacji",
  "year": "2021"
 },
 "nbformat": 4,
 "nbformat_minor": 4
}