1180 lines
28 KiB
Plaintext
1180 lines
28 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
"<h1> Modelowanie Języka</h1>\n",
|
|
"<h2> 15. <i>Model transformer autoregresywny</i> [ćwiczenia]</h2> \n",
|
|
"<h3> Jakub Pokrywka (2022)</h3>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"https://huggingface.co/gpt2"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: transformers in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (4.19.2)\n",
|
|
"Requirement already satisfied: tqdm>=4.27 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (4.64.0)\n",
|
|
"Requirement already satisfied: numpy>=1.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (1.22.3)\n",
|
|
"Requirement already satisfied: requests in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2.27.1)\n",
|
|
"Requirement already satisfied: packaging>=20.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (21.3)\n",
|
|
"Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.12.1)\n",
|
|
"Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.6.0)\n",
|
|
"Requirement already satisfied: filelock in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (3.7.0)\n",
|
|
"Requirement already satisfied: regex!=2019.12.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2022.4.24)\n",
|
|
"Requirement already satisfied: pyyaml>=5.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (6.0)\n",
|
|
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.1.1)\n",
|
|
"Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from packaging>=20.0->transformers) (3.0.8)\n",
|
|
"Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2020.6.20)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (3.3)\n",
|
|
"Requirement already satisfied: charset-normalizer~=2.0.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2.0.4)\n",
|
|
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (1.26.9)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!pip install transformers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import torch"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from transformers import pipeline, set_seed, AutoTokenizer, AutoModel, AutoModelForCausalLM"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### przykładowy tekst"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TEXT = 'Today, on my way to the university,'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## użycie modelu w bibliotece transormers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model_name = \"gpt2\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"w przypadku długiego czasu inferencji lub za małą ilością RAMu użyj mniejszego modelu:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# model_name = 'distilgpt2'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tokenizer = AutoTokenizer.from_pretrained(model_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"encoding = tokenizer(TEXT)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"encoding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"8888 \t Today\n",
|
|
"11 \t ,\n",
|
|
"319 \t on\n",
|
|
"616 \t my\n",
|
|
"835 \t way\n",
|
|
"284 \t to\n",
|
|
"262 \t the\n",
|
|
"6403 \t university\n",
|
|
"11 \t ,\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for token in encoding['input_ids']:\n",
|
|
" print(token, '\\t', tokenizer.decode(token))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pt_model = AutoModel.from_pretrained(model_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"encoding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"poniżej pojawi się błąd, ponieważ na wejściu modelu muszą być tensory"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pt_model(**encoding)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'Today, on my way to the university,'"
|
|
]
|
|
},
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"TEXT"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"encoding = tokenizer(TEXT, return_tensors='pt')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"?pt_model.forward"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"output = pt_model(**encoding, output_hidden_states= True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"output"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.Size([1, 9, 768])"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output.hidden_states[0].shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.Size([1, 9, 768])"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output.hidden_states[1].shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.Size([1, 9, 768])"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output.hidden_states[2].shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"13"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(output.hidden_states)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.Size([1, 9, 768])"
|
|
]
|
|
},
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output.last_hidden_state.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pt_model = AutoModelForCausalLM.from_pretrained(model_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"output = pt_model(**encoding)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"output"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"tensor([[[ -36.3292, -36.3402, -40.4228, ..., -46.0234, -44.5284,\n",
|
|
" -37.1276],\n",
|
|
" [-114.9346, -116.5035, -117.9236, ..., -117.8857, -119.3379,\n",
|
|
" -112.9298],\n",
|
|
" [-123.5036, -123.0548, -127.3876, ..., -130.5238, -130.5279,\n",
|
|
" -123.2711],\n",
|
|
" ...,\n",
|
|
" [-101.3852, -101.2506, -103.6583, ..., -103.3747, -107.7192,\n",
|
|
" -99.4521],\n",
|
|
" [ -83.0701, -84.3884, -91.9513, ..., -91.7482, -93.3971,\n",
|
|
" -85.1204],\n",
|
|
" [ -91.2749, -93.1332, -93.6408, ..., -94.3482, -93.4517,\n",
|
|
" -90.1472]]], grad_fn=<UnsafeViewBackward0>)"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.Size([1, 9, 50257])"
|
|
]
|
|
},
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"output[0].shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"torch.return_types.topk(\n",
|
|
"values=tensor([[ -32.8755, -33.1021, -33.9975, -34.4861, -34.5463],\n",
|
|
" [-105.5972, -106.3818, -106.3978, -106.9693, -107.0778],\n",
|
|
" [-113.2521, -114.7346, -114.8781, -114.9605, -115.0834],\n",
|
|
" [-118.2435, -119.2980, -119.5907, -119.6229, -119.7969],\n",
|
|
" [ -83.6241, -84.6822, -84.8526, -85.4978, -86.6938],\n",
|
|
" [ -79.9051, -80.3284, -81.6157, -81.8538, -82.9018],\n",
|
|
" [ -90.4443, -90.7053, -91.9059, -92.0003, -92.1531],\n",
|
|
" [ -75.2650, -76.9698, -77.5753, -77.6700, -77.8095],\n",
|
|
" [ -78.7985, -81.5545, -81.6846, -81.8984, -82.5938]],\n",
|
|
" grad_fn=<TopkBackward0>),\n",
|
|
"indices=tensor([[ 11, 13, 198, 290, 286],\n",
|
|
" [ 262, 356, 314, 340, 257],\n",
|
|
" [ 262, 257, 1737, 2901, 2805],\n",
|
|
" [ 835, 717, 938, 10955, 1218],\n",
|
|
" [ 284, 736, 1363, 503, 422],\n",
|
|
" [ 670, 262, 616, 257, 1524],\n",
|
|
" [ 9003, 2607, 11550, 4436, 4495],\n",
|
|
" [ 11, 314, 338, 284, 287],\n",
|
|
" [ 314, 616, 257, 262, 612]]))"
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"torch.topk(output[0][0],5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"tensor([8888, 11, 319, 616, 835, 284, 262, 6403, 11])"
|
|
]
|
|
},
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"encoding.input_ids[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Today, \t→ the\n",
|
|
"Today, on \t→ the\n",
|
|
"Today, on my \t→ way\n",
|
|
"Today, on my way \t→ to\n",
|
|
"Today, on my way to \t→ work\n",
|
|
"Today, on my way to the \t→ airport\n",
|
|
"Today, on my way to the university \t→ ,\n",
|
|
"Today, on my way to the university, \t→ I\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for i in range(1,len(encoding.input_ids[0])):\n",
|
|
" print(tokenizer.decode(encoding.input_ids[0][:i+1]), '\\t→', tokenizer.decode(torch.topk(output[0][0],1).indices[i]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### generowanie tekstu"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'input_ids': tensor([[8888, 11, 319, 616, 835, 284, 262, 6403, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
|
|
]
|
|
},
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"encoding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = TEXT"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'Today, on my way to the university,'"
|
|
]
|
|
},
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'input_ids': tensor([[8888, 11, 319, 616, 835, 284, 262, 6403, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
|
|
]
|
|
},
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"encoding"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"encoding = tokenizer(text, return_tensors='pt')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for i in range(10):\n",
|
|
" output =pt_model(**encoding)\n",
|
|
" text += tokenizer.decode(torch.topk(output[0][0][-1],1).indices)\n",
|
|
" encoding = tokenizer(text, return_tensors='pt')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'Today, on my way to the university, I was approached by a man who was a student'"
|
|
]
|
|
},
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Co można zrobić, żeby poprawić wynik? Strategie dekodowania:\n",
|
|
"\n",
|
|
"- greedy search\n",
|
|
"- random sampling\n",
|
|
"- random sampling with temperature\n",
|
|
"- top-k sampling lub top-k sampling with temperature\n",
|
|
"- top-p sampling (inna nazwa: nucleus sampling) lub top-p sampling with temperature\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### https://huggingface.co/tasks"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## pipeline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"generator = pipeline('text-generation', model=model_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'Today, on my way to the university,'"
|
|
]
|
|
},
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"TEXT"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, some of them would have been very pleased, and I'},\n",
|
|
" {'generated_text': 'Today, on my way to the university, and he made me dinner, and he called me back'},\n",
|
|
" {'generated_text': 'Today, on my way to the university, I saw three white girls who seemed a bit different—'},\n",
|
|
" {'generated_text': 'Today, on my way to the university, I drove through the town, past trees and bushes,'},\n",
|
|
" {'generated_text': 'Today, on my way to the university, I saw an elderly lady come up behind me.\"\\n'}]"
|
|
]
|
|
},
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_return_sequences=5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"https://huggingface.co/docs/transformers/main_classes/text_generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 44,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, I was approached by a man who was a student at'}]"
|
|
]
|
|
},
|
|
"execution_count": 44,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_beams=1, do_sample=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, I was approached by a man who was very nice and'}]"
|
|
]
|
|
},
|
|
"execution_count": 45,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_beams=10, top_p = 0.2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, I was approached by a group of students who asked me'}]"
|
|
]
|
|
},
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_beams=10, temperature = 1.0 )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, I noticed some young boys who was very active on campus'}]"
|
|
]
|
|
},
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_beams=10, temperature = 10.0 )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'generated_text': 'Today, on my way to the university, the trainees have noticed how a car could become an'}]"
|
|
]
|
|
},
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"generator(TEXT, max_length=20, num_beams=10, temperature = 100.0 )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"inne możliwość:\n",
|
|
"\n",
|
|
"\n",
|
|
"- repetition_penalty\n",
|
|
"- length_penalty\n",
|
|
"- no_repeat_ngram_size\n",
|
|
"- bad_words_ids\n",
|
|
"- force_words_ids"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## huggingface API"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"https://huggingface.co/gpt2?text=Today%2C+on+my+way+to+the+university"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from transformers import CTRLTokenizer, CTRLModel"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tokenizer = CTRLTokenizer.from_pretrained(\"ctrl\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## CTRL"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"inputs = tokenizer(\"Opinion My dog is cute\", return_tensors=\"pt\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'input_ids': tensor([[43213, 586, 3153, 8, 83781]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}"
|
|
]
|
|
},
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"inputs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'Pregnancy': 168629,\n",
|
|
" 'Christianity': 7675,\n",
|
|
" 'Explain': 106423,\n",
|
|
" 'Fitness': 63440,\n",
|
|
" 'Saving': 63163,\n",
|
|
" 'Ask': 27171,\n",
|
|
" 'Ass': 95985,\n",
|
|
" 'Joke': 163509,\n",
|
|
" 'Questions': 45622,\n",
|
|
" 'Thoughts': 49605,\n",
|
|
" 'Retail': 52342,\n",
|
|
" 'Feminism': 164338,\n",
|
|
" 'Writing': 11992,\n",
|
|
" 'Atheism': 192263,\n",
|
|
" 'Netflix': 48616,\n",
|
|
" 'Computing': 39639,\n",
|
|
" 'Opinion': 43213,\n",
|
|
" 'Alone': 44967,\n",
|
|
" 'Funny': 58917,\n",
|
|
" 'Gaming': 40358,\n",
|
|
" 'Human': 4088,\n",
|
|
" 'India': 1331,\n",
|
|
" 'Joker': 77138,\n",
|
|
" 'Diet': 36206,\n",
|
|
" 'Legal': 11859,\n",
|
|
" 'Norman': 4939,\n",
|
|
" 'Tip': 72689,\n",
|
|
" 'Weight': 52343,\n",
|
|
" 'Movies': 46273,\n",
|
|
" 'Running': 23425,\n",
|
|
" 'Science': 2090,\n",
|
|
" 'Horror': 37793,\n",
|
|
" 'Confession': 60572,\n",
|
|
" 'Finance': 12250,\n",
|
|
" 'Politics': 16360,\n",
|
|
" 'Scary': 191985,\n",
|
|
" 'Support': 12654,\n",
|
|
" 'Technologies': 32516,\n",
|
|
" 'Teenage': 66160,\n",
|
|
" 'Event': 32769,\n",
|
|
" 'Learned': 67460,\n",
|
|
" 'Notion': 182770,\n",
|
|
" 'Wikipedia': 37583,\n",
|
|
" 'Books': 6665,\n",
|
|
" 'Extract': 76050,\n",
|
|
" 'Confessions': 102701,\n",
|
|
" 'Conspiracy': 75932,\n",
|
|
" 'Links': 63674,\n",
|
|
" 'Narcissus': 150425,\n",
|
|
" 'Relationship': 54766,\n",
|
|
" 'Relationships': 134796,\n",
|
|
" 'Reviews': 41671,\n",
|
|
" 'News': 4256,\n",
|
|
" 'Translation': 26820,\n",
|
|
" 'multilingual': 128406}"
|
|
]
|
|
},
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"tokenizer.control_codes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages/transformers/models/ctrl/modeling_ctrl.py:43: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').\n",
|
|
" angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"generator = pipeline('text-generation', model=\"ctrl\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TEXT = \"Today\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"generator(\"Opinion \" + TEXT, max_length = 50)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[{'generated_text': 'Opinion Today I learned that the US government has been spying on the citizens of other countries for years. \\n Score: 6 \\n \\n Title: CMV: I think that the US should not be involved in the Middle East \\n Text: I think that the US'}]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"generator(\"Technologies \" + TEXT, max_length = 50)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[{'generated_text': 'Technologies Today \\n Score: 6 \\n \\n Title: The Internet is a great tool for the average person to get information and to share it with others. But it is also a great tool for the government to spy on us. \\n Score: 6 \\n \\n Title: The'}]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"generator(\"Gaming \" + TEXT, max_length = 50)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[{'generated_text': 'Gaming Today \\n Score: 6 \\n \\n Title: I just got a new gaming pc and I have a question \\n Text: I just got a new gaming pc and I have a question \\n \\n I have a monitor that I bought a while back'}]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Zadanie \n",
|
|
"\n",
|
|
"Za pomocą GPT2 lub distillGPT wygenerować odpowiedzi dla wyzwania challanging america. Nie trzeba douczać modelu."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"author": "Jakub Pokrywka",
|
|
"email": "kubapok@wmi.amu.edu.pl",
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"lang": "pl",
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.3"
|
|
},
|
|
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
|
|
"title": "Ekstrakcja informacji",
|
|
"year": "2021"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|