28 KiB
Modelowanie Języka
15. Model transformer autoregresywny [ćwiczenia]
Jakub Pokrywka (2022)
!pip install transformers
Requirement already satisfied: transformers in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (4.19.2) Requirement already satisfied: tqdm>=4.27 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (4.64.0) Requirement already satisfied: numpy>=1.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (1.22.3) Requirement already satisfied: requests in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2.27.1) Requirement already satisfied: packaging>=20.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (21.3) Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.12.1) Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.6.0) Requirement already satisfied: filelock in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (3.7.0) Requirement already satisfied: regex!=2019.12.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2022.4.24) Requirement already satisfied: pyyaml>=5.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (6.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.1.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from packaging>=20.0->transformers) (3.0.8) Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2020.6.20) Requirement already satisfied: idna<4,>=2.5 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (1.26.9)
import torch
from transformers import pipeline, set_seed, AutoTokenizer, AutoModel, AutoModelForCausalLM
przykładowy tekst
TEXT = 'Today, on my way to the university,'
użycie modelu w bibliotece transormers
model_name = "gpt2"
w przypadku długiego czasu inferencji lub za małą ilością RAMu użyj mniejszego modelu:
# model_name = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
encoding = tokenizer(TEXT)
encoding
{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
for token in encoding['input_ids']:
print(token, '\t', tokenizer.decode(token))
8888 Today 11 , 319 on 616 my 835 way 284 to 262 the 6403 university 11 ,
pt_model = AutoModel.from_pretrained(model_name)
encoding
{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
poniżej pojawi się błąd, ponieważ na wejściu modelu muszą być tensory
pt_model(**encoding)
TEXT
'Today, on my way to the university,'
encoding = tokenizer(TEXT, return_tensors='pt')
?pt_model.forward
output = pt_model(**encoding, output_hidden_states= True)
output
output.hidden_states[0].shape
torch.Size([1, 9, 768])
output.hidden_states[1].shape
torch.Size([1, 9, 768])
output.hidden_states[2].shape
torch.Size([1, 9, 768])
len(output.hidden_states)
13
output.last_hidden_state.shape
torch.Size([1, 9, 768])
pt_model = AutoModelForCausalLM.from_pretrained(model_name)
output = pt_model(**encoding)
output
output[0]
tensor([[[ -36.3292, -36.3402, -40.4228, ..., -46.0234, -44.5284, -37.1276], [-114.9346, -116.5035, -117.9236, ..., -117.8857, -119.3379, -112.9298], [-123.5036, -123.0548, -127.3876, ..., -130.5238, -130.5279, -123.2711], ..., [-101.3852, -101.2506, -103.6583, ..., -103.3747, -107.7192, -99.4521], [ -83.0701, -84.3884, -91.9513, ..., -91.7482, -93.3971, -85.1204], [ -91.2749, -93.1332, -93.6408, ..., -94.3482, -93.4517, -90.1472]]], grad_fn=<UnsafeViewBackward0>)
output[0].shape
torch.Size([1, 9, 50257])
torch.topk(output[0][0],5)
torch.return_types.topk( values=tensor([[ -32.8755, -33.1021, -33.9975, -34.4861, -34.5463], [-105.5972, -106.3818, -106.3978, -106.9693, -107.0778], [-113.2521, -114.7346, -114.8781, -114.9605, -115.0834], [-118.2435, -119.2980, -119.5907, -119.6229, -119.7969], [ -83.6241, -84.6822, -84.8526, -85.4978, -86.6938], [ -79.9051, -80.3284, -81.6157, -81.8538, -82.9018], [ -90.4443, -90.7053, -91.9059, -92.0003, -92.1531], [ -75.2650, -76.9698, -77.5753, -77.6700, -77.8095], [ -78.7985, -81.5545, -81.6846, -81.8984, -82.5938]], grad_fn=<TopkBackward0>), indices=tensor([[ 11, 13, 198, 290, 286], [ 262, 356, 314, 340, 257], [ 262, 257, 1737, 2901, 2805], [ 835, 717, 938, 10955, 1218], [ 284, 736, 1363, 503, 422], [ 670, 262, 616, 257, 1524], [ 9003, 2607, 11550, 4436, 4495], [ 11, 314, 338, 284, 287], [ 314, 616, 257, 262, 612]]))
encoding.input_ids[0]
tensor([8888, 11, 319, 616, 835, 284, 262, 6403, 11])
for i in range(1,len(encoding.input_ids[0])):
print(tokenizer.decode(encoding.input_ids[0][:i+1]), '\t→', tokenizer.decode(torch.topk(output[0][0],1).indices[i]))
Today, → the Today, on → the Today, on my → way Today, on my way → to Today, on my way to → work Today, on my way to the → airport Today, on my way to the university → , Today, on my way to the university, → I
generowanie tekstu
encoding
{'input_ids': tensor([[8888, 11, 319, 616, 835, 284, 262, 6403, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
text = TEXT
text
'Today, on my way to the university,'
encoding
{'input_ids': tensor([[8888, 11, 319, 616, 835, 284, 262, 6403, 11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
encoding = tokenizer(text, return_tensors='pt')
for i in range(10):
output =pt_model(**encoding)
text += tokenizer.decode(torch.topk(output[0][0][-1],1).indices)
encoding = tokenizer(text, return_tensors='pt')
text
'Today, on my way to the university, I was approached by a man who was a student'
Co można zrobić, żeby poprawić wynik? Strategie dekodowania:
- greedy search
- random sampling
- random sampling with temperature
- top-k sampling lub top-k sampling with temperature
- top-p sampling (inna nazwa: nucleus sampling) lub top-p sampling with temperature
pipeline
generator = pipeline('text-generation', model=model_name)
TEXT
'Today, on my way to the university,'
generator(TEXT, max_length=20, num_return_sequences=5)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, some of them would have been very pleased, and I'}, {'generated_text': 'Today, on my way to the university, and he made me dinner, and he called me back'}, {'generated_text': 'Today, on my way to the university, I saw three white girls who seemed a bit different—'}, {'generated_text': 'Today, on my way to the university, I drove through the town, past trees and bushes,'}, {'generated_text': 'Today, on my way to the university, I saw an elderly lady come up behind me."\n'}]
generator(TEXT, max_length=20, num_beams=1, do_sample=False)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a man who was a student at'}]
generator(TEXT, max_length=20, num_beams=10, top_p = 0.2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a man who was very nice and'}]
generator(TEXT, max_length=20, num_beams=10, temperature = 1.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a group of students who asked me'}]
generator(TEXT, max_length=20, num_beams=10, temperature = 10.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I noticed some young boys who was very active on campus'}]
generator(TEXT, max_length=20, num_beams=10, temperature = 100.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, the trainees have noticed how a car could become an'}]
inne możliwość:
- repetition_penalty
- length_penalty
- no_repeat_ngram_size
- bad_words_ids
- force_words_ids
huggingface API
from transformers import CTRLTokenizer, CTRLModel
tokenizer = CTRLTokenizer.from_pretrained("ctrl")
CTRL
inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
inputs
{'input_ids': tensor([[43213, 586, 3153, 8, 83781]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
tokenizer.control_codes
{'Pregnancy': 168629, 'Christianity': 7675, 'Explain': 106423, 'Fitness': 63440, 'Saving': 63163, 'Ask': 27171, 'Ass': 95985, 'Joke': 163509, 'Questions': 45622, 'Thoughts': 49605, 'Retail': 52342, 'Feminism': 164338, 'Writing': 11992, 'Atheism': 192263, 'Netflix': 48616, 'Computing': 39639, 'Opinion': 43213, 'Alone': 44967, 'Funny': 58917, 'Gaming': 40358, 'Human': 4088, 'India': 1331, 'Joker': 77138, 'Diet': 36206, 'Legal': 11859, 'Norman': 4939, 'Tip': 72689, 'Weight': 52343, 'Movies': 46273, 'Running': 23425, 'Science': 2090, 'Horror': 37793, 'Confession': 60572, 'Finance': 12250, 'Politics': 16360, 'Scary': 191985, 'Support': 12654, 'Technologies': 32516, 'Teenage': 66160, 'Event': 32769, 'Learned': 67460, 'Notion': 182770, 'Wikipedia': 37583, 'Books': 6665, 'Extract': 76050, 'Confessions': 102701, 'Conspiracy': 75932, 'Links': 63674, 'Narcissus': 150425, 'Relationship': 54766, 'Relationships': 134796, 'Reviews': 41671, 'News': 4256, 'Translation': 26820, 'multilingual': 128406}
generator = pipeline('text-generation', model="ctrl")
/home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages/transformers/models/ctrl/modeling_ctrl.py:43: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)
TEXT = "Today"
generator("Opinion " + TEXT, max_length = 50)
[{'generated_text': 'Opinion Today I learned that the US government has been spying on the citizens of other countries for years. \n Score: 6 \n \n Title: CMV: I think that the US should not be involved in the Middle East \n Text: I think that the US'}]
generator("Technologies " + TEXT, max_length = 50)
[{'generated_text': 'Technologies Today \n Score: 6 \n \n Title: The Internet is a great tool for the average person to get information and to share it with others. But it is also a great tool for the government to spy on us. \n Score: 6 \n \n Title: The'}]
generator("Gaming " + TEXT, max_length = 50)
[{'generated_text': 'Gaming Today \n Score: 6 \n \n Title: I just got a new gaming pc and I have a question \n Text: I just got a new gaming pc and I have a question \n \n I have a monitor that I bought a while back'}]
Zadanie
Za pomocą GPT2 lub distillGPT wygenerować odpowiedzi dla wyzwania challanging america. Nie trzeba douczać modelu.