aitech-moj/cw/15_Model_transformer_autoregresywny.ipynb
Jakub Pokrywka 8b5140b261 update
2022-07-05 11:29:58 +02:00

28 KiB

Logo 1

Modelowanie Języka

15. Model transformer autoregresywny [ćwiczenia]

Jakub Pokrywka (2022)

Logo 2

!pip install transformers
Requirement already satisfied: transformers in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (4.19.2)
Requirement already satisfied: tqdm>=4.27 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (4.64.0)
Requirement already satisfied: numpy>=1.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (1.22.3)
Requirement already satisfied: requests in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2.27.1)
Requirement already satisfied: packaging>=20.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (21.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.12.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (0.6.0)
Requirement already satisfied: filelock in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (3.7.0)
Requirement already satisfied: regex!=2019.12.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (2022.4.24)
Requirement already satisfied: pyyaml>=5.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from transformers) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.1.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from packaging>=20.0->transformers) (3.0.8)
Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2020.6.20)
Requirement already satisfied: idna<4,>=2.5 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages (from requests->transformers) (1.26.9)
import torch
from transformers import pipeline, set_seed, AutoTokenizer, AutoModel, AutoModelForCausalLM

przykładowy tekst

TEXT = 'Today, on my way to the university,'

użycie modelu w bibliotece transormers

model_name = "gpt2"

w przypadku długiego czasu inferencji lub za małą ilością RAMu użyj mniejszego modelu:

# model_name = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
encoding = tokenizer(TEXT)
encoding
{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
for token in encoding['input_ids']:
    print(token, '\t', tokenizer.decode(token))
8888 	 Today
11 	 ,
319 	  on
616 	  my
835 	  way
284 	  to
262 	  the
6403 	  university
11 	 ,
pt_model = AutoModel.from_pretrained(model_name)
encoding
{'input_ids': [8888, 11, 319, 616, 835, 284, 262, 6403, 11], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

poniżej pojawi się błąd, ponieważ na wejściu modelu muszą być tensory

pt_model(**encoding)
TEXT
'Today, on my way to the university,'
encoding = tokenizer(TEXT, return_tensors='pt')
?pt_model.forward
output = pt_model(**encoding, output_hidden_states= True)
output
output.hidden_states[0].shape
torch.Size([1, 9, 768])
output.hidden_states[1].shape
torch.Size([1, 9, 768])
output.hidden_states[2].shape
torch.Size([1, 9, 768])
len(output.hidden_states)
13
output.last_hidden_state.shape
torch.Size([1, 9, 768])
pt_model = AutoModelForCausalLM.from_pretrained(model_name)
output = pt_model(**encoding)
output
output[0]
tensor([[[ -36.3292,  -36.3402,  -40.4228,  ...,  -46.0234,  -44.5284,
           -37.1276],
         [-114.9346, -116.5035, -117.9236,  ..., -117.8857, -119.3379,
          -112.9298],
         [-123.5036, -123.0548, -127.3876,  ..., -130.5238, -130.5279,
          -123.2711],
         ...,
         [-101.3852, -101.2506, -103.6583,  ..., -103.3747, -107.7192,
           -99.4521],
         [ -83.0701,  -84.3884,  -91.9513,  ...,  -91.7482,  -93.3971,
           -85.1204],
         [ -91.2749,  -93.1332,  -93.6408,  ...,  -94.3482,  -93.4517,
           -90.1472]]], grad_fn=<UnsafeViewBackward0>)
output[0].shape
torch.Size([1, 9, 50257])
torch.topk(output[0][0],5)
torch.return_types.topk(
values=tensor([[ -32.8755,  -33.1021,  -33.9975,  -34.4861,  -34.5463],
        [-105.5972, -106.3818, -106.3978, -106.9693, -107.0778],
        [-113.2521, -114.7346, -114.8781, -114.9605, -115.0834],
        [-118.2435, -119.2980, -119.5907, -119.6229, -119.7969],
        [ -83.6241,  -84.6822,  -84.8526,  -85.4978,  -86.6938],
        [ -79.9051,  -80.3284,  -81.6157,  -81.8538,  -82.9018],
        [ -90.4443,  -90.7053,  -91.9059,  -92.0003,  -92.1531],
        [ -75.2650,  -76.9698,  -77.5753,  -77.6700,  -77.8095],
        [ -78.7985,  -81.5545,  -81.6846,  -81.8984,  -82.5938]],
       grad_fn=<TopkBackward0>),
indices=tensor([[   11,    13,   198,   290,   286],
        [  262,   356,   314,   340,   257],
        [  262,   257,  1737,  2901,  2805],
        [  835,   717,   938, 10955,  1218],
        [  284,   736,  1363,   503,   422],
        [  670,   262,   616,   257,  1524],
        [ 9003,  2607, 11550,  4436,  4495],
        [   11,   314,   338,   284,   287],
        [  314,   616,   257,   262,   612]]))
encoding.input_ids[0]
tensor([8888,   11,  319,  616,  835,  284,  262, 6403,   11])
for i in range(1,len(encoding.input_ids[0])):
    print(tokenizer.decode(encoding.input_ids[0][:i+1]), '\t→', tokenizer.decode(torch.topk(output[0][0],1).indices[i]))
Today, 	→  the
Today, on 	→  the
Today, on my 	→  way
Today, on my way 	→  to
Today, on my way to 	→  work
Today, on my way to the 	→  airport
Today, on my way to the university 	→ ,
Today, on my way to the university, 	→  I

generowanie tekstu

encoding
{'input_ids': tensor([[8888,   11,  319,  616,  835,  284,  262, 6403,   11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
text = TEXT
text
'Today, on my way to the university,'
encoding
{'input_ids': tensor([[8888,   11,  319,  616,  835,  284,  262, 6403,   11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
encoding = tokenizer(text, return_tensors='pt')
for i in range(10):
    output =pt_model(**encoding)
    text += tokenizer.decode(torch.topk(output[0][0][-1],1).indices)
    encoding = tokenizer(text, return_tensors='pt')
text
'Today, on my way to the university, I was approached by a man who was a student'

Co można zrobić, żeby poprawić wynik? Strategie dekodowania:

  • greedy search
  • random sampling
  • random sampling with temperature
  • top-k sampling lub top-k sampling with temperature
  • top-p sampling (inna nazwa: nucleus sampling) lub top-p sampling with temperature

pipeline

generator = pipeline('text-generation', model=model_name)
TEXT
'Today, on my way to the university,'
generator(TEXT, max_length=20, num_return_sequences=5)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, some of them would have been very pleased, and I'},
 {'generated_text': 'Today, on my way to the university, and he made me dinner, and he called me back'},
 {'generated_text': 'Today, on my way to the university, I saw three white girls who seemed a bit different—'},
 {'generated_text': 'Today, on my way to the university, I drove through the town, past trees and bushes,'},
 {'generated_text': 'Today, on my way to the university, I saw an elderly lady come up behind me."\n'}]
generator(TEXT, max_length=20, num_beams=1, do_sample=False)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a man who was a student at'}]
generator(TEXT, max_length=20, num_beams=10, top_p = 0.2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a man who was very nice and'}]
generator(TEXT, max_length=20, num_beams=10, temperature = 1.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I was approached by a group of students who asked me'}]
generator(TEXT, max_length=20, num_beams=10, temperature = 10.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, I noticed some young boys who was very active on campus'}]
generator(TEXT, max_length=20, num_beams=10,  temperature = 100.0 )
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'Today, on my way to the university, the trainees have noticed how a car could become an'}]

inne możliwość:

  • repetition_penalty
  • length_penalty
  • no_repeat_ngram_size
  • bad_words_ids
  • force_words_ids

huggingface API

from transformers import CTRLTokenizer, CTRLModel
tokenizer = CTRLTokenizer.from_pretrained("ctrl")

CTRL

inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
inputs
{'input_ids': tensor([[43213,   586,  3153,     8, 83781]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
tokenizer.control_codes
{'Pregnancy': 168629,
 'Christianity': 7675,
 'Explain': 106423,
 'Fitness': 63440,
 'Saving': 63163,
 'Ask': 27171,
 'Ass': 95985,
 'Joke': 163509,
 'Questions': 45622,
 'Thoughts': 49605,
 'Retail': 52342,
 'Feminism': 164338,
 'Writing': 11992,
 'Atheism': 192263,
 'Netflix': 48616,
 'Computing': 39639,
 'Opinion': 43213,
 'Alone': 44967,
 'Funny': 58917,
 'Gaming': 40358,
 'Human': 4088,
 'India': 1331,
 'Joker': 77138,
 'Diet': 36206,
 'Legal': 11859,
 'Norman': 4939,
 'Tip': 72689,
 'Weight': 52343,
 'Movies': 46273,
 'Running': 23425,
 'Science': 2090,
 'Horror': 37793,
 'Confession': 60572,
 'Finance': 12250,
 'Politics': 16360,
 'Scary': 191985,
 'Support': 12654,
 'Technologies': 32516,
 'Teenage': 66160,
 'Event': 32769,
 'Learned': 67460,
 'Notion': 182770,
 'Wikipedia': 37583,
 'Books': 6665,
 'Extract': 76050,
 'Confessions': 102701,
 'Conspiracy': 75932,
 'Links': 63674,
 'Narcissus': 150425,
 'Relationship': 54766,
 'Relationships': 134796,
 'Reviews': 41671,
 'News': 4256,
 'Translation': 26820,
 'multilingual': 128406}
generator = pipeline('text-generation', model="ctrl")
/home/kuba/anaconda3/envs/zajeciaei/lib/python3.10/site-packages/transformers/models/ctrl/modeling_ctrl.py:43: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)
TEXT = "Today"
generator("Opinion " + TEXT, max_length = 50)

[{'generated_text': 'Opinion Today I learned that the US government has been spying on the citizens of other countries for years. \n Score: 6 \n \n Title: CMV: I think that the US should not be involved in the Middle East \n Text: I think that the US'}]

generator("Technologies " + TEXT, max_length = 50)

[{'generated_text': 'Technologies Today \n Score: 6 \n \n Title: The Internet is a great tool for the average person to get information and to share it with others. But it is also a great tool for the government to spy on us. \n Score: 6 \n \n Title: The'}]

generator("Gaming " + TEXT, max_length = 50)

[{'generated_text': 'Gaming Today \n Score: 6 \n \n Title: I just got a new gaming pc and I have a question \n Text: I just got a new gaming pc and I have a question \n \n I have a monitor that I bought a while back'}]

Zadanie

Za pomocą GPT2 lub distillGPT wygenerować odpowiedzi dla wyzwania challanging america. Nie trzeba douczać modelu.