Przetwarzanie_tekstu/03.ipynb

172 KiB
Raw Permalink Blame History

Lab 3

!pip install transformers
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
     |████████████████████████████████| 5.8 MB 8.1 MB/s 
[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers) (4.64.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.21.6)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.8.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (21.3)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     |████████████████████████████████| 7.6 MB 54.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
     |████████████████████████████████| 182 kB 57.8 MB/s 
[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.10.0->transformers) (4.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2022.9.24)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (1.24.3)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1

Zadanie 1

Wygeneruj dowolny tekst z wykorzystaniem modelu Longformer.

from transformers import LongformerTokenizer, EncoderDecoderModel

model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096") 
Downloading:   0%|          | 0.00/3.52k [00:00<?, ?B/s]
You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Downloading:   0%|          | 0.00/1.21G [00:00<?, ?B/s]
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'encoder.embeddings.position_ids', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]
text = "Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP and the fifth-largest by PPP. As a global power in industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer. As a highly developed country, which ranks ninth on the Human Development Index, it offers social security and a universal health care system, environmental protections, a tuition-free university education, and it is ranked as sixteenth-most peaceful country in the world. Germany is a member of the United Nations, the European Union, NATO, the Council of Europe, the G7, the G20 and the OECD. It has the third-greatest number of UNESCO World Heritage Sites."

input_ids = tokenizer(text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)

summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py:1387: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 142 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Germany is a great power in industrial, scientific and technological sectors.
It has the world's third-largest economy by nominal GDP and the fifth-largest by PPP.
Germany has the third-greatest number of UNESCO World Heritage Sites.
The world's fourth-largest exporter and importer of UNESCO world heritage sites.

Zadanie 2

Wygeneruj dowolny tekst w języku polskim (wykorzystując jeden z polskich modeli).

from transformers import GPT2LMHeadModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sdadas/polish-gpt2-medium")
model_head = GPT2LMHeadModel.from_pretrained("sdadas/polish-gpt2-medium")
Downloading:   0%|          | 0.00/2.34M [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/837 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/1.53G [00:00<?, ?B/s]
text = "Afera korupcyjna w Polsce. "

tokenized = tokenizer.tokenize(text)
encoded_input = tokenizer(text, return_tensors='pt')

result = model_head.generate(**encoded_input, max_length=200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
result_decode = tokenizer.decode(result.tolist()[0])
print(result_decode)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Afera korupcyjna w Polsce. Ówczesny szef CBA Paweł Wojtunik i jego zastępca Mariusz Kamiński zostali zatrzymani przez CBA w związku z tzw. aferą gruntową. Zatrzymani usłyszeli zarzuty przyjęcia korzyści majątkowej w zamian za pośrednictwo w załatwieniu korzystnych dla jednej ze spółek załatwienia spraw urzędowych w Ministerstwie Rolnictwa i Rozwoju Wsi. Wśród zatrzymanych jest m.in. były minister rolnictwa Marek Sawicki, były wiceszef CBA Maciej Wąsik oraz były szef Agencji Bezpieczeństwa Wewnętrznego Krzysztof Bondaryk. Śledztwo w tej sprawie prowadzi Prokuratura Okręgowa Warszawa-Praga.</s>

Zadanie 3

Spróbuj wygenerować konwersację/dialog.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model_head = GPT2LMHeadModel.from_pretrained('gpt2')
Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]
text = "- It's snowing badly. \n- Yes, the weather is really bad.\n- Do you plan to go out tonight?\n-"

tokenized = tokenizer.tokenize(text)
encoded_input = tokenizer(text, return_tensors='pt')

result = model_head.generate(**encoded_input, max_length=200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
result_decode = tokenizer.decode(result.tolist()[0])
print(result_decode)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
- It's snowing badly. 
- Yes, the weather is really bad.
- Do you plan to go out tonight?
- No, I don't plan on going out at all. I'm just going to stay in the car and watch the snow fall.<|endoftext|>

Zadanie 4

Spróbuj wygenerować dowolny news (wiadomość, opowiadanie).

text = "Travel chaos across UK as snow causes closure and delays."

tokenized = tokenizer.tokenize(text)
encoded_input = tokenizer(text, return_tensors='pt')

result = model_head.generate(**encoded_input, max_length=200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
result_decode = tokenizer.decode(result.tolist()[0])
print(result_decode)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Travel chaos across UK as snow causes closure and delays.

More than 1,000 people have been forced to flee their homes in the capital, London, as the National Weather Service (NWS) warned of a "dangerous storm surge" in parts of the north-east and south-west of England and Wales.


The NWS said: "This is the worst storm to hit the UK in more than a decade, with winds of up to 40mph (60km/h) and gusts in excess of 100mph. Storm surge is expected to continue through the night and into the early hours of Monday morning. The National Meteorological Service is urging people to stay away from areas with high winds and heavy rain."<|endoftext|>