112 KiB
112 KiB
!pip install transformers torch accelerate
Requirement already satisfied: transformers in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (4.23.1) Requirement already satisfied: torch in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (2.0.0) Collecting accelerate Downloading accelerate-0.20.3-py3-none-any.whl (227 kB) [K |████████████████████████████████| 227 kB 2.6 MB/s eta 0:00:01 [?25hRequirement already satisfied: packaging>=20.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (21.3) Requirement already satisfied: regex!=2019.12.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (2022.10.31) Requirement already satisfied: pyyaml>=5.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (6.0) Requirement already satisfied: numpy>=1.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (1.23.4) Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (0.10.1) Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (0.11.4) Requirement already satisfied: tqdm>=4.27 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (4.64.0) Requirement already satisfied: filelock in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (3.6.0) Requirement already satisfied: requests in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (2.28.2) Requirement already satisfied: typing-extensions in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (4.3.0) Requirement already satisfied: sympy in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (1.11.1) Requirement already satisfied: networkx in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (2.8.8) Requirement already satisfied: jinja2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (3.1.2) Requirement already satisfied: psutil in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from accelerate) (5.9.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from packaging>=20.0->transformers) (3.0.9) Requirement already satisfied: MarkupSafe>=2.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from jinja2->torch) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (2022.12.7) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (1.26.12) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (2.0.4) Requirement already satisfied: mpmath>=0.19 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from sympy->torch) (1.3.0) Installing collected packages: accelerate Successfully installed accelerate-0.20.3
Wczytanie bazowego modelu
Bazowym modelem jest polska wersja GPT2 https://huggingface.co/flax-community/papuGaPT2?text=Najsmaczniejszy+polski+owoc+to
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import pandas as pd
model = AutoModelForCausalLM.from_pretrained('flax-community/papuGaPT2')
tokenizer = AutoTokenizer.from_pretrained('flax-community/papuGaPT2')
tokenizer.pad_token = tokenizer.eos_token
Wczytanie danych do finetuningu
Dane stworzyliśmy ręcznie oraz za pomocą ChatGPT.
data = pd.read_csv('prompts.csv', sep=';')
# data.head()
# data["answer"]
texts = 'question: ' + data['question'] + "\nanswer: " + data['answer']
texts = texts.tolist()
print(texts[0])
question: Dlaczego w ogóle warto się starać? answer: Nie warto. Wszystko i tak skończy się niepowodzeniem.
Preprocessing
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
import torch
# Create custom dataset
class PromptsDataset(Dataset):
def __init__(self, txt_list, tokenizer):
self.tokenizer = tokenizer
self.input_ids = []
self.attn_masks = []
for txt in txt_list:
encodings_dict = tokenizer(txt, padding="max_length", truncation=True, max_length=512)
self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.attn_masks[idx]
# Create dataset
dataset = PromptsDataset(texts, tokenizer)
# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
154 training samples 18 validation samples
dataset[0]
(tensor([ 7636, 1736, 536, 30, 6072, 263, 4090, 1076, 330, 20777, 35, 203, 16488, 1633, 30, 225, 624, 1076, 18, 4651, 288, 497, 8427, 330, 19241, 3239, 18, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
batch_size = 8
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_dataset, # The training samples.
sampler = RandomSampler(train_dataset), # Select batches randomly
batch_size = batch_size # Trains with this batch size.
)
# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_dataset, # The validation samples.
sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)
Fine-tuning
# some parameters I cooked up that work reasonably well
epochs = 10
learning_rate = 0.001
warmup_steps = 1e2
epsilon = 1e-8
from transformers import AdamW, get_linear_schedule_with_warmup
# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
optimizer = AdamW(model.parameters(),
lr = learning_rate,
eps = epsilon
)
/Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs
# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = warmup_steps,
num_training_steps = total_steps)
import datetime
import time
import random
def format_time(elapsed):
return str(datetime.timedelta(seconds=int(round((elapsed)))))
device = torch.device("mps")
model.to(device)
GPT2LMHeadModel( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-11): 12 x GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=768, out_features=50257, bias=False) )
total_t0 = time.time()
training_stats = []
model = model.to(device)
for epoch_i in range(0, epochs):
# ========================================
# Training
# ========================================
print("")
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
print('Training...')
t0 = time.time()
total_train_loss = 0
model.train()
for step, batch in enumerate(train_dataloader):
b_input_ids = batch[0].to(device)
b_labels = batch[0].to(device)
b_masks = batch[1].to(device)
model.zero_grad()
outputs = model( b_input_ids,
labels=b_labels,
attention_mask = b_masks,
token_type_ids=None
)
loss = outputs[0]
batch_loss = loss.item()
total_train_loss += batch_loss
loss.backward()
optimizer.step()
scheduler.step()
# Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader)
# Measure how long this epoch took.
training_time = format_time(time.time() - t0)
print("")
print(" Average training loss: {0:.2f}".format(avg_train_loss))
print(" Training epoch took: {:}".format(training_time))
# ========================================
# Validation
# ========================================
print("")
print("Running Validation...")
t0 = time.time()
model.eval()
total_eval_loss = 0
nb_eval_steps = 0
# Evaluate data for one epoch
for batch in validation_dataloader:
b_input_ids = batch[0].to(device)
b_labels = batch[0].to(device)
b_masks = batch[1].to(device)
with torch.no_grad():
outputs = model(b_input_ids,
# token_type_ids=None,
attention_mask = b_masks,
labels=b_labels)
loss = outputs[0]
batch_loss = loss.item()
total_eval_loss += batch_loss
avg_val_loss = total_eval_loss / len(validation_dataloader)
validation_time = format_time(time.time() - t0)
print(" Validation Loss: {0:.2f}".format(avg_val_loss))
print(" Validation took: {:}".format(validation_time))
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
======== Epoch 1 / 10 ======== Training...
model.eval()
input_text = "question: Czy życie ma jakiś sens?\nanswer:"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
input_ids = input_ids.to(device)
output = model.generate(input_ids, max_length=100, early_stopping=True)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
question: Czy piłka nożna to dobra pasja? answer: Absolutnie nie! Czy próbowałeś/aś już grać w piłkę? Może warto spróbować!
Zapisanie modelu
import os
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
output_dir = 'model_save/'
# Create output directory if needed
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print("Saving model to %s" % output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
Saving model to /content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/
('/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/tokenizer_config.json', '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/special_tokens_map.json', '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/vocab.json', '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/merges.txt', '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/added_tokens.json', '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/tokenizer.json')