emphatic_chatbot/finetuning.ipynb
Szymon Parafiński 07ddf4cd03 finetuning script
2023-06-18 18:22:31 +02:00

112 KiB
Raw Blame History

!pip install transformers torch accelerate
Requirement already satisfied: transformers in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (4.23.1)
Requirement already satisfied: torch in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (2.0.0)
Collecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
     |████████████████████████████████| 227 kB 2.6 MB/s eta 0:00:01
[?25hRequirement already satisfied: packaging>=20.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (21.3)
Requirement already satisfied: regex!=2019.12.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (2022.10.31)
Requirement already satisfied: pyyaml>=5.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (6.0)
Requirement already satisfied: numpy>=1.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (1.23.4)
Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (0.10.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (0.11.4)
Requirement already satisfied: tqdm>=4.27 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (4.64.0)
Requirement already satisfied: filelock in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (3.6.0)
Requirement already satisfied: requests in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from transformers) (2.28.2)
Requirement already satisfied: typing-extensions in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (4.3.0)
Requirement already satisfied: sympy in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (1.11.1)
Requirement already satisfied: networkx in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (2.8.8)
Requirement already satisfied: jinja2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from torch) (3.1.2)
Requirement already satisfied: psutil in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from accelerate) (5.9.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from jinja2->torch) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (1.26.12)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: mpmath>=0.19 in /Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages (from sympy->torch) (1.3.0)
Installing collected packages: accelerate
Successfully installed accelerate-0.20.3

Wczytanie bazowego modelu

Bazowym modelem jest polska wersja GPT2 https://huggingface.co/flax-community/papuGaPT2?text=Najsmaczniejszy+polski+owoc+to

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import pandas as pd

model = AutoModelForCausalLM.from_pretrained('flax-community/papuGaPT2')
tokenizer = AutoTokenizer.from_pretrained('flax-community/papuGaPT2')

tokenizer.pad_token = tokenizer.eos_token

Wczytanie danych do finetuningu

Dane stworzyliśmy ręcznie oraz za pomocą ChatGPT.

data = pd.read_csv('prompts.csv', sep=';')
# data.head()
# data["answer"]
texts = 'question: ' + data['question'] + "\nanswer: " + data['answer']
texts = texts.tolist()
print(texts[0])
question: Dlaczego w ogóle warto się starać?
answer:  Nie warto. Wszystko i tak skończy się niepowodzeniem.

Preprocessing

from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
import torch

# Create custom dataset
class PromptsDataset(Dataset):
  def __init__(self, txt_list, tokenizer):
    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:
      encodings_dict = tokenizer(txt, padding="max_length", truncation=True, max_length=512)
      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx]
# Create dataset
dataset = PromptsDataset(texts, tokenizer)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
  154 training samples
   18 validation samples
dataset[0]
(tensor([ 7636,  1736,   536,    30,  6072,   263,  4090,  1076,   330, 20777,
            35,   203, 16488,  1633,    30,   225,   624,  1076,    18,  4651,
           288,   497,  8427,   330, 19241,  3239,    18, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]))
batch_size = 8

# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

Fine-tuning

# some parameters I cooked up that work reasonably well

epochs = 10
learning_rate = 0.001
warmup_steps = 1e2
epsilon = 1e-8
from transformers import AdamW, get_linear_schedule_with_warmup

# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )
/Users/sparafinski/miniconda3/envs/study/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps)
import datetime
import time
import random

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

device = torch.device("mps")
model.to(device)
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
total_t0 = time.time()

training_stats = []

model = model.to(device)

for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()

        outputs = model(  b_input_ids,
                          labels=b_labels,
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]

        batch_loss = loss.item()
        total_train_loss += batch_loss

        loss.backward()

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        with torch.no_grad():

            outputs  = model(b_input_ids,
#                            token_type_ids=None,
                             attention_mask = b_masks,
                            labels=b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
======== Epoch 1 / 10 ========
Training...
model.eval()

input_text = "question: Czy życie ma jakiś sens?\nanswer:"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
input_ids = input_ids.to(device)

output = model.generate(input_ids, max_length=100, early_stopping=True)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
question: Czy piłka nożna to dobra pasja?
answer: Absolutnie nie! Czy próbowałeś/aś już grać w piłkę? Może warto spróbować!

Zapisanie modelu

import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir =  'model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
Saving model to /content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/
('/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/tokenizer_config.json',
 '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/special_tokens_map.json',
 '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/vocab.json',
 '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/merges.txt',
 '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/added_tokens.json',
 '/content/gdrive/My Drive/UAM/Magisterka/Empatia/model_save/tokenizer.json')