41 KiB
41 KiB
Instalacja 'datasets' i 'transformers'
!pip install datasets
!pip install transformers
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.16.1) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.23.5) Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (10.0.1) Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6) Requirement already satisfied: dill<0.3.8,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.7) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.1) Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1) Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.15) Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0) Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.1) Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.20.2) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets) (4.5.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2023.11.17) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.3.post1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0) Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.35.2) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.13.1) Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.2) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.23.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.2) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.6.3) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0) Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.15.0) Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.1) Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.1) Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (2023.6.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (4.5.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.11.17)
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, RobertaForSequenceClassification, RobertaTokenizerFast, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
Ładowanie i przetwarzanie zbioru danych
def load_and_process_dataset():
dataset = load_dataset("sst2")
dataset.remove_columns('idx')
del dataset['test']
dataset['test'] = dataset['validation']
del dataset['validation']
split_dataset = dataset['train'].train_test_split(test_size=1600)
dataset['train'] = split_dataset['train']
dataset['validation'] = split_dataset['test']
return dataset
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
dataset = load_and_process_dataset()
dataset
DatasetDict({ train: Dataset({ features: ['idx', 'sentence', 'label'], num_rows: 65749 }) test: Dataset({ features: ['idx', 'sentence', 'label'], num_rows: 872 }) validation: Dataset({ features: ['idx', 'sentence', 'label'], num_rows: 1600 }) })
train = dataset['train']
validation = dataset['validation']
test = dataset['test']
model = RobertaForSequenceClassification.from_pretrained('roberta-base')
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = 512)
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
def tokenization(batched_text):
return tokenizer(batched_text['sentence'], padding = True, truncation=True)
train_data = train.map(tokenization, batched = True, batch_size = len(train))
val_data = validation.map(tokenization, batched = True, batch_size = len(validation))
test_data = test.map(tokenization, batched = True, batch_size = len(test))
Map: 0%| | 0/65749 [00:00<?, ? examples/s]
Map: 0%| | 0/1600 [00:00<?, ? examples/s]
train_data.set_format('torch', columns=['input_ids', 'sentence', 'label'])
val_data.set_format('torch', columns=['input_ids', 'sentence', 'label'])
test_data.set_format('torch', columns=['input_ids', 'sentence', 'label'])
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
!pip install 'transformers[torch]>=4.34,<4.35'
Collecting transformers[torch]<4.35,>=4.34 Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (3.13.1) Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (0.20.2) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (1.23.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (23.2) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (2023.6.3) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (2.31.0) Collecting tokenizers<0.15,>=0.14 (from transformers[torch]<4.35,>=4.34) Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (0.4.1) Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (4.66.1) Requirement already satisfied: torch!=1.12.0,>=1.10 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (2.1.0+cu121) Requirement already satisfied: accelerate>=0.20.3 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]<4.35,>=4.34) (0.26.0) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.20.3->transformers[torch]<4.35,>=4.34) (5.9.5) Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers[torch]<4.35,>=4.34) (2023.6.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers[torch]<4.35,>=4.34) (4.5.0) Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch]<4.35,>=4.34) Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (3.2.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (3.1.2) Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (2.1.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]<4.35,>=4.34) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]<4.35,>=4.34) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]<4.35,>=4.34) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]<4.35,>=4.34) (2023.11.17) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (2.1.3) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[torch]<4.35,>=4.34) (1.3.0) Installing collected packages: huggingface-hub, tokenizers, transformers Attempting uninstall: huggingface-hub Found existing installation: huggingface-hub 0.20.2 Uninstalling huggingface-hub-0.20.2: Successfully uninstalled huggingface-hub-0.20.2 Attempting uninstall: tokenizers Found existing installation: tokenizers 0.15.0 Uninstalling tokenizers-0.15.0: Successfully uninstalled tokenizers-0.15.0 Attempting uninstall: transformers Found existing installation: transformers 4.35.2 Uninstalling transformers-4.35.2: Successfully uninstalled transformers-4.35.2 [31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. datasets 2.16.1 requires huggingface-hub>=0.19.4, but you have huggingface-hub 0.17.3 which is incompatible.[0m[31m [0mSuccessfully installed huggingface-hub-0.17.3 tokenizers-0.14.1 transformers-4.34.1
training_args = TrainingArguments(
output_dir = './results',
num_train_epochs=3,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 16,
per_device_eval_batch_size= 8,
evaluation_strategy = "epoch",
disable_tqdm = False,
load_best_model_at_end=False,
warmup_steps=500,
weight_decay=0.01,
logging_steps = 8,
fp16 = True,
logging_dir='./logs',
dataloader_num_workers = 2,
run_name = 'roberta-classification',
optim="adamw_torch"
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_data,
eval_dataset=val_data,
)
trainer.train()
[3081/3081 41:54, Epoch 2/3]
Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|---|---|
0 | 0.207500 | 0.209651 | 0.923750 | 0.934054 | 0.909474 | 0.960000 |
1 | 0.217200 | 0.171252 | 0.943750 | 0.949944 | 0.951002 | 0.948889 |
2 | 0.067300 | 0.173004 | 0.939375 | 0.946141 | 0.945616 | 0.946667 |
TrainOutput(global_step=3081, training_loss=0.18958045851048694, metrics={'train_runtime': 2517.0617, 'train_samples_per_second': 78.364, 'train_steps_per_second': 1.224, 'total_flos': 6788946644810280.0, 'train_loss': 0.18958045851048694, 'epoch': 3.0})
print(model)
RobertaForSequenceClassification( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 768, padding_idx=1) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0-11): 12 x RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): RobertaClassificationHead( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=768, out_features=2, bias=True) ) )
trainer.evaluate()