259 KiB
259 KiB
! pip install datasets transformers torch scikit-learn
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting datasets Downloading datasets-2.9.0-py3-none-any.whl (462 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m [?25hCollecting transformers Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: torch in /usr/local/lib/python3.8/dist-packages (1.13.1+cu116) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (1.0.2) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from datasets) (1.21.6) Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.8/dist-packages (from datasets) (2023.1.0) Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.8/dist-packages (from datasets) (9.0.0) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.8/dist-packages (from datasets) (2.25.1) Collecting multiprocess Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m [?25hCollecting responses<0.19 Downloading responses-0.18.0-py3-none-any.whl (38 kB) Requirement already satisfied: dill<0.3.7 in /usr/local/lib/python3.8/dist-packages (from datasets) (0.3.6) Collecting xxhash Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m [?25hCollecting huggingface-hub<1.0.0,>=0.2.0 Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: aiohttp in /usr/local/lib/python3.8/dist-packages (from datasets) (3.8.3) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from datasets) (6.0) Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from datasets) (1.3.5) Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from datasets) (23.0) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.8/dist-packages (from datasets) (4.64.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2) Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.9.0) Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch) (4.4.0) Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.7.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (3.1.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.2.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (22.2.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (6.0.4) Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (2.1.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (1.3.3) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (4.0.2) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets) (1.8.2) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets) (2022.12.7) Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets) (4.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets) (1.24.3) Collecting urllib3<1.27,>=1.21.1 Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m [?25hRequirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->datasets) (2022.7.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0) Installing collected packages: tokenizers, xxhash, urllib3, multiprocess, responses, huggingface-hub, transformers, datasets Attempting uninstall: urllib3 Found existing installation: urllib3 1.24.3 Uninstalling urllib3-1.24.3: Successfully uninstalled urllib3-1.24.3 Successfully installed datasets-2.9.0 huggingface-hub-0.12.0 multiprocess-0.70.14 responses-0.18.0 tokenizers-0.13.2 transformers-4.26.0 urllib3-1.26.14 xxhash-3.2.0
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device
'cuda'
from datasets import load_dataset
from transformers import RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer, Trainer, TrainingArguments
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
# Initializing a RoBERTa configuration
configuration = RobertaConfig(vocab_size=2048)
model = RobertaForSequenceClassification(configuration).from_pretrained("roberta-base", num_labels=6)
model.cuda()
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
def tokenization(batched_text):
return tokenizer(batched_text['text'], padding = True, truncation=True)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
# Accessing the model configuration
configuration = model.config
dataset = load_dataset("emotion")
train_data= dataset["train"]
test_data = dataset["test"]
eval_data = dataset["validation"]
train_data = train_data.map(tokenization, batched=True, batch_size=len(train_data))
eval_data = eval_data.map(tokenization, batched=True, batch_size=len(eval_data))
train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
eval_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=5,
per_device_train_batch_size = 32,
gradient_accumulation_steps = 16,
per_device_eval_batch_size= 32,
evaluation_strategy = "epoch",
save_strategy = "epoch",
disable_tqdm = False,
load_best_model_at_end=True,
warmup_steps=10,
weight_decay=0.01,
logging_steps = 4,
fp16 = True,
dataloader_num_workers = 2,
run_name = 'roberta-classification'
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_data,
eval_dataset=eval_data,
)
trainer.train()
#for line in test_data:
# inputs = tokenizer(line.get('text'), return_tensors="pt")
# labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
# outputs = model(**inputs, labels=labels)
# _, predictions = torch.max(outputs[1], 1)
# print(predictions)
Downloading (…)lve/main/config.json: 0%| | 0.00/481 [00:00<?, ?B/s]
Downloading (…)"pytorch_model.bin";: 0%| | 0.00/501M [00:00<?, ?B/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias'] - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading (…)olve/main/vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
Downloading builder script: 0%| | 0.00/3.97k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/3.28k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/8.78k [00:00<?, ?B/s]
WARNING:datasets.builder:No config specified, defaulting to: emotion/split
Downloading and preparing dataset emotion/split to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd...
Downloading data files: 0%| | 0/3 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/592k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/74.0k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/74.9k [00:00<?, ?B/s]
Extracting data files: 0%| | 0/3 [00:00<?, ?it/s]
Generating train split: 0%| | 0/16000 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/2000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/2000 [00:00<?, ? examples/s]
Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
Using cuda_amp half precision backend The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. /usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( ***** Running training ***** Num examples = 16000 Num Epochs = 5 Instantaneous batch size per device = 32 Total train batch size (w. parallel, distributed & accumulation) = 512 Gradient Accumulation steps = 16 Total optimization steps = 155 Number of trainable parameters = 124650246
[155/155 06:28, Epoch 4/5]
Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|---|---|
0 | 1.089600 | 0.782759 | 0.711500 | 0.711500 | 0.711500 | 0.711500 |
1 | 0.380000 | 0.267057 | 0.900000 | 0.900000 | 0.900000 | 0.900000 |
2 | 0.186000 | 0.183795 | 0.923000 | 0.923000 | 0.923000 | 0.923000 |
3 | 0.156900 | 0.163963 | 0.934500 | 0.934500 | 0.934500 | 0.934500 |
4 | 0.130100 | 0.160831 | 0.933500 | 0.933500 | 0.933500 | 0.933500 |
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 2000 Batch size = 32 Saving model checkpoint to ./output/checkpoint-31 Configuration saved in ./output/checkpoint-31/config.json Model weights saved in ./output/checkpoint-31/pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 2000 Batch size = 32 Saving model checkpoint to ./output/checkpoint-62 Configuration saved in ./output/checkpoint-62/config.json Model weights saved in ./output/checkpoint-62/pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 2000 Batch size = 32 Saving model checkpoint to ./output/checkpoint-93 Configuration saved in ./output/checkpoint-93/config.json Model weights saved in ./output/checkpoint-93/pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 2000 Batch size = 32 Saving model checkpoint to ./output/checkpoint-124 Configuration saved in ./output/checkpoint-124/config.json Model weights saved in ./output/checkpoint-124/pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 2000 Batch size = 32 Saving model checkpoint to ./output/checkpoint-155 Configuration saved in ./output/checkpoint-155/config.json Model weights saved in ./output/checkpoint-155/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from ./output/checkpoint-155 (score: 0.1608307659626007).
TrainOutput(global_step=155, training_loss=0.495477742725803, metrics={'train_runtime': 393.8758, 'train_samples_per_second': 203.11, 'train_steps_per_second': 0.394, 'total_flos': 3612118290333696.0, 'train_loss': 0.495477742725803, 'epoch': 4.99})
i = 0
sum_preds = 0
model = model.to('cpu')
for line in test_data:
inputs = tokenizer(line.get('text'), return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(**inputs, labels=labels)
_, predictions = torch.max(outputs[1], 1)
a = int(predictions.int())
b = line.get('label')
i += 1
sum_preds += int(a == b)
print(f"ACCURACY: {(sum_preds/i * 100)}")
ACCURACY: 93.15