# **Uczenie Głębokie - projekt**
W projekcie wykorzystano dataset [emotion](https://huggingface.co/datasets/emotion), zawierający wpisy nacechowane określonymi emocjami.

<br>

Labels:
- 0 - sadness
- 1 - joy
- 2 - love
- 3 - anger
- 4 - fear
- 5 - surprise

### **REQUIREMENTS**

In [1]:
!pip3 install transformers scikit-learn accelerate evaluate datasets torch sentencepiece torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import os
import json
from pathlib import Path
from typing import Dict, List
from datasets import load_dataset
import torch
import pandas as pd

os.environ['TOKENIZERS_PARALLELISM'] = 'true'

### **DATA PREP**

In [3]:
!mkdir -p data
!python data_prep.py

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)
  0% 0/3 [00:00<?, ?it/s]100% 3/3 [00:00<00:00, 182.77it/s]
Saving into: data/train.json
Saving into: data/s2s-train.json
Saving into: data/valid.json
Saving into: data/s2s-valid.json
Saving into: data/test.json
Saving into: data/s2s-test.json


In [4]:
!head data/train.json

{"label": 0, "text": "i didnt feel humiliated"}
{"label": 0, "text": "i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake"}
{"label": 3, "text": "im grabbing a minute to post i feel greedy wrong"}
{"label": 2, "text": "i am ever feeling nostalgic about the fireplace i will know that it is still on the property"}
{"label": 3, "text": "i am feeling grouchy"}
{"label": 0, "text": "ive been feeling a little burdened lately wasnt sure why that was"}
{"label": 5, "text": "ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny"}
{"label": 4, "text": "i feel as confused about life as a teenager or as jaded as a year old man"}
{"label": 1, "text": "i have been with petronas for years i feel that petronas has performed well and made a huge profit"}
{"label": 2, "text": "i feel romantic too"}


In [5]:
!head data/s2s-train.json

{"label": "sadness", "text": "i didnt feel humiliated"}
{"label": "sadness", "text": "i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake"}
{"label": "anger", "text": "im grabbing a minute to post i feel greedy wrong"}
{"label": "love", "text": "i am ever feeling nostalgic about the fireplace i will know that it is still on the property"}
{"label": "anger", "text": "i am feeling grouchy"}
{"label": "sadness", "text": "ive been feeling a little burdened lately wasnt sure why that was"}
{"label": "surprise", "text": "ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny"}
{"label": "fear", "text": "i feel as confused about life as a teenager or as jaded as a year old man"}
{"label": "joy", "text": "i have been with petronas for years i feel that petronas has performed well and made a huge profit"}
{"label": "love", "text": "i feel romantic too"}


In [6]:
!wc -l data/*

   2000 data/s2s-test.json
  16000 data/s2s-train.json
   2000 data/s2s-valid.json
   2000 data/test.json
  16000 data/train.json
   2000 data/valid.json
  40000 total


## **ROBERTA**

- full data
- model `roberta-base`
- sequnece length: 128
- training epoch: 1

In [7]:
!python run_glue.py \
  --cache_dir roberta_training_cache \
  --model_name_or_path roberta-base \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --do_train \
  --do_eval \
  --do_predict \
  --max_seq_length 128 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --output_dir out/emotion/roberta  \
  --overwrite_output_dir

2023-02-14 21:44:57.299984: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-14 21:44:57.452345: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-14 21:44:58.236913: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-14 21:44:58.237017: W tensorflow/compiler/xla/stream_executor

- full data
- sequence length: 128
- leakyRelu instad of relu
- every other layer frozen
- custom head

In [9]:
!python run_glue.py \
  --cache_dir roberta_custom_training_cache \
  --model_name_or_path roberta-base \
  --custom_model roberta_custom \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --do_train \
  --do_eval \
  --do_predict \
  --max_seq_length 128 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --output_dir out/emotion/roberta_custom  \
  --overwrite_output_dir

2023-02-14 21:47:02.722049: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-14 21:47:02.876002: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-14 21:47:03.659342: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-14 21:47:03.659451: W tensorflow/compiler/xla/stream_executor

## **GPT2**

- full data
- model `GPT2`
- sequnece length: 128
- training epoch: 1

In [10]:
!python run_glue.py \
  --cache_dir gtp_cache_training \
  --model_name_or_path gpt2 \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json  \
  --per_device_train_batch_size 24  \
  --per_device_eval_batch_size 24 \
  --do_train  \
  --do_eval \
  --do_predict  \
  --max_seq_length 128  \
  --learning_rate 2e-5  \
  --num_train_epochs 1  \
  --output_dir out/emotion/gpt2  \
  --overwrite_output_dir \
  --eval_steps 250 \
  --evaluation_strategy steps \
  --metric_for_best_model accuracy \
  --logging_steps 100 \
  --save_total_limit 5 \
  --max_steps 2500 \
  --load_best_model_at_end True 

2023-02-14 21:48:52.605236: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-14 21:48:52.757779: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-14 21:48:53.540701: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-14 21:48:53.540799: W tensorflow/compiler/xla/stream_executor

- full dataset
- custom head

In [11]:
!python run_glue.py \
  --cache_dir gtp_custom_cache_training \
  --model_name_or_path gpt2 \
  --custom_model gpt2_custom  \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json  \
  --per_device_train_batch_size 24  \
  --per_device_eval_batch_size 24 \
  --do_train  \
  --do_eval \
  --do_predict  \
  --max_seq_length 128  \
  --learning_rate 2e-5  \
  --num_train_epochs 1  \
  --output_dir out/emotion/gpt2_custom  \
  --overwrite_output_dir \
  --eval_steps 250 \
  --evaluation_strategy steps \
  --metric_for_best_model accuracy \
  --logging_steps 100 \
  --save_total_limit 5 \
  --max_steps 2500 \
  --load_best_model_at_end True 

2023-02-14 21:56:25.884599: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-14 21:56:26.040127: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-14 21:56:26.823479: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-14 21:56:26.823595: W tensorflow/compiler/xla/stream_executor

## **T5**

- full data
- model `T5`
- sequnece length: 128
- training epoch: 1
- first few layers frozen

In [12]:
!python run_translation.py \
  --cache_dir t5_cache_training \
  --model_name_or_path "google/t5-v1_1-small" \
  --train_file data/s2s-train.json \
  --validation_file data/s2s-valid.json \
  --test_file data/s2s-test.json \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --source_lang "text" \
  --target_lang "label" \
  --source_prefix "emotion classification" \
  --max_source_length 256 \
  --max_target_length 128 \
  --generation_max_length 128 \
  --do_train \
  --do_eval \
  --do_predict \
  --predict_with_generate \
  --num_train_epochs 1 \
  --output_dir out/emotion/t5_v1_1  \
  --overwrite_output_dir \
  --eval_steps 250 \
  --evaluation_strategy steps \
  --metric_for_best_model accuracy \
  --logging_steps 100 \
  --save_total_limit 5 \
  --max_steps 2500 \
  --load_best_model_at_end True 

2023-02-14 22:04:17.129470: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-14 22:04:17.281426: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-14 22:04:18.087509: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-14 22:04:18.087605: W tensorflow/compiler/xla/stream_executor

# **FLAN T5**

In [13]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import json

In [14]:
if torch.cuda.is_available():
    device = 0
else:
    device = -1

In [15]:
def perform_shot_learning(pipeline_type, model_name, test_file):
    class_type = AutoModelForSeq2SeqLM
    model = class_type.from_pretrained(model_name, torch_dtype=torch.float32)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    our_pipeline = pipeline(pipeline_type, model=model, tokenizer=tokenizer, device=device)

    correct = 0

    labels = "possible labels: sadness, joy, love, anger, fear, surprise"

    with open(test_file) as f:
      f_lines = f.readlines()
      for line in f_lines:
          ex = json.loads(line)
          prompt = ex['text']

          tmp = labels + '\n' + f'text: {prompt}' + '\n' + 'label: '
          
          predict = our_pipeline(tmp, do_sample=False)[0]['generated_text']

          if predict == ex['label']:
            correct += 1

    print(f'Accuracy: {correct/len(f_lines)}')

In [16]:
test_ds = 'data/s2s-test.json'

In [17]:
perform_shot_learning('text2text-generation', 'google/flan-t5-large', test_ds)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



Accuracy: 0.647


In [18]:
!zip -r /content/projekt.zip /content/

  adding: content/ (stored 0%)
  adding: content/.config/ (stored 0%)
  adding: content/.config/config_sentinel (stored 0%)
  adding: content/.config/logs/ (stored 0%)
  adding: content/.config/logs/2023.02.10/ (stored 0%)
  adding: content/.config/logs/2023.02.10/14.32.38.026074.log (deflated 58%)
  adding: content/.config/logs/2023.02.10/14.33.38.691407.log (deflated 56%)
  adding: content/.config/logs/2023.02.10/14.33.11.427170.log (deflated 58%)
  adding: content/.config/logs/2023.02.10/14.33.37.863925.log (deflated 57%)
  adding: content/.config/logs/2023.02.10/14.32.12.281772.log (deflated 91%)
  adding: content/.config/logs/2023.02.10/14.33.03.230973.log (deflated 86%)
  adding: content/.config/gce (stored 0%)
  adding: content/.config/.last_survey_prompt.yaml (stored 0%)
  adding: content/.config/configurations/ (stored 0%)
  adding: content/.config/configurations/config_default (deflated 15%)
  adding: content/.config/active_config (stored 0%)
  adding: content/.config/.last_u