Uczenie_Glebokie/5. Transformer/transformer.ipynb
PawelDopierala b7fa4a8484 upload tasks
2024-06-07 14:07:37 +02:00

86 KiB

!unzip -q /content/en-ner-conll-2003.zip -d /content/
import os
import pandas as pd
from transformers import pipeline, AutoModelForTokenClassification, BertTokenizer

Declare path

data_dir_path = 'en-ner-conll-2003'
train_path = os.path.join(data_dir_path, 'train', 'train.tsv')
dev_texts_path = os.path.join(data_dir_path, 'dev-0', 'in.tsv')
dev_labels_path = os.path.join(data_dir_path, 'dev-0', 'expected.tsv')
dev_predicted_path = os.path.join(data_dir_path, 'dev-0', 'out.tsv')
test_texts_path = os.path.join(data_dir_path, 'test-A', 'in.tsv')
test_predicted_path = os.path.join(data_dir_path, 'test-A', 'out.tsv')

Load files

train_data = pd.read_csv(train_path, sep='\t', usecols=[0, 1], header=None, names=['label', 'text'])
dev_texts_data = pd.read_csv(dev_texts_path, sep='\t', usecols=[0], header=None, names=['text'])
dev_labels_data = pd.read_csv(dev_labels_path, sep='\t', usecols=[0], header=None, names=['label'])
test_texts_data = pd.read_csv(test_texts_path, sep='\t', usecols=[0], header=None, names=['text'])

Create transformer and tokenizer pipeline

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

class SpaceTokenizer(BertTokenizer):
    def tokenize(self, text):
        return text.split()

tokenizer = SpaceTokenizer.from_pretrained("bert-base-cased")

recognizer = pipeline("ner", model=model, tokenizer=tokenizer)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'SpaceTokenizer'.

Predict and save results

def predict(X):
    predictions = list()
    for text in X:
        predictions.append(recognizer(text))
    return predictions
def map_predictions(X, predictions):
    results = list()
    for text, prediction in zip(X, predictions):
        result = ['O'] * len(text.split())
        for prediction_element in prediction:
            result[prediction_element['index']-1] = prediction_element['entity']
        result = " ".join(result)
        results.append(result)
    return results
def predict_and_save(X, filename):
    X = X['text']
    predictions = predict(X)
    Y_predicted = map_predictions(X, predictions)
    Y_predicted_df = pd.DataFrame(Y_predicted, columns=['predicted_label'])
    Y_predicted_df.to_csv(filename, sep='\t', index=False, header=None)
dev_predicted = predict_and_save(dev_texts_data, dev_predicted_path)
test_predicted = predict_and_save(test_texts_data, test_predicted_path)