## Transformer

Użyj transformeroewgo pipeline'u (https://huggingface.co/docs/transformers/main_classes/pipelines) do implementacji zadania rozpoznawania jednostek nazewniczych (NER) na zbiorze danych https://git.wmi.amu.edu.pl/kubapok/en-ner-conll-2003. \
Dokonaj ewaluacji za pomocą narzędzia GEval.

### Import bibliotek

In [1]:
import pandas as pd
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
from tqdm.notebook import tqdm

### Wczytanie danych

In [2]:
test_A_data = pd.read_csv("test-A/in.tsv", sep="\t", header=None, names=["x"])
dev_0_data = pd.read_csv("dev-0/in.tsv", sep="\t", header=None, names=["x"])

### Ustawienie modelu, tokenizatora oraz pipeline'u

In [3]:
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
recognizer = pipeline("ner", model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Funkcja naprawiająca zbiory przewidzianych tagów

In [4]:
def correct_labels(data):
    corrected_lines = []

    for line in data:
        corrected_line = []
        previous_token = "O"

        for token in line:
            if (
                token == "I-ORG"
                and previous_token != "B-ORG"
                and previous_token != "I-ORG"
            ):
                corrected_line.append("B-ORG")
            elif (
                token == "I-PER"
                and previous_token != "B-PER"
                and previous_token != "I-PER"
            ):
                corrected_line.append("B-PER")
            elif (
                token == "I-LOC"
                and previous_token != "B-LOC"
                and previous_token != "I-LOC"
            ):
                corrected_line.append("B-LOC")
            elif (
                token == "I-MISC"
                and previous_token != "B-MISC"
                and previous_token != "I-MISC"
            ):
                corrected_line.append("B-MISC")
            else:
                corrected_line.append(token)

            previous_token = token

        corrected_lines.append(corrected_line)

    return corrected_lines

### Funkcja przewidująca tagi NER

In [5]:
def predict_ner_tags(data):
    predictions = []
    counter = 1
    for line in data:
        print(f'Predicting NER tags for line {counter}/{len(data)}... ', end='')
        word_positions = []
        position = 0
        result = recognizer(line)
        entity_dict = {res['start']: res['entity'] for res in result}

        for word in line.split():
            word_positions.append(position)
            position += len(word) + 1
        classified_words = []

        for checked_position in word_positions:
            entity = entity_dict.get(checked_position, "O")
            classified_words.append(entity)

        predictions.append(classified_words)
        print('Done')
        counter += 1
    return correct_labels(predictions)

### Funkcja zapisująca wyniki

In [6]:
def save_predictions(predictions, filename):
    with open(filename, "w") as f:
        for line in predictions:
            f.write(" ".join(line) + "\n")

### Wyznaczenie tagów NER

In [7]:
print("Prediction for dev-0 data")
dev_0_labels = predict_ner_tags(dev_0_data["x"])

print()

print("Prediction for test-A data")
test_A_labels = predict_ner_tags(test_A_data["x"])

Prediction for dev-0 data
Predicting NER tags for line 1/215... Done
Predicting NER tags for line 2/215... Done
Predicting NER tags for line 3/215... Done
Predicting NER tags for line 4/215... Done
Predicting NER tags for line 5/215... Done
Predicting NER tags for line 6/215... Done
Predicting NER tags for line 7/215... Done
Predicting NER tags for line 8/215... Done
Predicting NER tags for line 9/215... Done
Predicting NER tags for line 10/215... Done
Predicting NER tags for line 11/215... Done
Predicting NER tags for line 12/215... Done
Predicting NER tags for line 13/215... Done
Predicting NER tags for line 14/215... Done
Predicting NER tags for line 15/215... Done
Predicting NER tags for line 16/215... Done
Predicting NER tags for line 17/215... Done
Predicting NER tags for line 18/215... Done
Predicting NER tags for line 19/215... Done
Predicting NER tags for line 20/215... Done
Predicting NER tags for line 21/215... Done
Predicting NER tags for line 22/215... Done
Predicting NER 

### Zapis wyników do plików

In [8]:
save_predictions(dev_0_labels, "dev-0/out.tsv")
save_predictions(test_A_labels, "test-A/out.tsv")