## Uczenie głębokie – przetwarzanie tekstu – laboratoria
# 3. RNN

### Podejście softmax z embeddingami na przykładzie NER

In [46]:
!pip install torch torchtext
!pip install torch datasets
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension



usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
 [--paths] [--json] [--debug]
 [subcommand]

Jupyter: Interactive Computing

positional arguments:
 subcommand the subcommand to launch

options:
 -h, --help show this help message and exit
 --version show the versions of core jupyter packages and exit
 --config-dir show Jupyter config dir
 --data-dir show Jupyter data dir
 --runtime-dir show Jupyter runtime dir
 --paths show all Jupyter paths. Add --json for machine-readable
 format.
 --json output paths as machine-readable json
 --debug output debug information about paths

Available subcommands: kernel kernelspec migrate run troubleshoot

Jupyter command `jupyter-nbextension` not found.


In [47]:
from collections import Counter
import torch
from datasets import load_dataset
from torchtext.vocab import vocab
from tqdm import tqdm
from ipywidgets import FloatProgress

Wczytujemy zbiór danych `conll2003` (https://huggingface.co/datasets/conll2003), który zawiera teksty oznaczone znacznikami części mowy (*POS tags*): 

In [48]:
dataset = load_dataset("conll2003")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [49]:
print(dataset)
print(dataset["train"]["tokens"])

DatasetDict({
 train: Dataset({
 features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
 num_rows: 14041
 })
 validation: Dataset({
 features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
 num_rows: 3250
 })
 test: Dataset({
 features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
 num_rows: 3453
 })
})


Poiżej funkcja, która tworzy słownik (https://pytorch.org/text/stable/vocab.html).

Parametr `special` określa symbole specjalne:
* `` – nieznany token
* `` – wypełnienie
* `` – początek zdania
* `` – koniec zdania

In [50]:
print(dataset["train"]["chunk_tags"])

[[11, 21, 11, 12, 21, 22, 11, 12, 0], [11, 12], [11, 12], [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0], [11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 12, 21, 13, 11, 12, 21, 22, 11, 13, 11, 1, 13, 11, 17, 11, 12, 12, 21, 1, 0], [0, 11, 21, 22, 22, 11, 12, 12, 17, 11, 21, 22, 22, 11, 12, 13, 11, 0, 0, 11, 12, 11, 12, 12, 12, 12, 12, 12, 21, 11, 12, 12, 0], [11, 21, 11, 12, 12, 21, 22, 0, 17, 11, 21, 22, 17, 11, 21, 22, 11, 21, 22, 22, 13, 11, 12, 12, 0], [11, 21, 11, 12, 11, 12, 13, 11, 12, 12, 12, 12, 21, 22, 11, 12, 0, 11, 0, 11, 12, 13, 11, 12, 12, 12, 12, 12, 21, 11, 12, 1, 2, 2, 11, 21, 22, 11, 12, 0], [11, 12, 12, 21, 13, 11, 13, 11, 12, 12, 11, 13, 11, 11, 12, 21, 22, 11, 12, 12, 0, 11, 0, 0, 11, 12, 12, 0], [0, 11, 21, 22, 22, 11, 12, 13, 11, 12, 11, 12, 12, 12, 0, 11, 12, 12, 12, 0, 21, 17, 11, 12, 21, 22, 13, 3, 21, 3, 11, 12, 12, 13, 11, 12, 0], [11, 12, 12, 12, 12, 12, 21, 22, 22, 11, 13, 11, 12, 12, 1

In [51]:
def build_vocab(dataset):
 counter = Counter()
 for document in dataset:
 counter.update(document)
 return vocab(counter, specials=["", "", "", ""])

In [52]:
v = build_vocab(dataset["train"]["tokens"])

In [53]:
itos = v.get_itos() # mapowanie indeksów na tokeny

In [54]:
print(itos)



In [55]:
len(itos) # liczba różnych tokenów w słowniku

23627

In [56]:
v["rejects"] # indeks tokenu `on`

5

In [57]:
v[""] # indeks nieznanego tokenu

0

W przypadku, gdy w analizowanym tekście znajdzie się token, którego nie ma w słowniku, będzie reprezentowany przez indeks domyślny (*default index*). Ustawiamy, żeby był taki sam, jak indeks „nieznanego tokenu”:

In [58]:
v.set_default_index(v[""])

In [59]:
def data_process(dt):
 # Wektoryzacja dokumentów tekstowych.
 return [
 torch.tensor(
 [v[""]] + [v[token] for token in document] + [v[""]],
 dtype=torch.long,
 )
 for document in dt
 ]

In [60]:
def labels_process(dt):
 # Wektoryzacja etykiet (NER)
 return [torch.tensor([0] + document + [0], dtype=torch.long) for document in dt]

Teraz wektoryzujemy wszystkie dane:

In [61]:
print(dataset["train"]["tokens"])
train_tokens_ids = data_process(dataset["train"]["tokens"])



In [62]:
test_tokens_ids = data_process(dataset["test"]["tokens"])

In [63]:
validation_tokens_ids = data_process(dataset["validation"]["tokens"])

In [64]:
print(dataset["train"]["ner_tags"])
train_labels = labels_process(dataset["train"]["ner_tags"])

[[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2], [5, 0], [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0], [0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 7, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [7, 0, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0], [0, 5, 0, 5, 0, 1, 0, 0, 0], [0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [65]:
validation_labels = labels_process(dataset["validation"]["ner_tags"])

In [66]:
test_labels = labels_process(dataset["test"]["ner_tags"])

Przykład, jak wyglądają dane po zwektoryzowaniu:

In [67]:
train_tokens_ids[0]

tensor([ 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 3])

In [68]:
dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'lamb',
 '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [69]:
train_labels[0]

tensor([0, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0])

Funkcja, której użyjemy do ewaluacji:

In [70]:
def get_scores(y_true, y_pred):
 # Funkcja zwraca precyzję, pokrycie i F1
 acc_score = 0
 tp = 0
 fp = 0
 selected_items = 0
 relevant_items = 0

 for p, t in zip(y_pred, y_true):
 if p == t:
 acc_score += 1

 if p > 0 and p == t:
 tp += 1

 if p > 0:
 selected_items += 1

 if t > 0:
 relevant_items += 1

 if selected_items == 0:
 precision = 1.0
 else:
 precision = tp / selected_items

 if relevant_items == 0:
 recall = 1.0
 else:
 recall = tp / relevant_items

 if precision + recall == 0.0:
 f1 = 0.0
 else:
 f1 = 2 * precision * recall / (precision + recall)

 return precision, recall, f1

Ile mamy różnych tagów NER?

In [71]:
num_tags = max([max(x) for x in dataset["train"]["ner_tags"]]) + 1
print(num_tags)

9


Implementacja rekurencyjnej sieci neuronowej LSTM:

In [72]:
class LSTM(torch.nn.Module):

 def __init__(self):
 super(LSTM, self).__init__()
 self.emb = torch.nn.Embedding(len(v.get_itos()), 100)
 self.rec = torch.nn.LSTM(100, 256, 1, batch_first=True)
 self.fc1 = torch.nn.Linear(256, num_tags)

 def forward(self, x):
 emb = torch.relu(self.emb(x))
 lstm_output, (h_n, c_n) = self.rec(emb)
 out_weights = self.fc1(lstm_output)
 return out_weights

Stworzenie modelu:

In [73]:
lstm = LSTM()

Definicja funkcji kosztu:

In [74]:
criterion = torch.nn.CrossEntropyLoss()

Definicja optymalizatora:

In [75]:
optimizer = torch.optim.Adam(lstm.parameters())

Funkcja do ewaluacji modelu:

In [76]:
def eval_model(dataset_tokens, dataset_labels, model):
 Y_true = []
 Y_pred = []
 for i in tqdm(range(len(dataset_labels))):
 batch_tokens = dataset_tokens[i].unsqueeze(0)
 tags = list(dataset_labels[i].numpy())
 Y_true += tags

 Y_batch_pred_weights = model(batch_tokens).squeeze(0)
 Y_batch_pred = torch.argmax(Y_batch_pred_weights, 1)
 Y_pred += list(Y_batch_pred.numpy())

 return get_scores(Y_true, Y_pred)

Uczenie modelu:

In [77]:
NUM_EPOCHS = 5

In [78]:
for i in range(NUM_EPOCHS):
 lstm.train()
 # for i in tqdm(range(500)):
 for i in tqdm(range(len(train_labels))):
 batch_tokens = train_tokens_ids[i].unsqueeze(0)
 tags = train_labels[i].unsqueeze(1)

 predicted_tags = lstm(batch_tokens)

 optimizer.zero_grad()
 loss = criterion(predicted_tags.squeeze(0), tags.squeeze(1))

 loss.backward()
 optimizer.step()

 lstm.eval()
 print(eval_model(validation_tokens_ids, validation_labels, lstm))

100%|██████████| 14041/14041 [05:54<00:00, 39.57it/s]
100%|██████████| 3250/3250 [00:01<00:00, 1678.69it/s]


(0.5988246210949583, 0.4500755550389399, 0.513902714181432)


100%|██████████| 14041/14041 [07:01<00:00, 33.29it/s]
100%|██████████| 3250/3250 [00:01<00:00, 1652.85it/s]


(0.7379187666765491, 0.5786353597582239, 0.6486416053163073)


100%|██████████| 14041/14041 [06:35<00:00, 35.49it/s]
100%|██████████| 3250/3250 [00:02<00:00, 1513.42it/s]


(0.7980072463768116, 0.6144368243635941, 0.6942930321140081)


100%|██████████| 14041/14041 [06:34<00:00, 35.58it/s]
100%|██████████| 3250/3250 [00:02<00:00, 1468.00it/s]


(0.8167669945676113, 0.646634894804138, 0.7218113403399506)


100%|██████████| 14041/14041 [06:28<00:00, 36.11it/s]
100%|██████████| 3250/3250 [00:02<00:00, 1558.26it/s]

(0.8325018896447468, 0.6401255376031617, 0.7237481929294256)





Ewaluacja:

In [79]:
eval_model(validation_tokens_ids, validation_labels, lstm)

100%|██████████| 3250/3250 [00:02<00:00, 1603.66it/s]


(0.8325018896447468, 0.6401255376031617, 0.7237481929294256)

In [80]:
eval_model(test_tokens_ids, test_labels, lstm)

100%|██████████| 3453/3453 [00:02<00:00, 1517.54it/s]


(0.7690643591130341, 0.525887573964497, 0.6246430924665056)

## Zadanie 3

Sklonuj repozytorium https://git.wmi.amu.edu.pl/kubapok/en-ner-conll-2003

Stwórz model *sequence labelling* realizujący zadanie NER, oparty o dowolną rekurencyjną sieć neuronową (możesz wzorować się na przykładzie z zajęć).

W plikach dev-0/out.tsv oraz test-A/out.tsv umieść wyniki predykcji dla dev-0/in.tsv i test-A/in.tsv odpowiednio.
Do ewaluacji wykorzystaj narzędzie GEval (https://gitlab.com/filipg/geval):

 wget https://gonito.net/get/bin/geval
 chmod u+x geval
 ./geval --help

Liczba punktów uzyskanych za zadanie zależy od uzyskanej wartości accuracy na zbiorze `test-A` (wynik zaokrąglony w górę):

 points = math.ceil(accuracy * 7.0)

⚠️ W systemie Moodle proszę załączyć plik `test-A/out.tsv` oraz link do repozytorium z rozwiązaniem zadania.
 