s452629/en-ner-conll-2003

Go to file

Marcin Czerniak 04464285bd WIP Co-authored-by: Alexxiia <Alexxiia@users.noreply.github.com>		2023-06-21 01:52:12 +02:00
dev-0	End of sentence missing token added	2019-01-18 09:38:23 +01:00
test-A	End of sentence missing token added	2019-01-18 09:38:23 +01:00
train	End of sentence missing token added	2019-01-18 09:38:23 +01:00
.gitignore	add .gitignore	2021-05-19 08:40:29 +02:00
config.txt	Initial commit	2019-01-17 14:01:36 +01:00
README.md	Updated README.md	2019-01-17 14:12:44 +01:00
train.py	WIP	2023-06-21 01:52:12 +02:00

README.md

CoNLL-2003 English Named Entity Recognition.

NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.

Format of the train set

The train set has just two columns separated by TABs:

the expected BIO labels,
the docuemnt.

Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located here

End-of-lines inside documents were replaced with the '' tag.

The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:

xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The -S disables line wrapping, press "q" to exit less browser.)

Format of the test sets

For the test sets, the input data is given in two files: the text in in.tsv and the expected labels in expected.tsv. (The files have .tsv extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The expected.tsv file for the test-A test set is hidden and is not available in the master branch.

Format of the output files

For each input line, a probability for each label must be given:

label1:prob1 label2:prob2 ... labelN:probN

(The separator is space, not TAB here.)

You are expected to suply dev-0/out.tsv and test-A/out.tsv in this format (the file has .tsv extension for consistency, but actually there should be no TAB there).

Evaluation metrics

One evaluation metric is used:

BIO-F1, F1 metric on NER tags

Directory structure

README.md — this file
config.txt — GEval configuration file
train/ — directory with training data
train/train.tsv.xz — train set
dev-0/ — directory with dev (test) data (split preserved from CoNLL-2003)
dev-0/in.tsv — input data for the dev set
dev-0/expected.tsv — expected (reference) data for the dev set
test-A — directory with test data
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected (reference) data for the test set (hidden from the developers, not available in the master branch)

Usually, there is no reason to change these files.