CoNLL-2003 English Named Entity Recognition.

NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.

Format of the train set

The train set has just two columns separated by TABs:

the expected BIO labels,
the docuemnt.

Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located here

End-of-lines inside documents were replaced with the '' tag.

The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:

xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The -S disables line wrapping, press "q" to exit less browser.)

Format of the test sets

For the test sets, the input data is given in two files: the text in in.tsv and the expected labels in expected.tsv. (The files have .tsv extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The expected.tsv file for the test-A test set is hidden and is not available in the master branch.

Format of the output files

For each input line, a probability for each label must be given:

label1:prob1 label2:prob2 ... labelN:probN

(The separator is space, not TAB here.)

You are expected to suply dev-0/out.tsv and test-A/out.tsv in this format (the file has .tsv extension for consistency, but actually there should be no TAB there).

Evaluation metrics

Two evaluation metrics are used:

Soft-F1 - a "softer" version of F1 in which overlap is also counted

Directory structure

README.md — this file
config.txt — GEval configuration file
train/ — directory with training data (15/16 of the original train set)
train/train.tsv.xz — train set
train/meta.tsv — metadata (the original ID of the document), do *not use during training, this is just for reference (e.g. when you need to go back to the original document)
dev-0/ — directory with dev (test) data (split preserved from CoNLL-2003)
dev-0/in.tsv — input data for the dev set
dev-0/expected.tsv — expected (reference) data for the dev set
test-A — directory with test data
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected (reference) data for the test set (hidden from the developers, not available in the master branch)

Usually, there is no reason to change these files.

2.8 KiB Raw Blame History