en-ner-conll-2003/README.md
Przemysław Lipka bf94068487 Initial commit
2019-01-17 14:01:36 +01:00

2.8 KiB

CoNLL-2003 English Named Entity Recognition.

NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.

Format of the train set

The train set has just two columns separated by TABs:

  • the expected BIO labels,
  • the docuemnt.

Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located here

End-of-lines inside documents were replaced with the '' tag.

The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:

xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The -S disables line wrapping, press "q" to exit less browser.)

Format of the test sets

For the test sets, the input data is given in two files: the text in in.tsv and the expected labels in expected.tsv. (The files have .tsv extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The expected.tsv file for the test-A test set is hidden and is not available in the master branch.

Format of the output files

For each input line, a probability for each label must be given:

label1:prob1 label2:prob2 ... labelN:probN

(The separator is space, not TAB here.)

You are expected to suply dev-0/out.tsv and test-A/out.tsv in this format (the file has .tsv extension for consistency, but actually there should be no TAB there).

Evaluation metrics

Two evaluation metrics are used:

  • Soft-F1 - a "softer" version of F1 in which overlap is also counted

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data (15/16 of the original train set)
  • train/train.tsv.xz — train set
  • train/meta.tsv — metadata (the original ID of the document), do *not use during training, this is just for reference (e.g. when you need to go back to the original document)
  • dev-0/ — directory with dev (test) data (split preserved from CoNLL-2003)
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden from the developers, not available in the master branch)

Usually, there is no reason to change these files.