Go to file
Marcin Czerniak 04464285bd WIP
Co-authored-by: Alexxiia <Alexxiia@users.noreply.github.com>
2023-06-21 01:52:12 +02:00
dev-0 End of sentence missing token added 2019-01-18 09:38:23 +01:00
test-A End of sentence missing token added 2019-01-18 09:38:23 +01:00
train End of sentence missing token added 2019-01-18 09:38:23 +01:00
.gitignore add .gitignore 2021-05-19 08:40:29 +02:00
config.txt Initial commit 2019-01-17 14:01:36 +01:00
README.md Updated README.md 2019-01-17 14:12:44 +01:00
train.py WIP 2023-06-21 01:52:12 +02:00

CoNLL-2003 English Named Entity Recognition.

NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.

Format of the train set

The train set has just two columns separated by TABs:

  • the expected BIO labels,
  • the docuemnt.

Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located here

End-of-lines inside documents were replaced with the '' tag.

The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:

xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The -S disables line wrapping, press "q" to exit less browser.)

Format of the test sets

For the test sets, the input data is given in two files: the text in in.tsv and the expected labels in expected.tsv. (The files have .tsv extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The expected.tsv file for the test-A test set is hidden and is not available in the master branch.

Format of the output files

For each input line, a probability for each label must be given:

label1:prob1 label2:prob2 ... labelN:probN

(The separator is space, not TAB here.)

You are expected to suply dev-0/out.tsv and test-A/out.tsv in this format (the file has .tsv extension for consistency, but actually there should be no TAB there).

Evaluation metrics

One evaluation metric is used:

  • BIO-F1, F1 metric on NER tags

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data
  • train/train.tsv.xz — train set
  • dev-0/ — directory with dev (test) data (split preserved from CoNLL-2003)
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden from the developers, not available in the master branch)

Usually, there is no reason to change these files.