.idea | ||
dev-0 | ||
test-A | ||
train | ||
.gitignore | ||
BIO_fixer.py | ||
classificator.py | ||
config.txt | ||
README.md | ||
test.py |
CoNLL-2003 English Named Entity Recognition.
NER challenge for CoNLL-2003 English. Annotations were taken from University of Antwerp. The English data is a collection of news wire articles from the Reuters Corpus, RCV1.
Format of the train set
The train set has just two columns separated by TABs:
- the expected BIO labels,
- the docuemnt.
Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way!
Preprocessing snippet located here
End-of-lines inside documents were replaced with the '' tag.
The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run:
xzcat train/train.tsv.xz | shuf -n 10 | less -S
(The -S
disables line wrapping, press "q" to exit less
browser.)
Format of the test sets
For the test sets, the input data is given in two files: the text in
in.tsv
and the expected labels in expected.tsv
. (The files have
.tsv
extensions for consistency but actually they do not contain TABs.)
To see the first 5 test items run:
cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5
The expected.tsv
file for the test-A
test set is hidden and is not
available in the master branch.
Evaluation metrics
One evaluation metric is used:
- BIO-F1
Directory structure
README.md
— this fileconfig.txt
— GEval configuration filetrain/
— directory with training datatrain/train.tsv.xz
— train setdev-0/
— directory with dev (test) data (split preserved from CoNLL-2003)dev-0/in.tsv
— input data for the dev setdev-0/expected.tsv
— expected (reference) data for the dev settest-A
— directory with test datatest-A/in.tsv
— input data for the test settest-A/expected.tsv
— expected (reference) data for the test set (hidden from the developers, not available in themaster
branch)
Usually, there is no reason to change these files.