CoNLL-2003 English Named Entity Recognition. ====================================================== NER challenge for CoNLL-2003 English. Annotations were taken from [University of Antwerp](https://www.clips.uantwerpen.be/conll2003/ner/). The English data is a collection of news wire articles from the [Reuters Corpus](https://trec.nist.gov/data/reuters/reuters.html), RCV1. Format of the train set ----------------------- The train set has just two columns separated by TABs: * the expected BIO labels, * the docuemnt. Each line is a separate training item. Note that this is TSV format, not CSV, double quotes are not interpreted in a special way! Preprocessing snippet located [here](https://git.applica.pl/snippets/18) End-of-lines inside documents were replaced with the '' tag. The train is compressed with the xz compressor, in order to see a random sample of 10 training items, run: xzcat train/train.tsv.xz | shuf -n 10 | less -S (The `-S` disables line wrapping, press "q" to exit `less` browser.) Format of the test sets ----------------------- For the test sets, the input data is given in two files: the text in `in.tsv` and the expected labels in `expected.tsv`. (The files have `.tsv` extensions for consistency but actually they do not contain TABs.) To see the first 5 test items run: cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5 The `expected.tsv` file for the `test-A` test set is hidden and is not available in the master branch. Format of the output files -------------------------- For each input line, a probability for each label must be given: label1:prob1 label2:prob2 ... labelN:probN (The separator is space, *not* TAB here.) You are expected to suply `dev-0/out.tsv` and `test-A/out.tsv` in this format (the file has `.tsv` extension for consistency, but actually there should be no TAB there). Evaluation metrics ------------------ Two evaluation metrics are used: * Soft-F1 - a "softer" version of F1 in which overlap is also counted Directory structure ------------------- * `README.md` — this file * `config.txt` — GEval configuration file * `train/` — directory with training data (15/16 of the original train set) * `train/train.tsv.xz` — train set * `train/meta.tsv` — metadata (the original ID of the document), do **not* use during training, this is just for reference (e.g. when you need to go back to the original document) * `dev-0/` — directory with dev (test) data (split preserved from CoNLL-2003) * `dev-0/in.tsv` — input data for the dev set * `dev-0/expected.tsv` — expected (reference) data for the dev set * `test-A` — directory with test data * `test-A/in.tsv` — input data for the test set * `test-A/expected.tsv` — expected (reference) data for the test set (hidden from the developers, not available in the `master` branch) Usually, there is no reason to change these files.