en-ner-conll-2003/README.md

CoNLL-2003 English Named Entity Recognition.
======================================================

NER challenge for CoNLL-2003 English.
Annotations were taken from [University of Antwerp](https://www.clips.uantwerpen.be/conll2003/ner/).
The English data is a collection of news wire articles from the [Reuters Corpus](https://trec.nist.gov/data/reuters/reuters.html), RCV1.

Format of the train set
-----------------------

The train set has just two columns separated by TABs:

* the expected BIO labels,
* the docuemnt.

Each line is a separate training item. Note that this is TSV format,
not CSV, double quotes are not interpreted in a special way!

Preprocessing snippet located [here](https://git.applica.pl/snippets/18)

End-of-lines inside documents were replaced with the '</S>' tag.

The train is compressed with the xz compressor, in order to see a
random sample of 10 training items, run:

    xzcat train/train.tsv.xz | shuf -n 10 | less -S

(The `-S` disables line wrapping, press "q" to exit `less` browser.)

Format of the test sets
-----------------------

For the test sets, the input data is given in two files: the text in
`in.tsv` and the expected labels in `expected.tsv`. (The files have
`.tsv` extensions for consistency but actually they do not contain TABs.)

To see the first 5 test items run:

    cat dev-0/in.tsv | paste dev-0/expected.tsv - | head -n 5

The `expected.tsv` file for the `test-A` test set is hidden and is not
available in the master branch.

Format of the output files
--------------------------

For each input line, a probability for each label must be given:

    label1:prob1 label2:prob2 ... labelN:probN

(The separator is space, *not* TAB here.)

You are expected to suply `dev-0/out.tsv` and `test-A/out.tsv` in this
format (the file has `.tsv` extension for consistency, but actually
there should be no TAB there).

Evaluation metrics
------------------

Two evaluation metrics are used:

* Soft-F1 - a "softer" version of F1 in which overlap is also counted

Directory structure
-------------------

* `README.md` — this file
* `config.txt` — GEval configuration file
* `train/` — directory with training data (15/16 of the original train set)
* `train/train.tsv.xz` — train set
* `train/meta.tsv` — metadata (the original ID of the document), do **not* use during training,
   this is just for reference (e.g. when you need to go back to the original document)
* `dev-0/` — directory with dev (test) data (split preserved from CoNLL-2003)
* `dev-0/in.tsv` — input data for the dev set
* `dev-0/expected.tsv` — expected (reference) data for the dev set
* `test-A` — directory with test data
* `test-A/in.tsv` — input data for the test set
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden from the developers,
   not available in the `master` branch)

Usually, there is no reason to change these files.
Initial commit 2019-01-17 14:01:36 +01:00			`CoNLL-2003 English Named Entity Recognition.`
			`======================================================`

			`NER challenge for CoNLL-2003 English.`
			`Annotations were taken from [University of Antwerp](https://www.clips.uantwerpen.be/conll2003/ner/).`
			`The English data is a collection of news wire articles from the [Reuters Corpus](https://trec.nist.gov/data/reuters/reuters.html), RCV1.`

			`Format of the train set`
			`-----------------------`

			`The train set has just two columns separated by TABs:`

			`* the expected BIO labels,`
			`* the docuemnt.`

			`Each line is a separate training item. Note that this is TSV format,`
			`not CSV, double quotes are not interpreted in a special way!`

			`Preprocessing snippet located [here](https://git.applica.pl/snippets/18)`

			`End-of-lines inside documents were replaced with the '</S>' tag.`

			`The train is compressed with the xz compressor, in order to see a`
			`random sample of 10 training items, run:`

			`xzcat train/train.tsv.xz \| shuf -n 10 \| less -S`

			(The `-S` disables line wrapping, press "q" to exit `less` browser.)

			`Format of the test sets`
			`-----------------------`

			`For the test sets, the input data is given in two files: the text in`
			`in.tsv` and the expected labels in `expected.tsv`. (The files have
			`.tsv` extensions for consistency but actually they do not contain TABs.)

			`To see the first 5 test items run:`

			`cat dev-0/in.tsv \| paste dev-0/expected.tsv - \| head -n 5`

			The `expected.tsv` file for the `test-A` test set is hidden and is not
			`available in the master branch.`

			`Format of the output files`
			`--------------------------`

			`For each input line, a probability for each label must be given:`

			`label1:prob1 label2:prob2 ... labelN:probN`

			`(The separator is space, not TAB here.)`

			You are expected to suply `dev-0/out.tsv` and `test-A/out.tsv` in this
			format (the file has `.tsv` extension for consistency, but actually
			`there should be no TAB there).`

			`Evaluation metrics`
			`------------------`

			`Two evaluation metrics are used:`

			`* Soft-F1 - a "softer" version of F1 in which overlap is also counted`

			`Directory structure`
			`-------------------`

			* `README.md` — this file
			* `config.txt` — GEval configuration file
			* `train/` — directory with training data (15/16 of the original train set)
			* `train/train.tsv.xz` — train set
			* `train/meta.tsv` — metadata (the original ID of the document), do *not use during training,
			`this is just for reference (e.g. when you need to go back to the original document)`
			* `dev-0/` — directory with dev (test) data (split preserved from CoNLL-2003)
			* `dev-0/in.tsv` — input data for the dev set
			* `dev-0/expected.tsv` — expected (reference) data for the dev set
			* `test-A` — directory with test data
			* `test-A/in.tsv` — input data for the test set
			* `test-A/expected.tsv` — expected (reference) data for the test set (hidden from the developers,
			not available in the `master` branch)

			`Usually, there is no reason to change these files.`