41 lines
1.6 KiB
Markdown
41 lines
1.6 KiB
Markdown
|
|
English-Polish Europarl
|
|
=======================
|
|
|
|
Translate Europarl proceedings from English into Polish.
|
|
|
|
Sources
|
|
-------
|
|
|
|
The data set is based on the [EUROPARL
|
|
v7](http://opus.lingfil.uu.se/Europarl.php) corpus prepared by [Jörg
|
|
Tiedemann](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf).
|
|
|
|
Metric
|
|
------
|
|
|
|
[BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation
|
|
metric.
|
|
|
|
|
|
Directory structure
|
|
-------------------
|
|
|
|
* `README.md` — this file
|
|
* `config.txt` — GEval configuration file
|
|
* `train/` — directory with training data
|
|
* `train/train.tsv` — train set (English-Polish corpus of 550K sentence pairs, the Polish sentence is given in the first column, the English sentence — in the second one)
|
|
* `dev-0/` — directory with development data (10K sentence pairs)
|
|
* `dev-0/in.tsv` — input data for the dev set (English utterances)
|
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances)
|
|
* `test-A` — directory with test data (5K sentence pairs)
|
|
* `test-A/in.tsv` — input data for the test set
|
|
* `test-A/expected.tsv` — expected data for the test set (not available in the master branch)
|
|
|
|
The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized
|
|
(by [Moses](http://www.statmt.org/moses/) tokenizer.perl script), but the English utterances in all the files and Polish utterances in `train/train.tsv` are *NOT* tokenized.
|
|
|
|
Note that some of the files have `.tsv` extension, even though they
|
|
actually do not contain TABs — this is just for the compatibility with
|
|
[Gonito](http://gonito.net) platform.
|