Jakub 2cb44ca173 | ||
---|---|---|
dev-0 | ||
test | ||
test-A | ||
train | ||
.gitignore | ||
README.md | ||
TAU_translator_from_scratch.ipynb | ||
clean.rb | ||
config.txt | ||
improve.sh | ||
out_files_denorm.py | ||
train_set_prune.py |
README.md
English-Polish Europarl
Translate Europarl proceedings from English into Polish.
Sources
The data set is based on the EUROPARL v7 corpus prepared by Jörg Tiedemann.
Metric
BLEU is used as the evaluation metric.
Directory structure
README.md
— this fileconfig.txt
— GEval configuration filetrain/
— directory with training datatrain/train.tsv
— train set (English-Polish corpus of 550K sentence pairs, the Polish sentence is given in the first column, the English sentence — in the second one)dev-0/
— directory with development data (10K sentence pairs)dev-0/in.tsv
— input data for the dev set (English utterances)dev-0/expected.tsv
— expected (reference) data for the dev set (Polish utterances)test-A
— directory with test data (5K sentence pairs)test-A/in.tsv
— input data for the test settest-A/expected.tsv
— expected data for the test set (not available in the master branch)
The Polish utterances in {dev-0,test-A}/expected.tsv
are tokenized
(by Moses tokenizer.perl script), but the English utterances in all the files and Polish utterances in train/train.tsv
are NOT tokenized.
Note that some of the files have .tsv
extension, even though they
actually do not contain TABs — this is just for the compatibility with
Gonito platform.