Go to file

Jakub 2cb44ca173 Improved En-Pl Europarl		2020-01-28 18:51:11 +01:00
dev-0	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
test	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
test-A	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
train	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
.gitignore	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
clean.rb	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
config.txt	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
improve.sh	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
out_files_denorm.py	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
README.md	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
TAU_translator_from_scratch.ipynb	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00
train_set_prune.py	Improved En-Pl Europarl	2020-01-28 18:51:11 +01:00

README.md

English-Polish Europarl

Translate Europarl proceedings from English into Polish.

Sources

The data set is based on the EUROPARL v7 corpus prepared by Jörg Tiedemann.

Metric

BLEU is used as the evaluation metric.

Directory structure

README.md — this file
config.txt — GEval configuration file
train/ — directory with training data
train/train.tsv — train set (English-Polish corpus of 550K sentence pairs, the Polish sentence is given in the first column, the English sentence — in the second one)
dev-0/ — directory with development data (10K sentence pairs)
dev-0/in.tsv — input data for the dev set (English utterances)
dev-0/expected.tsv — expected (reference) data for the dev set (Polish utterances)
test-A — directory with test data (5K sentence pairs)
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected data for the test set (not available in the master branch)

The Polish utterances in {dev-0,test-A}/expected.tsv are tokenized (by Moses tokenizer.perl script), but the English utterances in all the files and Polish utterances in train/train.tsv are NOT tokenized.

Note that some of the files have .tsv extension, even though they actually do not contain TABs — this is just for the compatibility with Gonito platform.