Go to file
Jakub 2cb44ca173 Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
dev-0 Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
test Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
test-A Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
train Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
.gitignore Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
README.md Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
TAU_translator_from_scratch.ipynb Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
clean.rb Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
config.txt Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
improve.sh Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
out_files_denorm.py Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00
train_set_prune.py Improved En-Pl Europarl 2020-01-28 18:51:11 +01:00

README.md

English-Polish Europarl

Translate Europarl proceedings from English into Polish.

Sources

The data set is based on the EUROPARL v7 corpus prepared by Jörg Tiedemann.

Metric

BLEU is used as the evaluation metric.

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data
  • train/train.tsv — train set (English-Polish corpus of 550K sentence pairs, the Polish sentence is given in the first column, the English sentence — in the second one)
  • dev-0/ — directory with development data (10K sentence pairs)
  • dev-0/in.tsv — input data for the dev set (English utterances)
  • dev-0/expected.tsv — expected (reference) data for the dev set (Polish utterances)
  • test-A — directory with test data (5K sentence pairs)
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected data for the test set (not available in the master branch)

The Polish utterances in {dev-0,test-A}/expected.tsv are tokenized (by Moses tokenizer.perl script), but the English utterances in all the files and Polish utterances in train/train.tsv are NOT tokenized.

Note that some of the files have .tsv extension, even though they actually do not contain TABs — this is just for the compatibility with Gonito platform.