init
This commit is contained in:
commit
ea840a0351
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
*~
|
||||||
|
temp/
|
41
README.md
Normal file
41
README.md
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
|
||||||
|
Russain-Polish Opensubtitles
|
||||||
|
============================
|
||||||
|
|
||||||
|
Translate subtitles from Russian into Polish.
|
||||||
|
|
||||||
|
Sources
|
||||||
|
-------
|
||||||
|
|
||||||
|
The data set is based on the
|
||||||
|
[Opensubtitles 2016](http://opus.lingfil.uu.se/OpenSubtitles2016.php)
|
||||||
|
corpus [prepared by Pierre Lison and Jörg Tiedemann](http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf).
|
||||||
|
|
||||||
|
Metric
|
||||||
|
------
|
||||||
|
|
||||||
|
[BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation metric.
|
||||||
|
|
||||||
|
|
||||||
|
Directory structure
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
* `README.md` — this file
|
||||||
|
* `config.txt` — GEval configuration file
|
||||||
|
* `train/` — directory with training data
|
||||||
|
* `train/train.tsv` — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
|
||||||
|
* `dev-0/` — directory with development data (20K sentence pairs)
|
||||||
|
* `dev-0/in.tsv` — input data for the dev set (Russian utterances)
|
||||||
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances)
|
||||||
|
* `test-A` — directory with test data (20K sentence pairs)
|
||||||
|
* `test-A/in.tsv` — input data for the test set
|
||||||
|
* `test-A/expected.tsv` — expected data for the test set (not available in the master branch)
|
||||||
|
|
||||||
|
The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized
|
||||||
|
(by [Moses](http://www.statmt.org/moses/) tokenizer.perl script, with
|
||||||
|
`-no-escape-option`), but the Russian utterances in all the files and
|
||||||
|
Polish utterances in `train/train.tsv` are *NOT* tokenized.
|
||||||
|
|
||||||
|
Note that some of the files have `.tsv` extension, even though they
|
||||||
|
actually do not contain TABs — this is just for the compatibility with
|
||||||
|
[Gonito](http://gonito.net) platform.
|
1
config.txt
Normal file
1
config.txt
Normal file
@ -0,0 +1 @@
|
|||||||
|
--metric BLEU --precision 4
|
20000
dev-0/expected.tsv
Normal file
20000
dev-0/expected.tsv
Normal file
File diff suppressed because it is too large
Load Diff
20000
dev-0/in.tsv
Normal file
20000
dev-0/in.tsv
Normal file
File diff suppressed because it is too large
Load Diff
20000
test-A/in.tsv
Normal file
20000
test-A/in.tsv
Normal file
File diff suppressed because it is too large
Load Diff
BIN
train/train.tsv.gz
Normal file
BIN
train/train.tsv.gz
Normal file
Binary file not shown.
Loading…
Reference in New Issue
Block a user