TAU 28 Marian s2s, tokenizer, truecaser

This commit is contained in:
Jakub 2020-01-27 10:17:26 +01:00
commit f393c9cd67
25 changed files with 29139 additions and 0 deletions

45
README.md Normal file
View File

@ -0,0 +1,45 @@
WMT2017 Czech-English machine translation challenge for news
=============================================================
Translate news articles from Czech into English.
This is WMT2017 news challenge reformatted as a Gonito.net challenge,
all the data were taken from <http://www.statmt.org/wmt17/translation-task.html>.
BLEU is used as the evaluation metric.
Directory structure
-------------------
* `README.md` — this file
* `config.txt` — configuration file
* `train/` — directory with training data
* `train/commoncrawl.tsv.xz` — Common Crawl parallel corpus
* `train/news-commentary-v12.tsv.xz` — News Commentary parallel corpus
* `train/europarl-v7.tsv.xz` — Europarliament parallel corpus
* `dev-0/` — directory with dev (test) data (Newstest 2013)
* `dev-0/in.tsv` — German input text for the dev set
* `dev-0/expected.tsv` — English reference translation for the dev set
* `test-A` — directory with test data
* `test-A/in.tsv` — German input data for the test set (WMT2017 test set)
* `test-A/expected.tsv` — English reference translation for the test set
Training sets
-------------
All training sets were compressed with xz, use `xzcat` to decompress:
$ xzcat train/*.tsv.xz | ...
The pairs where German or English side is empty were removed from the
training sets.
Test sets
---------
Reference English translations in the dev and test sets is not tokenised.
Monolingual data
----------------
Monolingual data was not included here.

1
config.txt Normal file
View File

@ -0,0 +1 @@
--metric BLEU --metric GLEU --tokenizer 13a --precision 4

3000
dev-0/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

3000
dev-0/in.tc.cs Normal file

File diff suppressed because it is too large Load Diff

3000
dev-0/in.tc.cs.output Normal file

File diff suppressed because it is too large Load Diff

3000
dev-0/in.tok.tsv Normal file

File diff suppressed because it is too large Load Diff

3000
dev-0/in.tsv Normal file

File diff suppressed because it is too large Load Diff

3000
dev-0/out.tsv Normal file

File diff suppressed because it is too large Load Diff

48
preprocess-data_dev-0_test-A.sh Executable file
View File

@ -0,0 +1,48 @@
#!/bin/bash -v
# This sample script preprocesses a sample corpus, including tokenization,
# truecasing, and subword segmentation.
# For application to a different language pair, change source and target
# prefix, optionally the number of BPE operations, and the file names
# (currently, data/corpus and data/newsdev2016 are being processed).
# In the tokenization step, you will want to remove Romanian-specific
# normalization / diacritic removal, and you may want to add your own.
# Also, you may want to learn BPE segmentations separately for each language,
# especially if they differ in their alphabet
# Suffix of source language files
SRC=cs
# Suffix of target language files
TRG=en
# Number of merge operations. Network vocabulary should be slightly larger (to
# include characters), or smaller if the operations are learned on the joint
# vocabulary
bpe_operations=85000
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
# tokenize
cat dev-0/in.tsv \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > dev-0/in.tok.tsv
cat test-A/in.tsv \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > test-A/in.tok.tsv
# apply truecaser (dev/test files)
for prefix in dev-0/in test-A/in
do
$mosesdecoder/scripts/recaser/truecase.perl -model preprocess-train/model/tc.$SRC < $prefix.tok.tsv > $prefix.tc.$SRC
done

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,36 @@
[2020-01-24 03:53:38] [valid] Ep. 1 : Up. 10000 : cross-entropy : 108.847 : new best
[2020-01-24 03:56:42] [valid] Ep. 1 : Up. 10000 : translation : 7.52 : new best
[2020-01-24 05:33:26] [valid] Ep. 2 : Up. 20000 : cross-entropy : 94.6385 : new best
[2020-01-24 05:36:21] [valid] Ep. 2 : Up. 20000 : translation : 12.28 : new best
[2020-01-24 07:13:19] [valid] Ep. 3 : Up. 30000 : cross-entropy : 89.3 : new best
[2020-01-24 07:16:05] [valid] Ep. 3 : Up. 30000 : translation : 14.42 : new best
[2020-01-24 08:52:52] [valid] Ep. 4 : Up. 40000 : cross-entropy : 86.3268 : new best
[2020-01-24 08:55:27] [valid] Ep. 4 : Up. 40000 : translation : 15.55 : new best
[2020-01-24 12:16:53] [valid] Ep. 5 : Up. 50000 : cross-entropy : 84.0588 : new best
[2020-01-24 12:19:29] [valid] Ep. 5 : Up. 50000 : translation : 15.5 : stalled 1 times (last best: 15.55)
[2020-01-24 13:58:10] [valid] Ep. 5 : Up. 60000 : cross-entropy : 82.5176 : new best
[2020-01-24 14:01:02] [valid] Ep. 5 : Up. 60000 : translation : 16.08 : new best
[2020-01-24 14:07:51] [valid] Ep. 6 : Up. 60621 : cross-entropy : 82.5103 : new best
[2020-01-24 14:10:42] [valid] Ep. 6 : Up. 60621 : translation : 15.87 : stalled 1 times (last best: 16.08)
[2020-01-24 16:26:57] [valid] Ep. 6 : Up. 70000 : cross-entropy : 82.2896 : new best
[2020-01-24 16:29:33] [valid] Ep. 6 : Up. 70000 : translation : 16.03 : stalled 2 times (last best: 16.08)
[2020-01-24 18:09:49] [valid] Ep. 7 : Up. 80000 : cross-entropy : 82.905 : stalled 1 times (last best: 82.2896)
[2020-01-24 18:12:31] [valid] Ep. 7 : Up. 80000 : translation : 16.49 : new best
[2020-01-24 19:53:05] [valid] Ep. 8 : Up. 90000 : cross-entropy : 83.1733 : stalled 2 times (last best: 82.2896)
[2020-01-24 19:55:48] [valid] Ep. 8 : Up. 90000 : translation : 16.18 : stalled 1 times (last best: 16.49)
[2020-01-24 21:36:56] [valid] Ep. 9 : Up. 100000 : cross-entropy : 84.6166 : stalled 3 times (last best: 82.2896)
[2020-01-24 21:39:23] [valid] Ep. 9 : Up. 100000 : translation : 16.42 : stalled 2 times (last best: 16.49)
[2020-01-24 23:16:42] [valid] Ep. 10 : Up. 110000 : cross-entropy : 85.5421 : stalled 4 times (last best: 82.2896)
[2020-01-24 23:19:12] [valid] Ep. 10 : Up. 110000 : translation : 16.23 : stalled 3 times (last best: 16.49)
[2020-01-25 00:57:49] [valid] Ep. 10 : Up. 120000 : cross-entropy : 86.5355 : stalled 5 times (last best: 82.2896)
[2020-01-25 01:00:23] [valid] Ep. 10 : Up. 120000 : translation : 15.92 : stalled 4 times (last best: 16.49)
[2020-01-25 02:37:11] [valid] Ep. 11 : Up. 130000 : cross-entropy : 88.0175 : stalled 6 times (last best: 82.2896)
[2020-01-25 02:39:36] [valid] Ep. 11 : Up. 130000 : translation : 15.98 : stalled 5 times (last best: 16.49)
[2020-01-25 04:16:32] [valid] Ep. 12 : Up. 140000 : cross-entropy : 89.7736 : stalled 7 times (last best: 82.2896)
[2020-01-25 04:18:58] [valid] Ep. 12 : Up. 140000 : translation : 15.72 : stalled 6 times (last best: 16.49)
[2020-01-25 05:55:59] [valid] Ep. 13 : Up. 150000 : cross-entropy : 91.4267 : stalled 8 times (last best: 82.2896)
[2020-01-25 05:58:28] [valid] Ep. 13 : Up. 150000 : translation : 15.43 : stalled 7 times (last best: 16.49)
[2020-01-25 07:35:33] [valid] Ep. 14 : Up. 160000 : cross-entropy : 93.3989 : stalled 9 times (last best: 82.2896)
[2020-01-25 07:38:02] [valid] Ep. 14 : Up. 160000 : translation : 15.24 : stalled 8 times (last best: 16.49)
[2020-01-25 09:15:09] [valid] Ep. 15 : Up. 170000 : cross-entropy : 94.6314 : stalled 10 times (last best: 82.2896)
[2020-01-25 09:17:37] [valid] Ep. 15 : Up. 170000 : translation : 15.18 : stalled 9 times (last best: 16.49)

View File

@ -0,0 +1,62 @@
#!/bin/bash -v
# This sample script preprocesses a sample corpus, including tokenization,
# truecasing, and subword segmentation.
# For application to a different language pair, change source and target
# prefix, optionally the number of BPE operations, and the file names
# (currently, data/corpus and data/newsdev2016 are being processed).
# In the tokenization step, you will want to remove Romanian-specific
# normalization / diacritic removal, and you may want to add your own.
# Also, you may want to learn BPE segmentations separately for each language,
# especially if they differ in their alphabet
# Suffix of source language files
SRC=cs
# Suffix of target language files
TRG=en
# Number of merge operations. Network vocabulary should be slightly larger (to
# include characters), or smaller if the operations are learned on the joint
# vocabulary
bpe_operations=85000
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
# tokenize
for prefix in europarl valideuroparl
do
cat data/$prefix.$SRC \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC
cat data/$prefix.$TRG \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
done
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
$mosesdecoder/scripts/training/clean-corpus-n.perl data/europarl.tok $SRC $TRG data/europarl.tok.clean 1 80
# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$SRC -model model/tc.$SRC
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$TRG -model model/tc.$TRG
# apply truecaser (cleaned training corpus)
for prefix in europarl
do
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.clean.$SRC > data/$prefix.tc.$SRC
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.clean.$TRG > data/$prefix.tc.$TRG
done
# apply truecaser (dev/test files)
for prefix in valideuroparl
do
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
done

View File

@ -0,0 +1,45 @@
#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import argparse
import numpy as np
# Parse arguments.
parser = argparse.ArgumentParser()
parser.add_argument('-m', '--model', nargs='+', required=True,
help="Models to average")
parser.add_argument('-o', '--output', required=True,
help="Output path")
args = parser.parse_args()
# *average* holds the model matrix.
average = dict()
# No. of models.
n = len(args.model)
for filename in args.model:
print("Loading {}".format(filename))
with open(filename, "rb") as mfile:
# Loads matrix from model file.
m = np.load(mfile)
for k in m:
if k != "history_errs":
# Initialize the key.
if k not in average:
average[k] = m[k]
# Add to the appropriate value.
elif average[k].shape == m[k].shape and "special" not in k:
average[k] += m[k]
# Actual averaging.
for k in average:
if "special" not in k:
average[k] /= n
# Save averaged model to file.
print("Saving to {}".format(args.output))
np.savez(args.output, **average)

View File

@ -0,0 +1,20 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: Barry Haddow
# Distributed under MIT license
#
# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised
import io
import sys
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
ostream = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
for line in istream:
line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma
line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma
line = line.replace("\u0102", "A").replace("\u0103", "a")
line = line.replace("\u00C2", "A").replace("\u00E2", "a")
line = line.replace("\u00CE", "I").replace("\u00EE", "i")
ostream.write(line)

View File

@ -0,0 +1,8 @@
#!/bin/bash
cat $1 \
| sed 's/\@\@ //g' \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl 2> /dev/null \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en 2>/dev/null \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/generic/multi-bleu-detok.perl preprocess-train/data/valideuroparl.en \
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'

6
scores_geval.txt Normal file
View File

@ -0,0 +1,6 @@
BLEU 0.1547
GLEU 0.2063
BLEU 0.1547
GLEU 0.2063
BLEU 0.1592
GLEU 0.2100

3005
test-A/in.tc.cs Normal file

File diff suppressed because it is too large Load Diff

0
test-A/in.tc.cs.output Normal file
View File

3005
test-A/in.tok.tsv Normal file

File diff suppressed because it is too large Load Diff

3005
test-A/in.tsv Normal file

File diff suppressed because it is too large Load Diff

0
test-A/out.tsv Normal file
View File

83
train-and-translate.sh Executable file
View File

@ -0,0 +1,83 @@
#!/bin/bash -v
MARIAN=/home/jakub/TAU/TAU_24/marian-dev/build
# if we are in WSL, we need to add '.exe' to the tool names
if [ -e "/bin/wslpath" ]
then
EXT=.exe
fi
MARIAN_TRAIN=$MARIAN/marian$EXT
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
MARIAN_SCORER=$MARIAN/marian-scorer$EXT
# set chosen gpus
GPUS=0
if [ $# -ne 0 ]
then
GPUS=$@
fi
echo Using GPUs: $GPUS
if [ ! -e $MARIAN_TRAIN ]
then
echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
exit 1
fi
# create common vocabulary
if [ ! -e "preprocess-train/model/vocab.cs.yml" ] || [ ! -e "preprocess-train/model/vocab.en.yml" ]
then
cat preprocess-train/data/europarl.tc.cs preprocess-train/data/valideuroparl.tc.cs | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.cs.yml
cat preprocess-train/data/europarl.tc.en preprocess-train/data/valideuroparl.tc.en | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.en.yml
fi
# train model
if [ ! -e "preprocess-train/model/model.npz.best-translation.npz" ]
then
$MARIAN_TRAIN \
--devices $GPUS \
--type s2s \
--early-stopping 10 -e 15 \
--model preprocess-train/model/model.npz \
--train-sets preprocess-train/data/europarl.tc.cs preprocess-train/data/europarl.tc.en \
--vocabs preprocess-train/model/vocab.cs.yml preprocess-train/model/vocab.en.yml \
--dim-vocabs 50000 50000 \
--mini-batch-fit -w 1024 \
--valid-freq 10000 --save-freq 10000 --disp-freq 1000 \
--valid-mini-batch 23 --valid-max-length 100 \
--valid-metrics cross-entropy translation \
--valid-sets preprocess-train/data/valideuroparl.tc.cs preprocess-train/data/valideuroparl.tc.en \
--valid-script-path "bash ./preprocess-train/scripts/validate.sh" \
--log preprocess-train/model/train.log --valid-log preprocess-train/model/valid.log \
--overwrite --keep-best \
fi
# translate dev set
cat dev-0/in.tc.cs \
| $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
| sed 's/\@\@ //g' \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
> dev-0/in.tc.cs.output
cp dev-0/in.tc.cs.output dev-0/out.tsv
# translate test set
cat test-A/in.tc.cs \
| $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
| sed 's/\@\@ //g' \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/recaser/detruecase.perl \
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
> test-A/in.tc.cs.output
cp test-A/in.tc.cs.output test-A/out.tsv
geval -t dev-0 >> scores_geval.txt

7
train/clean.rb Normal file
View File

@ -0,0 +1,7 @@
while l=gets
l.gsub! /\t+/, "\t"
puts l
end

21
train/split.py Normal file
View File

@ -0,0 +1,21 @@
import pandas as pd
europarl = pd.read_csv('news-commentary-v12-clean.tsv', sep='\t', names=['cs', 'en'], header=None)
cs = europarl['cs']
en = europarl['en']
file1 = open("./after_split/news-commentary-v12-clean.cs","w")
for row in cs:
file1.write(str(row) + "\n")
file1.close()
file2 = open("./after_split/news-commentary-v12-clean.en","w")
for row in en:
file2.write(str(row) + "\n")
file2.close()

22
train/split.rb Normal file
View File

@ -0,0 +1,22 @@
filename1 = ARGV.shift
filename2 = ARGV.shift
f1 = File.open(filename1, 'w')
f2 = File.open(filename2, 'w')
lines = readlines
lines.each{|line| a = line.split(/\t+/)
if a.length == 2 && line.match(/^[^\t]+\t+[^\t]+$/) then
f1.puts(a[0])
f2.puts(a[1])
else
puts("@@@BŁĄD@@@\n#{line}")
end
}