TAU 28 Marian s2s, tokenizer, truecaser

2020-01-27 10:17:26 +01:00 · 2020-01-27 10:17:26 +01:00 · f393c9cd67
commit f393c9cd67
25 changed files with 29139 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,45 @@
+WMT2017 Czech-English machine translation challenge for news
+=============================================================
+
+Translate news articles from Czech into English.
+
+This is WMT2017 news challenge reformatted as a Gonito.net challenge,
+all the data were taken from <http://www.statmt.org/wmt17/translation-task.html>.
+
+BLEU is used as the evaluation metric.
+
+Directory structure
+-------------------
+
+* `README.md` — this file
+* `config.txt` — configuration file
+* `train/` — directory with training data
+* `train/commoncrawl.tsv.xz` — Common Crawl parallel corpus
+* `train/news-commentary-v12.tsv.xz` — News Commentary parallel corpus
+* `train/europarl-v7.tsv.xz` — Europarliament parallel corpus
+* `dev-0/` — directory with dev (test) data (Newstest 2013)
+* `dev-0/in.tsv` — German input text for the dev set
+* `dev-0/expected.tsv` — English reference translation for the dev set
+* `test-A` — directory with test data
+* `test-A/in.tsv` — German input data for the test set (WMT2017 test set)
+* `test-A/expected.tsv` — English reference translation for the test set
+
+Training sets
+-------------
+
+All training sets were compressed with xz, use `xzcat` to decompress:
+
+    $ xzcat train/*.tsv.xz | ...
+
+The pairs where German or English side is empty were removed from the
+training sets.
+
+Test sets
+---------
+
+Reference English translations in the dev and test sets is not tokenised.
+
+Monolingual data
+----------------
+
+Monolingual data was not included here.
--- a/config.txt
+++ b/config.txt
@ -0,0 +1 @@
+--metric BLEU --metric GLEU --tokenizer 13a --precision 4
--- a/dev-0/expected.tsv
+++ b/dev-0/expected.tsv
--- a/dev-0/in.tc.cs
+++ b/dev-0/in.tc.cs
--- a/dev-0/in.tc.cs.output
+++ b/dev-0/in.tc.cs.output
--- a/dev-0/in.tok.tsv
+++ b/dev-0/in.tok.tsv
--- a/dev-0/in.tsv
+++ b/dev-0/in.tsv
--- a/dev-0/out.tsv
+++ b/dev-0/out.tsv
--- a/preprocess-data_dev-0_test-A.sh
+++ b/preprocess-data_dev-0_test-A.sh
@ -0,0 +1,48 @@
+#!/bin/bash -v
+
+# This sample script preprocesses a sample corpus, including tokenization,
+# truecasing, and subword segmentation.
+# For application to a different language pair, change source and target
+# prefix, optionally the number of BPE operations, and the file names
+# (currently, data/corpus and data/newsdev2016 are being processed).
+
+# In the tokenization step, you will want to remove Romanian-specific
+# normalization / diacritic removal, and you may want to add your own.
+# Also, you may want to learn BPE segmentations separately for each language,
+# especially if they differ in their alphabet
+
+# Suffix of source language files
+SRC=cs
+# Suffix of target language files
+TRG=en
+
+# Number of merge operations. Network vocabulary should be slightly larger (to
+# include characters), or smaller if the operations are learned on the joint
+# vocabulary
+bpe_operations=85000
+
+# path to moses decoder: https://github.com/moses-smt/mosesdecoder
+
+mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
+
+# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
+subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
+
+# tokenize
+
+cat dev-0/in.tsv \
+        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > dev-0/in.tok.tsv
+
+cat test-A/in.tsv \
+        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > test-A/in.tok.tsv   
+
+
+
+# apply truecaser (dev/test files)
+for prefix in  dev-0/in test-A/in
+do
+    $mosesdecoder/scripts/recaser/truecase.perl -model preprocess-train/model/tc.$SRC < $prefix.tok.tsv > $prefix.tc.$SRC
+
+done
+
+
--- a/preprocess-train/model/train.log
+++ b/preprocess-train/model/train.log
--- a/preprocess-train/model/valid.log
+++ b/preprocess-train/model/valid.log
@ -0,0 +1,36 @@
+[2020-01-24 03:53:38] [valid] Ep. 1 : Up. 10000 : cross-entropy : 108.847 : new best
+[2020-01-24 03:56:42] [valid] Ep. 1 : Up. 10000 : translation : 7.52 : new best
+[2020-01-24 05:33:26] [valid] Ep. 2 : Up. 20000 : cross-entropy : 94.6385 : new best
+[2020-01-24 05:36:21] [valid] Ep. 2 : Up. 20000 : translation : 12.28 : new best
+[2020-01-24 07:13:19] [valid] Ep. 3 : Up. 30000 : cross-entropy : 89.3 : new best
+[2020-01-24 07:16:05] [valid] Ep. 3 : Up. 30000 : translation : 14.42 : new best
+[2020-01-24 08:52:52] [valid] Ep. 4 : Up. 40000 : cross-entropy : 86.3268 : new best
+[2020-01-24 08:55:27] [valid] Ep. 4 : Up. 40000 : translation : 15.55 : new best
+[2020-01-24 12:16:53] [valid] Ep. 5 : Up. 50000 : cross-entropy : 84.0588 : new best
+[2020-01-24 12:19:29] [valid] Ep. 5 : Up. 50000 : translation : 15.5 : stalled 1 times (last best: 15.55)
+[2020-01-24 13:58:10] [valid] Ep. 5 : Up. 60000 : cross-entropy : 82.5176 : new best
+[2020-01-24 14:01:02] [valid] Ep. 5 : Up. 60000 : translation : 16.08 : new best
+[2020-01-24 14:07:51] [valid] Ep. 6 : Up. 60621 : cross-entropy : 82.5103 : new best
+[2020-01-24 14:10:42] [valid] Ep. 6 : Up. 60621 : translation : 15.87 : stalled 1 times (last best: 16.08)
+[2020-01-24 16:26:57] [valid] Ep. 6 : Up. 70000 : cross-entropy : 82.2896 : new best
+[2020-01-24 16:29:33] [valid] Ep. 6 : Up. 70000 : translation : 16.03 : stalled 2 times (last best: 16.08)
+[2020-01-24 18:09:49] [valid] Ep. 7 : Up. 80000 : cross-entropy : 82.905 : stalled 1 times (last best: 82.2896)
+[2020-01-24 18:12:31] [valid] Ep. 7 : Up. 80000 : translation : 16.49 : new best
+[2020-01-24 19:53:05] [valid] Ep. 8 : Up. 90000 : cross-entropy : 83.1733 : stalled 2 times (last best: 82.2896)
+[2020-01-24 19:55:48] [valid] Ep. 8 : Up. 90000 : translation : 16.18 : stalled 1 times (last best: 16.49)
+[2020-01-24 21:36:56] [valid] Ep. 9 : Up. 100000 : cross-entropy : 84.6166 : stalled 3 times (last best: 82.2896)
+[2020-01-24 21:39:23] [valid] Ep. 9 : Up. 100000 : translation : 16.42 : stalled 2 times (last best: 16.49)
+[2020-01-24 23:16:42] [valid] Ep. 10 : Up. 110000 : cross-entropy : 85.5421 : stalled 4 times (last best: 82.2896)
+[2020-01-24 23:19:12] [valid] Ep. 10 : Up. 110000 : translation : 16.23 : stalled 3 times (last best: 16.49)
+[2020-01-25 00:57:49] [valid] Ep. 10 : Up. 120000 : cross-entropy : 86.5355 : stalled 5 times (last best: 82.2896)
+[2020-01-25 01:00:23] [valid] Ep. 10 : Up. 120000 : translation : 15.92 : stalled 4 times (last best: 16.49)
+[2020-01-25 02:37:11] [valid] Ep. 11 : Up. 130000 : cross-entropy : 88.0175 : stalled 6 times (last best: 82.2896)
+[2020-01-25 02:39:36] [valid] Ep. 11 : Up. 130000 : translation : 15.98 : stalled 5 times (last best: 16.49)
+[2020-01-25 04:16:32] [valid] Ep. 12 : Up. 140000 : cross-entropy : 89.7736 : stalled 7 times (last best: 82.2896)
+[2020-01-25 04:18:58] [valid] Ep. 12 : Up. 140000 : translation : 15.72 : stalled 6 times (last best: 16.49)
+[2020-01-25 05:55:59] [valid] Ep. 13 : Up. 150000 : cross-entropy : 91.4267 : stalled 8 times (last best: 82.2896)
+[2020-01-25 05:58:28] [valid] Ep. 13 : Up. 150000 : translation : 15.43 : stalled 7 times (last best: 16.49)
+[2020-01-25 07:35:33] [valid] Ep. 14 : Up. 160000 : cross-entropy : 93.3989 : stalled 9 times (last best: 82.2896)
+[2020-01-25 07:38:02] [valid] Ep. 14 : Up. 160000 : translation : 15.24 : stalled 8 times (last best: 16.49)
+[2020-01-25 09:15:09] [valid] Ep. 15 : Up. 170000 : cross-entropy : 94.6314 : stalled 10 times (last best: 82.2896)
+[2020-01-25 09:17:37] [valid] Ep. 15 : Up. 170000 : translation : 15.18 : stalled 9 times (last best: 16.49)
--- a/preprocess-train/preprocess-data.sh
+++ b/preprocess-train/preprocess-data.sh
@ -0,0 +1,62 @@
+#!/bin/bash -v
+
+# This sample script preprocesses a sample corpus, including tokenization,
+# truecasing, and subword segmentation.
+# For application to a different language pair, change source and target
+# prefix, optionally the number of BPE operations, and the file names
+# (currently, data/corpus and data/newsdev2016 are being processed).
+
+# In the tokenization step, you will want to remove Romanian-specific
+# normalization / diacritic removal, and you may want to add your own.
+# Also, you may want to learn BPE segmentations separately for each language,
+# especially if they differ in their alphabet
+
+# Suffix of source language files
+SRC=cs
+# Suffix of target language files
+TRG=en
+
+# Number of merge operations. Network vocabulary should be slightly larger (to
+# include characters), or smaller if the operations are learned on the joint
+# vocabulary
+bpe_operations=85000
+
+# path to moses decoder: https://github.com/moses-smt/mosesdecoder
+
+mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
+
+# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
+subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
+
+# tokenize
+for prefix in europarl valideuroparl
+do
+    cat data/$prefix.$SRC \
+        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC
+
+    cat data/$prefix.$TRG \
+        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
+
+done
+
+# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
+$mosesdecoder/scripts/training/clean-corpus-n.perl data/europarl.tok $SRC $TRG data/europarl.tok.clean 1 80
+
+# train truecaser
+$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$SRC -model model/tc.$SRC
+$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$TRG -model model/tc.$TRG
+
+# apply truecaser (cleaned training corpus)
+for prefix in europarl
+do
+    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.clean.$SRC > data/$prefix.tc.$SRC
+    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.clean.$TRG > data/$prefix.tc.$TRG
+done
+
+# apply truecaser (dev/test files)
+for prefix in  valideuroparl
+do
+    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
+    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
+done
+
--- a/preprocess-train/scripts/average.py
+++ b/preprocess-train/scripts/average.py
@ -0,0 +1,45 @@
+#!/usr/bin/env python
+
+from __future__ import print_function
+
+import os
+import sys
+import argparse
+
+import numpy as np
+
+# Parse arguments.
+parser = argparse.ArgumentParser()
+parser.add_argument('-m', '--model', nargs='+', required=True,
+                    help="Models to average")
+parser.add_argument('-o', '--output', required=True,
+                    help="Output path")
+args = parser.parse_args()
+
+# *average* holds the model matrix.
+average = dict()
+# No. of models.
+n = len(args.model)
+
+for filename in args.model:
+    print("Loading {}".format(filename))
+    with open(filename, "rb") as mfile:
+        # Loads matrix from model file.
+        m = np.load(mfile)
+        for k in m:
+            if k != "history_errs":
+                # Initialize the key.
+                if k not in average:
+                    average[k] = m[k]
+                # Add to the appropriate value.
+                elif average[k].shape == m[k].shape and "special" not in k:
+                    average[k] += m[k]
+
+# Actual averaging.
+for k in average:
+    if "special" not in k:
+        average[k] /= n
+
+# Save averaged model to file.
+print("Saving to {}".format(args.output))
+np.savez(args.output, **average)
--- a/preprocess-train/scripts/remove-diacritics.py
+++ b/preprocess-train/scripts/remove-diacritics.py
@ -0,0 +1,20 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Author: Barry Haddow
+# Distributed under MIT license
+
+#
+# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised
+
+import io
+import sys
+istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
+ostream = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
+
+for line in istream:
+  line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma
+  line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma
+  line = line.replace("\u0102", "A").replace("\u0103", "a")
+  line = line.replace("\u00C2", "A").replace("\u00E2", "a")
+  line = line.replace("\u00CE", "I").replace("\u00EE", "i")
+  ostream.write(line)
--- a/preprocess-train/scripts/validate.sh
+++ b/preprocess-train/scripts/validate.sh
@ -0,0 +1,8 @@
+#!/bin/bash
+
+cat $1 \
+    | sed 's/\@\@ //g' \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl 2> /dev/null \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en 2>/dev/null \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/generic/multi-bleu-detok.perl preprocess-train/data/valideuroparl.en \
+    | sed -r 's/BLEU = ([0-9.]+),.*/\1/'
--- a/scores_geval.txt
+++ b/scores_geval.txt
@ -0,0 +1,6 @@
+BLEU	0.1547
+GLEU	0.2063
+BLEU	0.1547
+GLEU	0.2063
+BLEU	0.1592
+GLEU	0.2100
--- a/test-A/in.tc.cs
+++ b/test-A/in.tc.cs
--- a/test-A/in.tc.cs.output
+++ b/test-A/in.tc.cs.output
--- a/test-A/in.tok.tsv
+++ b/test-A/in.tok.tsv
--- a/test-A/in.tsv
+++ b/test-A/in.tsv
--- a/test-A/out.tsv
+++ b/test-A/out.tsv
--- a/train-and-translate.sh
+++ b/train-and-translate.sh
@ -0,0 +1,83 @@
+#!/bin/bash -v
+
+MARIAN=/home/jakub/TAU/TAU_24/marian-dev/build
+
+# if we are in WSL, we need to add '.exe' to the tool names
+if [ -e "/bin/wslpath" ]
+then
+    EXT=.exe
+fi
+
+MARIAN_TRAIN=$MARIAN/marian$EXT
+MARIAN_DECODER=$MARIAN/marian-decoder$EXT
+MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
+MARIAN_SCORER=$MARIAN/marian-scorer$EXT
+
+# set chosen gpus
+GPUS=0
+if [ $# -ne 0 ]
+then
+    GPUS=$@
+fi
+echo Using GPUs: $GPUS
+
+if [ ! -e $MARIAN_TRAIN ]
+then
+    echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
+    exit 1
+fi
+
+
+# create common vocabulary
+if [ ! -e "preprocess-train/model/vocab.cs.yml" ] || [ ! -e "preprocess-train/model/vocab.en.yml" ]
+then
+    cat preprocess-train/data/europarl.tc.cs preprocess-train/data/valideuroparl.tc.cs | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.cs.yml
+    cat preprocess-train/data/europarl.tc.en preprocess-train/data/valideuroparl.tc.en | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.en.yml
+fi
+
+# train model
+if [ ! -e "preprocess-train/model/model.npz.best-translation.npz" ]
+then
+    $MARIAN_TRAIN \
+        --devices $GPUS \
+        --type s2s \
+	--early-stopping 10 -e 15 \
+        --model preprocess-train/model/model.npz \
+        --train-sets preprocess-train/data/europarl.tc.cs preprocess-train/data/europarl.tc.en \
+        --vocabs preprocess-train/model/vocab.cs.yml preprocess-train/model/vocab.en.yml \
+        --dim-vocabs 50000 50000 \
+        --mini-batch-fit -w 1024 \
+        --valid-freq 10000 --save-freq 10000 --disp-freq 1000 \
+	--valid-mini-batch 23 --valid-max-length 100 \
+        --valid-metrics cross-entropy translation \
+        --valid-sets preprocess-train/data/valideuroparl.tc.cs preprocess-train/data/valideuroparl.tc.en \
+        --valid-script-path "bash ./preprocess-train/scripts/validate.sh" \
+        --log preprocess-train/model/train.log --valid-log preprocess-train/model/valid.log \
+        --overwrite --keep-best \
+
+
+	
+fi
+
+
+# translate dev set
+cat dev-0/in.tc.cs \
+    | $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
+      --mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
+    | sed 's/\@\@ //g' \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
+    > dev-0/in.tc.cs.output
+cp dev-0/in.tc.cs.output dev-0/out.tsv
+
+# translate test set
+cat test-A/in.tc.cs \
+    | $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
+      --mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
+    | sed 's/\@\@ //g' \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/recaser/detruecase.perl \
+    | /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
+    > test-A/in.tc.cs.output
+cp test-A/in.tc.cs.output test-A/out.tsv
+geval -t dev-0 >> scores_geval.txt
+
--- a/train/clean.rb
+++ b/train/clean.rb
@ -0,0 +1,7 @@
+
+
+while l=gets
+
+	l.gsub! /\t+/, "\t"
+	puts l
+end
--- a/train/split.py
+++ b/train/split.py
@ -0,0 +1,21 @@
+import pandas as pd
+
+
+europarl = pd.read_csv('news-commentary-v12-clean.tsv', sep='\t', names=['cs', 'en'], header=None)
+
+cs = europarl['cs']
+en = europarl['en']
+
+file1 = open("./after_split/news-commentary-v12-clean.cs","w") 
+for row in cs:
+	file1.write(str(row) + "\n")
+file1.close() 
+
+
+file2 = open("./after_split/news-commentary-v12-clean.en","w") 
+for row in en:
+	file2.write(str(row) + "\n")
+file2.close() 
+
+
+
--- a/train/split.rb
+++ b/train/split.rb
@ -0,0 +1,22 @@
+
+filename1 = ARGV.shift
+filename2 = ARGV.shift
+f1 = File.open(filename1, 'w')
+f2 = File.open(filename2, 'w')
+lines = readlines
+lines.each{|line| a = line.split(/\t+/)
+	if a.length == 2 && line.match(/^[^\t]+\t+[^\t]+$/) then
+		f1.puts(a[0])
+		f2.puts(a[1])
+	else
+		puts("@@@BŁĄD@@@\n#{line}")
+	end  
+
+
+	}
+
+
+	
+
+
+
				`@ -0,0 +1 @@`
				`--metric BLEU --metric GLEU --tokenizer 13a --precision 4`