concordia-server/mgiza-aligner/europarl/tools/README

Europarl v3 Preprocessing Tools
===============================
written by Philipp Koehn and Josh Schroeder


Sentence Splitter
=================
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile

Uses punctuation and Capitalization clues to split paragraphs of
sentences into files with one sentence per line. For example:

This is a paragraph. It contains several sentences. "But why," you ask?

goes to:

This is a paragraph.
It contains several sentences.
"But why," you ask?

See more information in the Nonbreaking Prefixes section.


Tokenizer
=========
Usage ./tokenizer.perl -l [en|de|...] < textfile > tokenizedfile

Splits out most punctuation from words. Special cases where splits
do not occur are documented in the code.

This E.U. treaty is, to use the words of Mr. Smith, "awesome."

goes to:

This E.U. treaty is , to use the words of Mr. Smith , " awesome . "

Like the sentence splitter, it makes use of the nonbreaking_prefixes
directory.


Nonbreaking Prefixes Directory
==============================

Nonbreaking prefixes are loosely defined as any word ending in a
period that does NOT indicate an end of sentence marker. A basic
example is Mr. and Ms. in English.

The sentence splitter and tokenizer included with this release
both use the nonbreaking prefix files included in this directory.

To add a file for other languages, follow the naming convention
nonbreaking_prefix.?? and use the two-letter language code you
intend to use when calling split-sentences.perl and tokenizer.perl.

Both split-sentences and tokenizer will first look for a file for the
language they are processing, and fall back to English if a file
for that language is not found. If the nonbreaking_prefixes directory does
not exist at the same location as the split-sentences.perl and tokenizer.perl
files, they will not run.

For the splitter, normally a period followed by an uppercase word
results in a sentence split. If the word preceeding the period
is a nonbreaking prefix, this line break is not inserted.

For the tokenizer, a nonbreaking prefix is not separated from its
period with a space.

A special case of prefixes, NUMERIC_ONLY, is included for special
cases where the prefix should be handled ONLY when before numbers.
For example, "Article No. 24 states this." the No. is a nonbreaking
prefix. However, in "No. It is not true." No functions as a word.

See the example prefix files included here for more examples.