73 lines
2.4 KiB
Plaintext
73 lines
2.4 KiB
Plaintext
Europarl v3 Preprocessing Tools
|
|
===============================
|
|
written by Philipp Koehn and Josh Schroeder
|
|
|
|
|
|
Sentence Splitter
|
|
=================
|
|
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile
|
|
|
|
Uses punctuation and Capitalization clues to split paragraphs of
|
|
sentences into files with one sentence per line. For example:
|
|
|
|
This is a paragraph. It contains several sentences. "But why," you ask?
|
|
|
|
goes to:
|
|
|
|
This is a paragraph.
|
|
It contains several sentences.
|
|
"But why," you ask?
|
|
|
|
See more information in the Nonbreaking Prefixes section.
|
|
|
|
|
|
Tokenizer
|
|
=========
|
|
Usage ./tokenizer.perl -l [en|de|...] < textfile > tokenizedfile
|
|
|
|
Splits out most punctuation from words. Special cases where splits
|
|
do not occur are documented in the code.
|
|
|
|
This E.U. treaty is, to use the words of Mr. Smith, "awesome."
|
|
|
|
goes to:
|
|
|
|
This E.U. treaty is , to use the words of Mr. Smith , " awesome . "
|
|
|
|
Like the sentence splitter, it makes use of the nonbreaking_prefixes
|
|
directory.
|
|
|
|
|
|
Nonbreaking Prefixes Directory
|
|
==============================
|
|
|
|
Nonbreaking prefixes are loosely defined as any word ending in a
|
|
period that does NOT indicate an end of sentence marker. A basic
|
|
example is Mr. and Ms. in English.
|
|
|
|
The sentence splitter and tokenizer included with this release
|
|
both use the nonbreaking prefix files included in this directory.
|
|
|
|
To add a file for other languages, follow the naming convention
|
|
nonbreaking_prefix.?? and use the two-letter language code you
|
|
intend to use when calling split-sentences.perl and tokenizer.perl.
|
|
|
|
Both split-sentences and tokenizer will first look for a file for the
|
|
language they are processing, and fall back to English if a file
|
|
for that language is not found. If the nonbreaking_prefixes directory does
|
|
not exist at the same location as the split-sentences.perl and tokenizer.perl
|
|
files, they will not run.
|
|
|
|
For the splitter, normally a period followed by an uppercase word
|
|
results in a sentence split. If the word preceeding the period
|
|
is a nonbreaking prefix, this line break is not inserted.
|
|
|
|
For the tokenizer, a nonbreaking prefix is not separated from its
|
|
period with a space.
|
|
|
|
A special case of prefixes, NUMERIC_ONLY, is included for special
|
|
cases where the prefix should be handled ONLY when before numbers.
|
|
For example, "Article No. 24 states this." the No. is a nonbreaking
|
|
prefix. However, in "No. It is not true." No functions as a word.
|
|
|
|
See the example prefix files included here for more examples. |