utt/doc/utt.texinfo
2016-11-11 17:08:18 +01:00

2908 lines
85 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\input texinfo @c -*-texinfo-*-
@c @documentencoding ISO-8859-2
@c @documentlanguage pl
@c %**start of header
@setfilename utt.info
@settitle UAM Text Tools v0.90
@documentencoding utf-8
@c %**end of header
@copying
This manual is for UAM Text Tools (version 0.90, October, 2008)
Copyright @copyright{} 2005, 2007 Tomasz Obrębski, Michał Stolarski, Justyna Walkowska, Paweł Konieczka.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
copy of the license is included in the section entitled GNU Free
Documentation License,,GNU Free Documentation License.
@c @quotation
@c Permission is granted to ...
@c No permission is granted until the document is completed.
@c @end quotation
@end copying
@titlepage
@title UAM Text Tools 0.90 - User Manual
@subtitle edition 0.01, @today{}
@subtitle status: prescript
@author by Justyna Walkowska, Tomasz Obrębski and Michał Stolarski
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents
@c @paragraphindent none
@iftex
@tex
% \usepackage[T1]{fontenc}
% \usepackage[utf8]{inputenc}
% \usepackage{times}
@end tex
@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
@end iftex
@c @headings off
@c @everyheading LEM(1) @| @| LEM(1)
@everyfooting @today @c @| @thispage @|
@ifnottex
@node Top
@top UTT - UAM Text Tools
@insertcopying
@menu
* General information::
* UTT file format::
* Configuration files::
* UTT components::
* Auxiliary tools::
* Usage examples::
* PMDBF dictionary::
@c * Examples::
@c * Copyright::
* GNU Free Documentation License::
* Reporting bugs::
* Author::
@end menu
@end ifnottex
@c ----------------------------------------------------------------------
@node General information
@chapter General information
UAM Text Tools (UTT) is a package of language processing tools
developed at Adam Mickiewicz University. Its functionality includes:
@itemize @bullet
@item
tokenization ółąż
@item
dictionary-based morphological analysis
@item
heuristic morphological analysis of unknown words
@item
spelling correction ółąśćż
@item
pattern search
@item
sentence splitting
@item
generation of concordance tables
@end itemize
The toolkit is destined for processing of raw (not annotated)
unrestricted text for any conceivable purpose.
The system is organized as a collection of command-line programs, each
performing one operation, e.g. tokenization, lemmatization, spelling
correction. The components are independent one from another, the
unifying element being the uniform i/o file format.
The components may be combined in various ways to provide various text
processing services. Also new components supplied by the used may be
easily incorporated into the system provided that they respect the i/o
file format conventions.
UTT component programs does not depend on any specific tagset or
morphological description format.
UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use.
List of contributors:
@itemize
@item Pawel Konieczka
@item Tomasz Obrębski
@item Michał Stolarski
@item Marcin Walas
@item Justyna Walkowska
@item Paweł Wereński
@end itemize
@c ----------------------------------------------------------------------
@c ---------------------------------------------------------------------
@node UTT file format
@chapter UTT file format
A UTT file contains annotation of a text. It consists of a sequence of
segments. Each segment explicitly refers to a continuous piece of the
text and provides some information on it.
@section Segment format
A segment occupies one line of a UTT file and consists of
space-separated fields:
@quotation
@sp 1
[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
@sp 1
@end quotation
@table @var
@item @var{start}
Non-negative integer value indicating the position in the source text where the
segment starts.
@item @var{length}
Non-negative integer value indicating the length of the segment.
@item @var{type}
A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
@var{type} reflects the main classification of segments -
into words, numbers, punctuation marks, meta-text markers.
@xref{tok output,,tok output}, for description of automatically recognized type markers.
@item @var{form}
This field contains the textual form of the segment or the special
symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
The characters or character sequences that have special meaning in the
@var{form} field are enumerated below.
Characters with special meaning:
@itemize
@item @code{_} - space character
@item @code{*} - undefined contents
@end itemize
Escape sequences:
@itemize
@item @code{\n} - new line
@item @code{\t} - tabulation
@item @code{\r} - carriage return
@item @code{\_} - the @code{_} character
@item @code{\*} - the @code{*} character
@item @code{\\} - the @code{\} character
@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
@end itemize
@item @var{annotation1}
@item @var{annotation2}
@item ...
Annotation fields have the following format:
@var{longname} @code{:} @var{value}
or
@var{shortname} @var{value}
where @var{longname} is a string of alphanumeric characters
(isalnum() test), @var{shortname} - a single non-alphanumeric character
(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
@end table
Only two fields are mandatory: @var{type} and @var{form}. All other fields
may be absent. In the case when only one number precedes the
@var{type} field, it is interpreted as the @var{START} position.
If the @var{length} field is ommited, the length of the segment is the
length of the @var{form} field, except when the value of the
@var{form} field is @code{*} -- in this case, the length is assumed to
be 0.
If the @var{start} field is also absent, the segment is assumed to directly
follow the preceding one.
@c Conventions:
@c Annotation fields with predefined meaning:
@c @itemize
@c @item @code{!} - UTT components are allowed to modify the contents of
@c the @var{form} field (e.g. spelling correction does this). If this happens the
@c original form of the segment have to be placed in the @code{!}-field.
@c @item @code{@@} - morphological description
@c @item @code{=} - node identifier assignment (used in graph encoding)
@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
@c @end itemize
Segments of length 0 may be used to mark file positions with some
information. See e.g. BOS and EOS (beginning/end of sentence) markers
in the example below.
Example:
sentence: @samp{Piszemy dobre progrumy.}
@example
0000 00 BOS *
0000 07 W Piszemy lem:pisać,V
0007 01 S _
0008 05 W dobre lem:dobry,ADJ
0013 01 S _
0014 08 W progrumy cor:programy lem:program,N
0022 01 P .
0023 00 EOS *
0023 01 S _
0024 00 BOS *
0024 11 W Warszawiacy lem:Warszawiak,N
0035 01 S _
0036 03 W też
0039 01 P .
0040 00 EOS *
@end example
@example
0000 BOS *
0000 W Piszemy lem:pisać,V
0007 S _
0008 W dobre lem:dobry,ADJ
0013 S _
0014 W progrumy cor:programy lem:program,N
0022 P .
0023 EOS *
@end example
Posion information may be provided only for some types of segments:
@example
0000 BOS *
W Piszemy lem:pisać‡,V
S _
W dobre lem:dobry,ADJ
S _
W progrumy cor:programy lem:program,N
P .
EOS *
S _
0024 BOS *
W Warszawiacy lem:Warszawiak,N
S _
W też
P .
EOS *
@end example
Position/length information may be provided only when necessary:
@example
0000 04 N *
0000 N 12
P .
N 5
S _
W km
@end example
@section UTT File
A UTT file consists of a sequence of segments. The same text position
may be covered by multiple segments. In cosequence, ambiguous text
segmentation and ambiguous annotation may be represented.
There are two structural requirements a valid UTT-formatted file
has to meet:
@itemize @bullet
@item
segments have to be sorted with respect to the @var{position} field,
@item
for each
segment ending at position @var{n}, either there must be a segment starting at
position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
for each segment starting at position @var{n}, either there must be a segment
ending at position @var{n-1}, or the position @var{n-1} must not be covered
by any segment.
@end itemize
A valid annotation for the text fragment
@example
12.5 km
@end example
may be
@example
0000 02 N 12
0000 04 N 12.5
0002 01 P .
0003 01 N 5
0004 01 S _
0005 02 W km
@end example
but not
@example
0000 02 N 12
0000 04 N 12.5
0004 01 S _
0005 02 W km
@end example
because in the latter example the first segment (starting at position
0000, 2 characters long) ends at position @var{n}=0001 which is
covered by the second segment and no segment starts at position
@var{n+2}=0002.
@section Flattened UTT file
A UTT file format has two variants: regular and flattened. The regular
format was described above. In the flattened format some of the
end-of-line characters are replaced with line-feed characters.
The flatten format is basically used to represent whole sentences as
single lines of the input file (all intrasentential end-of-line
characters are replaced with line-feed characters).
This technical trick permits to perform certain text
processing operations on entire sentences with the use of such tools as
@command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component).
The conversion between the two formats is performed by the tools:
@command{fla} and @command{unfla}.
@section Character encoding
The UTT component programs accept only 1-byte character encoding, such
as ISO, ANSI, DOS.
@c @section Formats
@c @unnumberedsubsubsec Basic format
@c While processing large amounts of the overhead related with explicit
@c ... of the start position and segment length becomes ... . Therefore,
@c for efficiency reasons certain shortcuts are possible:
@c @unnumberedsubsubsec Relative start position
@c Start position may be given as relative distance from the last
@c absolut position.
@c @unnumberedsubsubsec Absent length
@c Segment length may by omitted. Normally it can be restored by counting
@c the length of the @emph{form field}. For segments with the special value
@c @code{*} in the @emph{form field} length 0 is assumed.
@c @unnumberedsubsubsec Absent length and start position
@c Both start position and segment length may be omitted. In this format
@c each segment is assumed to follow the previous one. This format is,
@c therefore, suitable only for unambiguously tagged text
@c (0-length markers can be still used.)
@c @table @code
@c @item AL
@c @code{1234 03 W kot}
@c @item RL
@c @code{+56 03 W kot}
@c @item A
@c @code{1234 W kot}
@c @item R
@c @code{+56 W kot}
@c @item 0
@c @code{W kot}
@c @end table
@c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???]
@macro parhelp
@item @b{@minus{}@minus{}help}, @b{@minus{}h}
Print help.
@end macro
@macro parversion
@item @b{@minus{}@minus{}version}, @b{@minus{}V}
Print version information.
@end macro
@macro parinteractive
@item @b{@minus{}@minus{}interactive, @minus{}i}
This option toggles interactive mode, which is by default off. In the
interactive mode the program does not buffer the output.
@end macro
@c @macro parfile
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
@c Input file name.
@c If this option is absent or equal to '@minus{}', the program
@c reads from the standard input.
@c @end macro
@c @macro paroutput
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
@c Regular output file name. To regular output the program sends segments
@c which it successfully processed and copies those which were not
@c subject to processing. If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @end macro
@c @macro parfail
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
@c Fail output file name. To fail output the program copies the segments
@c it failed to process. If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @end macro
@c @macro parcopy
@c @item @b{@minus{}@minus{}copy, @minus{}c}
@c Copy succesfully processed segments to regular output also in their
@c original input form.
@c @end macro
@macro parinputfield
@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
The field containing the input to the program. The default is the
@var{form} field. The fields @var{position}, @var{length}, @var{type},
and @var{form} are referred to as @code{1}, @code{2}, @code{3},
@code{4}, respectively.
@end macro
@macro paroutputfield
@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
The name of the field added by the program. The default is the name of the program.
@end macro
@macro pardictionary
@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
Dictionary file name.
@end macro
@macro parprocess
@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
Process segments with the specified value in the @var{type} field.
Multiple occurences of this option are allowed and are interpreted as
disjunction. If this option is absent, all segments are processed.
@end macro
@macro parselect
@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
Select for processing only segments in which the field named
@var{fieldname} is present. Multiple occurences of this option are
allowed and are interpreted as conjunction of conditions. If this
option is absent, all segments are processed.
@end macro
@macro parunselect
@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
Select for processing only segments in which the field @var{fieldname}
is absent. Multiple occurences of this option are allowed and are
interpreted as conjunction of conditions. If this option is absent,
all segments are processed.
@end macro
@macro paroneline
@item @b{@minus{}@minus{}one-line}
This option makes the program print ambiguous annotation in one output
line by generating multiple annotation fields. By default when
ambiguous annotation may be produced for a segment, the segment is
multiplicated and each of the annotations is added to separate copy of
the segment.
@end macro
@macro paronefield
@item @b{@minus{}@minus{}one-field, @minus{}1}
This option makes the program print ambiguous annotation in one
annotation field. By default when ambiguous annotation may be produced
for a segment, the segment is multiplicated and each of the
annotations is added to separate copy of the segment.
This option is useful when working with @command{kot} or @command{con}.
@end macro
@c ---------------------------------------------------------------------
@c CONFIGURATION FILES
@c ---------------------------------------------------------------------
@node Configuration files
@chapter Configuration files
Values for all command line options accepted by a component
may be set in configuration files. The default location of the
configuration files for a component named @command{@var{program}} are
@example
@file{/usr/local/etc/utt/@var{program}.conf}
@end example
for system-wide configuration file and
@example
@file{~/.utt/@var{program}.conf}
@end example
for user configuration file.
@c The configuration file to load may be also specified with the
@c @option{--config} option. Configuration file need not be provided.
For each option, the value is set according to the following priority:
@itemize
@item command line
@c @item configuration file indicated with @option{--config} option
@item user configuration file (or configuration file indicated with the @option{--config} option)
@item system-wide configuration file
@end itemize
Parameter values are specified in the following format:
@var{parametername}=@var{value}
where @var{parametername} is the short or long name of an option accepted by
the program, or
@var{parametername}
if the option does not need arguments.
You can introduce comments to configuration files using the # sign.
If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
@c The equal sign may be omitted.
@quotation Tip
If you have two (or more) frequently used sets of options for the same
program (eg. lem with PMDBF dictionary and lem with a user dictionary)
a good solution is to create two soft links to lem, called
eg. lemg and lemu and specify their configuration in files lemg.conf
and lemu.conf respectively.
@end quotation
@c ---------------------------------------------------------------------
@c COMPONENTS
@c ---------------------------------------------------------------------
@node UTT components
@chapter UTT components
UTT components are of three types:
@menu
Sources: programs which read non-UTT data (e.g. raw text) and produce output
in UTT format
* tok:: a tokenizer
Filters: programs which read and produce UTT-formatted data
* lem:: a morphological analyzer
* gue:: a morphological guesser
* cor:: a simple spelling corrector
* kor:: a more elaborated spelling corrector
* sen:: a sentensizer
* ser:: a pattern search tool (marks matches)
* mar:: a pattern search tool (introduces arbitrary markers into the text)
* grp:: a pattern search tool (selects sentences containing a match)
@c * gph:: a word-graph annotation tool::
@c * dgp:: a dependency parser
Sinks: programs which read UTT data and produce output in another format
* kot:: an untokenizer
* con:: a concordance table generator
@end menu
@c ---------------------------------------------------------------------
@c TOK
@c ---------------------------------------------------------------------
@page
@node tok
@section tok - a tokenizer
@c ----------------------------------------
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Component category:} @tab source
@item @strong{Input format:} @tab raw text file
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab -
@end multitable
@menu
* tok description::
* tok input::
* tok output::
* tok command line options::
* tok example::
@end menu
@node tok description
@subsection Description
@code{tok} is a simple program which reads a text file and identifies
tokens on the basis of their orthographic form. The type of the token
is printed as the @var{type} field.
@node tok input
@subsection Input
Raw text.
@node tok output
@subsection Output
UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
@itemize
@item @code{W}
(word)
- continuous sequence of letters
@item @code{N}
(number)
- continuous sequence of digits
@item @code{S}
(space)
- continuous sequence of space characters
@item @code{P}
(punctuation mark)
- single printable characters not belonging to any of the other classes
@item @code{B}
(unprintable character)
- single unprintable character
@end itemize
@node tok command line options
@subsection Command line options
@table @code
@item @b{@minus{}@minus{}help}, @b{@minus{}h}
Print help.
@item @b{@minus{}@minus{}version}, @b{@minus{}V}
Print version information.
@item @b{@minus{}@minus{}interactive, @minus{}i}
This option toggles interactive mode, which is by default off. In the
interactive mode the program does not buffer the output.
@end table
@node tok example
@subsection Example
Input:
@example
Piszemy dobre programy.
@end example
Output:
@example
0000 07 W Piszemy
0007 01 S _
0008 05 W dobre
0013 01 S _
0014 08 W programy
0022 01 P .
0023 01 S \n
@end example
@c ---------------------------------------------------------------------
@c SEN
@c ---------------------------------------------------------------------
@c @node sen - sentencizer
@c @chapter sen - sentencizer
@c Authors: Tomasz Obrębski
@c ---------------------------------------------------------------------
@c LEM
@c ---------------------------------------------------------------------
@page
@node lem
@section lem - morphological analyzer
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski, Michał Stolarski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* lem description::
* lem command line options::
* lem input::
* lem output::
* lem example::
* lem dictionaries::
* lem hints::
@end menu
@node lem description
@subsection Description
@command{lem} performs morphological analysis of a simple orthographic
word, returning all its possible morphological annotations,
disregarding the context.
@c ----------------------------------------
@node lem command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@parinteractive
@c @parfile
@c @paroutput
@c @parfail
@c @parcopy
@parinputfield
@paroutputfield
@pardictionary
@parprocess
@parselect
@parunselect
@paroneline
@paronefield
@end table
@c ----------------------------------------
@node lem input
@subsection Input
Lem reads a UTT file and processes the value of the @var{form} field
(the input field may be changed with @option{--input-field} option).
@node lem output
@subsection Output
@command{lem} adds a new annotation field, whose default name is @code{lem}. In
case of ambiguity either the segment is multiplicated (default),
multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
annotation is produced as the value of single @code{lem} field (option
@option{--one-field,-1}):
@itemize @bullet
@item
unambiguous value format:
@example
<lemma>,<descr>
@end example
@item
ambiguous value format (@option{--one-field} option)
@example
<lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
@end example
(alternative descriptions for the same lemma are separated by commas,
alternative lemmata are separated by semicolons.)
@end itemize
@node lem example
@subsection Example
Input:
@example
0000 07 W Piszemy
0007 01 S _
0008 05 W dobre
0013 01 S _
0014 08 W programy
0022 01 P .
0023 01 B \n
@end example
Output (default):
@example
0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
0007 01 B _
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
0013 01 B _
0014 08 W programy lem:program,N/GiNpCa
0014 08 W programy lem:program,N/GiNpCn
0014 08 W programy lem:program,N/GiNpCv
0022 01 P .
0023 01 B \n
@end example
Output (@option{--one-line} option):
@example
0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
0007 01 S _
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
0013 01 S _
0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
0022 01 P .
0023 01 S \n
@end example
Output (@option{--one-field} option):
@example
0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
0007 01 S _
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
0013 01 S _
0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
0022 01 P .
0023 01 S \n
@end example
@c ----------------------------------------
@node lem dictionaries
@subsection Dictionaries
@command{lem} requires a dictionary. The dictionary may be provided in
one of two formats: in text (source) format or in binary (fsa) format.
@subsubheading Text format
Dictionary entries have the following structure:
@example
<form>;<lemma>,<descr>[;<lemma>,<descr>]
@end example
@var{lemma} may be given explicitly or in the cut-add format:
@example
@code{[<cut1><add1>-]<cut2><add2>}
@end example
meaning: replace prefix of length @code{<cut1>} with
string @code{<add1>}, replace suffix of length @code{<cut2>} with string
@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
@samp{kot}, @code{3-4aÂły} transforms @samp{najbielsi} into @samp{biaÂły}
Each dictionary entry must be written in one line and must not contain blank characters.
Examples:
@example
kot;0,N/GaNsCn
kota;1,N/GaNsCg;1,N/GaNsCa
kotu;1,N/GaNsCd
kotem;2,N/GaNsCi
kocie;3t,N/GaNsCl;3t,N/GaNsCv
najbielsi;3-4ały,ADJ/DsNpCnGp
najbielsze;3-5ały,ADJ/DsNpCnGaifn
najlepsi;dobry,ADJ/DsNpCnGp
najlepsze;dobry,ADJ/DsNpCnGaifn
@end example
The mandatory file name extension for a text dictionary is @code{dic}. For large
dictionaries it is preferable, however, to compile them into binary
(fsa) format.
@subsubheading Binary format
The mandatory file name extension for a binary dictionary is @code{bin}. To
compile a text dictionary into binary format, write:
@example
compdic <dictionaryname>.dic <dictionaryname>.bin
@end example
@subsubheading Polex/PMDBF dictionary
A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
the distribution as the default @emph{lem}'s dictionary. It's
located by default in:
@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
in local installation or in
@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
in system installation.
@node lem hints
@subsection Hints
@subsubheading Combining data from multiple dictionaries
@itemize
@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
@example
lem -d <dict1> | lem -S lem -d <dict2>
@end example
@item Add annotations from two dictionaries <dict1> and <dict2>.
@example
lem -c -d <dict1> | lem -S lem -d <dict2>
@end example
@end itemize
@c ---------------------------------------------------------------------
@c GUE
@c ---------------------------------------------------------------------
@page
@node gue
@section gue - morphological guesser
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Michał Stolarski, Tomasz Obrębski
@item @strong{Component category:} @tab filter
@end multitable
@menu
* gue description::
* gue command line options::
* gue example::
* gue dictionaries::
@end menu
@node gue description
@subsection Description
@command{gue} guesess morphological descriptions of the form contained
in the @var{form} field.
@node gue command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@parinteractive
@c @parfile
@c @paroutput
@c @parfail
@c @parcopy
@parinputfield
@paroutputfield
@pardictionary
@parprocess
@parselect
@parunselect
@paroneline
@paronefield
@item @b{@minus{}@minus{}delta=@var{n}}
Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
@item @b{@minus{}@minus{}cut-off=@var{n}}
Do not display answers with less weight than cut-off value (default=`200').
@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
Guess up to n descriptions (default=`0', which means 'display all results').
@end table
@node gue example
@subsection Example
@example
command: gue -n 2
input:
0000 07 W smerfny
output:
0000 07 W smerfny gue:,ADJ/CaDpGiNs
0000 07 W smerfny gue:,ADJ/CnvDpGaipNs
@end example
@node gue dictionaries
@subsection Dictionaries
@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
The fsa format is created by compiling text-format dictionaries.
@subsubheading Text format
Dictionary entries have the following structure:
@example
@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
@end example
@var{lemma} must be given in the cut-add format:
@example
@code{[<cut1><add1>-]<cut2><add2>}
@end example
(no spaces in between): replace prefix of length @var{cut1} with
string @var{add1}, replace suffix of length @var{cat2} with string
@var{add2}.
Example: @code{3-4ały} transforms @i{najbielsi} into @i{biały}
@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
@var{weight} is an integer value between 1 and 999 indicating the
likelihood of the guess.
@c @example
@c *łkę;1a,N/GfNsCa
@c naj*elszy;3-4ały,ADJ/...:...
@c @end example
@c ---------------------------------------------------------------------
@c COR
@c ---------------------------------------------------------------------
@page
@node cor
@section cor - spelling corrector
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski, Michał Stolarski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* cor description::
* cor command line options::
* cor dictionaries::
@end menu
@node cor description
@subsection Description
The spelling corrector applies Kemal Oflazer's dynamic programming
algorithm @cite{oflazer96} to the FSA representation of the set of
word forms of the Polex/PMDBF dictionary. Given an incorrect
word form it returns all word forms present in the dictionary whose
edit distance is smaller than the threshold given as the parameter.
@node cor command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@parinteractive
@c @parfile
@c @paroutput
@c @parfail
@c @parcopy
@parinputfield
@paroutputfield
@pardictionary
@parprocess
@parselect
@parunselect
@paroneline
@paronefield
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
Maximum edit distance (default='1').
@c @item @b{@minus{}@minus{}replace, @minus{}r}
@c Replace original form with corrected form, place original form in the
@c cor field. This option has no effect in @option{--one-*} modes (default=off)
@end table
@node cor dictionaries
@subsection Dictionaries
@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
The fsa format is created by compiling text-format dictionaries.
@subsubheading Text format
The @command{cor} dictionary is a list of words:
@example
odlot
odlotowy
odludek
@end example
@subsubheading Binary format
The mandatory file name extension for a binary dictionary is @code{bin}. To
compile a text dictionary into binary format, write:
@example
compdic <dictionaryname>.dic <dictionaryname>.bin
@end example
@c ---------------------------------------------------------------------
@c KOR
@c ---------------------------------------------------------------------
@page
@node kor
@section kor - configurable spelling corrector
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Paweł Werenski, Tomasz Obrębski, Michał Stolarski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* kor description::
* kor command line options::
* kor weights definition file::
* kor dictionaries::
@end menu
@node kor description
@subsection Description
The spelling corrector applies a Pawel Werenski's dynamic programming
algorithm to the FSA representation of the set of word forms of the
Polex/PMDBF dictionary. The algorithm is an extension of K. Oflazer
algorithm used by @command{cor}. In the extended version it is
possible to assign weights to individual edit operations.
Given an incorrect word form it returns all word forms
present in the dictionary whose edit distance is smaller than the
threshold given as the parameter.
@node kor command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@parinteractive
@c @parfile
@c @paroutput
@c @parfail
@c @parcopy
@parinputfield
@paroutputfield
@pardictionary
@parprocess
@parselect
@parunselect
@paroneline
@paronefield
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
Maximum edit distance (default='1').
@item @b{@minus{}@minus{}weights=@var{filename}, @minus{}w @var{filename}}
Edit operations' weights file.
@c @item @b{@minus{}@minus{}replace, @minus{}r}
@c Replace original form with corrected form, place original form in the
@c cor field. This option has no effect in @option{--one-*} modes (default=off)
@end table
@node kor weights definition file
@subsection Weights definition file
Example:
@example
%stdcor 1
%xchg 1
ż rz 0.5
ch h 0.5
u ó 0.5
@end example
Default weight is set to 1 (@code{%stdcor 1}), the weight of exchange
operation is set to 1 (@code{%xchg 1}), the three principal orthographic
errors are assigned the weight 0.5.
The edit operation weight declaration, such as
@example
ż rz 0.5
@end example
works in both ways, i.e. ż->rz, rz->ż.
The default weights definition file for @code{kor} is:
@example
$HOME/.local/share/utt/weights.kor
@end example
or, if the above mentioned file is absent:
@example
/usr/local/share/utt/weights.kor
@end example
@node kor dictionaries
@subsection Dictionaries
see @command{cor}
@c ---------------------------------------------------------------------
@c SEN
@c ---------------------------------------------------------------------
@page
@node sen
@section sen - a sentensizer
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* sen description::
@c * sen input::
@c * sen output::
* sen example::
@end menu
@node sen description
@subsection Description
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
@node sen example
@subsection Example
@example
command: sen
input:
0000 05 W Cześć
0005 01 P !
0006 01 S _
0007 02 W To
0009 01 S _
0010 02 W ja
0012 01 P .
0013 01 S \n
output:
0000 00 BOS *
0000 05 W Cześć
0005 01 P !
0006 00 EOS *
0006 00 BOS *
0006 01 S _
0007 02 W To
0009 01 S _
0010 02 W ja
0012 01 P .
0013 01 S \n
0014 00 EOS *
@end example
@c ---------------------------------------------------------------------
@c GPH
@c ---------------------------------------------------------------------
@c @node gph - graphizer
@c @chapter gph - graphizer
@c Authors: Tomasz Obrębski
@c ---------------------------------------------------------------------
@c SER
@c ---------------------------------------------------------------------
@page
@node ser
@section ser - pattern search tool
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok, lem --one-field
@end multitable
@menu
* ser description::
* ser command line options::
* ser pattern::
* ser how ser works::
* ser customization::
* ser limitations::
* ser requirements::
@end menu
@node ser description
@subsection Description
@command{ser} looks for patterns in UTT-formatted texts.
@c ---------------------------------------------------------------------
@node ser command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@c @parfile
@c @paroutput
@c @parinputfield
@c @paroutputfield
@parprocess
@parinteractive
@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
The search pattern.
@item @b{@minus{}@minus{}morph=@var{field}}
The name of the annotation field containing the morphological
description (default @code{lem}).
@item @b{@minus{}@minus{}flex}
Only print the generated flex source code.
@item @b{@minus{}@minus{}macro=@var{filename}}
Read macrodefinitions from file @var{filename} rather than from
default location. This option allows to redefine the set of terms.
@item @b{@minus{}@minus{}define=@var{filename}}
Append macrodefinitions from file @var{filename}. This option
allows to extend the set of terms.
@end table
@c ---------------------------------------------------------------------
@node ser pattern
@subsection Pattern
The @command{ser} pattern is a regular expression over terms corresponding
to text segments or segment sequences. Predefined terms are:
@table @code
@item seg(@var{t},@var{f},@var{a})
a segment of type @var{t}, containing form @var{f} and annotation
@var{a}
@item form(@var{f})
a segment containing form @var{f}
@item field(@var{f})
a segment containing annotation field @var{f}
@item space(@var{f})
a space segment of form @var{f}
@item word(@var{f})
a word segment of form @var{f}
@item punct(@var{f})
a punct segment of form @var{f}
@item number(@var{f})
a number segment of form @var{f}
@item lexeme(@var{f})
a word segment with lemma @var{f}
@item cat(@var{c})
a word segment of category @var{c}
@end table
All arguments are optional. If an argument is omitted, an arbitrary
string of non-blank characters is assumed as the argument value. Term
arguments may be arbitrary character-level regular expressions. The
following special symbols can by used:
@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @code{[@dots{}]} @tab a character class
@item @code{[^@dots{}]} @tab a negated character class
@item @code{|} @tab alternative
@item @code{*} @tab repetition, including zero times
@item @code{+} @tab repetition, at least one time
@item @code{?} @tab optionality
@item @code{@{@var{m},@var{n}@}} @tab repetition from @var{m} to @var{n} times
@item @code{@{@var{m},@}} @tab repetition @var{m} or more times
@item @code{@{@var{m}@}} @tab repetition @var{m} times
@item @code{@var{\ddd}} @tab the character with octal value @var{ddd}
@item @code{\x@var{hh}} @tab the character with hexadecimal value @var{hh}
@item @code{( )} @tab parentheses, used to override precedence
@c @end multitable
@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @code{.} @tab a non-blank character
@item @code{\w} @tab a letter
@item @code{\W} @tab a non-blank character other than a letter
@item @code{\d} @tab a digit
@item @code{\D} @tab a non-blank character other than a digit
@item @code{\s} @tab a space or tab character
@item @code{\S} @tab a non-blank character (the same as @code{.})
@item @code{\l} @tab a lowercase letter
@item @code{\L} @tab an uppercase letter
@end multitable
@noindent The following characters:
@example
@verb{% [ ] ^ | * + ? { } , . < > \ %}
@end example
must be escaped with a backslash, i.e. written as:
@example
@verb{% \[ \] \^ \| \* \+ \? \{ \} \, \. \< \> \\ %}
@end example
@quotation Note
The special symbols are ... borrowed from Perl with minor
modifications ... for convenience
The meaning of certain special characters/sequences slightly differs
from their common ???. This is motivated by convenience reasons.
The meaning of the @code{.} special character is modified due to
the special function of spaces in utt files (they are field
separators). Use @code{\s} to explicitly
@end quotation
In the argument of the @code{cat} term a special operator <...> may be
used. A category specification enclosed in angle brackets matches all
category descriptions which are consistent (non-contradictory) with the
specification. For example @code{<N>} matches all noun descriptions,
@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
@*
@noindent @b{Examples of one-segment patterns:}
@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @code{seg} @tab any segment
@item @code{word} @tab any word-form
@item @code{word(pomocy)} @tab the word-form @samp{pomocy}
@item @code{word(naj.+)} @tab a word-form beginning with @samp{naj}
@item @code{word(\L\l+)} @tab a capitalized word-form
@item @code{punct} @tab a punctuation character
@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
@item @code{lexeme(pomoc)} @tab any form of the lexeme 'pomoc'
@item @code{cat(N/.*)} @tab a word which category starts with @code{N/}
@item @code{cat(<N/Ca>)} @tab a word which category matches @code{N/Ca}
@end multitable
@*
@noindent @b{Examples of multi-segment patterns:}
@table @code
@item (word(\L) punct(\.) space?)+ word(\L\l+)
a sequence of initials followed by a surname
@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
a text fragment between two punctuation characters, containing an
ocurrence of a relative pronoun
@end table
@node ser how ser works
@subsection How ser works
@node ser customization
@subsection Customization
@c All predefined terms correspond to single segments,
@example
define(`verbseq', `(cat(<V>) (space cat(<V>)))')
@end example
the term @code{cat()} may not be used as a ... of
@c See @command{m4} manual for further details on macro definition format.
@node ser limitations
@subsection Limitations
Do not use more than 3 attributes in <>.
@node ser requirements
@subsection Requirements
In order to run @command{ser}, the following programs must be
installed in the system:
@itemize
@item @command{m4}
@item @command{grep}
@item @command{flex}
@item @command{gcc}
@end itemize
@c ---------------------------------------------------------------------
@c GRP
@c ---------------------------------------------------------------------
@page
@node grp
@section grp - pattern search tool
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT flattened
@item @strong{Output format:} @tab UTT flattened
@item @strong{Required annotation:} @tab tok, sen, lem --one-field
@end multitable
@menu
* grp description::
* grp command line options::
* grp pattern::
* grp hints::
@end menu
@node grp description
@subsection Description
@code{gre} selects sentences containing an expression matching a
pattern. The pattern format is exactly the same as that accepted by
@code{ser}.
@code{gre} is intended mainly for speeding up corpus search process.
It is extremely fast (processing speed is usually higher then the speed
of reading the corpus file from disk).
@node grp command line options
@subsection Command line options
@table @code
@parhelp
@parversion
@parprocess
@parinteractive
@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
The search pattern.
@item @b{@minus{}@minus{}morph=@var{field}}
The name of the annotation field containing the morphological
description (default @code{lem}).
@item @b{@minus{}@minus{}command}
Only print the generated flex source code.
@item @b{@minus{}@minus{}macro=@var{filename}}
Read macrodefinitions from file @var{filename} rather than from
default location. This option allows to redefine the set of terms.
@item @b{@minus{}@minus{}define=@var{filename}}
Append macrodefinitions from file @var{filename}. This option
allows to extend the set of terms.
@end table
@node grp pattern
@subsection Pattern
(see @code{ser})
@node grp hints
@subsection Hints
The corpus search speed may be increased by combining grp with lzop
compression tool (grp usually processes data faster than it is read from a
disk, especially for slow laptop drives).
@example
cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo
@end example
@example
lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR}
@end example
@c ---------------------------------------------------------------------
@c MAR
@c ---------------------------------------------------------------------
@page
@node mar
@section mar
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Marcin Walas, Tomasz Obrębski
@item @strong{Input format:} @tab UTT flattened
@item @strong{Output format:} @tab UTT flattened
@item @strong{Required annotation:} @tab tok, sen, lem -1
@end multitable
@subsection Description
@code{mar} is a perl script, which matches given pattern on the utt-formated text
and tags matching parts with any number of user-defined tags.
@subsection Command line options
@table @code
@parhelp
@parversion
@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
The search pattern.
@item @b{@minus{}@minus{}action=@var{action}, @minus{}a @var{action} [p] [s] [P]}
Perform only indicated actions. Where:
@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @code{p} @tab preprocess
@item @code{s} @tab search
@item @code{P} @tab postprocess
@end multitable
default: psP
@item @b{@minus{}@minus{}command}
print generated sed command, then exit
@item @b{@minus{}@minus{}help, @minus{}h}
print help, then exit
@item @b{@minus{}@minus{}version, @minus{}v}
print version, then exit
@end table
@subsection Tokens in pattern
@code{mar} pattern is based on @code{ser} patterns(see @pxref{ser pattern}). @code{mar} pattern is a @code{ser} pattern,
in which you can add any number of matching tags, which will be printed in exacly the place, where
they were placed in the pattern. A valid token starts with @@ which follows any number of alphanumeric
characters. For example valid match tokens are: @@STARTMATCH @@ENDMATCH
Matching tokens can be placed between, before or after any of @code{ser} pattern terms. They don't have
to be paritied. There can be any number of them in the pattern (zero or more). They don't have to be unique.
They can be placed one after another. For example:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @code{@@BOM lexeme(pomoc)} @tab place tag @b{BOM} before any form of the lexeme 'pomoc'
@item @code{@@MATCH lexeme(pomoc) @@MATCH} @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc'
@item @code{cat(<ADJ>) @@MATCH lexeme(pomoc) @@MATCH} @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc' which is followef by adjective
@item @code{cat(<ADJ>) @@TAG @@BOM lexeme(pomoc) @@EOM} @tab place tags @b{TAG} and @b{BOM} before any form of the lexeme 'pomoc' which is followed by adjective and tag @b{EOM} after it
@end multitable
(see mar's help 'mar -h' for some more information)
@subsection How mar works
@code{mar} translates given @code{ser} pattern with @code{m4} macroprocessor to regular expression. Then it changes it into @code{sed} command script, which is then executed.
You can see translated sed script by using the @code{@minus{}@minus{}command} option.
@subsection Limitations
The complexity of computations performed by @code{mar} increases linearly with the number of placed tokens. So it is highly recommended not to place too much tokens.
@subsection Requirements
In order to run @code{mar}, the following programs must be installed in the system:
@itemize
@item @command{m4}
@item @command{grep}
@item @command{sed}
@end itemize
@c ---------------------------------------------------------------------
@c KOT
@c ---------------------------------------------------------------------
@page
@node kot
@section kot - untokenizer
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab text
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* kot description::
* kot command line options::
* kot usage examples::
@end menu
@node kot description
@subsection Description
@command{kot} transforms a UTT formatted file back into raw text format.
@node kot command line options
@subsection Command line options
@table @code
@parhelp
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
@c @item @b{@minus{}@minus{}interactive @minus{}i}
@c @item @b{@minus{}@minus{}config=@var{filename}}
@item
@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
print @var{string} between nonadjacent segments of the input file
@item @b{@minus{}@minus{}spaces, @minus{}r}
retain the special characters @code{_}, @code{\t},
@code{\n}, @code{\r}, @code{\f} unexpanded in the output
@end table
@node kot usage examples
@subsection Usage examples
@example
cat legia.txt | tok | kot
@end example
@example
cat legia.txt | tok | lem -1 | kot
@end example
@c ---------------------------------------------------------------
@c CON
@c ---------------------------------------------------------------
@page
@node con
@section con - concordance table generator
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Justyna Walkowska
@item @strong{Component category:} @tab sink
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab text
@item @strong{Required annotation:} @tab ser or mar
@end multitable
@c
@menu
* con description::
* con command line options::
* con usage example::
* con hints::
@end menu
@node con description
@subsection Description
@command{con} generates a concordance table based on a pattern given to @command{ser}.
@node con command line options
@subsection Command line options
@table @code
@parhelp
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
@c @item @b{@minus{}@minus{}interactive @minus{}i}
@c @item @b{@minus{}@minus{}config=@var{filename}}
@c @item
@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
@c search pattern
@c
@c @item @b{@minus{}@minus{}flex}
@c only print the generated flex source code
@c
@c @item @b{@minus{}@minus{}macro=@var{filename}}
@c read macrodefinitions from file @var{filename} rather than from
@c default location. This option allows to redefine the set of terms.
@c
@c @item @b{@minus{}@minus{}define=@var{filename}}
@c append macrodefinitions from file @var{filename}. This option
@c allows to extend the set of terms.
@item @b{@minus{}@minus{}left @minus{}l}
Left context info (default='30c'). Example:
@example
-l=5c: left context is 5 characters
-l=5w: left context is 5 words
-l=5s: left context is 5 non-empty input lines
-l='\s*\S+\sr\S+BOS': left context starts with the given regex
@end example
@item @b{@minus{}@minus{}right @minus{}r}
Right context info (default='30c').
@item @b{@minus{}@minus{}trim @minus{}t}
Clear incomplete words from output.
@item @b{@minus{}@minus{}white @minus{}w}
DO NOT change all white characters into spaces.
@item @b{@minus{}@minus{}column @minus{}c}
Left column minimal width in characters (default = 0).
@item @b{@minus{}@minus{}ignore @minus{}i}
Ignore segment inconsistency in the input.
@item @b{@minus{}@minus{}bom}
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
@item @b{@minus{}@minus{}eom}
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
@item @b{@minus{}@minus{}bod}
Selected segment beginning display string (default='[').
@item @b{@minus{}@minus{}eod}
Selected segment end display string (default=']').
@end table
@node con usage example
@subsection Usage example
@example
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con
@end example
@node con hints
@subsection Hints
@command{con} is a rather slow program. Do not pass large amounts of
redundant text through this program. @command{con} works fine in the following
sequence:
@example
... | grp -e EXPR | ser -e EXPR | con
@end example
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@page
@node Auxiliary tools
@chapter Auxiliary tools
@menu
* compdic:: dictionary compiler
* fla:: UTT file flattener
* unfla:: UTT file unflattener
@end menu
@page
@node compdic
@section compdic - the dictionary compiler
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Michał Stolarski, Tomasz Obrębski
@item @strong{Component category:} @tab additional tool
@end multitable
@c
@command{compdic} compiles dictionaries in text format (@code{.dic} extension) into binary
(FST) format (@code{.bin} extension).
Automaton representation of a dictionary is built using the OpenFst toolkit.
In order for the compdic program to work you have to install the OpenFst toolkit in your system.
Usage:
@example
compdic <dictionaryname>.dic <dictionaryname>.bin
@end example
The file <dictionaryname>.bin will be generated.
@c @menu
@c * con command line options::
@c * con usage example::
@c * con hints::
@c @end menu
@c -------------------------------------------------------------------------------
@c FLA
@c -------------------------------------------------------------------------------
@page
@node fla
@section fla - the UTT file flattener
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT flattened
@item @strong{Required annotation:} @tab sen
@end multitable
@c
@menu
* fla description::
@c * fla command line options::
@c * fla usage example::
@end menu
@node fla description
@subsection Description
@command{fla} ``flattens'' a utt file by merging segments belonging
to one sentence in one line. Technically, end-of-line characters
('\n', ASCII code 10) are replaced with line-feed characters ('\f',
ASCII code 12). The flattening makes it possible to process UTT files
with such tools as @command{grep} or @command{sed} sentence by
sentence (used in @command{grp} and @command{mar}).
Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
Flattened files are still human-readible.
Usage:
@example
fla [<bosregex>]
@end example
The facultative argument is a regular expression describing segments
which should be treated as sentence beginnings (the test is: the
segment contains a fragment matching the @code{<bosregex>}). By
default, segments containing a field @code{BOS} are seeked.
@c -------------------------------------------------------------------------------
@c UNFLA
@c -------------------------------------------------------------------------------
@page
@node unfla
@section unfla - the UTT file unflattener
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrębski
@item @strong{Input format:} @tab UTT flattened
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab -
@end multitable
@menu
* unfla description::
@c * fla command line options::
@c * fla usage example::
@end menu
@node unfla description
@subsection Description
@command{unfla} transforms a flattened UTT file, produced by
@command{fla}, into the regular format by restoring end-of-line
characters.
@c ---------------------------------------------------------------------
@c USAGE EXAMPLES
@c ---------------------------------------------------------------------
@node Usage examples
@chapter Usage examples
@subsubheading Simple pipelines
@enumerate
@item tokenization
cat text | tok > output1
@item morphological annotation (1)
simple dictionary based lemmatization
cat text | tok | lem > output1
@item morphological annotation (2)
1) perform dictionary-based lemmatization
4) guess descriptions for words which have no annotation
@example
cat text | tok | lem | gue -S lem > output2
@end example
@item morphological annotation (3)
1) perform dictionary-based lemmatization
2) try to correct words with no annotation
3) perform dictionary-based lemmatization of corrected words
4) guess descriptions for words which still have no annotation
@example
cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
@end example
@item spelling correction
@example
cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1
@end example
@item Expression extraction
Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
@example
cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
@end example
@item A word in context
Extraction of text fragments containing a form of the lexeme 'rozmowa' in
the context of 5 preceeding and 5 succeeding corpus segments.
@example
cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
@end example
@item generation of concordance table (1)
@example
cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
@end example
10"
@item generation of concordance table (2)
The same as above but much faster
@example
cat text | tok | lem -1 | \
grp -e 'cat(<V>) space lexeme(rozmowa)' | \
ser -e 'cat(<V>) space lexeme(rozmowa)' | \
con
@end example
2"
@item generation of concordance table (3)
Usually, one performs repetitively search over the same corpus. In
such case it is advisable to transform the corpus data into the format
required by @command{grp} first, and then use the preprocessed data.
As @command{grp} (@command{grep}) processes data faster then it is
read from the disk drive, the search time may be still shortened by
using file compression techniques. We suggest using the
@command{lzop} compressor/decompressor.
@item the fastest way to search a large corpus
step 1: corpus preprocessing
@example
cat corpus | tok | sen | lem -1 \
| fla | lzop -7 > corpus.grp.lzo
@end example
step 2: search
@example
lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space
lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
@end example
@end enumerate
@c @subsubheading More complicated configurations
@c @example
@c mknod fifo1 p
@c mknod fifo2 p
@c mknod fifo3 p
@c mknod fifo4 p
@c mknod fifo5 p
@c tok | lem -p W -e fifo1 > fifo2 &
@c cor -e fifo3 < fifo1 | lem > fifo4 &
@c gue < fifo3 > fifo5 &
@c sort -m fifo2 fifo4 fifo5
@c rm fifo?
@c @end example
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c PMDBF DICTIONARY
@c ---------------------------------------------------------------------
@node PMDBF dictionary
@chapter PMDBF dictionary
UTT components come with lexical data derived from Polish
Morphological Database (PMDB).
@menu
* PMDBF files::
* PMDBF tag structure::
* PMDBF parts of speech::
* PMDBF morphosyntactic attributes::
@end menu
@node PMDBF files
@section Files
@node PMDBF tag structure
@section Tag structure
pos = [[:upper:]]+
attr = [[:upper:]]+
val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
descr = pos ( / ( attr val + ) + ) ?
@node PMDBF parts of speech
@section Parts of speech
@multitable {ADJPRP} { adjectival-passive-participle }
@item @code{N} @tab noun
@item @code{NPRO} @tab nominal-pronoun
@item @code{NV} @tab deverbal-noun
@item @code{V} @tab verb
@item @code{BYC} @tab byc
@item @code{VNI} @tab non-inflected-verb
@item @code{ADJ} @tab adjective
@item @code{ADJPAP} @tab adjectival-passive-participle
@item @code{ADJPRP} @tab adjectival-present-participle
@item @code{ADJPP} @tab adjectival-past-participle
@item @code{ADJPRO} @tab adjectival-pronoun
@item @code{ADJNUM} @tab adjectival-numeral
@item @code{ADV} @tab adverb
@item @code{ADVANP} @tab adverbial-anterior-participle
@item @code{ADVPRP} @tab adverbial-present-participle
@item @code{ADVPRO} @tab adverbial-pronoun
@item @code{ADVNUM} @tab adverbial-numeral
@item @code{P} @tab preposition
@item @code{PPRO} @tab prep-noun-pronoun
@item @code{CONJ} @tab conjunction
@item @code{EXCL} @tab exclamation
@item @code{APP} @tab call
@item @code{ONO} @tab onomatopoeia
@item @code{PART} @tab particle
@item @code{NUMCRD} @tab cardinal-numeral
@item @code{NUMCOL} @tab collective-numeral
@item @code{NUMPAR} @tab partitive-numeral
@item @code{NUMORD} @tab ordinal-numeral
@end multitable
@node PMDBF morphosyntactic attributes
@section Morphosyntactic attributes
@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@c @headitem Attr @tab Val @tab Description
@item
@code{A} @tab @tab Aspect
@item
@tab @code{p} @tab perfect
@item
@tab @code{i} @tab imperfect.
@item
@item
@code{V} @tab @tab Verb-Form
@item
@tab @code{b} @tab infinitive,
@item
@tab @code{p} @tab personal,
@item
@tab @code{i} @tab impersonal.
@item
@item
@code{M} @tab @tab Mood
@item
@tab @code{d} @tab declarative,
@item
@tab @code{c} @tab conditional,
@item
@tab @code{i} @tab imperative.
@item
@item
@code{T} @tab @tab Tense
@item
@tab @code{a} @tab past,
@item
@tab @code{r} @tab present,
@item
@tab @code{f} @tab future.
@item
@item
@code{P} @tab @tab Person
@item
@tab @code{1} @tab 1,
@item
@tab @code{2} @tab 2,
@item
@tab @code{3} @tab 3.
@item
@item
@code{D} @tab @tab Degree
@item
@tab @code{p} @tab positive,
@item
@tab @code{c} @tab comparative,
@item
@tab @code{s} @tab superlative.
@item
@item
@code{N} @tab @tab Number
@item
@tab @code{s} @tab singular,
@item
@tab @code{p} @tab plural.
@item
@item
@code{C} @tab @tab Case
@item
@tab @code{n} @tab nominative,
@item
@tab @code{g} @tab genitive,
@item
@tab @code{d} @tab dative,
@item
@tab @code{a} @tab accusative,
@item
@tab @code{i} @tab instrumantal,
@item
@tab @code{l} @tab locative,
@item
@tab @code{v} @tab vocative.
@item
@code{G} @tab @tab Gender
@item
@tab @code{p} @tab masculine-personal,
@item
@tab @code{a} @tab masculine-animal,
@item
@tab @code{i} @tab masculine-inanimate,
@item
@tab @code{f} @tab feminine,
@item
@tab @code{n} @tab neuter.
@end multitable
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c
@c @node Examples
@c @chapter Examples
@c ----------------------------------------------------------------------
@c ----------------------------------------------------------------------
@node GNU Free Documentation License
@chapter GNU Free Documentation License
@c The GNU Free Documentation License.
@center Version 1.2, November 2002
@c This file is intended to be included within another document,
@c hence no sectioning command or @node.
@display
Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
@end display
@enumerate 0
@item
PREAMBLE
The purpose of this License is to make a manual, textbook, or other
functional and useful document @dfn{free} in the sense of freedom: to
assure everyone the effective freedom to copy and redistribute it,
with or without modifying it, either commercially or noncommercially.
Secondarily, this License preserves for the author and publisher a way
to get credit for their work, while not being considered responsible
for modifications made by others.
This License is a kind of ``copyleft'', which means that derivative
works of the document must themselves be free in the same sense. It
complements the GNU General Public License, which is a copyleft
license designed for free software.
We have designed this License in order to use it for manuals for free
software, because free software needs free documentation: a free
program should come with manuals providing the same freedoms that the
software does. But this License is not limited to software manuals;
it can be used for any textual work, regardless of subject matter or
whether it is published as a printed book. We recommend this License
principally for works whose purpose is instruction or reference.
@item
APPLICABILITY AND DEFINITIONS
This License applies to any manual or other work, in any medium, that
contains a notice placed by the copyright holder saying it can be
distributed under the terms of this License. Such a notice grants a
world-wide, royalty-free license, unlimited in duration, to use that
work under the conditions stated herein. The ``Document'', below,
refers to any such manual or work. Any member of the public is a
licensee, and is addressed as ``you''. You accept the license if you
copy, modify or distribute the work in a way requiring permission
under copyright law.
A ``Modified Version'' of the Document means any work containing the
Document or a portion of it, either copied verbatim, or with
modifications and/or translated into another language.
A ``Secondary Section'' is a named appendix or a front-matter section
of the Document that deals exclusively with the relationship of the
publishers or authors of the Document to the Document's overall
subject (or to related matters) and contains nothing that could fall
directly within that overall subject. (Thus, if the Document is in
part a textbook of mathematics, a Secondary Section may not explain
any mathematics.) The relationship could be a matter of historical
connection with the subject or with related matters, or of legal,
commercial, philosophical, ethical or political position regarding
them.
The ``Invariant Sections'' are certain Secondary Sections whose titles
are designated, as being those of Invariant Sections, in the notice
that says that the Document is released under this License. If a
section does not fit the above definition of Secondary then it is not
allowed to be designated as Invariant. The Document may contain zero
Invariant Sections. If the Document does not identify any Invariant
Sections then there are none.
The ``Cover Texts'' are certain short passages of text that are listed,
as Front-Cover Texts or Back-Cover Texts, in the notice that says that
the Document is released under this License. A Front-Cover Text may
be at most 5 words, and a Back-Cover Text may be at most 25 words.
A ``Transparent'' copy of the Document means a machine-readable copy,
represented in a format whose specification is available to the
general public, that is suitable for revising the document
straightforwardly with generic text editors or (for images composed of
pixels) generic paint programs or (for drawings) some widely available
drawing editor, and that is suitable for input to text formatters or
for automatic translation to a variety of formats suitable for input
to text formatters. A copy made in an otherwise Transparent file
format whose markup, or absence of markup, has been arranged to thwart
or discourage subsequent modification by readers is not Transparent.
An image format is not Transparent if used for any substantial amount
of text. A copy that is not ``Transparent'' is called ``Opaque''.
Examples of suitable formats for Transparent copies include plain
@sc{ascii} without markup, Texinfo input format, La@TeX{} input
format, @acronym{SGML} or @acronym{XML} using a publicly available
@acronym{DTD}, and standard-conforming simple @acronym{HTML},
PostScript or @acronym{PDF} designed for human modification. Examples
of transparent image formats include @acronym{PNG}, @acronym{XCF} and
@acronym{JPG}. Opaque formats include proprietary formats that can be
read and edited only by proprietary word processors, @acronym{SGML} or
@acronym{XML} for which the @acronym{DTD} and/or processing tools are
not generally available, and the machine-generated @acronym{HTML},
PostScript or @acronym{PDF} produced by some word processors for
output purposes only.
The ``Title Page'' means, for a printed book, the title page itself,
plus such following pages as are needed to hold, legibly, the material
this License requires to appear in the title page. For works in
formats which do not have any title page as such, ``Title Page'' means
the text near the most prominent appearance of the work's title,
preceding the beginning of the body of the text.
A section ``Entitled XYZ'' means a named subunit of the Document whose
title either is precisely XYZ or contains XYZ in parentheses following
text that translates XYZ in another language. (Here XYZ stands for a
specific section name mentioned below, such as ``Acknowledgements'',
``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title''
of such a section when you modify the Document means that it remains a
section ``Entitled XYZ'' according to this definition.
The Document may include Warranty Disclaimers next to the notice which
states that this License applies to the Document. These Warranty
Disclaimers are considered to be included by reference in this
License, but only as regards disclaiming warranties: any other
implication that these Warranty Disclaimers may have is void and has
no effect on the meaning of this License.
@item
VERBATIM COPYING
You may copy and distribute the Document in any medium, either
commercially or noncommercially, provided that this License, the
copyright notices, and the license notice saying this License applies
to the Document are reproduced in all copies, and that you add no other
conditions whatsoever to those of this License. You may not use
technical measures to obstruct or control the reading or further
copying of the copies you make or distribute. However, you may accept
compensation in exchange for copies. If you distribute a large enough
number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and
you may publicly display copies.
@item
COPYING IN QUANTITY
If you publish printed copies (or copies in media that commonly have
printed covers) of the Document, numbering more than 100, and the
Document's license notice requires Cover Texts, you must enclose the
copies in covers that carry, clearly and legibly, all these Cover
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
the back cover. Both covers must also clearly and legibly identify
you as the publisher of these copies. The front cover must present
the full title with all words of the title equally prominent and
visible. You may add other material on the covers in addition.
Copying with changes limited to the covers, as long as they preserve
the title of the Document and satisfy these conditions, can be treated
as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit
legibly, you should put the first ones listed (as many as fit
reasonably) on the actual cover, and continue the rest onto adjacent
pages.
If you publish or distribute Opaque copies of the Document numbering
more than 100, you must either include a machine-readable Transparent
copy along with each Opaque copy, or state in or with each Opaque copy
a computer-network location from which the general network-using
public has access to download using public-standard network protocols
a complete Transparent copy of the Document, free of added material.
If you use the latter option, you must take reasonably prudent steps,
when you begin distribution of Opaque copies in quantity, to ensure
that this Transparent copy will remain thus accessible at the stated
location until at least one year after the last time you distribute an
Opaque copy (directly or through your agents or retailers) of that
edition to the public.
It is requested, but not required, that you contact the authors of the
Document well before redistributing any large number of copies, to give
them a chance to provide you with an updated version of the Document.
@item
MODIFICATIONS
You may copy and distribute a Modified Version of the Document under
the conditions of sections 2 and 3 above, provided that you release
the Modified Version under precisely this License, with the Modified
Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy
of it. In addition, you must do these things in the Modified Version:
@enumerate A
@item
Use in the Title Page (and on the covers, if any) a title distinct
from that of the Document, and from those of previous versions
(which should, if there were any, be listed in the History section
of the Document). You may use the same title as a previous version
if the original publisher of that version gives permission.
@item
List on the Title Page, as authors, one or more persons or entities
responsible for authorship of the modifications in the Modified
Version, together with at least five of the principal authors of the
Document (all of its principal authors, if it has fewer than five),
unless they release you from this requirement.
@item
State on the Title page the name of the publisher of the
Modified Version, as the publisher.
@item
Preserve all the copyright notices of the Document.
@item
Add an appropriate copyright notice for your modifications
adjacent to the other copyright notices.
@item
Include, immediately after the copyright notices, a license notice
giving the public permission to use the Modified Version under the
terms of this License, in the form shown in the Addendum below.
@item
Preserve in that license notice the full lists of Invariant Sections
and required Cover Texts given in the Document's license notice.
@item
Include an unaltered copy of this License.
@item
Preserve the section Entitled ``History'', Preserve its Title, and add
to it an item stating at least the title, year, new authors, and
publisher of the Modified Version as given on the Title Page. If
there is no section Entitled ``History'' in the Document, create one
stating the title, year, authors, and publisher of the Document as
given on its Title Page, then add an item describing the Modified
Version as stated in the previous sentence.
@item
Preserve the network location, if any, given in the Document for
public access to a Transparent copy of the Document, and likewise
the network locations given in the Document for previous versions
it was based on. These may be placed in the ``History'' section.
You may omit a network location for a work that was published at
least four years before the Document itself, or if the original
publisher of the version it refers to gives permission.
@item
For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
the Title of the section, and preserve in the section all the
substance and tone of each of the contributor acknowledgements and/or
dedications given therein.
@item
Preserve all the Invariant Sections of the Document,
unaltered in their text and in their titles. Section numbers
or the equivalent are not considered part of the section titles.
@item
Delete any section Entitled ``Endorsements''. Such a section
may not be included in the Modified Version.
@item
Do not retitle any existing section to be Entitled ``Endorsements'' or
to conflict in title with any Invariant Section.
@item
Preserve any Warranty Disclaimers.
@end enumerate
If the Modified Version includes new front-matter sections or
appendices that qualify as Secondary Sections and contain no material
copied from the Document, you may at your option designate some or all
of these sections as invariant. To do this, add their titles to the
list of Invariant Sections in the Modified Version's license notice.
These titles must be distinct from any other section titles.
You may add a section Entitled ``Endorsements'', provided it contains
nothing but endorsements of your Modified Version by various
parties---for example, statements of peer review or that the text has
been approved by an organization as the authoritative definition of a
standard.
You may add a passage of up to five words as a Front-Cover Text, and a
passage of up to 25 words as a Back-Cover Text, to the end of the list
of Cover Texts in the Modified Version. Only one passage of
Front-Cover Text and one of Back-Cover Text may be added by (or
through arrangements made by) any one entity. If the Document already
includes a cover text for the same cover, previously added by you or
by arrangement made by the same entity you are acting on behalf of,
you may not add another; but you may replace the old one, on explicit
permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License
give permission to use their names for publicity for or to assert or
imply endorsement of any Modified Version.
@item
COMBINING DOCUMENTS
You may combine the Document with other documents released under this
License, under the terms defined in section 4 above for modified
versions, provided that you include in the combination all of the
Invariant Sections of all of the original documents, unmodified, and
list them all as Invariant Sections of your combined work in its
license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and
multiple identical Invariant Sections may be replaced with a single
copy. If there are multiple Invariant Sections with the same name but
different contents, make the title of each such section unique by
adding at the end of it, in parentheses, the name of the original
author or publisher of that section if known, or else a unique number.
Make the same adjustment to the section titles in the list of
Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled ``History''
in the various original documents, forming one section Entitled
``History''; likewise combine any sections Entitled ``Acknowledgements'',
and any sections Entitled ``Dedications''. You must delete all
sections Entitled ``Endorsements.''
@item
COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents
released under this License, and replace the individual copies of this
License in the various documents with a single copy that is included in
the collection, provided that you follow the rules of this License for
verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute
it individually under this License, provided you insert a copy of this
License into the extracted document, and follow this License in all
other respects regarding verbatim copying of that document.
@item
AGGREGATION WITH INDEPENDENT WORKS
A compilation of the Document or its derivatives with other separate
and independent documents or works, in or on a volume of a storage or
distribution medium, is called an ``aggregate'' if the copyright
resulting from the compilation is not used to limit the legal rights
of the compilation's users beyond what the individual works permit.
When the Document is included in an aggregate, this License does not
apply to the other works in the aggregate which are not themselves
derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these
copies of the Document, then if the Document is less than one half of
the entire aggregate, the Document's Cover Texts may be placed on
covers that bracket the Document within the aggregate, or the
electronic equivalent of covers if the Document is in electronic form.
Otherwise they must appear on printed covers that bracket the whole
aggregate.
@item
TRANSLATION
Translation is considered a kind of modification, so you may
distribute translations of the Document under the terms of section 4.
Replacing Invariant Sections with translations requires special
permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the
original versions of these Invariant Sections. You may include a
translation of this License, and all the license notices in the
Document, and any Warranty Disclaimers, provided that you also include
the original English version of this License and the original versions
of those notices and disclaimers. In case of a disagreement between
the translation and the original version of this License or a notice
or disclaimer, the original version will prevail.
If a section in the Document is Entitled ``Acknowledgements'',
``Dedications'', or ``History'', the requirement (section 4) to Preserve
its Title (section 1) will typically require changing the actual
title.
@item
TERMINATION
You may not copy, modify, sublicense, or distribute the Document except
as expressly provided for under this License. Any other attempt to
copy, modify, sublicense or distribute the Document is void, and will
automatically terminate your rights under this License. However,
parties who have received copies, or rights, from you under this
License will not have their licenses terminated so long as such
parties remain in full compliance.
@item
FUTURE REVISIONS OF THIS LICENSE
The Free Software Foundation may publish new, revised versions
of the GNU Free Documentation License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns. See
@uref{http://www.gnu.org/copyleft/}.
Each version of the License is given a distinguishing version number.
If the Document specifies that a particular numbered version of this
License ``or any later version'' applies to it, you have the option of
following the terms and conditions either of that specified version or
of any later version that has been published (not as a draft) by the
Free Software Foundation. If the Document does not specify a version
number of this License, you may choose any version ever published (not
as a draft) by the Free Software Foundation.
@end enumerate
@page
@heading ADDENDUM: How to use this License for your documents
To use this License in a document you have written, include a copy of
the License in the document and put the following copyright and
license notices just after the title page:
@smallexample
@group
Copyright (C) @var{year} @var{your name}.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts. A copy of the license is included in the section entitled ``GNU
Free Documentation License''.
@end group
@end smallexample
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
replace the ``with@dots{}Texts.'' line with this:
@smallexample
@group
with the Invariant Sections being @var{list their titles}, with
the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
being @var{list}.
@end group
@end smallexample
If you have Invariant Sections without Cover Texts, or some other
combination of the three, merge those two alternatives to suit the
situation.
If your document contains nontrivial examples of program code, we
recommend releasing these examples in parallel under your choice of
free software license, such as the GNU General Public License,
to permit their use in free software.
@c Local Variables:
@c ispell-local-pdict: "ispell-dict"
@c End:
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@node Reporting bugs
@chapter Reporting bugs
Report bugs to <obrebski@@amu.edu.pl>.
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c @node Copyright
@c @chapter Copyright
@c
@c Copyright 2004 by Tomasz Obrębski
@c This software is free for research and educational use.
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@node Author
@chapter Author
@bye