2688 lines
79 KiB
Plaintext
2688 lines
79 KiB
Plaintext
|
\input texinfo @c -*-texinfo-*-
|
|||
|
@documentencoding ISO-8859-2
|
|||
|
@c @documentlanguage pl
|
|||
|
|
|||
|
@c %**start of header
|
|||
|
@setfilename utt.info
|
|||
|
@settitle UAM Text Tools v0.90
|
|||
|
@c %**end of header
|
|||
|
|
|||
|
@copying
|
|||
|
This manual is for UAM Text Tools (version 0.90, November, 2007)
|
|||
|
|
|||
|
Copyright @copyright{} 2005, 2007 Tomasz Obr<62>bski, Micha<68> Stolarski, Justyna Walkowska, Pawe<77> Konieczka.
|
|||
|
|
|||
|
Permission is granted to copy, distribute and/or modify this document
|
|||
|
under the terms of the GNU Free Documentation License, Version 1.2
|
|||
|
or any later version published by the Free Software Foundation;
|
|||
|
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
|
|||
|
Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
|
|||
|
|
|||
|
@c @quotation
|
|||
|
@c Permission is granted to ...
|
|||
|
@c No permission is granted until the document is completed.
|
|||
|
@c @end quotation
|
|||
|
@end copying
|
|||
|
|
|||
|
|
|||
|
@titlepage
|
|||
|
@title UAM Text Tools 0.90 - User Manual
|
|||
|
@subtitle edition 0.01, @today
|
|||
|
@subtitle status: prescript
|
|||
|
@author by Justyna Walkowska, Tomasz Obr@,{}ebski and Micha@l{} Stolarski
|
|||
|
@page
|
|||
|
@vskip 0pt plus 1filll
|
|||
|
@insertcopying
|
|||
|
@end titlepage
|
|||
|
|
|||
|
@contents
|
|||
|
|
|||
|
@c @paragraphindent none
|
|||
|
|
|||
|
@iftex
|
|||
|
@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
|
|||
|
@end iftex
|
|||
|
|
|||
|
@c @headings off
|
|||
|
@c @everyheading LEM(1) @| @| LEM(1)
|
|||
|
@everyfooting @today @c @| @thispage @|
|
|||
|
|
|||
|
@ifnottex
|
|||
|
|
|||
|
@node Top
|
|||
|
@top UTT - UAM Text Tools
|
|||
|
|
|||
|
@insertcopying
|
|||
|
|
|||
|
@menu
|
|||
|
* General information::
|
|||
|
* UTT file format::
|
|||
|
* Configuration files::
|
|||
|
* UTT components::
|
|||
|
* Auxiliary tools::
|
|||
|
* Usage examples::
|
|||
|
* PMDBF dictionary::
|
|||
|
@c * Examples::
|
|||
|
@c * Copyright::
|
|||
|
* GNU Free Documentation License::
|
|||
|
* Reporting bugs::
|
|||
|
* Author::
|
|||
|
@end menu
|
|||
|
@end ifnottex
|
|||
|
|
|||
|
|
|||
|
@c ----------------------------------------------------------------------
|
|||
|
|
|||
|
@node General information
|
|||
|
@chapter General information
|
|||
|
|
|||
|
UAM Text Tools (UTT) is a package of language processing tools
|
|||
|
developed at Adam Mickiewicz University. Its functionality includes:
|
|||
|
|
|||
|
@itemize @bullet
|
|||
|
|
|||
|
@item
|
|||
|
tokenization
|
|||
|
@item
|
|||
|
dictionary-based morphological analysis
|
|||
|
@item
|
|||
|
heuristic morphological analysis of unknown words
|
|||
|
@item
|
|||
|
spelling correction
|
|||
|
@item
|
|||
|
pattern search
|
|||
|
@item
|
|||
|
sentence splitting
|
|||
|
@item
|
|||
|
generation of concordance tables
|
|||
|
@end itemize
|
|||
|
|
|||
|
The toolkit is destined for processing of raw (not annotated)
|
|||
|
unrestricted text for any conceivable purpose.
|
|||
|
|
|||
|
The system is organized as a collection of command-line programs, each
|
|||
|
performing one operation, e.g. tokenization, lemmatization, spelling
|
|||
|
correction. The components are independent one from another, the
|
|||
|
unifying element being the uniform i/o file format.
|
|||
|
|
|||
|
The components may be combined in various ways to provide various text
|
|||
|
processing services. Also new components supplied by the used may be
|
|||
|
easily incorporated into the system provided that they respect the i/o
|
|||
|
file format conventions.
|
|||
|
|
|||
|
UTT component programs does not depend on any specific tagset or
|
|||
|
morphological description format.
|
|||
|
|
|||
|
UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
|
|||
|
the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
|
|||
|
|
|||
|
The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use.
|
|||
|
|
|||
|
|
|||
|
List of contributors:
|
|||
|
|
|||
|
@itemize
|
|||
|
@item Pawel Konieczka
|
|||
|
@item Tomasz Obrebski
|
|||
|
@item Michal Stolarski
|
|||
|
@item Marcin Walas
|
|||
|
@item Justyna Walkowska
|
|||
|
@end itemize
|
|||
|
|
|||
|
@c ----------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node UTT file format
|
|||
|
@chapter UTT file format
|
|||
|
|
|||
|
A UTT file contains annotation of a text. It consists of a sequence of
|
|||
|
segments. Each segment explicitly refers to a continuous piece of the
|
|||
|
text and provides some information on it.
|
|||
|
|
|||
|
@section Segment format
|
|||
|
|
|||
|
A segment occupies one line of a UTT file and consists of
|
|||
|
space-separated fields:
|
|||
|
|
|||
|
|
|||
|
@quotation
|
|||
|
@sp 1
|
|||
|
[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
|
|||
|
@sp 1
|
|||
|
@end quotation
|
|||
|
|
|||
|
@table @var
|
|||
|
|
|||
|
@item @var{start}
|
|||
|
Non-negative integer value indicating the position in the source text where the
|
|||
|
segment starts.
|
|||
|
|
|||
|
@item @var{length}
|
|||
|
Non-negative integer value indicating the length of the segment.
|
|||
|
|
|||
|
@item @var{type}
|
|||
|
A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
|
|||
|
@var{type} reflects the main classification of segments -
|
|||
|
into words, numbers, punctuation marks, meta-text markers.
|
|||
|
@xref{tok output,,tok output}, for description of automatically recognized type markers.
|
|||
|
|
|||
|
@item @var{form}
|
|||
|
This field contains the textual form of the segment or the special
|
|||
|
symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
|
|||
|
|
|||
|
The characters or character sequences that have special meaning in the
|
|||
|
@var{form} field are enumerated below.
|
|||
|
|
|||
|
Characters with special meaning:
|
|||
|
|
|||
|
@itemize
|
|||
|
@item @code{_} - space character
|
|||
|
@item @code{*} - undefined contents
|
|||
|
@end itemize
|
|||
|
|
|||
|
Escape sequences:
|
|||
|
|
|||
|
@itemize
|
|||
|
@item @code{\n} - new line
|
|||
|
@item @code{\t} - tabulation
|
|||
|
@item @code{\r} - carriage return
|
|||
|
|
|||
|
@item @code{\_} - the @code{_} character
|
|||
|
@item @code{\*} - the @code{*} character
|
|||
|
@item @code{\\} - the @code{\} character
|
|||
|
|
|||
|
@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
|
|||
|
@end itemize
|
|||
|
|
|||
|
@item @var{annotation1}
|
|||
|
@item @var{annotation2}
|
|||
|
@item ...
|
|||
|
Annotation fields have the following format:
|
|||
|
|
|||
|
@var{longname} @code{:} @var{value}
|
|||
|
|
|||
|
or
|
|||
|
|
|||
|
@var{shortname} @var{value}
|
|||
|
|
|||
|
where @var{longname} is a string of alphanumeric characters
|
|||
|
(isalnum() test), @var{shortname} - a single non-alphanumeric character
|
|||
|
(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
|
|||
|
Only two fields are mandatory: @var{type} and @var{form}. All other fields
|
|||
|
may be absent. In the case when only one number precedes the
|
|||
|
@var{type} field, it is interpreted as the @var{START} position.
|
|||
|
|
|||
|
If the @var{length} field is ommited, the length of the segment is the
|
|||
|
length of the @var{form} field, except when the value of the
|
|||
|
@var{form} field is @code{*} -- in this case, the length is assumed to
|
|||
|
be 0.
|
|||
|
|
|||
|
If the @var{start} field is also absent, the segment is assumed to directly
|
|||
|
follow the preceding one.
|
|||
|
|
|||
|
@c Conventions:
|
|||
|
|
|||
|
@c Annotation fields with predefined meaning:
|
|||
|
|
|||
|
@c @itemize
|
|||
|
@c @item @code{!} - UTT components are allowed to modify the contents of
|
|||
|
@c the @var{form} field (e.g. spelling correction does this). If this happens the
|
|||
|
@c original form of the segment have to be placed in the @code{!}-field.
|
|||
|
@c @item @code{@@} - morphological description
|
|||
|
@c @item @code{=} - node identifier assignment (used in graph encoding)
|
|||
|
@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
|
|||
|
@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
|
|||
|
@c @end itemize
|
|||
|
|
|||
|
Segments of length 0 may be used to mark file positions with some
|
|||
|
information. See e.g. BOS and EOS (beginning/end of sentence) markers
|
|||
|
in the example below.
|
|||
|
|
|||
|
Example:
|
|||
|
|
|||
|
sentence: @samp{Piszemy dobre progrumy.}
|
|||
|
|
|||
|
@example
|
|||
|
0000 00 BOS *
|
|||
|
0000 07 W Piszemy lem:pisa<73>,V
|
|||
|
0007 01 S _
|
|||
|
0008 05 W dobre lem:dobry,ADJ
|
|||
|
0013 01 S _
|
|||
|
0014 08 W progrumy cor:programy lem:program,N
|
|||
|
0022 01 P .
|
|||
|
0023 00 EOS *
|
|||
|
0023 01 S _
|
|||
|
0024 00 BOS *
|
|||
|
0024 11 W Warszawiacy lem:Warszawiak,N
|
|||
|
0035 01 S _
|
|||
|
0036 03 W te<74>
|
|||
|
0039 01 P .
|
|||
|
0040 00 EOS *
|
|||
|
|
|||
|
@end example
|
|||
|
|
|||
|
@example
|
|||
|
0000 BOS *
|
|||
|
0000 W Piszemy lem:pisa<73>,V
|
|||
|
0007 S _
|
|||
|
0008 W dobre lem:dobry,ADJ
|
|||
|
0013 S _
|
|||
|
0014 W progrumy cor:programy lem:program,N
|
|||
|
0022 P .
|
|||
|
0023 EOS *
|
|||
|
@end example
|
|||
|
|
|||
|
Posion information may be provided only for some types of segments:
|
|||
|
|
|||
|
@example
|
|||
|
0000 BOS *
|
|||
|
W Piszemy lem:pisa<73>,V
|
|||
|
S _
|
|||
|
W dobre lem:dobry,ADJ
|
|||
|
S _
|
|||
|
W progrumy cor:programy lem:program,N
|
|||
|
P .
|
|||
|
EOS *
|
|||
|
S _
|
|||
|
0024 BOS *
|
|||
|
W Warszawiacy lem:Warszawiak,N
|
|||
|
S _
|
|||
|
W te<74>
|
|||
|
P .
|
|||
|
EOS *
|
|||
|
@end example
|
|||
|
|
|||
|
Position/length information may be provided only when necessary:
|
|||
|
|
|||
|
@example
|
|||
|
0000 04 N *
|
|||
|
0000 N 12
|
|||
|
P .
|
|||
|
N 5
|
|||
|
S _
|
|||
|
W km
|
|||
|
@end example
|
|||
|
|
|||
|
@section UTT File
|
|||
|
|
|||
|
A UTT file consists of a sequence of segments. The same text position
|
|||
|
may be covered by multiple segments. In cosequence, ambiguous text
|
|||
|
segmentation and ambiguous annotation may be represented.
|
|||
|
|
|||
|
There are two structural requirements a valid UTT-formatted file
|
|||
|
has to meet:
|
|||
|
|
|||
|
@itemize @bullet
|
|||
|
|
|||
|
@item
|
|||
|
segments have to be sorted with respect to the @var{position} field,
|
|||
|
|
|||
|
@item
|
|||
|
for each
|
|||
|
segment ending at position @var{n}, either there must be a segment starting at
|
|||
|
position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
|
|||
|
for each segment starting at position @var{n}, either there must be a segment
|
|||
|
ending at position @var{n-1}, or the position @var{n-1} must not be covered
|
|||
|
by any segment.
|
|||
|
|
|||
|
@end itemize
|
|||
|
|
|||
|
A valid annotation for the text fragment
|
|||
|
@example
|
|||
|
12.5 km
|
|||
|
@end example
|
|||
|
|
|||
|
may be
|
|||
|
|
|||
|
@example
|
|||
|
0000 02 N 12
|
|||
|
0000 04 N 12.5
|
|||
|
0002 01 P .
|
|||
|
0003 01 N 5
|
|||
|
0004 01 S _
|
|||
|
0005 02 W km
|
|||
|
@end example
|
|||
|
|
|||
|
but not
|
|||
|
|
|||
|
@example
|
|||
|
0000 02 N 12
|
|||
|
0000 04 N 12.5
|
|||
|
0004 01 S _
|
|||
|
0005 02 W km
|
|||
|
@end example
|
|||
|
|
|||
|
because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
|
|||
|
|
|||
|
@section Character encoding
|
|||
|
|
|||
|
The UTT component programs accept only 1-byte character encoding, such
|
|||
|
as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
|
|||
|
|
|||
|
|
|||
|
@c @section Formats
|
|||
|
|
|||
|
@c @unnumberedsubsubsec Basic format
|
|||
|
|
|||
|
@c While processing large amounts of the overhead related with explicit
|
|||
|
@c ... of the start position and segment length becomes ... . Therefore,
|
|||
|
@c for efficiency reasons certain shortcuts are possible:
|
|||
|
|
|||
|
@c @unnumberedsubsubsec Relative start position
|
|||
|
|
|||
|
@c Start position may be given as relative distance from the last
|
|||
|
@c absolut position.
|
|||
|
|
|||
|
@c @unnumberedsubsubsec Absent length
|
|||
|
|
|||
|
@c Segment length may by omitted. Normally it can be restored by counting
|
|||
|
@c the length of the @emph{form field}. For segments with the special value
|
|||
|
@c @code{*} in the @emph{form field} length 0 is assumed.
|
|||
|
|
|||
|
@c @unnumberedsubsubsec Absent length and start position
|
|||
|
|
|||
|
@c Both start position and segment length may be omitted. In this format
|
|||
|
@c each segment is assumed to follow the previous one. This format is,
|
|||
|
@c therefore, suitable only for unambiguously tagged text
|
|||
|
@c (0-length markers can be still used.)
|
|||
|
|
|||
|
|
|||
|
@c @table @code
|
|||
|
@c @item AL
|
|||
|
@c @code{1234 03 W kot}
|
|||
|
@c @item RL
|
|||
|
@c @code{+56 03 W kot}
|
|||
|
@c @item A
|
|||
|
@c @code{1234 W kot}
|
|||
|
@c @item R
|
|||
|
@c @code{+56 W kot}
|
|||
|
@c @item 0
|
|||
|
@c @code{W kot}
|
|||
|
@c @end table
|
|||
|
|
|||
|
|
|||
|
@c [JAK UZYSKA<4B> POLSKIE CZCIONKI W DVI???]
|
|||
|
|
|||
|
@macro parhelp
|
|||
|
@item @b{@minus{}@minus{}help}, @b{@minus{}h}
|
|||
|
Print help.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro parversion
|
|||
|
@item @b{@minus{}@minus{}version}, @b{@minus{}V}
|
|||
|
Print version information.
|
|||
|
@end macro
|
|||
|
|
|||
|
@macro parinteractive
|
|||
|
@item @b{@minus{}@minus{}interactive, @minus{}i}
|
|||
|
This option toggles interactive mode, which is by default off. In the
|
|||
|
interactive mode the program does not buffer the output.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@c @macro parfile
|
|||
|
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
|
|||
|
@c Input file name.
|
|||
|
@c If this option is absent or equal to '@minus{}', the program
|
|||
|
@c reads from the standard input.
|
|||
|
@c @end macro
|
|||
|
|
|||
|
|
|||
|
@c @macro paroutput
|
|||
|
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
|
|||
|
@c Regular output file name. To regular output the program sends segments
|
|||
|
@c which it successfully processed and copies those which were not
|
|||
|
@c subject to processing. If this option is absent or equal to
|
|||
|
@c '@minus{}', standard output is used.
|
|||
|
@c @end macro
|
|||
|
|
|||
|
@c @macro parfail
|
|||
|
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
|
|||
|
@c Fail output file name. To fail output the program copies the segments
|
|||
|
@c it failed to process. If this option is absent or equal to
|
|||
|
@c '@minus{}', standard output is used.
|
|||
|
@c @end macro
|
|||
|
|
|||
|
|
|||
|
@c @macro parcopy
|
|||
|
@c @item @b{@minus{}@minus{}copy, @minus{}c}
|
|||
|
@c Copy succesfully processed segments to regular output also in their
|
|||
|
@c original input form.
|
|||
|
@c @end macro
|
|||
|
|
|||
|
|
|||
|
@macro parinputfield
|
|||
|
@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
|
|||
|
The field containing the input to the program. The default is the
|
|||
|
@var{form} field. The fields @var{position}, @var{length}, @var{type},
|
|||
|
and @var{form} are referred to as @code{1}, @code{2}, @code{3},
|
|||
|
@code{4}, respectively.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro paroutputfield
|
|||
|
@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
|
|||
|
The name of the field added by the program. The default is the name of the program.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro pardictionary
|
|||
|
@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
|
|||
|
Dictionary file name.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro parprocess
|
|||
|
@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
|
|||
|
Process segments with the specified value in the @var{type} field.
|
|||
|
Multiple occurences of this option are allowed and are interpreted as
|
|||
|
disjunction. If this option is absent, all segments are processed.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro parselect
|
|||
|
@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
|
|||
|
Select for processing only segments in which the field named
|
|||
|
@var{fieldname} is present. Multiple occurences of this option are
|
|||
|
allowed and are interpreted as conjunction of conditions. If this
|
|||
|
option is absent, all segments are processed.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro parunselect
|
|||
|
@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
|
|||
|
Select for processing only segments in which the field @var{fieldname}
|
|||
|
is absent. Multiple occurences of this option are allowed and are
|
|||
|
interpreted as conjunction of conditions. If this option is absent,
|
|||
|
all segments are processed.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro paroneline
|
|||
|
@item @b{@minus{}@minus{}one-line}
|
|||
|
This option makes the program print ambiguous annotation in one output
|
|||
|
line by generating multiple annotation fields. By default when
|
|||
|
ambiguous annotation may be produced for a segment, the segment is
|
|||
|
multiplicated and each of the annotations is added to separate copy of
|
|||
|
the segment.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@macro paronefield
|
|||
|
@item @b{@minus{}@minus{}one-field, @minus{}1}
|
|||
|
This option makes the program print ambiguous annotation in one
|
|||
|
annotation field. By default when ambiguous annotation may be produced
|
|||
|
for a segment, the segment is multiplicated and each of the
|
|||
|
annotations is added to separate copy of the segment.
|
|||
|
|
|||
|
This option is useful when working with @command{kot} or @command{con}.
|
|||
|
@end macro
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@c @node Common command line options
|
|||
|
@c @chapter Common command line options
|
|||
|
|
|||
|
@c @table @code
|
|||
|
|
|||
|
@c @parhelp
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
|
|||
|
@c Print help.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
|
|||
|
@c Print version information.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
|
|||
|
@c Input file name.
|
|||
|
@c If this option is absent or equal to '@minus{}', the program
|
|||
|
@c reads from the standard input.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
|
|||
|
@c Regular output file name. To regular output the program sends segments
|
|||
|
@c which it successfully processed and copies those which were not
|
|||
|
@c subject to processing. If this option is absent or equal to
|
|||
|
@c '@minus{}', standard output is used.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
|
|||
|
@c Fail output file name. To fail output the program copies the segments
|
|||
|
@c it failed to process. If this option is absent or equal to
|
|||
|
@c '@minus{}', standard output is used.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}only-fail}
|
|||
|
@c Discard segments which would normally be sent to regular
|
|||
|
@c output. Print only segments the program failed to process.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}no-fail}
|
|||
|
@c Discard segments the program failed to process.
|
|||
|
@c (This and the previous option are functionally equivalent to,
|
|||
|
@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
|
|||
|
@c make the programs run faster.)
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
|
|||
|
@c The field containing the input to the program. The default is usually
|
|||
|
@c the @var{form} field (unless otherwise stated in the program
|
|||
|
@c description). The fields @var{position}, @var{length}, @var{tag}, and
|
|||
|
@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
|
|||
|
@c respectively.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
|
|||
|
@c The name of the field added by the program. The default is the name of
|
|||
|
@c the program.
|
|||
|
|
|||
|
@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
|
|||
|
@c @c Copy processed segments to regular output.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
|
|||
|
@c Dictionary file name.
|
|||
|
@c (This option is used by programs which use dictionary data.)
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
|
|||
|
@c Process segments with the specified value in the @var{tag} field.
|
|||
|
@c Multiple occurences of this option are allowed and are interpreted as
|
|||
|
@c disjunction. If this option is absent, all segments are processed.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
|
|||
|
@c Select for processing only segments in which the field named
|
|||
|
@c @var{fieldname} is present. Multiple occurences of this option are
|
|||
|
@c allowed and are interpreted as conjunction of conditions. If this
|
|||
|
@c option is absent, all segments are processed.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
|
|||
|
@c Select for processing only segments in which the field @var{fieldname}
|
|||
|
@c is absent. Multiple occurences of this option are allowed and are
|
|||
|
@c interpreted as conjunction of conditions. If this option is absent,
|
|||
|
@c all segments are processed.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}interactive @minus{}i}
|
|||
|
@c This option toggles interactive mode, which is by default off. In the
|
|||
|
@c interactive mode the program does not buffer the output.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}config=@var{filename}}
|
|||
|
@c Read configuration from file @file{@var{filename}}.
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}one @minus{}1}
|
|||
|
@c This option makes the program print ambiguous annotation in one output
|
|||
|
@c segment. By default when
|
|||
|
@c ambiguous new annotation is being produced for a segment, the segment
|
|||
|
@c is multiplicated and each of the annotations is added to separate copy
|
|||
|
@c of the segment.
|
|||
|
|
|||
|
@c @end table
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c CONFIGURATION FILES
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node Configuration files
|
|||
|
@chapter Configuration files
|
|||
|
|
|||
|
Values for all command line options accepted by a component
|
|||
|
may be set in configuration files. The default location of the
|
|||
|
configuration files for a component named @command{@var{program}} are
|
|||
|
|
|||
|
@example
|
|||
|
@file{/etc/utt/conf/@var{program}.conf}
|
|||
|
@end example
|
|||
|
|
|||
|
for system-wide configuration file and
|
|||
|
|
|||
|
@example
|
|||
|
@file{~/.utt/conf/@var{program}.conf}
|
|||
|
@end example
|
|||
|
|
|||
|
for user configuration file.
|
|||
|
|
|||
|
@c The configuration file to load may be also specified with the
|
|||
|
@c @option{--config} option. Configuration file need not be provided.
|
|||
|
|
|||
|
For each option, the value is set according to the following priority:
|
|||
|
|
|||
|
@itemize
|
|||
|
@item command line
|
|||
|
@c @item configuration file indicated with @option{--config} option
|
|||
|
@item user configuration file (or configuration file indicated with the @option{--config} option)
|
|||
|
@item system-wide configuration file
|
|||
|
@end itemize
|
|||
|
|
|||
|
Parameter values are specified in the following format:
|
|||
|
|
|||
|
@var{parametername}=@var{value}
|
|||
|
|
|||
|
where @var{parametername} is the short or long name of an option accepted by
|
|||
|
the program, or
|
|||
|
|
|||
|
@var{parametername}
|
|||
|
|
|||
|
if the option does not need arguments.
|
|||
|
|
|||
|
You can introduce comments to configuration files using the # sign.
|
|||
|
|
|||
|
If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
|
|||
|
|
|||
|
@c The equal sign may be omitted.
|
|||
|
|
|||
|
|
|||
|
@quotation Tip
|
|||
|
If you have two (or more) frequently used sets of options for the same
|
|||
|
program (eg. lem with PMDBF dictionary and lem with a user dictionary)
|
|||
|
a good solution is to create two soft links to lem, called
|
|||
|
eg. lemg and lemu and specify their configuration in files lemg.conf
|
|||
|
and lemu.conf respectively.
|
|||
|
@end quotation
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c COMPONENTS
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node UTT components
|
|||
|
@chapter UTT components
|
|||
|
|
|||
|
UTT components are of three types:
|
|||
|
|
|||
|
@menu
|
|||
|
Sources: programs which read non-UTT data (e.g. raw text) and produce output
|
|||
|
in UTT format
|
|||
|
* tok:: a tokenizer
|
|||
|
|
|||
|
Filters: programs which read and produce UTT-formatted data
|
|||
|
@c * sen - the sentencizer::
|
|||
|
* lem:: a morphological analyzer
|
|||
|
* gue:: a morphological guesser
|
|||
|
* cor:: a spelling corrector
|
|||
|
* sen:: a sentensizer
|
|||
|
@c * gph - the graphizer::
|
|||
|
* ser:: a pattern search tool (marks matches)
|
|||
|
* grp:: a pattern search tool (selects sentences containing a match)
|
|||
|
|
|||
|
Sinks: programs which read UTT data and produce output in another format
|
|||
|
* kot:: an untokenizer
|
|||
|
* con:: a concordance table generator
|
|||
|
@end menu
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c TOK
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node tok
|
|||
|
@section tok - a tokenizer
|
|||
|
|
|||
|
@c ----------------------------------------
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab source
|
|||
|
@end multitable
|
|||
|
|
|||
|
|
|||
|
@menu
|
|||
|
* tok description::
|
|||
|
* tok input::
|
|||
|
* tok output::
|
|||
|
* tok command line options::
|
|||
|
* tok example::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node tok description
|
|||
|
@subsection Description
|
|||
|
|
|||
|
@code{tok} is a simple program which reads a text file and identifies
|
|||
|
tokens on the basis of their orthographic form. The type of the token
|
|||
|
is printed as the @var{type} field.
|
|||
|
|
|||
|
@node tok input
|
|||
|
@subsection Input
|
|||
|
|
|||
|
Raw text.
|
|||
|
|
|||
|
@node tok output
|
|||
|
@subsection Output
|
|||
|
|
|||
|
UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
|
|||
|
|
|||
|
@itemize
|
|||
|
|
|||
|
@item @code{W}
|
|||
|
(word)
|
|||
|
- continuous sequence of letters
|
|||
|
|
|||
|
@item @code{N}
|
|||
|
(number)
|
|||
|
- continuous sequence of digits
|
|||
|
|
|||
|
@item @code{S}
|
|||
|
(space)
|
|||
|
- continuous sequence of space characters
|
|||
|
|
|||
|
@item @code{P}
|
|||
|
(punctuation mark)
|
|||
|
- single printable characters not belonging to any of the other classes
|
|||
|
|
|||
|
@item @code{B}
|
|||
|
(unprintable character)
|
|||
|
- single unprintable character
|
|||
|
|
|||
|
@end itemize
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@node tok command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}help}, @b{@minus{}h}
|
|||
|
Print help.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}version}, @b{@minus{}V}
|
|||
|
Print version information.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}interactive, @minus{}i}
|
|||
|
This option toggles interactive mode, which is by default off. In the
|
|||
|
interactive mode the program does not buffer the output.
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
@node tok example
|
|||
|
@subsection Example
|
|||
|
|
|||
|
Input:
|
|||
|
|
|||
|
@example
|
|||
|
Piszemy dobre programy.
|
|||
|
@end example
|
|||
|
|
|||
|
Output:
|
|||
|
|
|||
|
@example
|
|||
|
0000 07 W Piszemy
|
|||
|
0007 01 S _
|
|||
|
0008 05 W dobre
|
|||
|
0013 01 S _
|
|||
|
0014 08 W programy
|
|||
|
0022 01 P .
|
|||
|
0023 01 S \n
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c SEN
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@c @node sen - sentencizer
|
|||
|
@c @chapter sen - sentencizer
|
|||
|
|
|||
|
@c Authors: Tomasz Obr<62>bski
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c LEM
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node lem
|
|||
|
@section lem - morphological analyzer
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski, Micha<68> Stolarski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
|
|||
|
@menu
|
|||
|
* lem description::
|
|||
|
* lem command line options::
|
|||
|
* lem input::
|
|||
|
* lem output::
|
|||
|
* lem example::
|
|||
|
* lem dictionaries::
|
|||
|
* lem hints::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node lem description
|
|||
|
@subsection Description
|
|||
|
|
|||
|
@command{lem} performs morphological analysis of a simple orthographic
|
|||
|
word, returning all its possible morphological annotations,
|
|||
|
disregarding the context.
|
|||
|
|
|||
|
@c ----------------------------------------
|
|||
|
|
|||
|
@node lem command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
@parhelp
|
|||
|
@parversion
|
|||
|
@parinteractive
|
|||
|
@c @parfile
|
|||
|
@c @paroutput
|
|||
|
@c @parfail
|
|||
|
@c @parcopy
|
|||
|
@parinputfield
|
|||
|
@paroutputfield
|
|||
|
@pardictionary
|
|||
|
@parprocess
|
|||
|
@parselect
|
|||
|
@parunselect
|
|||
|
@paroneline
|
|||
|
@paronefield
|
|||
|
@end table
|
|||
|
|
|||
|
@c ----------------------------------------
|
|||
|
|
|||
|
@node lem input
|
|||
|
@subsection Input
|
|||
|
|
|||
|
Lem reads a UTT file and processes the value of the @var{form} field
|
|||
|
(the input field may be changed with @option{--input-field} option).
|
|||
|
|
|||
|
@node lem output
|
|||
|
@subsection Output
|
|||
|
|
|||
|
@command{lem} adds a new annotation field, whose default name is @code{lem}. In
|
|||
|
case of ambiguity either the segment is multiplicated (default),
|
|||
|
multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
|
|||
|
annotation is produced as the value of single @code{lem} field (option
|
|||
|
@option{--one-field,-1}):
|
|||
|
|
|||
|
@itemize @bullet
|
|||
|
|
|||
|
@item
|
|||
|
unambiguous value format:
|
|||
|
|
|||
|
@example
|
|||
|
<lemma>,<descr>
|
|||
|
@end example
|
|||
|
|
|||
|
@item
|
|||
|
ambiguous value format (@option{--one-field} option)
|
|||
|
|
|||
|
|
|||
|
@example
|
|||
|
<lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
|
|||
|
@end example
|
|||
|
|
|||
|
(alternative descriptions for the same lemma are separated by commas,
|
|||
|
alternative lemmata are separated by semicolons.)
|
|||
|
|
|||
|
@end itemize
|
|||
|
|
|||
|
@node lem example
|
|||
|
@subsection Example
|
|||
|
|
|||
|
Input:
|
|||
|
|
|||
|
@example
|
|||
|
0000 07 W Piszemy
|
|||
|
0007 01 S _
|
|||
|
0008 05 W dobre
|
|||
|
0013 01 S _
|
|||
|
0014 08 W programy
|
|||
|
0022 01 P .
|
|||
|
0023 01 B \n
|
|||
|
@end example
|
|||
|
|
|||
|
Output (default):
|
|||
|
|
|||
|
@example
|
|||
|
0000 07 W Piszemy lem:pisa<73>,V/AiVpMdTrfNpP1
|
|||
|
0007 01 B _
|
|||
|
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
|
|||
|
0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
|
|||
|
0013 01 B _
|
|||
|
0014 08 W programy lem:program,N/GiNpCa
|
|||
|
0014 08 W programy lem:program,N/GiNpCn
|
|||
|
0014 08 W programy lem:program,N/GiNpCv
|
|||
|
0022 01 P .
|
|||
|
0023 01 B \n
|
|||
|
@end example
|
|||
|
|
|||
|
Output (@option{--one-line} option):
|
|||
|
|
|||
|
@example
|
|||
|
0000 07 W Piszemy lem:pisa<73>,V/AiVpMdTrfNpP1
|
|||
|
0007 01 S _
|
|||
|
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
|
|||
|
0013 01 S _
|
|||
|
0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
|
|||
|
0022 01 P .
|
|||
|
0023 01 S \n
|
|||
|
@end example
|
|||
|
|
|||
|
Output (@option{--one-field} option):
|
|||
|
|
|||
|
@example
|
|||
|
0000 07 W Piszemy lem:pisa<73>,V/AiVpMdTrfNpP1
|
|||
|
0007 01 S _
|
|||
|
0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
|
|||
|
0013 01 S _
|
|||
|
0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
|
|||
|
0022 01 P .
|
|||
|
0023 01 S \n
|
|||
|
@end example
|
|||
|
|
|||
|
@c ----------------------------------------
|
|||
|
|
|||
|
@node lem dictionaries
|
|||
|
@subsection Dictionaries
|
|||
|
|
|||
|
@command{lem} requires a dictionary. The dictionary may be provided in
|
|||
|
one of two formats: in text (source) format or in binary (fsa) format.
|
|||
|
|
|||
|
@subsubheading Text format
|
|||
|
|
|||
|
Dictionary entries have the following structure:
|
|||
|
|
|||
|
@example
|
|||
|
<form>;<lemma>,<descr>[;<lemma>,<descr>]
|
|||
|
@end example
|
|||
|
|
|||
|
@var{lemma} may be given explicitly or in the cut-add format:
|
|||
|
|
|||
|
@example
|
|||
|
@code{[<cut1><add1>-]<cut2><add2>}
|
|||
|
@end example
|
|||
|
|
|||
|
meaning: replace prefix of length @code{<cut1>} with
|
|||
|
string @code{<add1>}, replace suffix of length @code{<cut2>} with string
|
|||
|
@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
|
|||
|
@samp{kot}, @code{3-4a<34>y} transforms @samp{najbielsi} into @samp{bia<69>y}
|
|||
|
|
|||
|
Each dictionary entry must be written in one line and must not contain blank characters.
|
|||
|
|
|||
|
Examples:
|
|||
|
@example
|
|||
|
kot;0,N/GaNsCn
|
|||
|
kota;1,N/GaNsCg;1,N/GaNsCa
|
|||
|
kotu;1,N/GaNsCd
|
|||
|
kotem;2,N/GaNsCi
|
|||
|
kocie;3t,N/GaNsCl;3t,N/GaNsCv
|
|||
|
najbielsi;3-4a<34>y,ADJ/DsNpCnGp
|
|||
|
najbielsze;3-5a<35>y,ADJ/DsNpCnGaifn
|
|||
|
najlepsi;dobry,ADJ/DsNpCnGp
|
|||
|
najlepsze;dobry,ADJ/DsNpCnGaifn
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
The mandatory file name extension for a text dictionary is @code{dic}. For large
|
|||
|
dictionaries it is preferable, however, to compile them into binary
|
|||
|
(fsa) format.
|
|||
|
|
|||
|
@subsubheading Binary format
|
|||
|
|
|||
|
The mandatory file name extension for a binary dictionary is @code{bin}. To
|
|||
|
compile a text dictionary into binary format, write:
|
|||
|
|
|||
|
@example
|
|||
|
compiledic <dictionaryname>.dic
|
|||
|
@end example
|
|||
|
|
|||
|
@subsubheading Polex/PMDBF dictionary
|
|||
|
|
|||
|
A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
|
|||
|
the distribution as the default @emph{lem}'s dictionary. It's
|
|||
|
located by default in:
|
|||
|
|
|||
|
@file{$HOME/.utt/pl/lem.bin}
|
|||
|
|
|||
|
@node lem hints
|
|||
|
@subsection Hints
|
|||
|
|
|||
|
@c @subsubheading Combining data from multiple dictionaries
|
|||
|
|
|||
|
@c @itemize
|
|||
|
|
|||
|
@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
|
|||
|
|
|||
|
@c @example
|
|||
|
@c lem -d <dict1> | lem -S lem -d <dict2>
|
|||
|
@c @end example
|
|||
|
|
|||
|
@c @item Add annotations from two dictionaries <dict1> and <dict2>.
|
|||
|
|
|||
|
@c @example
|
|||
|
@c lem -c -d <dict1> | lem -S lem -d <dict2>
|
|||
|
@c @end example
|
|||
|
|
|||
|
@c @end itemize
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c GUE
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node gue
|
|||
|
@section gue - morphological guesser
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
|
|||
|
@item @strong{Authors:} @tab Micha<68> Stolarski, Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
|
|||
|
@end multitable
|
|||
|
|
|||
|
@command{gue} guesess morphological descriptions of the form contained
|
|||
|
in the @var{form} field.
|
|||
|
|
|||
|
@menu
|
|||
|
* gue command line options::
|
|||
|
* gue example::
|
|||
|
* gue dictionaries::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node gue command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
@parversion
|
|||
|
@parinteractive
|
|||
|
@c @parfile
|
|||
|
@c @paroutput
|
|||
|
@c @parfail
|
|||
|
@c @parcopy
|
|||
|
@parinputfield
|
|||
|
@paroutputfield
|
|||
|
@pardictionary
|
|||
|
@parprocess
|
|||
|
@parselect
|
|||
|
@parunselect
|
|||
|
@paroneline
|
|||
|
@paronefield
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}delta=@var{n}}
|
|||
|
Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
|
|||
|
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}cut-off=@var{n}}
|
|||
|
Do not display answers with less weight than cut-off value (default=`200').
|
|||
|
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
|
|||
|
Guess up to n descriptions (default=`0', which means 'display all results').
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
@node gue example
|
|||
|
@subsection Example
|
|||
|
|
|||
|
@example
|
|||
|
command: gue -n 2
|
|||
|
|
|||
|
input:
|
|||
|
0000 07 W smerfny
|
|||
|
|
|||
|
output:
|
|||
|
0000 07 W smerfny gue:,ADJ/CaDpGiNs
|
|||
|
0000 07 W smerfny gue:,ADJ/CnvDpGaipNs
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@node gue dictionaries
|
|||
|
@subsection Dictionaries
|
|||
|
|
|||
|
@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
|
|||
|
The fsa format is created by compiling text-format dictionaries.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@subsubheading Text format
|
|||
|
|
|||
|
Dictionary entries have the following structure:
|
|||
|
|
|||
|
@example
|
|||
|
@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
|
|||
|
@end example
|
|||
|
|
|||
|
@var{lemma} must be given in the cut-add format:
|
|||
|
|
|||
|
@example
|
|||
|
@code{[<cut1><add1>-]<cut2><add2>}
|
|||
|
@end example
|
|||
|
(no spaces in between): replace prefix of length @var{cut1} with
|
|||
|
string @var{add1}, replace suffix of length @var{cat2} with string
|
|||
|
@var{add2}.
|
|||
|
|
|||
|
|
|||
|
Example: @code{3-4a<34>y} transforms @i{najbielsi} into @i{bia<69>y}
|
|||
|
|
|||
|
|
|||
|
@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
|
|||
|
|
|||
|
@var{weight} is an integer value between 1 and 999 indicating the
|
|||
|
likelihood of the guess.
|
|||
|
|
|||
|
@example
|
|||
|
*<2A>k<EFBFBD>;1a,N/GfNsCa
|
|||
|
naj*elszy;3-4a<34>y,ADJ/...:...
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c COR
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node cor
|
|||
|
@section cor - spelling corrector
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski, Micha<68> Stolarski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
|
|||
|
The spelling corrector applies Kemal Oflazer's dynamic programming
|
|||
|
algorithm @cite{oflazer96} to the FSA representation of the set of
|
|||
|
word forms of the Polex/PMDBF dictionary. Given an incorrect
|
|||
|
word form it returns all word forms present in the dictionary whose
|
|||
|
edit distance is smaller than the threshold given as the parameter.
|
|||
|
|
|||
|
By default @code{cor} replaces the contents of the @var{form} field
|
|||
|
with new corrected value, placing the old contents in the @code{cor}
|
|||
|
field.
|
|||
|
|
|||
|
|
|||
|
@menu
|
|||
|
* cor command line options::
|
|||
|
* cor dictionaries::
|
|||
|
@end menu
|
|||
|
|
|||
|
|
|||
|
@node cor command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
@parversion
|
|||
|
@parinteractive
|
|||
|
@c @parfile
|
|||
|
@c @paroutput
|
|||
|
@c @parfail
|
|||
|
@c @parcopy
|
|||
|
@parinputfield
|
|||
|
@paroutputfield
|
|||
|
@pardictionary
|
|||
|
@parprocess
|
|||
|
@parselect
|
|||
|
@parunselect
|
|||
|
@paroneline
|
|||
|
@paronefield
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
|
|||
|
Maximum edit distance (default='1').
|
|||
|
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
@node cor dictionaries
|
|||
|
@subsection Dictionaries
|
|||
|
|
|||
|
@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
|
|||
|
The fsa format is created by compiling text-format dictionaries.
|
|||
|
|
|||
|
@subsubheading Text format
|
|||
|
|
|||
|
The @command{cor} dictionary is a list of words:
|
|||
|
@example
|
|||
|
odlot
|
|||
|
odlotowy
|
|||
|
odludek
|
|||
|
@end example
|
|||
|
|
|||
|
@page
|
|||
|
@node sen
|
|||
|
@section sen - a sentensizer
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
|
|||
|
@end multitable
|
|||
|
|
|||
|
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
|
|||
|
|
|||
|
@menu
|
|||
|
@c * sen input::
|
|||
|
@c * sen output::
|
|||
|
* sen example::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node sen example
|
|||
|
@subsection Example
|
|||
|
|
|||
|
@example
|
|||
|
command: sen
|
|||
|
|
|||
|
input:
|
|||
|
0000 05 W Cze<7A><65>
|
|||
|
0005 01 P !
|
|||
|
0006 01 S _
|
|||
|
0007 02 W To
|
|||
|
0009 01 S _
|
|||
|
0010 02 W ja
|
|||
|
0012 01 P .
|
|||
|
0013 01 S \n
|
|||
|
|
|||
|
output:
|
|||
|
0000 00 BOS *
|
|||
|
0000 05 W Cze<7A><65>
|
|||
|
0005 01 P !
|
|||
|
0006 00 EOS *
|
|||
|
0006 00 BOS *
|
|||
|
0006 01 S _
|
|||
|
0007 02 W To
|
|||
|
0009 01 S _
|
|||
|
0010 02 W ja
|
|||
|
0012 01 P .
|
|||
|
0013 01 S \n
|
|||
|
0014 00 EOS *
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c GPH
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@c @node gph - graphizer
|
|||
|
@c @chapter gph - graphizer
|
|||
|
|
|||
|
@c Authors: Tomasz Obr<62>bski
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@c SER
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node ser
|
|||
|
@section ser - pattern search tool
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
|
|||
|
@command{ser} looks for patterns in UTT-formatted texts.
|
|||
|
|
|||
|
@menu
|
|||
|
* ser command line options::
|
|||
|
* ser pattern::
|
|||
|
* ser how ser works::
|
|||
|
* ser customization::
|
|||
|
* ser limitations::
|
|||
|
* ser requirements::
|
|||
|
@end menu
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@node ser command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
@parversion
|
|||
|
@c @parfile
|
|||
|
@c @paroutput
|
|||
|
@c @parinputfield
|
|||
|
@c @paroutputfield
|
|||
|
@parprocess
|
|||
|
@parinteractive
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
|
|||
|
The search pattern.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}morph=@var{field}}
|
|||
|
The name of the annotation field containing the morphological
|
|||
|
description (default @code{lem}).
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}flex}
|
|||
|
Only print the generated flex source code.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}macro=@var{filename}}
|
|||
|
Read macrodefinitions from file @var{filename} rather than from
|
|||
|
default location. This option allows to redefine the set of terms.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}define=@var{filename}}
|
|||
|
Append macrodefinitions from file @var{filename}. This option
|
|||
|
allows to extend the set of terms.
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@node ser pattern
|
|||
|
@subsection Pattern
|
|||
|
|
|||
|
The @command{ser} pattern is a regular expression over terms corresponding
|
|||
|
to text segments or segment sequences. Predefined terms are:
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@item seg(@var{t},@var{f},@var{a})
|
|||
|
a segment of type @var{t}, containing form @var{f} and annotation
|
|||
|
@var{a}
|
|||
|
|
|||
|
@item form(@var{f})
|
|||
|
a segment containing form @var{f}
|
|||
|
|
|||
|
@item field(@var{f})
|
|||
|
a segment containing annotation field @var{f}
|
|||
|
|
|||
|
@item space(@var{f})
|
|||
|
a space segment of form @var{f}
|
|||
|
|
|||
|
@item word(@var{f})
|
|||
|
a word segment of form @var{f}
|
|||
|
|
|||
|
@item punct(@var{f})
|
|||
|
a punct segment of form @var{f}
|
|||
|
|
|||
|
@item number(@var{f})
|
|||
|
a number segment of form @var{f}
|
|||
|
|
|||
|
@item lexeme(@var{f})
|
|||
|
a word segment with lemma @var{f}
|
|||
|
|
|||
|
@item cat(@var{c})
|
|||
|
a word segment of category @var{c}
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
All arguments are optional. If an argument is omitted, an arbitrary
|
|||
|
string of non-blank characters is assumed as the argument value. Term
|
|||
|
arguments may be arbitrary character-level regular expressions. The
|
|||
|
following special symbols can by used:
|
|||
|
|
|||
|
@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @code{[@dots{}]} @tab a character class
|
|||
|
@item @code{[^@dots{}]} @tab a negated character class
|
|||
|
@item @code{|} @tab alternative
|
|||
|
@item @code{*} @tab repetition, including zero times
|
|||
|
@item @code{+} @tab repetition, at least one time
|
|||
|
@item @code{?} @tab optionality
|
|||
|
@item @code{@{@var{m},@var{n}@}} @tab repetition from @var{m} to @var{n} times
|
|||
|
@item @code{@{@var{m},@}} @tab repetition @var{m} or more times
|
|||
|
@item @code{@{@var{m}@}} @tab repetition @var{m} times
|
|||
|
@item @code{@var{\ddd}} @tab the character with octal value @var{ddd}
|
|||
|
@item @code{\x@var{hh}} @tab the character with hexadecimal value @var{hh}
|
|||
|
@item @code{( )} @tab parentheses, used to override precedence
|
|||
|
@c @end multitable
|
|||
|
|
|||
|
@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @code{.} @tab a non-blank character
|
|||
|
@item @code{\w} @tab a letter
|
|||
|
@item @code{\W} @tab a non-blank character other than a letter
|
|||
|
@item @code{\d} @tab a digit
|
|||
|
@item @code{\D} @tab a non-blank character other than a digit
|
|||
|
@item @code{\s} @tab a space or tab character
|
|||
|
@item @code{\S} @tab a non-blank character (the same as @code{.})
|
|||
|
@item @code{\l} @tab a lowercase letter
|
|||
|
@item @code{\L} @tab an uppercase letter
|
|||
|
@end multitable
|
|||
|
|
|||
|
|
|||
|
@noindent The following characters:
|
|||
|
@example
|
|||
|
@verb{% [ ] ^ | * + ? { } , . < > \ %}
|
|||
|
@end example
|
|||
|
must be escaped with a backslash, i.e. written as:
|
|||
|
@example
|
|||
|
@verb{% \[ \] \^ \| \* \+ \? \{ \} \, \. \< \> \\ %}
|
|||
|
@end example
|
|||
|
|
|||
|
@quotation Note
|
|||
|
The special symbols are ... borrowed from Perl with minor
|
|||
|
modifications ... for convenience
|
|||
|
The meaning of certain special characters/sequences slightly differs
|
|||
|
from their common ???. This is motivated by convenience reasons.
|
|||
|
The meaning of the @code{.} special character is modified due to
|
|||
|
the special function of spaces in utt files (they are field
|
|||
|
separators). Use @code{\s} to explicitly
|
|||
|
@end quotation
|
|||
|
|
|||
|
In the argument of the @code{cat} term a special operator <...> may be
|
|||
|
used. A category specification enclosed in angle brackets matches all
|
|||
|
category descriptions which are consistent (non-contradictory) with the
|
|||
|
specification. For example @code{<N>} matches all noun descriptions,
|
|||
|
@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
|
|||
|
|
|||
|
|
|||
|
@*
|
|||
|
@noindent @b{Examples of one-segment patterns:}
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @code{seg} @tab any segment
|
|||
|
@item @code{word} @tab any word-form
|
|||
|
@item @code{word(pomocy)} @tab the word-form @samp{pomocy}
|
|||
|
@item @code{word(naj.+)} @tab a word-form beginning with @samp{naj}
|
|||
|
@item @code{word(\L\l+)} @tab a capitalized word-form
|
|||
|
@item @code{punct} @tab a punctuation character
|
|||
|
@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
|
|||
|
@item @code{lexeme(pomoc)} @tab any form of the lexeme 'pomoc'
|
|||
|
@item @code{cat(N/.*)} @tab a word which category starts with @code{N/}
|
|||
|
@item @code{cat(<N/Ca>)} @tab a word which category matches @code{N/Ca}
|
|||
|
@end multitable
|
|||
|
|
|||
|
@*
|
|||
|
@noindent @b{Examples of multi-segment patterns:}
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@item (word(\L) punct(\.) space?)+ word(\L\l+)
|
|||
|
a sequence of initials followed by a surname
|
|||
|
|
|||
|
@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
|
|||
|
a text fragment between two punctuation characters, containing an
|
|||
|
ocurrence of a relative pronoun
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
|
|||
|
@node ser how ser works
|
|||
|
@subsection How ser works
|
|||
|
|
|||
|
@node ser customization
|
|||
|
@subsection Customization
|
|||
|
|
|||
|
@c All predefined terms correspond to single segments,
|
|||
|
|
|||
|
@example
|
|||
|
define(`verbseq', `(cat(V) (space cat(V)))')
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
the term @code{cat()} may not be used as a ... of
|
|||
|
|
|||
|
@c See @command{m4} manual for further details on macro definition format.
|
|||
|
|
|||
|
@node ser limitations
|
|||
|
@subsection Limitations
|
|||
|
|
|||
|
more than 3 attributes in <>.
|
|||
|
|
|||
|
@node ser requirements
|
|||
|
@subsection Requirements
|
|||
|
|
|||
|
In order to run @command{ser}, the following programs must be
|
|||
|
installed in the system:
|
|||
|
|
|||
|
@itemize
|
|||
|
|
|||
|
@item @command{m4}
|
|||
|
@item @command{grep}
|
|||
|
@item @command{flex}
|
|||
|
@item @command{gcc}
|
|||
|
|
|||
|
@end itemize
|
|||
|
|
|||
|
|
|||
|
@c GRP
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node grp
|
|||
|
@section grp - pattern search tool
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
|
|||
|
|
|||
|
@code{gre} selects sentences containing an expression matching a
|
|||
|
pattern. The pattern format is exactly the same as that accepted by
|
|||
|
@code{ser}.
|
|||
|
|
|||
|
@code{gre} is intended mainly for speeding up corpus search process.
|
|||
|
It is extremely fast (processing speed is usually higher then the speed
|
|||
|
of reading the corpus file from disk).
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@c @menu
|
|||
|
@c * ser command line options::
|
|||
|
@c * ser pattern::
|
|||
|
@c * ser how ser works::
|
|||
|
@c * ser customization::
|
|||
|
@c * ser limitations::
|
|||
|
@c * ser requirements::
|
|||
|
@c @end menu
|
|||
|
@menu
|
|||
|
* grp command line options::
|
|||
|
* grp pattern::
|
|||
|
* grp hints::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node grp command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
@parversion
|
|||
|
@c @parfile
|
|||
|
@c @paroutput
|
|||
|
@c @parinputfield
|
|||
|
@c @paroutputfield
|
|||
|
@parprocess
|
|||
|
@parinteractive
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
|
|||
|
The search pattern.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}morph=@var{field}}
|
|||
|
The name of the annotation field containing the morphological
|
|||
|
description (default @code{lem}).
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}command}
|
|||
|
Only print the generated flex source code.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}macro=@var{filename}}
|
|||
|
Read macrodefinitions from file @var{filename} rather than from
|
|||
|
default location. This option allows to redefine the set of terms.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}define=@var{filename}}
|
|||
|
Append macrodefinitions from file @var{filename}. This option
|
|||
|
allows to extend the set of terms.
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
|
|||
|
@node grp pattern
|
|||
|
@subsection Pattern
|
|||
|
|
|||
|
(see @code{ser})
|
|||
|
|
|||
|
@node grp hints
|
|||
|
@subsection Hints
|
|||
|
|
|||
|
The corpus search speed may be increased by combining grp with lzop
|
|||
|
compression tool (grp usually processes data faster than it is read from a
|
|||
|
disk, especially for slow laptop drives).
|
|||
|
|
|||
|
@example
|
|||
|
cat corpus | tok | sen | lem | grp -a p | lzop -7 > corpus.grp.lzo
|
|||
|
@end example
|
|||
|
|
|||
|
@example
|
|||
|
lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c kot
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node kot
|
|||
|
@section kot - untokenizer
|
|||
|
|
|||
|
Authors: Tomasz Obr<62>bski
|
|||
|
|
|||
|
@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.
|
|||
|
|
|||
|
@menu
|
|||
|
* kot command line options::
|
|||
|
* kot usage examples::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node kot command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}interactive @minus{}i}
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}config=@var{filename}}
|
|||
|
|
|||
|
@item
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
|
|||
|
print @var{string} between nonadjacent segments of the input file
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}spaces, @minus{}r}
|
|||
|
retain the special characters @code{_}, @code{\t},
|
|||
|
@code{\n}, @code{\r}, @code{\f} unexpanded in the output
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
@node kot usage examples
|
|||
|
@subsection Usage examples
|
|||
|
|
|||
|
@example
|
|||
|
cat legia.txt | tok | kot
|
|||
|
@end example
|
|||
|
|
|||
|
@example
|
|||
|
cat legia.txt | tok | lem -1 | kot
|
|||
|
@end example
|
|||
|
|
|||
|
@c CON............................................................
|
|||
|
@c ...............................................................
|
|||
|
@c ...............................................................
|
|||
|
|
|||
|
@page
|
|||
|
@node con
|
|||
|
@section con - concordance table generator
|
|||
|
|
|||
|
@command{con} generates a concordance table based on a pattern given to @command{ser}.
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Justyna Walkowska
|
|||
|
@item @strong{Component category:} @tab sink
|
|||
|
@end multitable
|
|||
|
@c
|
|||
|
|
|||
|
@menu
|
|||
|
* con command line options::
|
|||
|
* con usage example::
|
|||
|
* con hints::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node con command line options
|
|||
|
@subsection Command line options
|
|||
|
|
|||
|
@table @code
|
|||
|
|
|||
|
@parhelp
|
|||
|
|
|||
|
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
|
|||
|
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
|
|||
|
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
|
|||
|
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
|
|||
|
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
|
|||
|
@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
|
|||
|
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
|
|||
|
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
|
|||
|
@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
|
|||
|
@c @item @b{@minus{}@minus{}interactive @minus{}i}
|
|||
|
@c @item @b{@minus{}@minus{}config=@var{filename}}
|
|||
|
@c @item
|
|||
|
@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
|
|||
|
@c search pattern
|
|||
|
@c
|
|||
|
@c @item @b{@minus{}@minus{}flex}
|
|||
|
@c only print the generated flex source code
|
|||
|
@c
|
|||
|
@c @item @b{@minus{}@minus{}macro=@var{filename}}
|
|||
|
@c read macrodefinitions from file @var{filename} rather than from
|
|||
|
@c default location. This option allows to redefine the set of terms.
|
|||
|
@c
|
|||
|
@c @item @b{@minus{}@minus{}define=@var{filename}}
|
|||
|
@c append macrodefinitions from file @var{filename}. This option
|
|||
|
@c allows to extend the set of terms.
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}left @minus{}l}
|
|||
|
Left context info (default='30c'). Example:
|
|||
|
@example
|
|||
|
-l=5c: left context is 5 characters
|
|||
|
-l=5w: left context is 5 words
|
|||
|
-l=5s: left context is 5 non-empty input lines
|
|||
|
-l='\s*\S+\sr\S+BOS': left context starts with the given regex
|
|||
|
@end example
|
|||
|
|
|||
|
@item @b{@minus{}@minus{}right @minus{}r}
|
|||
|
Right context info (default='30c').
|
|||
|
@item @b{@minus{}@minus{}trim @minus{}t}
|
|||
|
Clear incomplete words from output.
|
|||
|
@item @b{@minus{}@minus{}white @minus{}w}
|
|||
|
DO NOT change all white characters into spaces.
|
|||
|
@item @b{@minus{}@minus{}column @minus{}c}
|
|||
|
Left column minimal width in characters (default = 0).
|
|||
|
@item @b{@minus{}@minus{}ignore @minus{}i}
|
|||
|
Ignore segment inconsistency in the input.
|
|||
|
@item @b{@minus{}@minus{}bon}
|
|||
|
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
|
|||
|
@item @b{@minus{}@minus{}eob}
|
|||
|
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
|
|||
|
@item @b{@minus{}@minus{}bod}
|
|||
|
Selected segment beginning display string (default='[').
|
|||
|
@item @b{@minus{}@minus{}eod}
|
|||
|
Selected segment end display string (default=']').
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@end table
|
|||
|
|
|||
|
@node con usage example
|
|||
|
@subsection Usage example
|
|||
|
@example
|
|||
|
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@node con hints
|
|||
|
@subsection Hints
|
|||
|
|
|||
|
@command{con} is a rather slow program. Do not pass large amounts of
|
|||
|
redundant text through this program. @command{con} works fine in the following
|
|||
|
sequence:
|
|||
|
|
|||
|
@example
|
|||
|
... | grp -e EXPR | ser -e EXPR | con
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@page
|
|||
|
@node Auxiliary tools
|
|||
|
@chapter Auxiliary tools
|
|||
|
|
|||
|
@menu
|
|||
|
* compiledic:: dictionary compiler
|
|||
|
* fla:: UTT file flattener
|
|||
|
* unfla:: UTT file unflattener
|
|||
|
@end menu
|
|||
|
|
|||
|
|
|||
|
@page
|
|||
|
@node compiledic
|
|||
|
@section compiledic - the dictionary compiler
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Michal Stolarski, Tomasz Obrebski
|
|||
|
@item @strong{Component category:} @tab additional tool
|
|||
|
@end multitable
|
|||
|
@c
|
|||
|
|
|||
|
@command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary
|
|||
|
(FSA) format (@code{.bin} extension).
|
|||
|
|
|||
|
Automaton representation of a dictionary is built using the AT&T tools:
|
|||
|
@itemize
|
|||
|
@item AT&T FSM Library,
|
|||
|
@item AT&T Lextools.
|
|||
|
@end itemize
|
|||
|
|
|||
|
In order for the compiledic program to work you have to install the
|
|||
|
above mentioned packages into your system. They are freely available
|
|||
|
for non-commercial use.
|
|||
|
|
|||
|
Usage:
|
|||
|
@example
|
|||
|
compiledic <dictionaryname>.dic
|
|||
|
@end example
|
|||
|
|
|||
|
The file <dictionaryname>.bin will be generated.
|
|||
|
|
|||
|
Remarque: The program produces a lot of temporary files which are
|
|||
|
stored in the current directory. They are deleted after successfull
|
|||
|
termination of the program.
|
|||
|
|
|||
|
@c @menu
|
|||
|
@c * con command line options::
|
|||
|
@c * con usage example::
|
|||
|
@c * con hints::
|
|||
|
@c @end menu
|
|||
|
|
|||
|
|
|||
|
@page
|
|||
|
@node fla
|
|||
|
@section fla - the UTT file flattener
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
@c
|
|||
|
|
|||
|
@command{fla} ``flattens'' a utt file by merging segments belonging
|
|||
|
to one sentence in one line. Technically, end-of-line characters
|
|||
|
('\n', ASCII code 10) are replaced with line-feed characters ('\f',
|
|||
|
ASCII code 12). The flattening makes it possible to process UTT files
|
|||
|
with such tools as @command{grep} or @command{sed} sentence by
|
|||
|
sentence (used in @command{grp} and @command{mar}).
|
|||
|
|
|||
|
Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
|
|||
|
|
|||
|
Flattened files are still human-readible.
|
|||
|
|
|||
|
Usage:
|
|||
|
|
|||
|
@example
|
|||
|
fla [<bosregex>]
|
|||
|
@end example
|
|||
|
|
|||
|
The facultative argument is a regular expression describing segments
|
|||
|
which should be treated as sentence beginnings (the test is: the
|
|||
|
segment contains a fragment matching the @code{<bosregex>}). By
|
|||
|
default, segments containing a field @code{BOS} are seeked.
|
|||
|
@c @menu
|
|||
|
@c * con command line options::
|
|||
|
@c * con usage example::
|
|||
|
@c * con hints::
|
|||
|
@c @end menu
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@page
|
|||
|
@node unfla
|
|||
|
@section unfla - the UTT file unflattener
|
|||
|
|
|||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@item @strong{Authors:} @tab Tomasz Obr<62>bski
|
|||
|
@item @strong{Component category:} @tab filter
|
|||
|
@end multitable
|
|||
|
|
|||
|
@command{unfla} transforms a flattened UTT file, produced by
|
|||
|
@command{fla}, into the regular format by restoring end-of-line
|
|||
|
characters.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c USAGE EXAMPLES
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node Usage examples
|
|||
|
@chapter Usage examples
|
|||
|
|
|||
|
@subsubheading Simple pipelines
|
|||
|
|
|||
|
@enumerate
|
|||
|
|
|||
|
@item tokenization
|
|||
|
|
|||
|
cat text | tok > output1
|
|||
|
|
|||
|
@item morphological annotation (1)
|
|||
|
|
|||
|
simple dictionary based lemmatization
|
|||
|
|
|||
|
cat text | tok | lem > output1
|
|||
|
|
|||
|
@item morphological annotation (2)
|
|||
|
|
|||
|
1) perform dictionary-based lemmatization
|
|||
|
4) guess descriptions for words which have no annotation
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem | gue -S lem > output2
|
|||
|
@end example
|
|||
|
|
|||
|
@item morphological annotation (3)
|
|||
|
|
|||
|
1) perform dictionary-based lemmatization
|
|||
|
2) try to correct words with no annotation
|
|||
|
3) perform dictionary-based lemmatization of corrected words
|
|||
|
4) guess descriptions for words which still have no annotation
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
|
|||
|
@end example
|
|||
|
@item spelling correction
|
|||
|
|
|||
|
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem --only-fail | cor -1 > output3
|
|||
|
@end example
|
|||
|
|
|||
|
@item Expression extraction
|
|||
|
|
|||
|
Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
|
|||
|
@end example
|
|||
|
|
|||
|
@item A word in context
|
|||
|
|
|||
|
Extraction of text fragments containing a form of the lexeme 'rozmowa' in
|
|||
|
the context of 5 preceeding and 5 succeeding corpus segments.
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
|
|||
|
@end example
|
|||
|
|
|||
|
@item generation of concordance table (1)
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
|
|||
|
@end example
|
|||
|
|
|||
|
10"
|
|||
|
|
|||
|
@item generation of concordance table (2)
|
|||
|
|
|||
|
The same as above but much faster
|
|||
|
|
|||
|
@example
|
|||
|
cat text | tok | lem -1 | \
|
|||
|
grp -e 'cat(<V>) space lexeme(rozmowa)' | \
|
|||
|
ser -e 'cat(<V>) space lexeme(rozmowa)' | \
|
|||
|
con
|
|||
|
@end example
|
|||
|
|
|||
|
2"
|
|||
|
|
|||
|
@item generation of concordance table (3)
|
|||
|
|
|||
|
Usually, one performs repetitively search over the same corpus. In
|
|||
|
such case it is advisable to transform the corpus data into the format
|
|||
|
required by @command{grp} first, and then use the preprocessed data.
|
|||
|
|
|||
|
As @command{grp} (@command{grep}) processes data faster then it is
|
|||
|
read from the disk drive, the search time may be still shortened by
|
|||
|
using file compression techniques. We suggest usin @command{lzop}.
|
|||
|
|
|||
|
@item the fastest way to search a large corpus
|
|||
|
|
|||
|
step 1: preprocessing
|
|||
|
|
|||
|
@example
|
|||
|
cat corpus | tok | sen | lem -1 \
|
|||
|
| grp -a p | lzop -7 > corpus.grp.lzo
|
|||
|
@end example
|
|||
|
|
|||
|
step 2: search
|
|||
|
|
|||
|
@example
|
|||
|
lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space
|
|||
|
lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
|
|||
|
@end example
|
|||
|
|
|||
|
@end enumerate
|
|||
|
|
|||
|
@subsubheading More complicated configurations
|
|||
|
|
|||
|
|
|||
|
@example
|
|||
|
mknod fifo1 p
|
|||
|
mknod fifo2 p
|
|||
|
mknod fifo3 p
|
|||
|
mknod fifo4 p
|
|||
|
mknod fifo5 p
|
|||
|
|
|||
|
tok | lem -p W -e fifo1 > fifo2 &
|
|||
|
cor -e fifo3 < fifo1 | lem > fifo4 &
|
|||
|
gue < fifo3 > fifo5 &
|
|||
|
sort -m fifo2 fifo4 fifo5
|
|||
|
|
|||
|
rm fifo?
|
|||
|
@end example
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c PMDBF DICTIONARY
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node PMDBF dictionary
|
|||
|
@chapter PMDBF dictionary
|
|||
|
|
|||
|
UTT components come with lexical data derived from Polish
|
|||
|
Morphological Database (PMDB).
|
|||
|
|
|||
|
@menu
|
|||
|
* PMDBF files::
|
|||
|
* PMDBF tag structure::
|
|||
|
* PMDBF parts of speech::
|
|||
|
* PMDBF morphosyntactic attributes::
|
|||
|
@end menu
|
|||
|
|
|||
|
@node PMDBF files
|
|||
|
@section Files
|
|||
|
|
|||
|
@node PMDBF tag structure
|
|||
|
@section Tag structure
|
|||
|
|
|||
|
pos = [[:upper:]]+
|
|||
|
|
|||
|
attr = [[:upper:]]+
|
|||
|
|
|||
|
val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
|
|||
|
|
|||
|
descr = pos ( / ( attr val + ) + ) ?
|
|||
|
|
|||
|
@node PMDBF parts of speech
|
|||
|
@section Parts of speech
|
|||
|
|
|||
|
@multitable {ADJPRP} { adjectival-passive-participle }
|
|||
|
@item @code{N} @tab noun
|
|||
|
@item @code{NPRO} @tab nominal-pronoun
|
|||
|
@item @code{NV} @tab deverbal-noun
|
|||
|
@item @code{V} @tab verb
|
|||
|
@item @code{BYC} @tab byc
|
|||
|
@item @code{VNI} @tab non-inflected-verb
|
|||
|
@item @code{ADJ} @tab adjective
|
|||
|
@item @code{ADJPAP} @tab adjectival-passive-participle
|
|||
|
@item @code{ADJPRP} @tab adjectival-present-participle
|
|||
|
@item @code{ADJPP} @tab adjectival-past-participle
|
|||
|
@item @code{ADJPRO} @tab adjectival-pronoun
|
|||
|
@item @code{ADJNUM} @tab adjectival-numeral
|
|||
|
@item @code{ADV} @tab adverb
|
|||
|
@item @code{ADVANP} @tab adverbial-anterior-participle
|
|||
|
@item @code{ADVPRP} @tab adverbial-present-participle
|
|||
|
@item @code{ADVPRO} @tab adverbial-pronoun
|
|||
|
@item @code{ADVNUM} @tab adverbial-numeral
|
|||
|
@item @code{P} @tab preposition
|
|||
|
@item @code{PPRO} @tab prep-noun-pronoun
|
|||
|
@item @code{CONJ} @tab conjunction
|
|||
|
@item @code{EXCL} @tab exclamation
|
|||
|
@item @code{APP} @tab call
|
|||
|
@item @code{ONO} @tab onomatopoeia
|
|||
|
@item @code{PART} @tab particle
|
|||
|
@item @code{NUMCRD} @tab cardinal-numeral
|
|||
|
@item @code{NUMCOL} @tab collective-numeral
|
|||
|
@item @code{NUMPAR} @tab partitive-numeral
|
|||
|
@item @code{NUMORD} @tab ordinal-numeral
|
|||
|
@end multitable
|
|||
|
|
|||
|
@node PMDBF morphosyntactic attributes
|
|||
|
@section Morphosyntactic attributes
|
|||
|
|
|||
|
@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
|||
|
@c @headitem Attr @tab Val @tab Description
|
|||
|
@item
|
|||
|
@code{A} @tab @tab Aspect
|
|||
|
@item
|
|||
|
@tab @code{p} @tab perfect
|
|||
|
@item
|
|||
|
@tab @code{i} @tab imperfect.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{V} @tab @tab Verb-Form
|
|||
|
@item
|
|||
|
@tab @code{b} @tab infinitive,
|
|||
|
@item
|
|||
|
@tab @code{p} @tab personal,
|
|||
|
@item
|
|||
|
@tab @code{i} @tab impersonal.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{M} @tab @tab Mood
|
|||
|
@item
|
|||
|
@tab @code{d} @tab declarative,
|
|||
|
@item
|
|||
|
@tab @code{c} @tab conditional,
|
|||
|
@item
|
|||
|
@tab @code{i} @tab imperative.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{T} @tab @tab Tense
|
|||
|
@item
|
|||
|
@tab @code{a} @tab past,
|
|||
|
@item
|
|||
|
@tab @code{r} @tab present,
|
|||
|
@item
|
|||
|
@tab @code{f} @tab future.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{P} @tab @tab Person
|
|||
|
@item
|
|||
|
@tab @code{1} @tab 1,
|
|||
|
@item
|
|||
|
@tab @code{2} @tab 2,
|
|||
|
@item
|
|||
|
@tab @code{3} @tab 3.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{D} @tab @tab Degree
|
|||
|
@item
|
|||
|
@tab @code{p} @tab positive,
|
|||
|
@item
|
|||
|
@tab @code{c} @tab comparative,
|
|||
|
@item
|
|||
|
@tab @code{s} @tab superlative.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{N} @tab @tab Number
|
|||
|
@item
|
|||
|
@tab @code{s} @tab singular,
|
|||
|
@item
|
|||
|
@tab @code{p} @tab plural.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{C} @tab @tab Case
|
|||
|
@item
|
|||
|
@tab @code{n} @tab nominative,
|
|||
|
@item
|
|||
|
@tab @code{g} @tab genitive,
|
|||
|
@item
|
|||
|
@tab @code{d} @tab dative,
|
|||
|
@item
|
|||
|
@tab @code{a} @tab accusative,
|
|||
|
@item
|
|||
|
@tab @code{i} @tab instrumantal,
|
|||
|
@item
|
|||
|
@tab @code{l} @tab locative,
|
|||
|
@item
|
|||
|
@tab @code{v} @tab vocative.
|
|||
|
@item
|
|||
|
@item
|
|||
|
@code{G} @tab @tab Gender
|
|||
|
@item
|
|||
|
@tab @code{p} @tab masculine-personal,
|
|||
|
@item
|
|||
|
@tab @code{a} @tab masculine-animal,
|
|||
|
@item
|
|||
|
@tab @code{i} @tab masculine-inanimate,
|
|||
|
@item
|
|||
|
@tab @code{f} @tab feminine,
|
|||
|
@item
|
|||
|
@tab @code{n} @tab neuter.
|
|||
|
@end multitable
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c
|
|||
|
@c @node Examples
|
|||
|
@c @chapter Examples
|
|||
|
|
|||
|
@c ----------------------------------------------------------------------
|
|||
|
@c ----------------------------------------------------------------------
|
|||
|
|
|||
|
@node GNU Free Documentation License
|
|||
|
@chapter GNU Free Documentation License
|
|||
|
|
|||
|
@c The GNU Free Documentation License.
|
|||
|
@center Version 1.2, November 2002
|
|||
|
|
|||
|
@c This file is intended to be included within another document,
|
|||
|
@c hence no sectioning command or @node.
|
|||
|
|
|||
|
@display
|
|||
|
Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
|
|||
|
51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
|
|||
|
|
|||
|
Everyone is permitted to copy and distribute verbatim copies
|
|||
|
of this license document, but changing it is not allowed.
|
|||
|
@end display
|
|||
|
|
|||
|
@enumerate 0
|
|||
|
@item
|
|||
|
PREAMBLE
|
|||
|
|
|||
|
The purpose of this License is to make a manual, textbook, or other
|
|||
|
functional and useful document @dfn{free} in the sense of freedom: to
|
|||
|
assure everyone the effective freedom to copy and redistribute it,
|
|||
|
with or without modifying it, either commercially or noncommercially.
|
|||
|
Secondarily, this License preserves for the author and publisher a way
|
|||
|
to get credit for their work, while not being considered responsible
|
|||
|
for modifications made by others.
|
|||
|
|
|||
|
This License is a kind of ``copyleft'', which means that derivative
|
|||
|
works of the document must themselves be free in the same sense. It
|
|||
|
complements the GNU General Public License, which is a copyleft
|
|||
|
license designed for free software.
|
|||
|
|
|||
|
We have designed this License in order to use it for manuals for free
|
|||
|
software, because free software needs free documentation: a free
|
|||
|
program should come with manuals providing the same freedoms that the
|
|||
|
software does. But this License is not limited to software manuals;
|
|||
|
it can be used for any textual work, regardless of subject matter or
|
|||
|
whether it is published as a printed book. We recommend this License
|
|||
|
principally for works whose purpose is instruction or reference.
|
|||
|
|
|||
|
@item
|
|||
|
APPLICABILITY AND DEFINITIONS
|
|||
|
|
|||
|
This License applies to any manual or other work, in any medium, that
|
|||
|
contains a notice placed by the copyright holder saying it can be
|
|||
|
distributed under the terms of this License. Such a notice grants a
|
|||
|
world-wide, royalty-free license, unlimited in duration, to use that
|
|||
|
work under the conditions stated herein. The ``Document'', below,
|
|||
|
refers to any such manual or work. Any member of the public is a
|
|||
|
licensee, and is addressed as ``you''. You accept the license if you
|
|||
|
copy, modify or distribute the work in a way requiring permission
|
|||
|
under copyright law.
|
|||
|
|
|||
|
A ``Modified Version'' of the Document means any work containing the
|
|||
|
Document or a portion of it, either copied verbatim, or with
|
|||
|
modifications and/or translated into another language.
|
|||
|
|
|||
|
A ``Secondary Section'' is a named appendix or a front-matter section
|
|||
|
of the Document that deals exclusively with the relationship of the
|
|||
|
publishers or authors of the Document to the Document's overall
|
|||
|
subject (or to related matters) and contains nothing that could fall
|
|||
|
directly within that overall subject. (Thus, if the Document is in
|
|||
|
part a textbook of mathematics, a Secondary Section may not explain
|
|||
|
any mathematics.) The relationship could be a matter of historical
|
|||
|
connection with the subject or with related matters, or of legal,
|
|||
|
commercial, philosophical, ethical or political position regarding
|
|||
|
them.
|
|||
|
|
|||
|
The ``Invariant Sections'' are certain Secondary Sections whose titles
|
|||
|
are designated, as being those of Invariant Sections, in the notice
|
|||
|
that says that the Document is released under this License. If a
|
|||
|
section does not fit the above definition of Secondary then it is not
|
|||
|
allowed to be designated as Invariant. The Document may contain zero
|
|||
|
Invariant Sections. If the Document does not identify any Invariant
|
|||
|
Sections then there are none.
|
|||
|
|
|||
|
The ``Cover Texts'' are certain short passages of text that are listed,
|
|||
|
as Front-Cover Texts or Back-Cover Texts, in the notice that says that
|
|||
|
the Document is released under this License. A Front-Cover Text may
|
|||
|
be at most 5 words, and a Back-Cover Text may be at most 25 words.
|
|||
|
|
|||
|
A ``Transparent'' copy of the Document means a machine-readable copy,
|
|||
|
represented in a format whose specification is available to the
|
|||
|
general public, that is suitable for revising the document
|
|||
|
straightforwardly with generic text editors or (for images composed of
|
|||
|
pixels) generic paint programs or (for drawings) some widely available
|
|||
|
drawing editor, and that is suitable for input to text formatters or
|
|||
|
for automatic translation to a variety of formats suitable for input
|
|||
|
to text formatters. A copy made in an otherwise Transparent file
|
|||
|
format whose markup, or absence of markup, has been arranged to thwart
|
|||
|
or discourage subsequent modification by readers is not Transparent.
|
|||
|
An image format is not Transparent if used for any substantial amount
|
|||
|
of text. A copy that is not ``Transparent'' is called ``Opaque''.
|
|||
|
|
|||
|
Examples of suitable formats for Transparent copies include plain
|
|||
|
@sc{ascii} without markup, Texinfo input format, La@TeX{} input
|
|||
|
format, @acronym{SGML} or @acronym{XML} using a publicly available
|
|||
|
@acronym{DTD}, and standard-conforming simple @acronym{HTML},
|
|||
|
PostScript or @acronym{PDF} designed for human modification. Examples
|
|||
|
of transparent image formats include @acronym{PNG}, @acronym{XCF} and
|
|||
|
@acronym{JPG}. Opaque formats include proprietary formats that can be
|
|||
|
read and edited only by proprietary word processors, @acronym{SGML} or
|
|||
|
@acronym{XML} for which the @acronym{DTD} and/or processing tools are
|
|||
|
not generally available, and the machine-generated @acronym{HTML},
|
|||
|
PostScript or @acronym{PDF} produced by some word processors for
|
|||
|
output purposes only.
|
|||
|
|
|||
|
The ``Title Page'' means, for a printed book, the title page itself,
|
|||
|
plus such following pages as are needed to hold, legibly, the material
|
|||
|
this License requires to appear in the title page. For works in
|
|||
|
formats which do not have any title page as such, ``Title Page'' means
|
|||
|
the text near the most prominent appearance of the work's title,
|
|||
|
preceding the beginning of the body of the text.
|
|||
|
|
|||
|
A section ``Entitled XYZ'' means a named subunit of the Document whose
|
|||
|
title either is precisely XYZ or contains XYZ in parentheses following
|
|||
|
text that translates XYZ in another language. (Here XYZ stands for a
|
|||
|
specific section name mentioned below, such as ``Acknowledgements'',
|
|||
|
``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title''
|
|||
|
of such a section when you modify the Document means that it remains a
|
|||
|
section ``Entitled XYZ'' according to this definition.
|
|||
|
|
|||
|
The Document may include Warranty Disclaimers next to the notice which
|
|||
|
states that this License applies to the Document. These Warranty
|
|||
|
Disclaimers are considered to be included by reference in this
|
|||
|
License, but only as regards disclaiming warranties: any other
|
|||
|
implication that these Warranty Disclaimers may have is void and has
|
|||
|
no effect on the meaning of this License.
|
|||
|
|
|||
|
@item
|
|||
|
VERBATIM COPYING
|
|||
|
|
|||
|
You may copy and distribute the Document in any medium, either
|
|||
|
commercially or noncommercially, provided that this License, the
|
|||
|
copyright notices, and the license notice saying this License applies
|
|||
|
to the Document are reproduced in all copies, and that you add no other
|
|||
|
conditions whatsoever to those of this License. You may not use
|
|||
|
technical measures to obstruct or control the reading or further
|
|||
|
copying of the copies you make or distribute. However, you may accept
|
|||
|
compensation in exchange for copies. If you distribute a large enough
|
|||
|
number of copies you must also follow the conditions in section 3.
|
|||
|
|
|||
|
You may also lend copies, under the same conditions stated above, and
|
|||
|
you may publicly display copies.
|
|||
|
|
|||
|
@item
|
|||
|
COPYING IN QUANTITY
|
|||
|
|
|||
|
If you publish printed copies (or copies in media that commonly have
|
|||
|
printed covers) of the Document, numbering more than 100, and the
|
|||
|
Document's license notice requires Cover Texts, you must enclose the
|
|||
|
copies in covers that carry, clearly and legibly, all these Cover
|
|||
|
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
|
|||
|
the back cover. Both covers must also clearly and legibly identify
|
|||
|
you as the publisher of these copies. The front cover must present
|
|||
|
the full title with all words of the title equally prominent and
|
|||
|
visible. You may add other material on the covers in addition.
|
|||
|
Copying with changes limited to the covers, as long as they preserve
|
|||
|
the title of the Document and satisfy these conditions, can be treated
|
|||
|
as verbatim copying in other respects.
|
|||
|
|
|||
|
If the required texts for either cover are too voluminous to fit
|
|||
|
legibly, you should put the first ones listed (as many as fit
|
|||
|
reasonably) on the actual cover, and continue the rest onto adjacent
|
|||
|
pages.
|
|||
|
|
|||
|
If you publish or distribute Opaque copies of the Document numbering
|
|||
|
more than 100, you must either include a machine-readable Transparent
|
|||
|
copy along with each Opaque copy, or state in or with each Opaque copy
|
|||
|
a computer-network location from which the general network-using
|
|||
|
public has access to download using public-standard network protocols
|
|||
|
a complete Transparent copy of the Document, free of added material.
|
|||
|
If you use the latter option, you must take reasonably prudent steps,
|
|||
|
when you begin distribution of Opaque copies in quantity, to ensure
|
|||
|
that this Transparent copy will remain thus accessible at the stated
|
|||
|
location until at least one year after the last time you distribute an
|
|||
|
Opaque copy (directly or through your agents or retailers) of that
|
|||
|
edition to the public.
|
|||
|
|
|||
|
It is requested, but not required, that you contact the authors of the
|
|||
|
Document well before redistributing any large number of copies, to give
|
|||
|
them a chance to provide you with an updated version of the Document.
|
|||
|
|
|||
|
@item
|
|||
|
MODIFICATIONS
|
|||
|
|
|||
|
You may copy and distribute a Modified Version of the Document under
|
|||
|
the conditions of sections 2 and 3 above, provided that you release
|
|||
|
the Modified Version under precisely this License, with the Modified
|
|||
|
Version filling the role of the Document, thus licensing distribution
|
|||
|
and modification of the Modified Version to whoever possesses a copy
|
|||
|
of it. In addition, you must do these things in the Modified Version:
|
|||
|
|
|||
|
@enumerate A
|
|||
|
@item
|
|||
|
Use in the Title Page (and on the covers, if any) a title distinct
|
|||
|
from that of the Document, and from those of previous versions
|
|||
|
(which should, if there were any, be listed in the History section
|
|||
|
of the Document). You may use the same title as a previous version
|
|||
|
if the original publisher of that version gives permission.
|
|||
|
|
|||
|
@item
|
|||
|
List on the Title Page, as authors, one or more persons or entities
|
|||
|
responsible for authorship of the modifications in the Modified
|
|||
|
Version, together with at least five of the principal authors of the
|
|||
|
Document (all of its principal authors, if it has fewer than five),
|
|||
|
unless they release you from this requirement.
|
|||
|
|
|||
|
@item
|
|||
|
State on the Title page the name of the publisher of the
|
|||
|
Modified Version, as the publisher.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve all the copyright notices of the Document.
|
|||
|
|
|||
|
@item
|
|||
|
Add an appropriate copyright notice for your modifications
|
|||
|
adjacent to the other copyright notices.
|
|||
|
|
|||
|
@item
|
|||
|
Include, immediately after the copyright notices, a license notice
|
|||
|
giving the public permission to use the Modified Version under the
|
|||
|
terms of this License, in the form shown in the Addendum below.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve in that license notice the full lists of Invariant Sections
|
|||
|
and required Cover Texts given in the Document's license notice.
|
|||
|
|
|||
|
@item
|
|||
|
Include an unaltered copy of this License.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve the section Entitled ``History'', Preserve its Title, and add
|
|||
|
to it an item stating at least the title, year, new authors, and
|
|||
|
publisher of the Modified Version as given on the Title Page. If
|
|||
|
there is no section Entitled ``History'' in the Document, create one
|
|||
|
stating the title, year, authors, and publisher of the Document as
|
|||
|
given on its Title Page, then add an item describing the Modified
|
|||
|
Version as stated in the previous sentence.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve the network location, if any, given in the Document for
|
|||
|
public access to a Transparent copy of the Document, and likewise
|
|||
|
the network locations given in the Document for previous versions
|
|||
|
it was based on. These may be placed in the ``History'' section.
|
|||
|
You may omit a network location for a work that was published at
|
|||
|
least four years before the Document itself, or if the original
|
|||
|
publisher of the version it refers to gives permission.
|
|||
|
|
|||
|
@item
|
|||
|
For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
|
|||
|
the Title of the section, and preserve in the section all the
|
|||
|
substance and tone of each of the contributor acknowledgements and/or
|
|||
|
dedications given therein.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve all the Invariant Sections of the Document,
|
|||
|
unaltered in their text and in their titles. Section numbers
|
|||
|
or the equivalent are not considered part of the section titles.
|
|||
|
|
|||
|
@item
|
|||
|
Delete any section Entitled ``Endorsements''. Such a section
|
|||
|
may not be included in the Modified Version.
|
|||
|
|
|||
|
@item
|
|||
|
Do not retitle any existing section to be Entitled ``Endorsements'' or
|
|||
|
to conflict in title with any Invariant Section.
|
|||
|
|
|||
|
@item
|
|||
|
Preserve any Warranty Disclaimers.
|
|||
|
@end enumerate
|
|||
|
|
|||
|
If the Modified Version includes new front-matter sections or
|
|||
|
appendices that qualify as Secondary Sections and contain no material
|
|||
|
copied from the Document, you may at your option designate some or all
|
|||
|
of these sections as invariant. To do this, add their titles to the
|
|||
|
list of Invariant Sections in the Modified Version's license notice.
|
|||
|
These titles must be distinct from any other section titles.
|
|||
|
|
|||
|
You may add a section Entitled ``Endorsements'', provided it contains
|
|||
|
nothing but endorsements of your Modified Version by various
|
|||
|
parties---for example, statements of peer review or that the text has
|
|||
|
been approved by an organization as the authoritative definition of a
|
|||
|
standard.
|
|||
|
|
|||
|
You may add a passage of up to five words as a Front-Cover Text, and a
|
|||
|
passage of up to 25 words as a Back-Cover Text, to the end of the list
|
|||
|
of Cover Texts in the Modified Version. Only one passage of
|
|||
|
Front-Cover Text and one of Back-Cover Text may be added by (or
|
|||
|
through arrangements made by) any one entity. If the Document already
|
|||
|
includes a cover text for the same cover, previously added by you or
|
|||
|
by arrangement made by the same entity you are acting on behalf of,
|
|||
|
you may not add another; but you may replace the old one, on explicit
|
|||
|
permission from the previous publisher that added the old one.
|
|||
|
|
|||
|
The author(s) and publisher(s) of the Document do not by this License
|
|||
|
give permission to use their names for publicity for or to assert or
|
|||
|
imply endorsement of any Modified Version.
|
|||
|
|
|||
|
@item
|
|||
|
COMBINING DOCUMENTS
|
|||
|
|
|||
|
You may combine the Document with other documents released under this
|
|||
|
License, under the terms defined in section 4 above for modified
|
|||
|
versions, provided that you include in the combination all of the
|
|||
|
Invariant Sections of all of the original documents, unmodified, and
|
|||
|
list them all as Invariant Sections of your combined work in its
|
|||
|
license notice, and that you preserve all their Warranty Disclaimers.
|
|||
|
|
|||
|
The combined work need only contain one copy of this License, and
|
|||
|
multiple identical Invariant Sections may be replaced with a single
|
|||
|
copy. If there are multiple Invariant Sections with the same name but
|
|||
|
different contents, make the title of each such section unique by
|
|||
|
adding at the end of it, in parentheses, the name of the original
|
|||
|
author or publisher of that section if known, or else a unique number.
|
|||
|
Make the same adjustment to the section titles in the list of
|
|||
|
Invariant Sections in the license notice of the combined work.
|
|||
|
|
|||
|
In the combination, you must combine any sections Entitled ``History''
|
|||
|
in the various original documents, forming one section Entitled
|
|||
|
``History''; likewise combine any sections Entitled ``Acknowledgements'',
|
|||
|
and any sections Entitled ``Dedications''. You must delete all
|
|||
|
sections Entitled ``Endorsements.''
|
|||
|
|
|||
|
@item
|
|||
|
COLLECTIONS OF DOCUMENTS
|
|||
|
|
|||
|
You may make a collection consisting of the Document and other documents
|
|||
|
released under this License, and replace the individual copies of this
|
|||
|
License in the various documents with a single copy that is included in
|
|||
|
the collection, provided that you follow the rules of this License for
|
|||
|
verbatim copying of each of the documents in all other respects.
|
|||
|
|
|||
|
You may extract a single document from such a collection, and distribute
|
|||
|
it individually under this License, provided you insert a copy of this
|
|||
|
License into the extracted document, and follow this License in all
|
|||
|
other respects regarding verbatim copying of that document.
|
|||
|
|
|||
|
@item
|
|||
|
AGGREGATION WITH INDEPENDENT WORKS
|
|||
|
|
|||
|
A compilation of the Document or its derivatives with other separate
|
|||
|
and independent documents or works, in or on a volume of a storage or
|
|||
|
distribution medium, is called an ``aggregate'' if the copyright
|
|||
|
resulting from the compilation is not used to limit the legal rights
|
|||
|
of the compilation's users beyond what the individual works permit.
|
|||
|
When the Document is included in an aggregate, this License does not
|
|||
|
apply to the other works in the aggregate which are not themselves
|
|||
|
derivative works of the Document.
|
|||
|
|
|||
|
If the Cover Text requirement of section 3 is applicable to these
|
|||
|
copies of the Document, then if the Document is less than one half of
|
|||
|
the entire aggregate, the Document's Cover Texts may be placed on
|
|||
|
covers that bracket the Document within the aggregate, or the
|
|||
|
electronic equivalent of covers if the Document is in electronic form.
|
|||
|
Otherwise they must appear on printed covers that bracket the whole
|
|||
|
aggregate.
|
|||
|
|
|||
|
@item
|
|||
|
TRANSLATION
|
|||
|
|
|||
|
Translation is considered a kind of modification, so you may
|
|||
|
distribute translations of the Document under the terms of section 4.
|
|||
|
Replacing Invariant Sections with translations requires special
|
|||
|
permission from their copyright holders, but you may include
|
|||
|
translations of some or all Invariant Sections in addition to the
|
|||
|
original versions of these Invariant Sections. You may include a
|
|||
|
translation of this License, and all the license notices in the
|
|||
|
Document, and any Warranty Disclaimers, provided that you also include
|
|||
|
the original English version of this License and the original versions
|
|||
|
of those notices and disclaimers. In case of a disagreement between
|
|||
|
the translation and the original version of this License or a notice
|
|||
|
or disclaimer, the original version will prevail.
|
|||
|
|
|||
|
If a section in the Document is Entitled ``Acknowledgements'',
|
|||
|
``Dedications'', or ``History'', the requirement (section 4) to Preserve
|
|||
|
its Title (section 1) will typically require changing the actual
|
|||
|
title.
|
|||
|
|
|||
|
@item
|
|||
|
TERMINATION
|
|||
|
|
|||
|
You may not copy, modify, sublicense, or distribute the Document except
|
|||
|
as expressly provided for under this License. Any other attempt to
|
|||
|
copy, modify, sublicense or distribute the Document is void, and will
|
|||
|
automatically terminate your rights under this License. However,
|
|||
|
parties who have received copies, or rights, from you under this
|
|||
|
License will not have their licenses terminated so long as such
|
|||
|
parties remain in full compliance.
|
|||
|
|
|||
|
@item
|
|||
|
FUTURE REVISIONS OF THIS LICENSE
|
|||
|
|
|||
|
The Free Software Foundation may publish new, revised versions
|
|||
|
of the GNU Free Documentation License from time to time. Such new
|
|||
|
versions will be similar in spirit to the present version, but may
|
|||
|
differ in detail to address new problems or concerns. See
|
|||
|
@uref{http://www.gnu.org/copyleft/}.
|
|||
|
|
|||
|
Each version of the License is given a distinguishing version number.
|
|||
|
If the Document specifies that a particular numbered version of this
|
|||
|
License ``or any later version'' applies to it, you have the option of
|
|||
|
following the terms and conditions either of that specified version or
|
|||
|
of any later version that has been published (not as a draft) by the
|
|||
|
Free Software Foundation. If the Document does not specify a version
|
|||
|
number of this License, you may choose any version ever published (not
|
|||
|
as a draft) by the Free Software Foundation.
|
|||
|
@end enumerate
|
|||
|
|
|||
|
@page
|
|||
|
@heading ADDENDUM: How to use this License for your documents
|
|||
|
|
|||
|
To use this License in a document you have written, include a copy of
|
|||
|
the License in the document and put the following copyright and
|
|||
|
license notices just after the title page:
|
|||
|
|
|||
|
@smallexample
|
|||
|
@group
|
|||
|
Copyright (C) @var{year} @var{your name}.
|
|||
|
Permission is granted to copy, distribute and/or modify this document
|
|||
|
under the terms of the GNU Free Documentation License, Version 1.2
|
|||
|
or any later version published by the Free Software Foundation;
|
|||
|
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
|
|||
|
Texts. A copy of the license is included in the section entitled ``GNU
|
|||
|
Free Documentation License''.
|
|||
|
@end group
|
|||
|
@end smallexample
|
|||
|
|
|||
|
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
|
|||
|
replace the ``with@dots{}Texts.'' line with this:
|
|||
|
|
|||
|
@smallexample
|
|||
|
@group
|
|||
|
with the Invariant Sections being @var{list their titles}, with
|
|||
|
the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
|
|||
|
being @var{list}.
|
|||
|
@end group
|
|||
|
@end smallexample
|
|||
|
|
|||
|
If you have Invariant Sections without Cover Texts, or some other
|
|||
|
combination of the three, merge those two alternatives to suit the
|
|||
|
situation.
|
|||
|
|
|||
|
If your document contains nontrivial examples of program code, we
|
|||
|
recommend releasing these examples in parallel under your choice of
|
|||
|
free software license, such as the GNU General Public License,
|
|||
|
to permit their use in free software.
|
|||
|
|
|||
|
@c Local Variables:
|
|||
|
@c ispell-local-pdict: "ispell-dict"
|
|||
|
@c End:
|
|||
|
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node Reporting bugs
|
|||
|
@chapter Reporting bugs
|
|||
|
|
|||
|
Report bugs to <obrebski@@amu.edu.pl>.
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@c @node Copyright
|
|||
|
@c @chapter Copyright
|
|||
|
@c
|
|||
|
@c Copyright 2004 by Tomasz Obrebski
|
|||
|
@c This software is free for research and educational use.
|
|||
|
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
@c ---------------------------------------------------------------------
|
|||
|
|
|||
|
@node Author
|
|||
|
@chapter Author
|
|||
|
|
|||
|
|
|||
|
@bye
|