w utt.texinfo

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16
This commit is contained in:
obrebski 2008-10-22 09:53:31 +00:00
parent 839a0d50e2
commit 261bf629fb

View File

@ -8,15 +8,16 @@
@c %**end of header
@copying
This manual is for UAM Text Tools (version 0.90, November, 2007)
This manual is for UAM Text Tools (version 0.90, October, 2008)
Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
copy of the license is included in the section entitled GNU Free
Documentation License,,GNU Free Documentation License.
@c @quotation
@c Permission is granted to ...
@ -357,12 +358,33 @@ but not
0005 02 W km
@end example
because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
because in the latter example the first segment (starting at position
0000, 2 characters long) ends at position @var{n}=0001 which is
covered by the second segment and no segment starts at position
@var{n+2}=0002.
@section Flattened UTT file
A UTT file format has two variants: regular and flattend. The regular
format was described above. In the flattened format some of the
end-of-line characters are replaced with line-feed characters.
The flatten format is basically used to represent whole sentences as
single lines of the input file (all intrasentential end-of-line
characters are replaced with line-feed characters).
This technical trick permits to perform certain text
processing operations on entire sentences with the use of such tools as
@command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component).
The conversion between the two formats is performed by the tools:
@command{fla} and @command{unfla}.
@section Character encoding
The UTT component programs accept only 1-byte character encoding, such
as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
as ISO, ANSI, DOS.
@c @section Formats
@ -525,99 +547,6 @@ This option is useful when working with @command{kot} or @command{con}.
@end macro
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c @node Common command line options
@c @chapter Common command line options
@c @table @code
@c @parhelp
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
@c Print help.
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
@c Print version information.
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
@c Input file name.
@c If this option is absent or equal to '@minus{}', the program
@c reads from the standard input.
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
@c Regular output file name. To regular output the program sends segments
@c which it successfully processed and copies those which were not
@c subject to processing. If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
@c Fail output file name. To fail output the program copies the segments
@c it failed to process. If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @item @b{@minus{}@minus{}only-fail}
@c Discard segments which would normally be sent to regular
@c output. Print only segments the program failed to process.
@c @item @b{@minus{}@minus{}no-fail}
@c Discard segments the program failed to process.
@c (This and the previous option are functionally equivalent to,
@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
@c make the programs run faster.)
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
@c The field containing the input to the program. The default is usually
@c the @var{form} field (unless otherwise stated in the program
@c description). The fields @var{position}, @var{length}, @var{tag}, and
@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
@c respectively.
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
@c The name of the field added by the program. The default is the name of
@c the program.
@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
@c @c Copy processed segments to regular output.
@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
@c Dictionary file name.
@c (This option is used by programs which use dictionary data.)
@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
@c Process segments with the specified value in the @var{tag} field.
@c Multiple occurences of this option are allowed and are interpreted as
@c disjunction. If this option is absent, all segments are processed.
@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
@c Select for processing only segments in which the field named
@c @var{fieldname} is present. Multiple occurences of this option are
@c allowed and are interpreted as conjunction of conditions. If this
@c option is absent, all segments are processed.
@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
@c Select for processing only segments in which the field @var{fieldname}
@c is absent. Multiple occurences of this option are allowed and are
@c interpreted as conjunction of conditions. If this option is absent,
@c all segments are processed.
@c @item @b{@minus{}@minus{}interactive @minus{}i}
@c This option toggles interactive mode, which is by default off. In the
@c interactive mode the program does not buffer the output.
@c @item @b{@minus{}@minus{}config=@var{filename}}
@c Read configuration from file @file{@var{filename}}.
@c @item @b{@minus{}@minus{}one @minus{}1}
@c This option makes the program print ambiguous annotation in one output
@c segment. By default when
@c ambiguous new annotation is being produced for a segment, the segment
@c is multiplicated and each of the annotations is added to separate copy
@c of the segment.
@c @end table
@c ---------------------------------------------------------------------
@c CONFIGURATION FILES
@c ---------------------------------------------------------------------
@ -694,14 +623,16 @@ in UTT format
* tok:: a tokenizer
Filters: programs which read and produce UTT-formatted data
@c * sen - the sentencizer::
* lem:: a morphological analyzer
* gue:: a morphological guesser
* cor:: a spelling corrector
* cor:: a simple spelling corrector
* kor:: a more elaborated spelling corrector
* sen:: a sentensizer
@c * gph - the graphizer::
* ser:: a pattern search tool (marks matches)
* mar:: a pattern search tool (introduces arbitrary markers into the text)
* grp:: a pattern search tool (selects sentences containing a match)
@c * gph:: a word-graph annotation tool::
@c * dgp:: a dependency parser
Sinks: programs which read UTT data and produce output in another format
* kot:: an untokenizer
@ -721,6 +652,9 @@ Sinks: programs which read UTT data and produce output in another format
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski
@item @strong{Component category:} @tab source
@item @strong{Input format:} @tab raw text file
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab -
@end multitable
@ -834,6 +768,9 @@ Output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
@ -1031,28 +968,34 @@ A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is i
the distribution as the default @emph{lem}'s dictionary. It's
located by default in:
@file{$HOME/.utt/pl/lem.bin}
@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
in local installation or in
@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
in system installation.
@node lem hints
@subsection Hints
@c @subsubheading Combining data from multiple dictionaries
@subsubheading Combining data from multiple dictionaries
@c @itemize
@itemize
@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
@c @example
@c lem -d <dict1> | lem -S lem -d <dict2>
@c @end example
@example
lem -d <dict1> | lem -S lem -d <dict2>
@end example
@c @item Add annotations from two dictionaries <dict1> and <dict2>.
@item Add annotations from two dictionaries <dict1> and <dict2>.
@c @example
@c lem -c -d <dict1> | lem -S lem -d <dict2>
@c @end example
@example
lem -c -d <dict1> | lem -S lem -d <dict2>
@end example
@c @end itemize
@end itemize
@c ---------------------------------------------------------------------
@ -1070,15 +1013,21 @@ located by default in:
@end multitable
@command{gue} guesess morphological descriptions of the form contained
in the @var{form} field.
@menu
* gue description::
* gue command line options::
* gue example::
* gue dictionaries::
@end menu
@node gue description
@subsection Description
@command{gue} guesess morphological descriptions of the form contained
in the @var{form} field.
@node gue command line options
@subsection Command line options
@ -1181,24 +1130,27 @@ naj*elszy;3-4a³y,ADJ/...:...
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@menu
* cor description::
* cor command line options::
* cor dictionaries::
@end menu
@node cor description
@subsection Description
The spelling corrector applies Kemal Oflazer's dynamic programming
algorithm @cite{oflazer96} to the FSA representation of the set of
word forms of the Polex/PMDBF dictionary. Given an incorrect
word form it returns all word forms present in the dictionary whose
edit distance is smaller than the threshold given as the parameter.
By default @code{cor} replaces the contents of the @var{form} field
with new corrected value, placing the old contents in the @code{cor}
field.
@menu
* cor command line options::
* cor dictionaries::
@end menu
@node cor command line options
@subsection Command line options
@ -1224,6 +1176,10 @@ field.
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
Maximum edit distance (default='1').
@c @item @b{@minus{}@minus{}replace, @minus{}r}
@c Replace original form with corrected form, place original form in the
@c cor field. This option has no effect in @option{--one-*} modes (default=off)
@end table
@ -1242,6 +1198,29 @@ odlotowy
odludek
@end example
@subsubheading Binary format
The mandatory file name extension for a binary dictionary is @code{bin}. To
compile a text dictionary into binary format, write:
@example
compiledic <dictionaryname>.dic
@end example
@c ---------------------------------------------------------------------
@c KOR
@c ---------------------------------------------------------------------
@page
@node kor
@section kor - configurable spelling corrector
[TODO]
@c ---------------------------------------------------------------------
@c SEN
@c ---------------------------------------------------------------------
@page
@node sen
@section sen - a sentensizer
@ -1250,17 +1229,25 @@ odludek
@item @strong{Authors:} @tab Tomasz Obrêbski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok
@end multitable
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
@menu
* sen description::
@c * sen input::
@c * sen output::
* sen example::
@end menu
@node sen description
@subsection Description
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
@node sen example
@subsection Example
@ -1304,8 +1291,8 @@ output:
@c SER
@c ---------------------------------------------------------------------
@c SER
@c ---------------------------------------------------------------------
@page
@ -1315,11 +1302,13 @@ output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab UTT regular
@item @strong{Required annotation:} @tab tok, lem --one-field
@end multitable
@command{ser} looks for patterns in UTT-formatted texts.
@menu
* ser description::
* ser command line options::
* ser pattern::
* ser how ser works::
@ -1329,6 +1318,12 @@ output:
@end menu
@node ser description
@subsection Description
@command{ser} looks for patterns in UTT-formatted texts.
@c ---------------------------------------------------------------------
@node ser command line options
@subsection Command line options
@ -1503,7 +1498,7 @@ ocurrence of a relative pronoun
@c All predefined terms correspond to single segments,
@example
define(`verbseq', `(cat(V) (space cat(V)))')
define(`verbseq', `(cat(<V>) (space cat(<V>)))')
@end example
@ -1514,7 +1509,7 @@ the term @code{cat()} may not be used as a ... of
@node ser limitations
@subsection Limitations
more than 3 attributes in <>.
Do not use more than 3 attributes in <>.
@node ser requirements
@subsection Requirements
@ -1532,8 +1527,8 @@ installed in the system:
@end itemize
@c GRP
@c ---------------------------------------------------------------------
@c GRP
@c ---------------------------------------------------------------------
@page
@ -1543,9 +1538,23 @@ installed in the system:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT flattened
@item @strong{Output format:} @tab UTT flattened
@item @strong{Required annotation:} @tab tok, sen, lem --one-field
@end multitable
@menu
* grp description::
* grp command line options::
* grp pattern::
* grp hints::
@end menu
@node grp description
@subsection Description
@code{gre} selects sentences containing an expression matching a
pattern. The pattern format is exactly the same as that accepted by
@code{ser}.
@ -1554,22 +1563,6 @@ pattern. The pattern format is exactly the same as that accepted by
It is extremely fast (processing speed is usually higher then the speed
of reading the corpus file from disk).
@c @menu
@c * ser command line options::
@c * ser pattern::
@c * ser how ser works::
@c * ser customization::
@c * ser limitations::
@c * ser requirements::
@c @end menu
@menu
* grp command line options::
* grp pattern::
* grp hints::
@end menu
@node grp command line options
@subsection Command line options
@ -1577,10 +1570,6 @@ of reading the corpus file from disk).
@parhelp
@parversion
@c @parfile
@c @paroutput
@c @parinputfield
@c @paroutputfield
@parprocess
@parinteractive
@ -1626,24 +1615,51 @@ lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
@end example
@c ---------------------------------------------------------------------
@c kot
@c MAR
@c ---------------------------------------------------------------------
@page
@node mar
@section mar
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski
@item @strong{Component category:} @tab filter
@end multitable
[TODO]
@c ---------------------------------------------------------------------
@c KOT
@c ---------------------------------------------------------------------
@page
@node kot
@section kot - untokenizer
Authors: Tomasz Obrêbski
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Tomasz Obrêbski
@item @strong{Component category:} @tab filter
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab text
@item @strong{Required annotation:} @tab tok
@end multitable
@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.
@menu
* kot description::
* kot command line options::
* kot usage examples::
@end menu
@node kot description
@subsection Description
@command{kot} transforms a UTT formatted file back into raw text format.
@node kot command line options
@subsection Command line options
@ -1683,28 +1699,38 @@ cat legia.txt | tok | kot
cat legia.txt | tok | lem -1 | kot
@end example
@c CON............................................................
@c ...............................................................
@c ...............................................................
@c ---------------------------------------------------------------
@c CON
@c ---------------------------------------------------------------
@page
@node con
@section con - concordance table generator
@command{con} generates a concordance table based on a pattern given to @command{ser}.
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:} @tab Justyna Walkowska
@item @strong{Component category:} @tab sink
@item @strong{Input format:} @tab UTT regular
@item @strong{Output format:} @tab text
@item @strong{Required annotation:} @tab ser or mar
@end multitable
@c
@menu
* con description::
* con command line options::
* con usage example::
* con hints::
@end menu
@node con description
@subsection Description
@command{con} generates a concordance table based on a pattern given to @command{ser}.
@node con command line options
@subsection Command line options
@ -1757,9 +1783,9 @@ cat legia.txt | tok | lem -1 | kot
Left column minimal width in characters (default = 0).
@item @b{@minus{}@minus{}ignore @minus{}i}
Ignore segment inconsistency in the input.
@item @b{@minus{}@minus{}bon}
@item @b{@minus{}@minus{}bom}
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
@item @b{@minus{}@minus{}eob}
@item @b{@minus{}@minus{}eom}
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
@item @b{@minus{}@minus{}bod}
Selected segment beginning display string (default='[').
@ -1773,7 +1799,7 @@ cat legia.txt | tok | lem -1 | kot
@node con usage example
@subsection Usage example
@example
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con
@end example
@ -1789,7 +1815,6 @@ sequence:
@end example
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------