w utt.texinfo

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16
2008-10-22 09:53:31 +00:00 · 2008-10-22 09:53:31 +00:00 · 261bf629fb
commit 261bf629fb
parent 839a0d50e2
1 changed files with 192 additions and 167 deletions
--- a/app/doc/utt.texinfo
+++ b/app/doc/utt.texinfo
@ -8,15 +8,16 @@
@c %**end of header
@copying
-This manual is for UAM Text Tools (version 0.90, November, 2007)
+This manual is for UAM Text Tools (version 0.90, October, 2008)
 Copyright @copyright{}  2005, 2007  Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
 Permission is granted to copy, distribute and/or modify this document
-under the terms of the GNU Free Documentation License, Version 1.2
+under the terms of the GNU Free Documentation License, Version 1.2 or
-or any later version published by the Free Software Foundation;
+any later version published by the Free Software Foundation; with no
-with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
-Texts.  A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
+copy of the license is included in the section entitled GNU Free
 Documentation License,,GNU Free Documentation License.
@c @quotation
@c Permission is granted to ...
@ -357,12 +358,33 @@ but not
 0005 02 W km
@end example
-because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
+because in the latter example the first segment (starting at position
 0000, 2 characters long) ends at position @var{n}=0001 which is
 covered by the second segment and no segment starts at position
@var{n+2}=0002.
@section Flattened UTT file
 A UTT file format has two variants: regular and flattend. The regular
 format was described above.  In the flattened format some of the
 end-of-line characters are replaced with line-feed characters.
 The flatten format is basically used to represent whole sentences as
 single lines of the input file (all intrasentential end-of-line
 characters are replaced with line-feed characters).
 This technical trick permits to perform certain text
 processing operations on entire sentences with the use of such tools as
@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component).
 The conversion between the two formats is performed by the tools:
@command{fla} and @command{unfla}.
@section Character encoding
 The UTT component programs accept only 1-byte character encoding, such
-as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
+as ISO, ANSI, DOS.
@c @section Formats
@ -525,99 +547,6 @@ This option is useful when working with @command{kot} or @command{con}.
@end macro
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------
@c @node Common command line options
@c @chapter Common command line options
@c @table @code
@c @parhelp
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
@c Print help.
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
@c Print version information.
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
@c Input file name.
@c If this option is absent or equal to '@minus{}', the program
@c reads from the standard input.
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
@c Regular output file name. To regular output the program sends segments
@c which it successfully processed and copies those which were not
@c subject to processing. If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
@c Fail output file name. To fail output the program copies the segments
@c it failed to process.  If this option is absent or equal to
@c '@minus{}', standard output is used.
@c @item @b{@minus{}@minus{}only-fail}
@c Discard segments which would normally be sent to regular
@c output. Print only segments the program failed to process.
@c @item @b{@minus{}@minus{}no-fail}
@c Discard segments the program failed to process.
@c (This and the previous option are functionally equivalent to,
@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
@c make the programs run faster.)
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
@c The field containing the input to the program. The default is usually
@c the @var{form} field (unless otherwise stated in the program
@c description). The fields @var{position}, @var{length}, @var{tag}, and
@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
@c respectively.
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
@c The name of the field added by the program. The default is the name of
@c the program.
@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
@c @c Copy processed segments to regular output.
@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
@c Dictionary file name.
@c (This option is used by programs which use dictionary data.)
@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
@c Process segments with the specified value in the @var{tag} field.
@c Multiple occurences of this option are allowed and are interpreted as
@c disjunction. If this option is absent, all segments are processed.
@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
@c Select for processing only segments in which the field named
@c @var{fieldname} is present. Multiple occurences of this option are
@c allowed and are interpreted as conjunction of conditions. If this
@c option is absent, all segments are processed.
@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
@c Select for processing only segments in which the field @var{fieldname}
@c is absent.  Multiple occurences of this option are allowed and are
@c interpreted as conjunction of conditions. If this option is absent,
@c all segments are processed.
@c @item @b{@minus{}@minus{}interactive @minus{}i}
@c This option toggles interactive mode, which is by default off. In the
@c interactive mode the program does not buffer the output.
@c @item @b{@minus{}@minus{}config=@var{filename}}
@c Read configuration from file @file{@var{filename}}.
@c @item @b{@minus{}@minus{}one @minus{}1}
@c This option makes the program print ambiguous annotation in one output
@c segment. By default when
@c ambiguous new annotation is being produced for a segment, the segment
@c is multiplicated and each of the annotations is added to separate copy
@c of the segment.
@c @end table
@c ---------------------------------------------------------------------
@c CONFIGURATION FILES
@c ---------------------------------------------------------------------
@ -694,14 +623,16 @@ in UTT format
 * tok::         a tokenizer
 Filters: programs which read and produce UTT-formatted data
@c * sen - the sentencizer::
 * lem::         a morphological analyzer
 * gue::         a morphological guesser
-* cor::         a spelling corrector
+* cor::         a simple spelling corrector
 * kor::         a more elaborated spelling corrector
 * sen::         a sentensizer
@c * gph - the graphizer::
 * ser::         a pattern search tool (marks matches)
 * mar::         a pattern search tool (introduces arbitrary markers into the text)
 * grp::         a pattern search tool (selects sentences containing a match)
@c * gph::         a word-graph annotation tool::
@c * dgp::         a dependency parser
 Sinks: programs which read UTT data and produce output in another format
 * kot::         an untokenizer
@ -721,6 +652,9 @@ Sinks: programs which read UTT data and produce output in another format
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab source
@item @strong{Input format:}            @tab raw text file
@item @strong{Output format:}           @tab UTT regular
@item @strong{Required annotation:}     @tab -
@end multitable
@ -834,6 +768,9 @@ Output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab UTT regular
@item @strong{Required annotation:}     @tab tok
@end multitable
@menu
@ -1031,28 +968,34 @@ A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is i
 the distribution as the default @emph{lem}'s dictionary. It's 
 located by default in:
-@file{$HOME/.utt/pl/lem.bin}
+@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
 in local installation or in
@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
 in system installation.
@node lem hints
@subsection Hints
-@c @subsubheading Combining data from multiple dictionaries
+@subsubheading Combining data from multiple dictionaries
-@c @itemize
+@itemize
-@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
+@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
-@c @example
+@example
-@c lem -d <dict1> | lem -S lem -d <dict2>
+lem -d <dict1> | lem -S lem -d <dict2>
-@c @end example
+@end example
-@c @item Add annotations from two dictionaries <dict1> and <dict2>.
+@item Add annotations from two dictionaries <dict1> and <dict2>.
-@c @example
+@example
-@c lem -c -d <dict1> | lem -S lem -d <dict2>
+lem -c -d <dict1> | lem -S lem -d <dict2>
-@c @end example
+@end example
-@c @end itemize
+@end itemize
@c ---------------------------------------------------------------------
@ -1070,15 +1013,21 @@ located by default in:
@end multitable
@command{gue} guesess morphological descriptions of the form contained
 in the @var{form} field.
@menu
 * gue description::    
 * gue command line options::    
 * gue example::                 
 * gue dictionaries::            
@end menu
@node gue description
@subsection Description
@command{gue} guesess morphological descriptions of the form contained
 in the @var{form} field.
@node gue command line options
@subsection Command line options
@ -1181,24 +1130,27 @@ naj*elszy;3-4a³y,ADJ/...:...
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab UTT regular
@item @strong{Required annotation:}     @tab tok
@end multitable
@menu
 * cor description::
 * cor command line options::    
 * cor dictionaries::            
@end menu
@node cor description
@subsection Description
 The spelling corrector applies Kemal Oflazer's dynamic programming
 algorithm @cite{oflazer96} to the FSA representation of the set of
 word forms of the Polex/PMDBF dictionary. Given an incorrect
 word form it returns all word forms present in the dictionary whose
 edit distance is smaller than the threshold given as the parameter.
 By default @code{cor} replaces the contents of the @var{form} field
 with new corrected value, placing the old contents in the @code{cor}
 field.
@menu
 * cor command line options::    
 * cor dictionaries::            
@end menu
@node cor command line options
@subsection Command line options
@ -1224,6 +1176,10 @@ field.
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
 Maximum edit distance (default='1').
@c @item @b{@minus{}@minus{}replace, @minus{}r}
@c Replace original form with corrected form, place original form in the
@c cor field. This option has no effect in @option{--one-*} modes (default=off)
@end table
@ -1242,6 +1198,29 @@ odlotowy
 odludek
@end example
@subsubheading Binary format
 The mandatory file name extension for a binary dictionary is @code{bin}. To
 compile a text dictionary into binary format, write:
@example
 compiledic <dictionaryname>.dic
@end example
@c ---------------------------------------------------------------------
@c KOR
@c ---------------------------------------------------------------------
@page
@node kor
@section kor - configurable spelling corrector
 [TODO]
@c ---------------------------------------------------------------------
@c SEN
@c ---------------------------------------------------------------------
@page
@node sen
@section sen - a sentensizer
@ -1250,17 +1229,25 @@ odludek
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab UTT regular
@item @strong{Required annotation:}     @tab tok
@end multitable
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. 
@menu
 * sen description::
@c * sen input::
@c * sen output::
 * sen example::                 
@end menu
@node sen description
@subsection Description
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. 
@node sen example
@subsection Example
@ -1304,8 +1291,8 @@ output:
@c SER
@c ---------------------------------------------------------------------
@c SER
@c ---------------------------------------------------------------------
@page
@ -1315,11 +1302,13 @@ output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab UTT regular
@item @strong{Required annotation:}     @tab tok,  lem --one-field
@end multitable
@command{ser} looks for patterns in UTT-formatted texts.
@menu
 * ser description::
 * ser command line options::    
 * ser pattern::                 
 * ser how ser works::           
@ -1329,6 +1318,12 @@ output:
@end menu
@node ser description
@subsection Description
@command{ser} looks for patterns in UTT-formatted texts.
@c ---------------------------------------------------------------------
@node ser command line options
@subsection Command line options
@ -1503,7 +1498,7 @@ ocurrence of a relative pronoun
@c All predefined terms correspond to single segments, 
@example
-define(`verbseq', `(cat(V) (space cat(V)))')
+define(`verbseq', `(cat(<V>) (space cat(<V>)))')
@end example
@ -1514,7 +1509,7 @@ the term @code{cat()} may not be used as a ... of
@node ser limitations
@subsection Limitations
-more than 3 attributes in <>.
+Do not use more than 3 attributes in <>.
@node ser requirements
@subsection Requirements
@ -1532,8 +1527,8 @@ installed in the system:
@end itemize
@c GRP
@c ---------------------------------------------------------------------
@c GRP
@c ---------------------------------------------------------------------
@page
@ -1543,9 +1538,23 @@ installed in the system:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT flattened
@item @strong{Output format:}           @tab UTT flattened
@item @strong{Required annotation:}     @tab tok, sen, lem --one-field
@end multitable
@menu
 * grp description::
 * grp command line options::    
 * grp pattern::                 
 * grp hints::    
@end menu
@node grp description
@subsection Description
@code{gre} selects sentences containing an expression matching a
 pattern. The pattern format is exactly the same as that accepted by
@code{ser}.
@ -1554,22 +1563,6 @@ pattern. The pattern format is exactly the same as that accepted by
 It is extremely fast (processing speed is usually higher then the speed
 of reading the corpus file from disk). 
@c @menu
@c * ser command line options::    
@c * ser pattern::                 
@c * ser how ser works::           
@c * ser customization::           
@c * ser limitations::             
@c * ser requirements::            
@c @end menu
@menu
 * grp command line options::    
 * grp pattern::                 
 * grp hints::    
@end menu
@node grp command line options
@subsection Command line options
@ -1577,10 +1570,6 @@ of reading the corpus file from disk).
@parhelp
@parversion
@c @parfile
@c @paroutput
@c @parinputfield
@c @paroutputfield
@parprocess
@parinteractive
@ -1626,24 +1615,51 @@ lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
@end example
@c ---------------------------------------------------------------------
-@c kot
+@c MAR
@c ---------------------------------------------------------------------
@page
@node mar
@section mar
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
@end multitable
 [TODO]
@c ---------------------------------------------------------------------
@c KOT
@c ---------------------------------------------------------------------
@page
@node kot
@section kot - untokenizer
-Authors: Tomasz Obrêbski
+@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab text
@item @strong{Required annotation:}     @tab tok
@end multitable
@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.
@menu
 * kot description::
 * kot command line options::    
 * kot usage examples::    
@end menu
@node kot description
@subsection Description
@command{kot} transforms a UTT formatted file back into raw text format.
@node kot command line options
@subsection Command line options
@ -1683,28 +1699,38 @@ cat legia.txt | tok | kot
 cat legia.txt | tok | lem -1 | kot
@end example
-@c CON............................................................
+@c ---------------------------------------------------------------
-@c ...............................................................
+@c CON
-@c ...............................................................
+@c ---------------------------------------------------------------
@page
@node con
@section con - concordance table generator
@command{con} generates a concordance table based on a pattern given to @command{ser}.
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Justyna Walkowska
@item @strong{Component category:}      @tab sink
@item @strong{Input format:}            @tab UTT regular
@item @strong{Output format:}           @tab text
@item @strong{Required annotation:}     @tab ser or mar
@end multitable
@c
@menu
 * con description::
 * con command line options::
 * con usage example::
 * con hints::    
@end menu
@node con description
@subsection Description
@command{con} generates a concordance table based on a pattern given to @command{ser}.
@node con command line options
@subsection Command line options
@ -1757,9 +1783,9 @@ cat legia.txt | tok | lem -1 | kot
 	Left column minimal width in characters (default = 0).
@item @b{@minus{}@minus{}ignore @minus{}i}            
 	Ignore segment inconsistency in the input.
-@item @b{@minus{}@minus{}bon}            
+@item @b{@minus{}@minus{}bom}            
 	Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
-@item @b{@minus{}@minus{}eob}            
+@item @b{@minus{}@minus{}eom}            
 	End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
@item @b{@minus{}@minus{}bod}            
 	Selected segment beginning display string (default='[').
@ -1773,7 +1799,7 @@ cat legia.txt | tok | lem -1 | kot
@node con usage example
@subsection Usage example
@example
-cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'  
+cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con  
@end example
@ -1789,7 +1815,6 @@ sequence:
@end example
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------