w utt.texinfo

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16
2008-10-22 09:53:31 +00:00 · 2008-10-22 09:53:31 +00:00 · 261bf629fb
commit 261bf629fb
parent 839a0d50e2
1 changed files with 192 additions and 167 deletions
--- a/app/doc/utt.texinfo
+++ b/app/doc/utt.texinfo
@ -8,15 +8,16 @@
@c %**end of header

@copying
-This manual is for UAM Text Tools (version 0.90, November, 2007)
+This manual is for UAM Text Tools (version 0.90, October, 2008)

 Copyright @copyright{}  2005, 2007  Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.

 Permission is granted to copy, distribute and/or modify this document
-under the terms of the GNU Free Documentation License, Version 1.2
-or any later version published by the Free Software Foundation;
-with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
-Texts.  A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
+under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
+copy of the license is included in the section entitled GNU Free
+Documentation License,,GNU Free Documentation License.

@c @quotation
@c Permission is granted to ...
@ -357,12 +358,33 @@ but not
 0005 02 W km
@end example

-because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
+because in the latter example the first segment (starting at position
+0000, 2 characters long) ends at position @var{n}=0001 which is
+covered by the second segment and no segment starts at position
+@var{n+2}=0002.
+
+
+@section Flattened UTT file
+
+A UTT file format has two variants: regular and flattend. The regular
+format was described above.  In the flattened format some of the
+end-of-line characters are replaced with line-feed characters.
+
+The flatten format is basically used to represent whole sentences as
+single lines of the input file (all intrasentential end-of-line
+characters are replaced with line-feed characters).
+
+This technical trick permits to perform certain text
+processing operations on entire sentences with the use of such tools as
+@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component).
+
+The conversion between the two formats is performed by the tools:
+@command{fla} and @command{unfla}.

@section Character encoding

 The UTT component programs accept only 1-byte character encoding, such
-as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
+as ISO, ANSI, DOS.


@c @section Formats
@ -525,99 +547,6 @@ This option is useful when working with @command{kot} or @command{con}.
@end macro


-@c ---------------------------------------------------------------------
-@c ---------------------------------------------------------------------
-
-@c @node Common command line options
-@c @chapter Common command line options
-
-@c @table @code
-
-@c @parhelp
-
-@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
-@c Print help.
-
-@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
-@c Print version information.
-
-@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
-@c Input file name.
-@c If this option is absent or equal to '@minus{}', the program
-@c reads from the standard input.
-
-@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
-@c Regular output file name. To regular output the program sends segments
-@c which it successfully processed and copies those which were not
-@c subject to processing. If this option is absent or equal to
-@c '@minus{}', standard output is used.
-
-@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
-@c Fail output file name. To fail output the program copies the segments
-@c it failed to process.  If this option is absent or equal to
-@c '@minus{}', standard output is used.
-
-@c @item @b{@minus{}@minus{}only-fail}
-@c Discard segments which would normally be sent to regular
-@c output. Print only segments the program failed to process.
-
-@c @item @b{@minus{}@minus{}no-fail}
-@c Discard segments the program failed to process.
-@c (This and the previous option are functionally equivalent to,
-@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
-@c make the programs run faster.)
-
-@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
-@c The field containing the input to the program. The default is usually
-@c the @var{form} field (unless otherwise stated in the program
-@c description). The fields @var{position}, @var{length}, @var{tag}, and
-@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
-@c respectively.
-
-@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
-@c The name of the field added by the program. The default is the name of
-@c the program.
-
-@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
-@c @c Copy processed segments to regular output.
-
-@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
-@c Dictionary file name.
-@c (This option is used by programs which use dictionary data.)
-
-@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
-@c Process segments with the specified value in the @var{tag} field.
-@c Multiple occurences of this option are allowed and are interpreted as
-@c disjunction. If this option is absent, all segments are processed.
-
-@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
-@c Select for processing only segments in which the field named
-@c @var{fieldname} is present. Multiple occurences of this option are
-@c allowed and are interpreted as conjunction of conditions. If this
-@c option is absent, all segments are processed.
-
-@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
-@c Select for processing only segments in which the field @var{fieldname}
-@c is absent.  Multiple occurences of this option are allowed and are
-@c interpreted as conjunction of conditions. If this option is absent,
-@c all segments are processed.
-
-@c @item @b{@minus{}@minus{}interactive @minus{}i}
-@c This option toggles interactive mode, which is by default off. In the
-@c interactive mode the program does not buffer the output.
-
-@c @item @b{@minus{}@minus{}config=@var{filename}}
-@c Read configuration from file @file{@var{filename}}.
-
-@c @item @b{@minus{}@minus{}one @minus{}1}
-@c This option makes the program print ambiguous annotation in one output
-@c segment. By default when
-@c ambiguous new annotation is being produced for a segment, the segment
-@c is multiplicated and each of the annotations is added to separate copy
-@c of the segment.
-
-@c @end table
-
@c ---------------------------------------------------------------------
@c CONFIGURATION FILES
@c ---------------------------------------------------------------------
@ -694,14 +623,16 @@ in UTT format
 * tok::         a tokenizer

 Filters: programs which read and produce UTT-formatted data
-@c * sen - the sentencizer::
 * lem::         a morphological analyzer
 * gue::         a morphological guesser
-* cor::         a spelling corrector
+* cor::         a simple spelling corrector
+* kor::         a more elaborated spelling corrector
 * sen::         a sentensizer
-@c * gph - the graphizer::
 * ser::         a pattern search tool (marks matches)
+* mar::         a pattern search tool (introduces arbitrary markers into the text)
 * grp::         a pattern search tool (selects sentences containing a match)
+@c * gph::         a word-graph annotation tool::
+@c * dgp::         a dependency parser

 Sinks: programs which read UTT data and produce output in another format
 * kot::         an untokenizer
@ -721,6 +652,9 @@ Sinks: programs which read UTT data and produce output in another format
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab source
+@item @strong{Input format:}            @tab raw text file
+@item @strong{Output format:}           @tab UTT regular
+@item @strong{Required annotation:}     @tab -
@end multitable


@ -834,6 +768,9 @@ Output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab UTT regular
+@item @strong{Required annotation:}     @tab tok
@end multitable

@menu
@ -1031,28 +968,34 @@ A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is i
 the distribution as the default @emph{lem}'s dictionary. It's 
 located by default in:

-@file{$HOME/.utt/pl/lem.bin}
+@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
+
+in local installation or in
+
+@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
+
+in system installation.

@node lem hints
@subsection Hints

-@c @subsubheading Combining data from multiple dictionaries
+@subsubheading Combining data from multiple dictionaries

-@c @itemize
+@itemize

-@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
+@item Apply <dict1>, then apply <dict2> to words which were not annotatated.

-@c @example
-@c lem -d <dict1> | lem -S lem -d <dict2>
-@c @end example
+@example
+lem -d <dict1> | lem -S lem -d <dict2>
+@end example

-@c @item Add annotations from two dictionaries <dict1> and <dict2>.
+@item Add annotations from two dictionaries <dict1> and <dict2>.

-@c @example
-@c lem -c -d <dict1> | lem -S lem -d <dict2>
-@c @end example
+@example
+lem -c -d <dict1> | lem -S lem -d <dict2>
+@end example

-@c @end itemize
+@end itemize


@c ---------------------------------------------------------------------
@ -1070,15 +1013,21 @@ located by default in:

@end multitable

-@command{gue} guesess morphological descriptions of the form contained
-in the @var{form} field.
-
@menu
+* gue description::    
 * gue command line options::    
 * gue example::                 
 * gue dictionaries::            
@end menu

+
+@node gue description
+@subsection Description
+
+@command{gue} guesess morphological descriptions of the form contained
+in the @var{form} field.
+
+
@node gue command line options
@subsection Command line options

@ -1181,24 +1130,27 @@ naj*elszy;3-4a³y,ADJ/...:...
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab UTT regular
+@item @strong{Required annotation:}     @tab tok
@end multitable

+@menu
+* cor description::
+* cor command line options::    
+* cor dictionaries::            
+@end menu
+
+
+@node cor description
+@subsection Description
+
 The spelling corrector applies Kemal Oflazer's dynamic programming
 algorithm @cite{oflazer96} to the FSA representation of the set of
 word forms of the Polex/PMDBF dictionary. Given an incorrect
 word form it returns all word forms present in the dictionary whose
 edit distance is smaller than the threshold given as the parameter.

-By default @code{cor} replaces the contents of the @var{form} field
-with new corrected value, placing the old contents in the @code{cor}
-field.
-
-
-@menu
-* cor command line options::    
-* cor dictionaries::            
-@end menu
-

@node cor command line options
@subsection Command line options
@ -1224,6 +1176,10 @@ field.
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
 Maximum edit distance (default='1').

+@c @item @b{@minus{}@minus{}replace, @minus{}r}
+@c Replace original form with corrected form, place original form in the
+@c cor field. This option has no effect in @option{--one-*} modes (default=off)
+

@end table

@ -1242,6 +1198,29 @@ odlotowy
 odludek
@end example

+@subsubheading Binary format
+
+The mandatory file name extension for a binary dictionary is @code{bin}. To
+compile a text dictionary into binary format, write:
+
+@example
+compiledic <dictionaryname>.dic
+@end example
+
+@c ---------------------------------------------------------------------
+@c KOR
+@c ---------------------------------------------------------------------
+
+@page
+@node kor
+@section kor - configurable spelling corrector
+
+[TODO]
+
+@c ---------------------------------------------------------------------
+@c SEN
+@c ---------------------------------------------------------------------
+
@page
@node sen
@section sen - a sentensizer
@ -1250,17 +1229,25 @@ odludek

@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab UTT regular
+@item @strong{Required annotation:}     @tab tok

@end multitable

-@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. 

@menu
+* sen description::
@c * sen input::
@c * sen output::
 * sen example::                 
@end menu

+@node sen description
+@subsection Description
+
+@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. 
+
@node sen example
@subsection Example

@ -1304,8 +1291,8 @@ output:



-@c SER
@c ---------------------------------------------------------------------
+@c SER
@c ---------------------------------------------------------------------

@page
@ -1315,11 +1302,13 @@ output:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab UTT regular
+@item @strong{Required annotation:}     @tab tok,  lem --one-field
@end multitable

-@command{ser} looks for patterns in UTT-formatted texts.
-
@menu
+* ser description::
 * ser command line options::    
 * ser pattern::                 
 * ser how ser works::           
@ -1329,6 +1318,12 @@ output:
@end menu


+@node ser description
+@subsection Description
+
+@command{ser} looks for patterns in UTT-formatted texts.
+
+
@c ---------------------------------------------------------------------
@node ser command line options
@subsection Command line options
@ -1503,7 +1498,7 @@ ocurrence of a relative pronoun
@c All predefined terms correspond to single segments, 

@example
-define(`verbseq', `(cat(V) (space cat(V)))')
+define(`verbseq', `(cat(<V>) (space cat(<V>)))')
@end example


@ -1514,7 +1509,7 @@ the term @code{cat()} may not be used as a ... of
@node ser limitations
@subsection Limitations

-more than 3 attributes in <>.
+Do not use more than 3 attributes in <>.

@node ser requirements
@subsection Requirements
@ -1532,8 +1527,8 @@ installed in the system:
@end itemize


-@c GRP
@c ---------------------------------------------------------------------
+@c GRP
@c ---------------------------------------------------------------------

@page
@ -1543,9 +1538,23 @@ installed in the system:
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Tomasz Obrêbski
@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT flattened
+@item @strong{Output format:}           @tab UTT flattened
+@item @strong{Required annotation:}     @tab tok, sen, lem --one-field
@end multitable


+@menu
+* grp description::
+* grp command line options::    
+* grp pattern::                 
+* grp hints::    
+@end menu
+
+
+@node grp description
+@subsection Description
+
@code{gre} selects sentences containing an expression matching a
 pattern. The pattern format is exactly the same as that accepted by
@code{ser}.
@ -1554,22 +1563,6 @@ pattern. The pattern format is exactly the same as that accepted by
 It is extremely fast (processing speed is usually higher then the speed
 of reading the corpus file from disk). 

-
-
-@c @menu
-@c * ser command line options::    
-@c * ser pattern::                 
-@c * ser how ser works::           
-@c * ser customization::           
-@c * ser limitations::             
-@c * ser requirements::            
-@c @end menu
-@menu
-* grp command line options::    
-* grp pattern::                 
-* grp hints::    
-@end menu
-
@node grp command line options
@subsection Command line options

@ -1577,10 +1570,6 @@ of reading the corpus file from disk).

@parhelp
@parversion
-@c @parfile
-@c @paroutput
-@c @parinputfield
-@c @paroutputfield
@parprocess
@parinteractive

@ -1626,24 +1615,51 @@ lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
@end example


+
@c ---------------------------------------------------------------------
-@c kot
+@c MAR
@c ---------------------------------------------------------------------
+
+@page
+@node mar
+@section mar
+
+@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
+@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrêbski
+@item @strong{Component category:}      @tab filter
+@end multitable
+
+[TODO]
+
@c ---------------------------------------------------------------------
+@c KOT
+@c ---------------------------------------------------------------------
+

@page
@node kot
@section kot - untokenizer

-Authors: Tomasz Obrêbski
+@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
+@item @strong{Authors:}                 @tab Tomasz Obrêbski
+@item @strong{Component category:}      @tab filter
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab text
+@item @strong{Required annotation:}     @tab tok
+@end multitable

-@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.

@menu
+* kot description::
 * kot command line options::    
 * kot usage examples::    
@end menu

+@node kot description
+@subsection Description
+
+@command{kot} transforms a UTT formatted file back into raw text format.
+
@node kot command line options
@subsection Command line options

@ -1683,28 +1699,38 @@ cat legia.txt | tok | kot
 cat legia.txt | tok | lem -1 | kot
@end example

-@c CON............................................................
-@c ...............................................................
-@c ...............................................................
+@c ---------------------------------------------------------------
+@c CON
+@c ---------------------------------------------------------------
+

@page
@node con
@section con - concordance table generator

-@command{con} generates a concordance table based on a pattern given to @command{ser}.
-
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
@item @strong{Authors:}                 @tab Justyna Walkowska
@item @strong{Component category:}      @tab sink
+@item @strong{Input format:}            @tab UTT regular
+@item @strong{Output format:}           @tab text
+@item @strong{Required annotation:}     @tab ser or mar
@end multitable
@c

@menu
+* con description::
 * con command line options::
 * con usage example::
 * con hints::    
@end menu

+
+@node con description
+@subsection Description
+
+@command{con} generates a concordance table based on a pattern given to @command{ser}.
+
+
@node con command line options
@subsection Command line options

@ -1757,9 +1783,9 @@ cat legia.txt | tok | lem -1 | kot
 	Left column minimal width in characters (default = 0).
@item @b{@minus{}@minus{}ignore @minus{}i}            
 	Ignore segment inconsistency in the input.
-@item @b{@minus{}@minus{}bon}            
+@item @b{@minus{}@minus{}bom}            
 	Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
-@item @b{@minus{}@minus{}eob}            
+@item @b{@minus{}@minus{}eom}            
 	End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
@item @b{@minus{}@minus{}bod}            
 	Selected segment beginning display string (default='[').
@ -1773,7 +1799,7 @@ cat legia.txt | tok | lem -1 | kot
@node con usage example
@subsection Usage example
@example
-cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'  
+cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con  
@end example


@ -1789,7 +1815,6 @@ sequence:
@end example


-
@c ---------------------------------------------------------------------
@c ---------------------------------------------------------------------