diff --git a/app/doc/utt.texinfo b/app/doc/utt.texinfo index 64ee5a6..0a26c93 100644 --- a/app/doc/utt.texinfo +++ b/app/doc/utt.texinfo @@ -8,15 +8,16 @@ @c %**end of header @copying -This manual is for UAM Text Tools (version 0.90, November, 2007) +This manual is for UAM Text Tools (version 0.90, October, 2008) Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka. Permission is granted to copy, distribute and/or modify this document -under the terms of the GNU Free Documentation License, Version 1.2 -or any later version published by the Free Software Foundation; -with no Invariant Sections, no Front-Cover Texts, and no Back-Cover -Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License. +under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no +Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A +copy of the license is included in the section entitled GNU Free +Documentation License,,GNU Free Documentation License. @c @quotation @c Permission is granted to ... @@ -357,12 +358,33 @@ but not 0005 02 W km @end example -because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002. +because in the latter example the first segment (starting at position +0000, 2 characters long) ends at position @var{n}=0001 which is +covered by the second segment and no segment starts at position +@var{n+2}=0002. + + +@section Flattened UTT file + +A UTT file format has two variants: regular and flattend. The regular +format was described above. In the flattened format some of the +end-of-line characters are replaced with line-feed characters. + +The flatten format is basically used to represent whole sentences as +single lines of the input file (all intrasentential end-of-line +characters are replaced with line-feed characters). + +This technical trick permits to perform certain text +processing operations on entire sentences with the use of such tools as +@command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component). + +The conversion between the two formats is performed by the tools: +@command{fla} and @command{unfla}. @section Character encoding The UTT component programs accept only 1-byte character encoding, such -as ISO, ANSI, DOS, UTF-8 (probably: not tested yet). +as ISO, ANSI, DOS. @c @section Formats @@ -525,99 +547,6 @@ This option is useful when working with @command{kot} or @command{con}. @end macro -@c --------------------------------------------------------------------- -@c --------------------------------------------------------------------- - -@c @node Common command line options -@c @chapter Common command line options - -@c @table @code - -@c @parhelp - -@c @item @b{@minus{}@minus{}help}, @b{@minus{}h} -@c Print help. - -@c @item @b{@minus{}@minus{}version}, @b{@minus{}v} -@c Print version information. - -@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} -@c Input file name. -@c If this option is absent or equal to '@minus{}', the program -@c reads from the standard input. - -@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} -@c Regular output file name. To regular output the program sends segments -@c which it successfully processed and copies those which were not -@c subject to processing. If this option is absent or equal to -@c '@minus{}', standard output is used. - -@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} -@c Fail output file name. To fail output the program copies the segments -@c it failed to process. If this option is absent or equal to -@c '@minus{}', standard output is used. - -@c @item @b{@minus{}@minus{}only-fail} -@c Discard segments which would normally be sent to regular -@c output. Print only segments the program failed to process. - -@c @item @b{@minus{}@minus{}no-fail} -@c Discard segments the program failed to process. -@c (This and the previous option are functionally equivalent to, -@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but -@c make the programs run faster.) - -@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} -@c The field containing the input to the program. The default is usually -@c the @var{form} field (unless otherwise stated in the program -@c description). The fields @var{position}, @var{length}, @var{tag}, and -@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4}, -@c respectively. - -@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} -@c The name of the field added by the program. The default is the name of -@c the program. - -@c @c @item @b{@minus{}@minus{}copy, @minus{}c} -@c @c Copy processed segments to regular output. - -@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} -@c Dictionary file name. -@c (This option is used by programs which use dictionary data.) - -@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}} -@c Process segments with the specified value in the @var{tag} field. -@c Multiple occurences of this option are allowed and are interpreted as -@c disjunction. If this option is absent, all segments are processed. - -@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} -@c Select for processing only segments in which the field named -@c @var{fieldname} is present. Multiple occurences of this option are -@c allowed and are interpreted as conjunction of conditions. If this -@c option is absent, all segments are processed. - -@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} -@c Select for processing only segments in which the field @var{fieldname} -@c is absent. Multiple occurences of this option are allowed and are -@c interpreted as conjunction of conditions. If this option is absent, -@c all segments are processed. - -@c @item @b{@minus{}@minus{}interactive @minus{}i} -@c This option toggles interactive mode, which is by default off. In the -@c interactive mode the program does not buffer the output. - -@c @item @b{@minus{}@minus{}config=@var{filename}} -@c Read configuration from file @file{@var{filename}}. - -@c @item @b{@minus{}@minus{}one @minus{}1} -@c This option makes the program print ambiguous annotation in one output -@c segment. By default when -@c ambiguous new annotation is being produced for a segment, the segment -@c is multiplicated and each of the annotations is added to separate copy -@c of the segment. - -@c @end table - @c --------------------------------------------------------------------- @c CONFIGURATION FILES @c --------------------------------------------------------------------- @@ -694,14 +623,16 @@ in UTT format * tok:: a tokenizer Filters: programs which read and produce UTT-formatted data -@c * sen - the sentencizer:: * lem:: a morphological analyzer * gue:: a morphological guesser -* cor:: a spelling corrector +* cor:: a simple spelling corrector +* kor:: a more elaborated spelling corrector * sen:: a sentensizer -@c * gph - the graphizer:: * ser:: a pattern search tool (marks matches) +* mar:: a pattern search tool (introduces arbitrary markers into the text) * grp:: a pattern search tool (selects sentences containing a match) +@c * gph:: a word-graph annotation tool:: +@c * dgp:: a dependency parser Sinks: programs which read UTT data and produce output in another format * kot:: an untokenizer @@ -721,6 +652,9 @@ Sinks: programs which read UTT data and produce output in another format @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab source +@item @strong{Input format:} @tab raw text file +@item @strong{Output format:} @tab UTT regular +@item @strong{Required annotation:} @tab - @end multitable @@ -834,6 +768,9 @@ Output: @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski @item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab UTT regular +@item @strong{Required annotation:} @tab tok @end multitable @menu @@ -1031,28 +968,34 @@ A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is i the distribution as the default @emph{lem}'s dictionary. It's located by default in: -@file{$HOME/.utt/pl/lem.bin} +@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin} + +in local installation or in + +@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin} + +in system installation. @node lem hints @subsection Hints -@c @subsubheading Combining data from multiple dictionaries +@subsubheading Combining data from multiple dictionaries -@c @itemize +@itemize -@c @item Apply , then apply to words which were not annotatated. +@item Apply , then apply to words which were not annotatated. -@c @example -@c lem -d | lem -S lem -d -@c @end example +@example +lem -d | lem -S lem -d +@end example -@c @item Add annotations from two dictionaries and . +@item Add annotations from two dictionaries and . -@c @example -@c lem -c -d | lem -S lem -d -@c @end example +@example +lem -c -d | lem -S lem -d +@end example -@c @end itemize +@end itemize @c --------------------------------------------------------------------- @@ -1070,15 +1013,21 @@ located by default in: @end multitable -@command{gue} guesess morphological descriptions of the form contained -in the @var{form} field. - @menu +* gue description:: * gue command line options:: * gue example:: * gue dictionaries:: @end menu + +@node gue description +@subsection Description + +@command{gue} guesess morphological descriptions of the form contained +in the @var{form} field. + + @node gue command line options @subsection Command line options @@ -1181,24 +1130,27 @@ naj*elszy;3-4a³y,ADJ/...:... @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski @item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab UTT regular +@item @strong{Required annotation:} @tab tok @end multitable +@menu +* cor description:: +* cor command line options:: +* cor dictionaries:: +@end menu + + +@node cor description +@subsection Description + The spelling corrector applies Kemal Oflazer's dynamic programming algorithm @cite{oflazer96} to the FSA representation of the set of word forms of the Polex/PMDBF dictionary. Given an incorrect word form it returns all word forms present in the dictionary whose edit distance is smaller than the threshold given as the parameter. -By default @code{cor} replaces the contents of the @var{form} field -with new corrected value, placing the old contents in the @code{cor} -field. - - -@menu -* cor command line options:: -* cor dictionaries:: -@end menu - @node cor command line options @subsection Command line options @@ -1224,6 +1176,10 @@ field. @item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}} Maximum edit distance (default='1'). +@c @item @b{@minus{}@minus{}replace, @minus{}r} +@c Replace original form with corrected form, place original form in the +@c cor field. This option has no effect in @option{--one-*} modes (default=off) + @end table @@ -1242,6 +1198,29 @@ odlotowy odludek @end example +@subsubheading Binary format + +The mandatory file name extension for a binary dictionary is @code{bin}. To +compile a text dictionary into binary format, write: + +@example +compiledic .dic +@end example + +@c --------------------------------------------------------------------- +@c KOR +@c --------------------------------------------------------------------- + +@page +@node kor +@section kor - configurable spelling corrector + +[TODO] + +@c --------------------------------------------------------------------- +@c SEN +@c --------------------------------------------------------------------- + @page @node sen @section sen - a sentensizer @@ -1250,17 +1229,25 @@ odludek @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab UTT regular +@item @strong{Required annotation:} @tab tok @end multitable -@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. @menu +* sen description:: @c * sen input:: @c * sen output:: * sen example:: @end menu +@node sen description +@subsection Description + +@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. + @node sen example @subsection Example @@ -1304,8 +1291,8 @@ output: -@c SER @c --------------------------------------------------------------------- +@c SER @c --------------------------------------------------------------------- @page @@ -1315,11 +1302,13 @@ output: @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab UTT regular +@item @strong{Required annotation:} @tab tok, lem --one-field @end multitable -@command{ser} looks for patterns in UTT-formatted texts. - @menu +* ser description:: * ser command line options:: * ser pattern:: * ser how ser works:: @@ -1329,6 +1318,12 @@ output: @end menu +@node ser description +@subsection Description + +@command{ser} looks for patterns in UTT-formatted texts. + + @c --------------------------------------------------------------------- @node ser command line options @subsection Command line options @@ -1503,7 +1498,7 @@ ocurrence of a relative pronoun @c All predefined terms correspond to single segments, @example -define(`verbseq', `(cat(V) (space cat(V)))') +define(`verbseq', `(cat() (space cat()))') @end example @@ -1514,7 +1509,7 @@ the term @code{cat()} may not be used as a ... of @node ser limitations @subsection Limitations -more than 3 attributes in <>. +Do not use more than 3 attributes in <>. @node ser requirements @subsection Requirements @@ -1532,8 +1527,8 @@ installed in the system: @end itemize -@c GRP @c --------------------------------------------------------------------- +@c GRP @c --------------------------------------------------------------------- @page @@ -1543,9 +1538,23 @@ installed in the system: @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT flattened +@item @strong{Output format:} @tab UTT flattened +@item @strong{Required annotation:} @tab tok, sen, lem --one-field @end multitable +@menu +* grp description:: +* grp command line options:: +* grp pattern:: +* grp hints:: +@end menu + + +@node grp description +@subsection Description + @code{gre} selects sentences containing an expression matching a pattern. The pattern format is exactly the same as that accepted by @code{ser}. @@ -1554,22 +1563,6 @@ pattern. The pattern format is exactly the same as that accepted by It is extremely fast (processing speed is usually higher then the speed of reading the corpus file from disk). - - -@c @menu -@c * ser command line options:: -@c * ser pattern:: -@c * ser how ser works:: -@c * ser customization:: -@c * ser limitations:: -@c * ser requirements:: -@c @end menu -@menu -* grp command line options:: -* grp pattern:: -* grp hints:: -@end menu - @node grp command line options @subsection Command line options @@ -1577,10 +1570,6 @@ of reading the corpus file from disk). @parhelp @parversion -@c @parfile -@c @paroutput -@c @parinputfield -@c @paroutputfield @parprocess @parinteractive @@ -1626,24 +1615,51 @@ lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR} @end example + @c --------------------------------------------------------------------- -@c kot +@c MAR @c --------------------------------------------------------------------- + +@page +@node mar +@section mar + +@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} +@item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski +@item @strong{Component category:} @tab filter +@end multitable + +[TODO] + @c --------------------------------------------------------------------- +@c KOT +@c --------------------------------------------------------------------- + @page @node kot @section kot - untokenizer -Authors: Tomasz Obrêbski +@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} +@item @strong{Authors:} @tab Tomasz Obrêbski +@item @strong{Component category:} @tab filter +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab text +@item @strong{Required annotation:} @tab tok +@end multitable -@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text. @menu +* kot description:: * kot command line options:: * kot usage examples:: @end menu +@node kot description +@subsection Description + +@command{kot} transforms a UTT formatted file back into raw text format. + @node kot command line options @subsection Command line options @@ -1683,28 +1699,38 @@ cat legia.txt | tok | kot cat legia.txt | tok | lem -1 | kot @end example -@c CON............................................................ -@c ............................................................... -@c ............................................................... +@c --------------------------------------------------------------- +@c CON +@c --------------------------------------------------------------- + @page @node con @section con - concordance table generator -@command{con} generates a concordance table based on a pattern given to @command{ser}. - @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Justyna Walkowska @item @strong{Component category:} @tab sink +@item @strong{Input format:} @tab UTT regular +@item @strong{Output format:} @tab text +@item @strong{Required annotation:} @tab ser or mar @end multitable @c @menu +* con description:: * con command line options:: * con usage example:: * con hints:: @end menu + +@node con description +@subsection Description + +@command{con} generates a concordance table based on a pattern given to @command{ser}. + + @node con command line options @subsection Command line options @@ -1757,9 +1783,9 @@ cat legia.txt | tok | lem -1 | kot Left column minimal width in characters (default = 0). @item @b{@minus{}@minus{}ignore @minus{}i} Ignore segment inconsistency in the input. -@item @b{@minus{}@minus{}bon} +@item @b{@minus{}@minus{}bom} Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). -@item @b{@minus{}@minus{}eob} +@item @b{@minus{}@minus{}eom} End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). @item @b{@minus{}@minus{}bod} Selected segment beginning display string (default='['). @@ -1773,7 +1799,7 @@ cat legia.txt | tok | lem -1 | kot @node con usage example @subsection Usage example @example -cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con' +cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con @end example @@ -1789,7 +1815,6 @@ sequence: @end example - @c --------------------------------------------------------------------- @c ---------------------------------------------------------------------