w utt.texinfo
git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16
This commit is contained in:
parent
839a0d50e2
commit
261bf629fb
@ -8,15 +8,16 @@
|
|||||||
@c %**end of header
|
@c %**end of header
|
||||||
|
|
||||||
@copying
|
@copying
|
||||||
This manual is for UAM Text Tools (version 0.90, November, 2007)
|
This manual is for UAM Text Tools (version 0.90, October, 2008)
|
||||||
|
|
||||||
Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
|
Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
|
||||||
|
|
||||||
Permission is granted to copy, distribute and/or modify this document
|
Permission is granted to copy, distribute and/or modify this document
|
||||||
under the terms of the GNU Free Documentation License, Version 1.2
|
under the terms of the GNU Free Documentation License, Version 1.2 or
|
||||||
or any later version published by the Free Software Foundation;
|
any later version published by the Free Software Foundation; with no
|
||||||
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
|
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
|
||||||
Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
|
copy of the license is included in the section entitled GNU Free
|
||||||
|
Documentation License,,GNU Free Documentation License.
|
||||||
|
|
||||||
@c @quotation
|
@c @quotation
|
||||||
@c Permission is granted to ...
|
@c Permission is granted to ...
|
||||||
@ -357,12 +358,33 @@ but not
|
|||||||
0005 02 W km
|
0005 02 W km
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
|
because in the latter example the first segment (starting at position
|
||||||
|
0000, 2 characters long) ends at position @var{n}=0001 which is
|
||||||
|
covered by the second segment and no segment starts at position
|
||||||
|
@var{n+2}=0002.
|
||||||
|
|
||||||
|
|
||||||
|
@section Flattened UTT file
|
||||||
|
|
||||||
|
A UTT file format has two variants: regular and flattend. The regular
|
||||||
|
format was described above. In the flattened format some of the
|
||||||
|
end-of-line characters are replaced with line-feed characters.
|
||||||
|
|
||||||
|
The flatten format is basically used to represent whole sentences as
|
||||||
|
single lines of the input file (all intrasentential end-of-line
|
||||||
|
characters are replaced with line-feed characters).
|
||||||
|
|
||||||
|
This technical trick permits to perform certain text
|
||||||
|
processing operations on entire sentences with the use of such tools as
|
||||||
|
@command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component).
|
||||||
|
|
||||||
|
The conversion between the two formats is performed by the tools:
|
||||||
|
@command{fla} and @command{unfla}.
|
||||||
|
|
||||||
@section Character encoding
|
@section Character encoding
|
||||||
|
|
||||||
The UTT component programs accept only 1-byte character encoding, such
|
The UTT component programs accept only 1-byte character encoding, such
|
||||||
as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
|
as ISO, ANSI, DOS.
|
||||||
|
|
||||||
|
|
||||||
@c @section Formats
|
@c @section Formats
|
||||||
@ -525,99 +547,6 @@ This option is useful when working with @command{kot} or @command{con}.
|
|||||||
@end macro
|
@end macro
|
||||||
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
|
||||||
@c ---------------------------------------------------------------------
|
|
||||||
|
|
||||||
@c @node Common command line options
|
|
||||||
@c @chapter Common command line options
|
|
||||||
|
|
||||||
@c @table @code
|
|
||||||
|
|
||||||
@c @parhelp
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
|
|
||||||
@c Print help.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
|
|
||||||
@c Print version information.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
|
|
||||||
@c Input file name.
|
|
||||||
@c If this option is absent or equal to '@minus{}', the program
|
|
||||||
@c reads from the standard input.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
|
|
||||||
@c Regular output file name. To regular output the program sends segments
|
|
||||||
@c which it successfully processed and copies those which were not
|
|
||||||
@c subject to processing. If this option is absent or equal to
|
|
||||||
@c '@minus{}', standard output is used.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
|
|
||||||
@c Fail output file name. To fail output the program copies the segments
|
|
||||||
@c it failed to process. If this option is absent or equal to
|
|
||||||
@c '@minus{}', standard output is used.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}only-fail}
|
|
||||||
@c Discard segments which would normally be sent to regular
|
|
||||||
@c output. Print only segments the program failed to process.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}no-fail}
|
|
||||||
@c Discard segments the program failed to process.
|
|
||||||
@c (This and the previous option are functionally equivalent to,
|
|
||||||
@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
|
|
||||||
@c make the programs run faster.)
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
|
|
||||||
@c The field containing the input to the program. The default is usually
|
|
||||||
@c the @var{form} field (unless otherwise stated in the program
|
|
||||||
@c description). The fields @var{position}, @var{length}, @var{tag}, and
|
|
||||||
@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
|
|
||||||
@c respectively.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
|
|
||||||
@c The name of the field added by the program. The default is the name of
|
|
||||||
@c the program.
|
|
||||||
|
|
||||||
@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
|
|
||||||
@c @c Copy processed segments to regular output.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
|
|
||||||
@c Dictionary file name.
|
|
||||||
@c (This option is used by programs which use dictionary data.)
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
|
|
||||||
@c Process segments with the specified value in the @var{tag} field.
|
|
||||||
@c Multiple occurences of this option are allowed and are interpreted as
|
|
||||||
@c disjunction. If this option is absent, all segments are processed.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
|
|
||||||
@c Select for processing only segments in which the field named
|
|
||||||
@c @var{fieldname} is present. Multiple occurences of this option are
|
|
||||||
@c allowed and are interpreted as conjunction of conditions. If this
|
|
||||||
@c option is absent, all segments are processed.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
|
|
||||||
@c Select for processing only segments in which the field @var{fieldname}
|
|
||||||
@c is absent. Multiple occurences of this option are allowed and are
|
|
||||||
@c interpreted as conjunction of conditions. If this option is absent,
|
|
||||||
@c all segments are processed.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}interactive @minus{}i}
|
|
||||||
@c This option toggles interactive mode, which is by default off. In the
|
|
||||||
@c interactive mode the program does not buffer the output.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}config=@var{filename}}
|
|
||||||
@c Read configuration from file @file{@var{filename}}.
|
|
||||||
|
|
||||||
@c @item @b{@minus{}@minus{}one @minus{}1}
|
|
||||||
@c This option makes the program print ambiguous annotation in one output
|
|
||||||
@c segment. By default when
|
|
||||||
@c ambiguous new annotation is being produced for a segment, the segment
|
|
||||||
@c is multiplicated and each of the annotations is added to separate copy
|
|
||||||
@c of the segment.
|
|
||||||
|
|
||||||
@c @end table
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@c CONFIGURATION FILES
|
@c CONFIGURATION FILES
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@ -694,14 +623,16 @@ in UTT format
|
|||||||
* tok:: a tokenizer
|
* tok:: a tokenizer
|
||||||
|
|
||||||
Filters: programs which read and produce UTT-formatted data
|
Filters: programs which read and produce UTT-formatted data
|
||||||
@c * sen - the sentencizer::
|
|
||||||
* lem:: a morphological analyzer
|
* lem:: a morphological analyzer
|
||||||
* gue:: a morphological guesser
|
* gue:: a morphological guesser
|
||||||
* cor:: a spelling corrector
|
* cor:: a simple spelling corrector
|
||||||
|
* kor:: a more elaborated spelling corrector
|
||||||
* sen:: a sentensizer
|
* sen:: a sentensizer
|
||||||
@c * gph - the graphizer::
|
|
||||||
* ser:: a pattern search tool (marks matches)
|
* ser:: a pattern search tool (marks matches)
|
||||||
|
* mar:: a pattern search tool (introduces arbitrary markers into the text)
|
||||||
* grp:: a pattern search tool (selects sentences containing a match)
|
* grp:: a pattern search tool (selects sentences containing a match)
|
||||||
|
@c * gph:: a word-graph annotation tool::
|
||||||
|
@c * dgp:: a dependency parser
|
||||||
|
|
||||||
Sinks: programs which read UTT data and produce output in another format
|
Sinks: programs which read UTT data and produce output in another format
|
||||||
* kot:: an untokenizer
|
* kot:: an untokenizer
|
||||||
@ -721,6 +652,9 @@ Sinks: programs which read UTT data and produce output in another format
|
|||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski
|
@item @strong{Authors:} @tab Tomasz Obrêbski
|
||||||
@item @strong{Component category:} @tab source
|
@item @strong{Component category:} @tab source
|
||||||
|
@item @strong{Input format:} @tab raw text file
|
||||||
|
@item @strong{Output format:} @tab UTT regular
|
||||||
|
@item @strong{Required annotation:} @tab -
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
|
|
||||||
@ -834,6 +768,9 @@ Output:
|
|||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
|
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
|
||||||
@item @strong{Component category:} @tab filter
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab UTT regular
|
||||||
|
@item @strong{Required annotation:} @tab tok
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
@ -1031,28 +968,34 @@ A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is i
|
|||||||
the distribution as the default @emph{lem}'s dictionary. It's
|
the distribution as the default @emph{lem}'s dictionary. It's
|
||||||
located by default in:
|
located by default in:
|
||||||
|
|
||||||
@file{$HOME/.utt/pl/lem.bin}
|
@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
|
||||||
|
|
||||||
|
in local installation or in
|
||||||
|
|
||||||
|
@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
|
||||||
|
|
||||||
|
in system installation.
|
||||||
|
|
||||||
@node lem hints
|
@node lem hints
|
||||||
@subsection Hints
|
@subsection Hints
|
||||||
|
|
||||||
@c @subsubheading Combining data from multiple dictionaries
|
@subsubheading Combining data from multiple dictionaries
|
||||||
|
|
||||||
@c @itemize
|
@itemize
|
||||||
|
|
||||||
@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
|
@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
|
||||||
|
|
||||||
@c @example
|
@example
|
||||||
@c lem -d <dict1> | lem -S lem -d <dict2>
|
lem -d <dict1> | lem -S lem -d <dict2>
|
||||||
@c @end example
|
@end example
|
||||||
|
|
||||||
@c @item Add annotations from two dictionaries <dict1> and <dict2>.
|
@item Add annotations from two dictionaries <dict1> and <dict2>.
|
||||||
|
|
||||||
@c @example
|
@example
|
||||||
@c lem -c -d <dict1> | lem -S lem -d <dict2>
|
lem -c -d <dict1> | lem -S lem -d <dict2>
|
||||||
@c @end example
|
@end example
|
||||||
|
|
||||||
@c @end itemize
|
@end itemize
|
||||||
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@ -1070,15 +1013,21 @@ located by default in:
|
|||||||
|
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
@command{gue} guesess morphological descriptions of the form contained
|
|
||||||
in the @var{form} field.
|
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
|
* gue description::
|
||||||
* gue command line options::
|
* gue command line options::
|
||||||
* gue example::
|
* gue example::
|
||||||
* gue dictionaries::
|
* gue dictionaries::
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@node gue description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
|
@command{gue} guesess morphological descriptions of the form contained
|
||||||
|
in the @var{form} field.
|
||||||
|
|
||||||
|
|
||||||
@node gue command line options
|
@node gue command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
|
|
||||||
@ -1181,24 +1130,27 @@ naj*elszy;3-4a³y,ADJ/...:...
|
|||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
|
@item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski
|
||||||
@item @strong{Component category:} @tab filter
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab UTT regular
|
||||||
|
@item @strong{Required annotation:} @tab tok
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
|
@menu
|
||||||
|
* cor description::
|
||||||
|
* cor command line options::
|
||||||
|
* cor dictionaries::
|
||||||
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@node cor description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
The spelling corrector applies Kemal Oflazer's dynamic programming
|
The spelling corrector applies Kemal Oflazer's dynamic programming
|
||||||
algorithm @cite{oflazer96} to the FSA representation of the set of
|
algorithm @cite{oflazer96} to the FSA representation of the set of
|
||||||
word forms of the Polex/PMDBF dictionary. Given an incorrect
|
word forms of the Polex/PMDBF dictionary. Given an incorrect
|
||||||
word form it returns all word forms present in the dictionary whose
|
word form it returns all word forms present in the dictionary whose
|
||||||
edit distance is smaller than the threshold given as the parameter.
|
edit distance is smaller than the threshold given as the parameter.
|
||||||
|
|
||||||
By default @code{cor} replaces the contents of the @var{form} field
|
|
||||||
with new corrected value, placing the old contents in the @code{cor}
|
|
||||||
field.
|
|
||||||
|
|
||||||
|
|
||||||
@menu
|
|
||||||
* cor command line options::
|
|
||||||
* cor dictionaries::
|
|
||||||
@end menu
|
|
||||||
|
|
||||||
|
|
||||||
@node cor command line options
|
@node cor command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
@ -1224,6 +1176,10 @@ field.
|
|||||||
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
|
@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
|
||||||
Maximum edit distance (default='1').
|
Maximum edit distance (default='1').
|
||||||
|
|
||||||
|
@c @item @b{@minus{}@minus{}replace, @minus{}r}
|
||||||
|
@c Replace original form with corrected form, place original form in the
|
||||||
|
@c cor field. This option has no effect in @option{--one-*} modes (default=off)
|
||||||
|
|
||||||
|
|
||||||
@end table
|
@end table
|
||||||
|
|
||||||
@ -1242,6 +1198,29 @@ odlotowy
|
|||||||
odludek
|
odludek
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
@subsubheading Binary format
|
||||||
|
|
||||||
|
The mandatory file name extension for a binary dictionary is @code{bin}. To
|
||||||
|
compile a text dictionary into binary format, write:
|
||||||
|
|
||||||
|
@example
|
||||||
|
compiledic <dictionaryname>.dic
|
||||||
|
@end example
|
||||||
|
|
||||||
|
@c ---------------------------------------------------------------------
|
||||||
|
@c KOR
|
||||||
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
@page
|
||||||
|
@node kor
|
||||||
|
@section kor - configurable spelling corrector
|
||||||
|
|
||||||
|
[TODO]
|
||||||
|
|
||||||
|
@c ---------------------------------------------------------------------
|
||||||
|
@c SEN
|
||||||
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
@page
|
@page
|
||||||
@node sen
|
@node sen
|
||||||
@section sen - a sentensizer
|
@section sen - a sentensizer
|
||||||
@ -1250,17 +1229,25 @@ odludek
|
|||||||
|
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski
|
@item @strong{Authors:} @tab Tomasz Obrêbski
|
||||||
@item @strong{Component category:} @tab filter
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab UTT regular
|
||||||
|
@item @strong{Required annotation:} @tab tok
|
||||||
|
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
|
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
|
* sen description::
|
||||||
@c * sen input::
|
@c * sen input::
|
||||||
@c * sen output::
|
@c * sen output::
|
||||||
* sen example::
|
* sen example::
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
@node sen description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
|
@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
|
||||||
|
|
||||||
@node sen example
|
@node sen example
|
||||||
@subsection Example
|
@subsection Example
|
||||||
|
|
||||||
@ -1304,8 +1291,8 @@ output:
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
@c SER
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
@c SER
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
@page
|
@page
|
||||||
@ -1315,11 +1302,13 @@ output:
|
|||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski
|
@item @strong{Authors:} @tab Tomasz Obrêbski
|
||||||
@item @strong{Component category:} @tab filter
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab UTT regular
|
||||||
|
@item @strong{Required annotation:} @tab tok, lem --one-field
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
@command{ser} looks for patterns in UTT-formatted texts.
|
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
|
* ser description::
|
||||||
* ser command line options::
|
* ser command line options::
|
||||||
* ser pattern::
|
* ser pattern::
|
||||||
* ser how ser works::
|
* ser how ser works::
|
||||||
@ -1329,6 +1318,12 @@ output:
|
|||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@node ser description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
|
@command{ser} looks for patterns in UTT-formatted texts.
|
||||||
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@node ser command line options
|
@node ser command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
@ -1503,7 +1498,7 @@ ocurrence of a relative pronoun
|
|||||||
@c All predefined terms correspond to single segments,
|
@c All predefined terms correspond to single segments,
|
||||||
|
|
||||||
@example
|
@example
|
||||||
define(`verbseq', `(cat(V) (space cat(V)))')
|
define(`verbseq', `(cat(<V>) (space cat(<V>)))')
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
|
||||||
@ -1514,7 +1509,7 @@ the term @code{cat()} may not be used as a ... of
|
|||||||
@node ser limitations
|
@node ser limitations
|
||||||
@subsection Limitations
|
@subsection Limitations
|
||||||
|
|
||||||
more than 3 attributes in <>.
|
Do not use more than 3 attributes in <>.
|
||||||
|
|
||||||
@node ser requirements
|
@node ser requirements
|
||||||
@subsection Requirements
|
@subsection Requirements
|
||||||
@ -1532,8 +1527,8 @@ installed in the system:
|
|||||||
@end itemize
|
@end itemize
|
||||||
|
|
||||||
|
|
||||||
@c GRP
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
@c GRP
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
@page
|
@page
|
||||||
@ -1543,9 +1538,23 @@ installed in the system:
|
|||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Tomasz Obrêbski
|
@item @strong{Authors:} @tab Tomasz Obrêbski
|
||||||
@item @strong{Component category:} @tab filter
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT flattened
|
||||||
|
@item @strong{Output format:} @tab UTT flattened
|
||||||
|
@item @strong{Required annotation:} @tab tok, sen, lem --one-field
|
||||||
@end multitable
|
@end multitable
|
||||||
|
|
||||||
|
|
||||||
|
@menu
|
||||||
|
* grp description::
|
||||||
|
* grp command line options::
|
||||||
|
* grp pattern::
|
||||||
|
* grp hints::
|
||||||
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@node grp description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
@code{gre} selects sentences containing an expression matching a
|
@code{gre} selects sentences containing an expression matching a
|
||||||
pattern. The pattern format is exactly the same as that accepted by
|
pattern. The pattern format is exactly the same as that accepted by
|
||||||
@code{ser}.
|
@code{ser}.
|
||||||
@ -1554,22 +1563,6 @@ pattern. The pattern format is exactly the same as that accepted by
|
|||||||
It is extremely fast (processing speed is usually higher then the speed
|
It is extremely fast (processing speed is usually higher then the speed
|
||||||
of reading the corpus file from disk).
|
of reading the corpus file from disk).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@c @menu
|
|
||||||
@c * ser command line options::
|
|
||||||
@c * ser pattern::
|
|
||||||
@c * ser how ser works::
|
|
||||||
@c * ser customization::
|
|
||||||
@c * ser limitations::
|
|
||||||
@c * ser requirements::
|
|
||||||
@c @end menu
|
|
||||||
@menu
|
|
||||||
* grp command line options::
|
|
||||||
* grp pattern::
|
|
||||||
* grp hints::
|
|
||||||
@end menu
|
|
||||||
|
|
||||||
@node grp command line options
|
@node grp command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
|
|
||||||
@ -1577,10 +1570,6 @@ of reading the corpus file from disk).
|
|||||||
|
|
||||||
@parhelp
|
@parhelp
|
||||||
@parversion
|
@parversion
|
||||||
@c @parfile
|
|
||||||
@c @paroutput
|
|
||||||
@c @parinputfield
|
|
||||||
@c @paroutputfield
|
|
||||||
@parprocess
|
@parprocess
|
||||||
@parinteractive
|
@parinteractive
|
||||||
|
|
||||||
@ -1626,24 +1615,51 @@ lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
|
|||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@c kot
|
@c MAR
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
@page
|
||||||
|
@node mar
|
||||||
|
@section mar
|
||||||
|
|
||||||
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
|
@item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski
|
||||||
|
@item @strong{Component category:} @tab filter
|
||||||
|
@end multitable
|
||||||
|
|
||||||
|
[TODO]
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
@c KOT
|
||||||
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
@page
|
@page
|
||||||
@node kot
|
@node kot
|
||||||
@section kot - untokenizer
|
@section kot - untokenizer
|
||||||
|
|
||||||
Authors: Tomasz Obrêbski
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
|
@item @strong{Authors:} @tab Tomasz Obrêbski
|
||||||
|
@item @strong{Component category:} @tab filter
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab text
|
||||||
|
@item @strong{Required annotation:} @tab tok
|
||||||
|
@end multitable
|
||||||
|
|
||||||
@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.
|
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
|
* kot description::
|
||||||
* kot command line options::
|
* kot command line options::
|
||||||
* kot usage examples::
|
* kot usage examples::
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
@node kot description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
|
@command{kot} transforms a UTT formatted file back into raw text format.
|
||||||
|
|
||||||
@node kot command line options
|
@node kot command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
|
|
||||||
@ -1683,28 +1699,38 @@ cat legia.txt | tok | kot
|
|||||||
cat legia.txt | tok | lem -1 | kot
|
cat legia.txt | tok | lem -1 | kot
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
@c CON............................................................
|
@c ---------------------------------------------------------------
|
||||||
@c ...............................................................
|
@c CON
|
||||||
@c ...............................................................
|
@c ---------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
@page
|
@page
|
||||||
@node con
|
@node con
|
||||||
@section con - concordance table generator
|
@section con - concordance table generator
|
||||||
|
|
||||||
@command{con} generates a concordance table based on a pattern given to @command{ser}.
|
|
||||||
|
|
||||||
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
|
||||||
@item @strong{Authors:} @tab Justyna Walkowska
|
@item @strong{Authors:} @tab Justyna Walkowska
|
||||||
@item @strong{Component category:} @tab sink
|
@item @strong{Component category:} @tab sink
|
||||||
|
@item @strong{Input format:} @tab UTT regular
|
||||||
|
@item @strong{Output format:} @tab text
|
||||||
|
@item @strong{Required annotation:} @tab ser or mar
|
||||||
@end multitable
|
@end multitable
|
||||||
@c
|
@c
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
|
* con description::
|
||||||
* con command line options::
|
* con command line options::
|
||||||
* con usage example::
|
* con usage example::
|
||||||
* con hints::
|
* con hints::
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@node con description
|
||||||
|
@subsection Description
|
||||||
|
|
||||||
|
@command{con} generates a concordance table based on a pattern given to @command{ser}.
|
||||||
|
|
||||||
|
|
||||||
@node con command line options
|
@node con command line options
|
||||||
@subsection Command line options
|
@subsection Command line options
|
||||||
|
|
||||||
@ -1757,9 +1783,9 @@ cat legia.txt | tok | lem -1 | kot
|
|||||||
Left column minimal width in characters (default = 0).
|
Left column minimal width in characters (default = 0).
|
||||||
@item @b{@minus{}@minus{}ignore @minus{}i}
|
@item @b{@minus{}@minus{}ignore @minus{}i}
|
||||||
Ignore segment inconsistency in the input.
|
Ignore segment inconsistency in the input.
|
||||||
@item @b{@minus{}@minus{}bon}
|
@item @b{@minus{}@minus{}bom}
|
||||||
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
|
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
|
||||||
@item @b{@minus{}@minus{}eob}
|
@item @b{@minus{}@minus{}eom}
|
||||||
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
|
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
|
||||||
@item @b{@minus{}@minus{}bod}
|
@item @b{@minus{}@minus{}bod}
|
||||||
Selected segment beginning display string (default='[').
|
Selected segment beginning display string (default='[').
|
||||||
@ -1773,7 +1799,7 @@ cat legia.txt | tok | lem -1 | kot
|
|||||||
@node con usage example
|
@node con usage example
|
||||||
@subsection Usage example
|
@subsection Usage example
|
||||||
@example
|
@example
|
||||||
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'
|
cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
|
||||||
@ -1789,7 +1815,6 @@ sequence:
|
|||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
@c ---------------------------------------------------------------------
|
@c ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user