\input texinfo @c -*-texinfo-*- @documentencoding ISO-8859-2 @c @documentlanguage pl @c %**start of header @setfilename utt.info @settitle UAM Text Tools v0.90 @c %**end of header @copying This manual is for UAM Text Tools (version 0.90, November, 2007) Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License. @c @quotation @c Permission is granted to ... @c No permission is granted until the document is completed. @c @end quotation @end copying @titlepage @title UAM Text Tools 0.90 - User Manual @subtitle edition 0.01, @today @subtitle status: prescript @author by Justyna Walkowska, Tomasz Obr@,{}ebski and Micha@l{} Stolarski @page @vskip 0pt plus 1filll @insertcopying @end titlepage @contents @c @paragraphindent none @iftex @parskip = 0.5@normalbaselineskip plus 3pt minus 1pt @end iftex @c @headings off @c @everyheading LEM(1) @| @| LEM(1) @everyfooting @today @c @| @thispage @| @ifnottex @node Top @top UTT - UAM Text Tools @insertcopying @menu * General information:: * UTT file format:: * Configuration files:: * UTT components:: * Auxiliary tools:: * Usage examples:: * PMDBF dictionary:: @c * Examples:: @c * Copyright:: * GNU Free Documentation License:: * Reporting bugs:: * Author:: @end menu @end ifnottex @c ---------------------------------------------------------------------- @node General information @chapter General information UAM Text Tools (UTT) is a package of language processing tools developed at Adam Mickiewicz University. Its functionality includes: @itemize @bullet @item tokenization @item dictionary-based morphological analysis @item heuristic morphological analysis of unknown words @item spelling correction @item pattern search @item sentence splitting @item generation of concordance tables @end itemize The toolkit is destined for processing of raw (not annotated) unrestricted text for any conceivable purpose. The system is organized as a collection of command-line programs, each performing one operation, e.g. tokenization, lemmatization, spelling correction. The components are independent one from another, the unifying element being the uniform i/o file format. The components may be combined in various ways to provide various text processing services. Also new components supplied by the used may be easily incorporated into the system provided that they respect the i/o file format conventions. UTT component programs does not depend on any specific tagset or morphological description format. UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. List of contributors: @itemize @item Pawel Konieczka @item Tomasz Obrebski @item Michal Stolarski @item Marcin Walas @item Justyna Walkowska @end itemize @c ---------------------------------------------------------------------- @c --------------------------------------------------------------------- @node UTT file format @chapter UTT file format A UTT file contains annotation of a text. It consists of a sequence of segments. Each segment explicitly refers to a continuous piece of the text and provides some information on it. @section Segment format A segment occupies one line of a UTT file and consists of space-separated fields: @quotation @sp 1 [@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]] @sp 1 @end quotation @table @var @item @var{start} Non-negative integer value indicating the position in the source text where the segment starts. @item @var{length} Non-negative integer value indicating the length of the segment. @item @var{type} A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field). @var{type} reflects the main classification of segments - into words, numbers, punctuation marks, meta-text markers. @xref{tok output,,tok output}, for description of automatically recognized type markers. @item @var{form} This field contains the textual form of the segment or the special symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0). The characters or character sequences that have special meaning in the @var{form} field are enumerated below. Characters with special meaning: @itemize @item @code{_} - space character @item @code{*} - undefined contents @end itemize Escape sequences: @itemize @item @code{\n} - new line @item @code{\t} - tabulation @item @code{\r} - carriage return @item @code{\_} - the @code{_} character @item @code{\*} - the @code{*} character @item @code{\\} - the @code{\} character @c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters) @end itemize @item @var{annotation1} @item @var{annotation2} @item ... Annotation fields have the following format: @var{longname} @code{:} @var{value} or @var{shortname} @var{value} where @var{longname} is a string of alphanumeric characters (isalnum() test), @var{shortname} - a single non-alphanumeric character (ispunct() test), and @var{value} is an arbitrary string of non-blank characters. @end table Only two fields are mandatory: @var{type} and @var{form}. All other fields may be absent. In the case when only one number precedes the @var{type} field, it is interpreted as the @var{START} position. If the @var{length} field is ommited, the length of the segment is the length of the @var{form} field, except when the value of the @var{form} field is @code{*} -- in this case, the length is assumed to be 0. If the @var{start} field is also absent, the segment is assumed to directly follow the preceding one. @c Conventions: @c Annotation fields with predefined meaning: @c @itemize @c @item @code{!} - UTT components are allowed to modify the contents of @c the @var{form} field (e.g. spelling correction does this). If this happens the @c original form of the segment have to be placed in the @code{!}-field. @c @item @code{@@} - morphological description @c @item @code{=} - node identifier assignment (used in graph encoding) @c @item @code{<} - preceding/dominating node(s) (used in graph encoding) @c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding) @c @end itemize Segments of length 0 may be used to mark file positions with some information. See e.g. BOS and EOS (beginning/end of sentence) markers in the example below. Example: sentence: @samp{Piszemy dobre progrumy.} @example 0000 00 BOS * 0000 07 W Piszemy lem:pisaæ,V 0007 01 S _ 0008 05 W dobre lem:dobry,ADJ 0013 01 S _ 0014 08 W progrumy cor:programy lem:program,N 0022 01 P . 0023 00 EOS * 0023 01 S _ 0024 00 BOS * 0024 11 W Warszawiacy lem:Warszawiak,N 0035 01 S _ 0036 03 W te¿ 0039 01 P . 0040 00 EOS * @end example @example 0000 BOS * 0000 W Piszemy lem:pisaæ,V 0007 S _ 0008 W dobre lem:dobry,ADJ 0013 S _ 0014 W progrumy cor:programy lem:program,N 0022 P . 0023 EOS * @end example Posion information may be provided only for some types of segments: @example 0000 BOS * W Piszemy lem:pisaæ,V S _ W dobre lem:dobry,ADJ S _ W progrumy cor:programy lem:program,N P . EOS * S _ 0024 BOS * W Warszawiacy lem:Warszawiak,N S _ W te¿ P . EOS * @end example Position/length information may be provided only when necessary: @example 0000 04 N * 0000 N 12 P . N 5 S _ W km @end example @section UTT File A UTT file consists of a sequence of segments. The same text position may be covered by multiple segments. In cosequence, ambiguous text segmentation and ambiguous annotation may be represented. There are two structural requirements a valid UTT-formatted file has to meet: @itemize @bullet @item segments have to be sorted with respect to the @var{position} field, @item for each segment ending at position @var{n}, either there must be a segment starting at position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly for each segment starting at position @var{n}, either there must be a segment ending at position @var{n-1}, or the position @var{n-1} must not be covered by any segment. @end itemize A valid annotation for the text fragment @example 12.5 km @end example may be @example 0000 02 N 12 0000 04 N 12.5 0002 01 P . 0003 01 N 5 0004 01 S _ 0005 02 W km @end example but not @example 0000 02 N 12 0000 04 N 12.5 0004 01 S _ 0005 02 W km @end example because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002. @section Character encoding The UTT component programs accept only 1-byte character encoding, such as ISO, ANSI, DOS, UTF-8 (probably: not tested yet). @c @section Formats @c @unnumberedsubsubsec Basic format @c While processing large amounts of the overhead related with explicit @c ... of the start position and segment length becomes ... . Therefore, @c for efficiency reasons certain shortcuts are possible: @c @unnumberedsubsubsec Relative start position @c Start position may be given as relative distance from the last @c absolut position. @c @unnumberedsubsubsec Absent length @c Segment length may by omitted. Normally it can be restored by counting @c the length of the @emph{form field}. For segments with the special value @c @code{*} in the @emph{form field} length 0 is assumed. @c @unnumberedsubsubsec Absent length and start position @c Both start position and segment length may be omitted. In this format @c each segment is assumed to follow the previous one. This format is, @c therefore, suitable only for unambiguously tagged text @c (0-length markers can be still used.) @c @table @code @c @item AL @c @code{1234 03 W kot} @c @item RL @c @code{+56 03 W kot} @c @item A @c @code{1234 W kot} @c @item R @c @code{+56 W kot} @c @item 0 @c @code{W kot} @c @end table @c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???] @macro parhelp @item @b{@minus{}@minus{}help}, @b{@minus{}h} Print help. @end macro @macro parversion @item @b{@minus{}@minus{}version}, @b{@minus{}V} Print version information. @end macro @macro parinteractive @item @b{@minus{}@minus{}interactive, @minus{}i} This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output. @end macro @c @macro parfile @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} @c Input file name. @c If this option is absent or equal to '@minus{}', the program @c reads from the standard input. @c @end macro @c @macro paroutput @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} @c Regular output file name. To regular output the program sends segments @c which it successfully processed and copies those which were not @c subject to processing. If this option is absent or equal to @c '@minus{}', standard output is used. @c @end macro @c @macro parfail @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} @c Fail output file name. To fail output the program copies the segments @c it failed to process. If this option is absent or equal to @c '@minus{}', standard output is used. @c @end macro @c @macro parcopy @c @item @b{@minus{}@minus{}copy, @minus{}c} @c Copy succesfully processed segments to regular output also in their @c original input form. @c @end macro @macro parinputfield @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} The field containing the input to the program. The default is the @var{form} field. The fields @var{position}, @var{length}, @var{type}, and @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4}, respectively. @end macro @macro paroutputfield @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} The name of the field added by the program. The default is the name of the program. @end macro @macro pardictionary @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} Dictionary file name. @end macro @macro parprocess @item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}} Process segments with the specified value in the @var{type} field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed. @end macro @macro parselect @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} Select for processing only segments in which the field named @var{fieldname} is present. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed. @end macro @macro parunselect @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} Select for processing only segments in which the field @var{fieldname} is absent. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed. @end macro @macro paroneline @item @b{@minus{}@minus{}one-line} This option makes the program print ambiguous annotation in one output line by generating multiple annotation fields. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment. @end macro @macro paronefield @item @b{@minus{}@minus{}one-field, @minus{}1} This option makes the program print ambiguous annotation in one annotation field. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment. This option is useful when working with @command{kot} or @command{con}. @end macro @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @c @node Common command line options @c @chapter Common command line options @c @table @code @c @parhelp @c @item @b{@minus{}@minus{}help}, @b{@minus{}h} @c Print help. @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} @c Print version information. @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} @c Input file name. @c If this option is absent or equal to '@minus{}', the program @c reads from the standard input. @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} @c Regular output file name. To regular output the program sends segments @c which it successfully processed and copies those which were not @c subject to processing. If this option is absent or equal to @c '@minus{}', standard output is used. @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} @c Fail output file name. To fail output the program copies the segments @c it failed to process. If this option is absent or equal to @c '@minus{}', standard output is used. @c @item @b{@minus{}@minus{}only-fail} @c Discard segments which would normally be sent to regular @c output. Print only segments the program failed to process. @c @item @b{@minus{}@minus{}no-fail} @c Discard segments the program failed to process. @c (This and the previous option are functionally equivalent to, @c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but @c make the programs run faster.) @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} @c The field containing the input to the program. The default is usually @c the @var{form} field (unless otherwise stated in the program @c description). The fields @var{position}, @var{length}, @var{tag}, and @c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4}, @c respectively. @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} @c The name of the field added by the program. The default is the name of @c the program. @c @c @item @b{@minus{}@minus{}copy, @minus{}c} @c @c Copy processed segments to regular output. @c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} @c Dictionary file name. @c (This option is used by programs which use dictionary data.) @c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}} @c Process segments with the specified value in the @var{tag} field. @c Multiple occurences of this option are allowed and are interpreted as @c disjunction. If this option is absent, all segments are processed. @c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} @c Select for processing only segments in which the field named @c @var{fieldname} is present. Multiple occurences of this option are @c allowed and are interpreted as conjunction of conditions. If this @c option is absent, all segments are processed. @c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} @c Select for processing only segments in which the field @var{fieldname} @c is absent. Multiple occurences of this option are allowed and are @c interpreted as conjunction of conditions. If this option is absent, @c all segments are processed. @c @item @b{@minus{}@minus{}interactive @minus{}i} @c This option toggles interactive mode, which is by default off. In the @c interactive mode the program does not buffer the output. @c @item @b{@minus{}@minus{}config=@var{filename}} @c Read configuration from file @file{@var{filename}}. @c @item @b{@minus{}@minus{}one @minus{}1} @c This option makes the program print ambiguous annotation in one output @c segment. By default when @c ambiguous new annotation is being produced for a segment, the segment @c is multiplicated and each of the annotations is added to separate copy @c of the segment. @c @end table @c --------------------------------------------------------------------- @c CONFIGURATION FILES @c --------------------------------------------------------------------- @node Configuration files @chapter Configuration files Values for all command line options accepted by a component may be set in configuration files. The default location of the configuration files for a component named @command{@var{program}} are @example @file{/usr/local/etc/utt/@var{program}.conf} @end example for system-wide configuration file and @example @file{~/.utt/@var{program}.conf} @end example for user configuration file. @c The configuration file to load may be also specified with the @c @option{--config} option. Configuration file need not be provided. For each option, the value is set according to the following priority: @itemize @item command line @c @item configuration file indicated with @option{--config} option @item user configuration file (or configuration file indicated with the @option{--config} option) @item system-wide configuration file @end itemize Parameter values are specified in the following format: @var{parametername}=@var{value} where @var{parametername} is the short or long name of an option accepted by the program, or @var{parametername} if the option does not need arguments. You can introduce comments to configuration files using the # sign. If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file. @c The equal sign may be omitted. @quotation Tip If you have two (or more) frequently used sets of options for the same program (eg. lem with PMDBF dictionary and lem with a user dictionary) a good solution is to create two soft links to lem, called eg. lemg and lemu and specify their configuration in files lemg.conf and lemu.conf respectively. @end quotation @c --------------------------------------------------------------------- @c COMPONENTS @c --------------------------------------------------------------------- @node UTT components @chapter UTT components UTT components are of three types: @menu Sources: programs which read non-UTT data (e.g. raw text) and produce output in UTT format * tok:: a tokenizer Filters: programs which read and produce UTT-formatted data @c * sen - the sentencizer:: * lem:: a morphological analyzer * gue:: a morphological guesser * cor:: a spelling corrector * sen:: a sentensizer @c * gph - the graphizer:: * ser:: a pattern search tool (marks matches) * grp:: a pattern search tool (selects sentences containing a match) Sinks: programs which read UTT data and produce output in another format * kot:: an untokenizer * con:: a concordance table generator @end menu @c --------------------------------------------------------------------- @c TOK @c --------------------------------------------------------------------- @page @node tok @section tok - a tokenizer @c ---------------------------------------- @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab source @end multitable @menu * tok description:: * tok input:: * tok output:: * tok command line options:: * tok example:: @end menu @node tok description @subsection Description @code{tok} is a simple program which reads a text file and identifies tokens on the basis of their orthographic form. The type of the token is printed as the @var{type} field. @node tok input @subsection Input Raw text. @node tok output @subsection Output UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished: @itemize @item @code{W} (word) - continuous sequence of letters @item @code{N} (number) - continuous sequence of digits @item @code{S} (space) - continuous sequence of space characters @item @code{P} (punctuation mark) - single printable characters not belonging to any of the other classes @item @code{B} (unprintable character) - single unprintable character @end itemize @node tok command line options @subsection Command line options @table @code @item @b{@minus{}@minus{}help}, @b{@minus{}h} Print help. @item @b{@minus{}@minus{}version}, @b{@minus{}V} Print version information. @item @b{@minus{}@minus{}interactive, @minus{}i} This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output. @end table @node tok example @subsection Example Input: @example Piszemy dobre programy. @end example Output: @example 0000 07 W Piszemy 0007 01 S _ 0008 05 W dobre 0013 01 S _ 0014 08 W programy 0022 01 P . 0023 01 S \n @end example @c --------------------------------------------------------------------- @c SEN @c --------------------------------------------------------------------- @c @node sen - sentencizer @c @chapter sen - sentencizer @c Authors: Tomasz Obrêbski @c --------------------------------------------------------------------- @c LEM @c --------------------------------------------------------------------- @page @node lem @section lem - morphological analyzer @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski @item @strong{Component category:} @tab filter @end multitable @menu * lem description:: * lem command line options:: * lem input:: * lem output:: * lem example:: * lem dictionaries:: * lem hints:: @end menu @node lem description @subsection Description @command{lem} performs morphological analysis of a simple orthographic word, returning all its possible morphological annotations, disregarding the context. @c ---------------------------------------- @node lem command line options @subsection Command line options @table @code @parhelp @parversion @parinteractive @c @parfile @c @paroutput @c @parfail @c @parcopy @parinputfield @paroutputfield @pardictionary @parprocess @parselect @parunselect @paroneline @paronefield @end table @c ---------------------------------------- @node lem input @subsection Input Lem reads a UTT file and processes the value of the @var{form} field (the input field may be changed with @option{--input-field} option). @node lem output @subsection Output @command{lem} adds a new annotation field, whose default name is @code{lem}. In case of ambiguity either the segment is multiplicated (default), multiple @code{lem} fields are added (@option{--one-line}) or ambiguous annotation is produced as the value of single @code{lem} field (option @option{--one-field,-1}): @itemize @bullet @item unambiguous value format: @example <lemma>,<descr> @end example @item ambiguous value format (@option{--one-field} option) @example <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]] @end example (alternative descriptions for the same lemma are separated by commas, alternative lemmata are separated by semicolons.) @end itemize @node lem example @subsection Example Input: @example 0000 07 W Piszemy 0007 01 S _ 0008 05 W dobre 0013 01 S _ 0014 08 W programy 0022 01 P . 0023 01 B \n @end example Output (default): @example 0000 07 W Piszemy lem:pisaæ,V/AiVpMdTrfNpP1 0007 01 B _ 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn 0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn 0013 01 B _ 0014 08 W programy lem:program,N/GiNpCa 0014 08 W programy lem:program,N/GiNpCn 0014 08 W programy lem:program,N/GiNpCv 0022 01 P . 0023 01 B \n @end example Output (@option{--one-line} option): @example 0000 07 W Piszemy lem:pisaæ,V/AiVpMdTrfNpP1 0007 01 S _ 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn 0013 01 S _ 0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv 0022 01 P . 0023 01 S \n @end example Output (@option{--one-field} option): @example 0000 07 W Piszemy lem:pisaæ,V/AiVpMdTrfNpP1 0007 01 S _ 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn 0013 01 S _ 0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv 0022 01 P . 0023 01 S \n @end example @c ---------------------------------------- @node lem dictionaries @subsection Dictionaries @command{lem} requires a dictionary. The dictionary may be provided in one of two formats: in text (source) format or in binary (fsa) format. @subsubheading Text format Dictionary entries have the following structure: @example <form>;<lemma>,<descr>[;<lemma>,<descr>] @end example @var{lemma} may be given explicitly or in the cut-add format: @example @code{[<cut1><add1>-]<cut2><add2>} @end example meaning: replace prefix of length @code{<cut1>} with string @code{<add1>}, replace suffix of length @code{<cut2>} with string @code{<add2>}. For example @code{3t} transforms @samp{kocie} into @samp{kot}, @code{3-4a³y} transforms @samp{najbielsi} into @samp{bia³y} Each dictionary entry must be written in one line and must not contain blank characters. Examples: @example kot;0,N/GaNsCn kota;1,N/GaNsCg;1,N/GaNsCa kotu;1,N/GaNsCd kotem;2,N/GaNsCi kocie;3t,N/GaNsCl;3t,N/GaNsCv najbielsi;3-4a³y,ADJ/DsNpCnGp najbielsze;3-5a³y,ADJ/DsNpCnGaifn najlepsi;dobry,ADJ/DsNpCnGp najlepsze;dobry,ADJ/DsNpCnGaifn @end example The mandatory file name extension for a text dictionary is @code{dic}. For large dictionaries it is preferable, however, to compile them into binary (fsa) format. @subsubheading Binary format The mandatory file name extension for a binary dictionary is @code{bin}. To compile a text dictionary into binary format, write: @example compiledic <dictionaryname>.dic @end example @subsubheading Polex/PMDBF dictionary A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in the distribution as the default @emph{lem}'s dictionary. It's located by default in: @file{$HOME/.utt/pl/lem.bin} @node lem hints @subsection Hints @c @subsubheading Combining data from multiple dictionaries @c @itemize @c @item Apply <dict1>, then apply <dict2> to words which were not annotatated. @c @example @c lem -d <dict1> | lem -S lem -d <dict2> @c @end example @c @item Add annotations from two dictionaries <dict1> and <dict2>. @c @example @c lem -c -d <dict1> | lem -S lem -d <dict2> @c @end example @c @end itemize @c --------------------------------------------------------------------- @c GUE @c --------------------------------------------------------------------- @page @node gue @section gue - morphological guesser @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Micha³ Stolarski, Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @command{gue} guesess morphological descriptions of the form contained in the @var{form} field. @menu * gue command line options:: * gue example:: * gue dictionaries:: @end menu @node gue command line options @subsection Command line options @table @code @parhelp @parversion @parinteractive @c @parfile @c @paroutput @c @parfail @c @parcopy @parinputfield @paroutputfield @pardictionary @parprocess @parselect @parunselect @paroneline @paronefield @item @b{@minus{}@minus{}delta=@var{n}} Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2'). @item @b{@minus{}@minus{}cut-off=@var{n}} Do not display answers with less weight than cut-off value (default=`200'). @item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}} Guess up to n descriptions (default=`0', which means 'display all results'). @end table @node gue example @subsection Example @example command: gue -n 2 input: 0000 07 W smerfny output: 0000 07 W smerfny gue:,ADJ/CaDpGiNs 0000 07 W smerfny gue:,ADJ/CnvDpGaipNs @end example @node gue dictionaries @subsection Dictionaries @command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format. The fsa format is created by compiling text-format dictionaries. @subsubheading Text format Dictionary entries have the following structure: @example @var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight} @end example @var{lemma} must be given in the cut-add format: @example @code{[<cut1><add1>-]<cut2><add2>} @end example (no spaces in between): replace prefix of length @var{cut1} with string @var{add1}, replace suffix of length @var{cat2} with string @var{add2}. Example: @code{3-4a³y} transforms @i{najbielsi} into @i{bia³y} @var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.). @var{weight} is an integer value between 1 and 999 indicating the likelihood of the guess. @example *³kê;1a,N/GfNsCa naj*elszy;3-4a³y,ADJ/...:... @end example @c --------------------------------------------------------------------- @c COR @c --------------------------------------------------------------------- @page @node cor @section cor - spelling corrector @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski @item @strong{Component category:} @tab filter @end multitable The spelling corrector applies Kemal Oflazer's dynamic programming algorithm @cite{oflazer96} to the FSA representation of the set of word forms of the Polex/PMDBF dictionary. Given an incorrect word form it returns all word forms present in the dictionary whose edit distance is smaller than the threshold given as the parameter. By default @code{cor} replaces the contents of the @var{form} field with new corrected value, placing the old contents in the @code{cor} field. @menu * cor command line options:: * cor dictionaries:: @end menu @node cor command line options @subsection Command line options @table @code @parhelp @parversion @parinteractive @c @parfile @c @paroutput @c @parfail @c @parcopy @parinputfield @paroutputfield @pardictionary @parprocess @parselect @parunselect @paroneline @paronefield @item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}} Maximum edit distance (default='1'). @end table @node cor dictionaries @subsection Dictionaries @command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format. The fsa format is created by compiling text-format dictionaries. @subsubheading Text format The @command{cor} dictionary is a list of words: @example odlot odlotowy odludek @end example @page @node sen @section sen - a sentensizer @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. @menu @c * sen input:: @c * sen output:: * sen example:: @end menu @node sen example @subsection Example @example command: sen input: 0000 05 W Cze¶æ 0005 01 P ! 0006 01 S _ 0007 02 W To 0009 01 S _ 0010 02 W ja 0012 01 P . 0013 01 S \n output: 0000 00 BOS * 0000 05 W Cze¶æ 0005 01 P ! 0006 00 EOS * 0006 00 BOS * 0006 01 S _ 0007 02 W To 0009 01 S _ 0010 02 W ja 0012 01 P . 0013 01 S \n 0014 00 EOS * @end example @c --------------------------------------------------------------------- @c GPH @c --------------------------------------------------------------------- @c @node gph - graphizer @c @chapter gph - graphizer @c Authors: Tomasz Obrêbski @c SER @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @page @node ser @section ser - pattern search tool @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @command{ser} looks for patterns in UTT-formatted texts. @menu * ser command line options:: * ser pattern:: * ser how ser works:: * ser customization:: * ser limitations:: * ser requirements:: @end menu @c --------------------------------------------------------------------- @node ser command line options @subsection Command line options @table @code @parhelp @parversion @c @parfile @c @paroutput @c @parinputfield @c @paroutputfield @parprocess @parinteractive @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} The search pattern. @item @b{@minus{}@minus{}morph=@var{field}} The name of the annotation field containing the morphological description (default @code{lem}). @item @b{@minus{}@minus{}flex} Only print the generated flex source code. @item @b{@minus{}@minus{}macro=@var{filename}} Read macrodefinitions from file @var{filename} rather than from default location. This option allows to redefine the set of terms. @item @b{@minus{}@minus{}define=@var{filename}} Append macrodefinitions from file @var{filename}. This option allows to extend the set of terms. @end table @c --------------------------------------------------------------------- @node ser pattern @subsection Pattern The @command{ser} pattern is a regular expression over terms corresponding to text segments or segment sequences. Predefined terms are: @table @code @item seg(@var{t},@var{f},@var{a}) a segment of type @var{t}, containing form @var{f} and annotation @var{a} @item form(@var{f}) a segment containing form @var{f} @item field(@var{f}) a segment containing annotation field @var{f} @item space(@var{f}) a space segment of form @var{f} @item word(@var{f}) a word segment of form @var{f} @item punct(@var{f}) a punct segment of form @var{f} @item number(@var{f}) a number segment of form @var{f} @item lexeme(@var{f}) a word segment with lemma @var{f} @item cat(@var{c}) a word segment of category @var{c} @end table All arguments are optional. If an argument is omitted, an arbitrary string of non-blank characters is assumed as the argument value. Term arguments may be arbitrary character-level regular expressions. The following special symbols can by used: @multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @code{[@dots{}]} @tab a character class @item @code{[^@dots{}]} @tab a negated character class @item @code{|} @tab alternative @item @code{*} @tab repetition, including zero times @item @code{+} @tab repetition, at least one time @item @code{?} @tab optionality @item @code{@{@var{m},@var{n}@}} @tab repetition from @var{m} to @var{n} times @item @code{@{@var{m},@}} @tab repetition @var{m} or more times @item @code{@{@var{m}@}} @tab repetition @var{m} times @item @code{@var{\ddd}} @tab the character with octal value @var{ddd} @item @code{\x@var{hh}} @tab the character with hexadecimal value @var{hh} @item @code{( )} @tab parentheses, used to override precedence @c @end multitable @c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @code{.} @tab a non-blank character @item @code{\w} @tab a letter @item @code{\W} @tab a non-blank character other than a letter @item @code{\d} @tab a digit @item @code{\D} @tab a non-blank character other than a digit @item @code{\s} @tab a space or tab character @item @code{\S} @tab a non-blank character (the same as @code{.}) @item @code{\l} @tab a lowercase letter @item @code{\L} @tab an uppercase letter @end multitable @noindent The following characters: @example @verb{% [ ] ^ | * + ? { } , . < > \ %} @end example must be escaped with a backslash, i.e. written as: @example @verb{% \[ \] \^ \| \* \+ \? \{ \} \, \. \< \> \\ %} @end example @quotation Note The special symbols are ... borrowed from Perl with minor modifications ... for convenience The meaning of certain special characters/sequences slightly differs from their common ???. This is motivated by convenience reasons. The meaning of the @code{.} special character is modified due to the special function of spaces in utt files (they are field separators). Use @code{\s} to explicitly @end quotation In the argument of the @code{cat} term a special operator <...> may be used. A category specification enclosed in angle brackets matches all category descriptions which are consistent (non-contradictory) with the specification. For example @code{<N>} matches all noun descriptions, @code{<ADJ/Can>} matches all adjectives in accusative or nominal case. @* @noindent @b{Examples of one-segment patterns:} @multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @code{seg} @tab any segment @item @code{word} @tab any word-form @item @code{word(pomocy)} @tab the word-form @samp{pomocy} @item @code{word(naj.+)} @tab a word-form beginning with @samp{naj} @item @code{word(\L\l+)} @tab a capitalized word-form @item @code{punct} @tab a punctuation character @item @code{space(.*\\n.*)} @tab a space segment containing a newline character @item @code{lexeme(pomoc)} @tab any form of the lexeme 'pomoc' @item @code{cat(N/.*)} @tab a word which category starts with @code{N/} @item @code{cat(<N/Ca>)} @tab a word which category matches @code{N/Ca} @end multitable @* @noindent @b{Examples of multi-segment patterns:} @table @code @item (word(\L) punct(\.) space?)+ word(\L\l+) a sequence of initials followed by a surname @item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct a text fragment between two punctuation characters, containing an ocurrence of a relative pronoun @end table @node ser how ser works @subsection How ser works @node ser customization @subsection Customization @c All predefined terms correspond to single segments, @example define(`verbseq', `(cat(V) (space cat(V)))') @end example the term @code{cat()} may not be used as a ... of @c See @command{m4} manual for further details on macro definition format. @node ser limitations @subsection Limitations more than 3 attributes in <>. @node ser requirements @subsection Requirements In order to run @command{ser}, the following programs must be installed in the system: @itemize @item @command{m4} @item @command{grep} @item @command{flex} @item @command{gcc} @end itemize @c GRP @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @page @node grp @section grp - pattern search tool @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @code{gre} selects sentences containing an expression matching a pattern. The pattern format is exactly the same as that accepted by @code{ser}. @code{gre} is intended mainly for speeding up corpus search process. It is extremely fast (processing speed is usually higher then the speed of reading the corpus file from disk). @c @menu @c * ser command line options:: @c * ser pattern:: @c * ser how ser works:: @c * ser customization:: @c * ser limitations:: @c * ser requirements:: @c @end menu @menu * grp command line options:: * grp pattern:: * grp hints:: @end menu @node grp command line options @subsection Command line options @table @code @parhelp @parversion @c @parfile @c @paroutput @c @parinputfield @c @paroutputfield @parprocess @parinteractive @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} The search pattern. @item @b{@minus{}@minus{}morph=@var{field}} The name of the annotation field containing the morphological description (default @code{lem}). @item @b{@minus{}@minus{}command} Only print the generated flex source code. @item @b{@minus{}@minus{}macro=@var{filename}} Read macrodefinitions from file @var{filename} rather than from default location. This option allows to redefine the set of terms. @item @b{@minus{}@minus{}define=@var{filename}} Append macrodefinitions from file @var{filename}. This option allows to extend the set of terms. @end table @node grp pattern @subsection Pattern (see @code{ser}) @node grp hints @subsection Hints The corpus search speed may be increased by combining grp with lzop compression tool (grp usually processes data faster than it is read from a disk, especially for slow laptop drives). @example cat corpus | tok | sen | lem | grp -a p | lzop -7 > corpus.grp.lzo @end example @example lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR} @end example @c --------------------------------------------------------------------- @c kot @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @page @node kot @section kot - untokenizer Authors: Tomasz Obrêbski @command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text. @menu * kot command line options:: * kot usage examples:: @end menu @node kot command line options @subsection Command line options @table @code @parhelp @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} @c @item @b{@minus{}@minus{}interactive @minus{}i} @c @item @b{@minus{}@minus{}config=@var{filename}} @item @item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}} print @var{string} between nonadjacent segments of the input file @item @b{@minus{}@minus{}spaces, @minus{}r} retain the special characters @code{_}, @code{\t}, @code{\n}, @code{\r}, @code{\f} unexpanded in the output @end table @node kot usage examples @subsection Usage examples @example cat legia.txt | tok | kot @end example @example cat legia.txt | tok | lem -1 | kot @end example @c CON............................................................ @c ............................................................... @c ............................................................... @page @node con @section con - concordance table generator @command{con} generates a concordance table based on a pattern given to @command{ser}. @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Justyna Walkowska @item @strong{Component category:} @tab sink @end multitable @c @menu * con command line options:: * con usage example:: * con hints:: @end menu @node con command line options @subsection Command line options @table @code @parhelp @c @item @b{@minus{}@minus{}help}, @b{@minus{}h} @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???] @c @item @b{@minus{}@minus{}copy, @minus{}c} [???] @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} @c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}} @c @item @b{@minus{}@minus{}interactive @minus{}i} @c @item @b{@minus{}@minus{}config=@var{filename}} @c @item @c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} @c search pattern @c @c @item @b{@minus{}@minus{}flex} @c only print the generated flex source code @c @c @item @b{@minus{}@minus{}macro=@var{filename}} @c read macrodefinitions from file @var{filename} rather than from @c default location. This option allows to redefine the set of terms. @c @c @item @b{@minus{}@minus{}define=@var{filename}} @c append macrodefinitions from file @var{filename}. This option @c allows to extend the set of terms. @item @b{@minus{}@minus{}left @minus{}l} Left context info (default='30c'). Example: @example -l=5c: left context is 5 characters -l=5w: left context is 5 words -l=5s: left context is 5 non-empty input lines -l='\s*\S+\sr\S+BOS': left context starts with the given regex @end example @item @b{@minus{}@minus{}right @minus{}r} Right context info (default='30c'). @item @b{@minus{}@minus{}trim @minus{}t} Clear incomplete words from output. @item @b{@minus{}@minus{}white @minus{}w} DO NOT change all white characters into spaces. @item @b{@minus{}@minus{}column @minus{}c} Left column minimal width in characters (default = 0). @item @b{@minus{}@minus{}ignore @minus{}i} Ignore segment inconsistency in the input. @item @b{@minus{}@minus{}bon} Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). @item @b{@minus{}@minus{}eob} End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). @item @b{@minus{}@minus{}bod} Selected segment beginning display string (default='['). @item @b{@minus{}@minus{}eod} Selected segment end display string (default=']'). @end table @node con usage example @subsection Usage example @example cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con' @end example @node con hints @subsection Hints @command{con} is a rather slow program. Do not pass large amounts of redundant text through this program. @command{con} works fine in the following sequence: @example ... | grp -e EXPR | ser -e EXPR | con @end example @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @page @node Auxiliary tools @chapter Auxiliary tools @menu * compiledic:: dictionary compiler * fla:: UTT file flattener * unfla:: UTT file unflattener @end menu @page @node compiledic @section compiledic - the dictionary compiler @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Michal Stolarski, Tomasz Obrebski @item @strong{Component category:} @tab additional tool @end multitable @c @command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary (FSA) format (@code{.bin} extension). Automaton representation of a dictionary is built using the AT&T tools: @itemize @item AT&T FSM Library, @item AT&T Lextools. @end itemize In order for the compiledic program to work you have to install the above mentioned packages into your system. They are freely available for non-commercial use. Usage: @example compiledic <dictionaryname>.dic @end example The file <dictionaryname>.bin will be generated. Remarque: The program produces a lot of temporary files which are stored in the current directory. They are deleted after successfull termination of the program. @c @menu @c * con command line options:: @c * con usage example:: @c * con hints:: @c @end menu @page @node fla @section fla - the UTT file flattener @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @c @command{fla} ``flattens'' a utt file by merging segments belonging to one sentence in one line. Technically, end-of-line characters ('\n', ASCII code 10) are replaced with line-feed characters ('\f', ASCII code 12). The flattening makes it possible to process UTT files with such tools as @command{grep} or @command{sed} sentence by sentence (used in @command{grp} and @command{mar}). Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}. Flattened files are still human-readible. Usage: @example fla [<bosregex>] @end example The facultative argument is a regular expression describing segments which should be treated as sentence beginnings (the test is: the segment contains a fragment matching the @code{<bosregex>}). By default, segments containing a field @code{BOS} are seeked. @c @menu @c * con command line options:: @c * con usage example:: @c * con hints:: @c @end menu @page @node unfla @section unfla - the UTT file unflattener @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @item @strong{Authors:} @tab Tomasz Obrêbski @item @strong{Component category:} @tab filter @end multitable @command{unfla} transforms a flattened UTT file, produced by @command{fla}, into the regular format by restoring end-of-line characters. @c --------------------------------------------------------------------- @c USAGE EXAMPLES @c --------------------------------------------------------------------- @node Usage examples @chapter Usage examples @subsubheading Simple pipelines @enumerate @item tokenization cat text | tok > output1 @item morphological annotation (1) simple dictionary based lemmatization cat text | tok | lem > output1 @item morphological annotation (2) 1) perform dictionary-based lemmatization 4) guess descriptions for words which have no annotation @example cat text | tok | lem | gue -S lem > output2 @end example @item morphological annotation (3) 1) perform dictionary-based lemmatization 2) try to correct words with no annotation 3) perform dictionary-based lemmatization of corrected words 4) guess descriptions for words which still have no annotation @example cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem @end example @item spelling correction @example cat text | tok | lem --only-fail | cor -1 > output3 @end example @item Expression extraction Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'. @example cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4 @end example @item A word in context Extraction of text fragments containing a form of the lexeme 'rozmowa' in the context of 5 preceeding and 5 succeeding corpus segments. @example cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output @end example @item generation of concordance table (1) @example cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con @end example 10" @item generation of concordance table (2) The same as above but much faster @example cat text | tok | lem -1 | \ grp -e 'cat(<V>) space lexeme(rozmowa)' | \ ser -e 'cat(<V>) space lexeme(rozmowa)' | \ con @end example 2" @item generation of concordance table (3) Usually, one performs repetitively search over the same corpus. In such case it is advisable to transform the corpus data into the format required by @command{grp} first, and then use the preprocessed data. As @command{grp} (@command{grep}) processes data faster then it is read from the disk drive, the search time may be still shortened by using file compression techniques. We suggest usin @command{lzop}. @item the fastest way to search a large corpus step 1: preprocessing @example cat corpus | tok | sen | lem -1 \ | grp -a p | lzop -7 > corpus.grp.lzo @end example step 2: search @example lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con @end example @end enumerate @subsubheading More complicated configurations @example mknod fifo1 p mknod fifo2 p mknod fifo3 p mknod fifo4 p mknod fifo5 p tok | lem -p W -e fifo1 > fifo2 & cor -e fifo3 < fifo1 | lem > fifo4 & gue < fifo3 > fifo5 & sort -m fifo2 fifo4 fifo5 rm fifo? @end example @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @c PMDBF DICTIONARY @c --------------------------------------------------------------------- @node PMDBF dictionary @chapter PMDBF dictionary UTT components come with lexical data derived from Polish Morphological Database (PMDB). @menu * PMDBF files:: * PMDBF tag structure:: * PMDBF parts of speech:: * PMDBF morphosyntactic attributes:: @end menu @node PMDBF files @section Files @node PMDBF tag structure @section Tag structure pos = [[:upper:]]+ attr = [[:upper:]]+ val = [[:lower:][:digit:]?!*+-] | <[^>\n]+> descr = pos ( / ( attr val + ) + ) ? @node PMDBF parts of speech @section Parts of speech @multitable {ADJPRP} { adjectival-passive-participle } @item @code{N} @tab noun @item @code{NPRO} @tab nominal-pronoun @item @code{NV} @tab deverbal-noun @item @code{V} @tab verb @item @code{BYC} @tab byc @item @code{VNI} @tab non-inflected-verb @item @code{ADJ} @tab adjective @item @code{ADJPAP} @tab adjectival-passive-participle @item @code{ADJPRP} @tab adjectival-present-participle @item @code{ADJPP} @tab adjectival-past-participle @item @code{ADJPRO} @tab adjectival-pronoun @item @code{ADJNUM} @tab adjectival-numeral @item @code{ADV} @tab adverb @item @code{ADVANP} @tab adverbial-anterior-participle @item @code{ADVPRP} @tab adverbial-present-participle @item @code{ADVPRO} @tab adverbial-pronoun @item @code{ADVNUM} @tab adverbial-numeral @item @code{P} @tab preposition @item @code{PPRO} @tab prep-noun-pronoun @item @code{CONJ} @tab conjunction @item @code{EXCL} @tab exclamation @item @code{APP} @tab call @item @code{ONO} @tab onomatopoeia @item @code{PART} @tab particle @item @code{NUMCRD} @tab cardinal-numeral @item @code{NUMCOL} @tab collective-numeral @item @code{NUMPAR} @tab partitive-numeral @item @code{NUMORD} @tab ordinal-numeral @end multitable @node PMDBF morphosyntactic attributes @section Morphosyntactic attributes @multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} @c @headitem Attr @tab Val @tab Description @item @code{A} @tab @tab Aspect @item @tab @code{p} @tab perfect @item @tab @code{i} @tab imperfect. @item @item @code{V} @tab @tab Verb-Form @item @tab @code{b} @tab infinitive, @item @tab @code{p} @tab personal, @item @tab @code{i} @tab impersonal. @item @item @code{M} @tab @tab Mood @item @tab @code{d} @tab declarative, @item @tab @code{c} @tab conditional, @item @tab @code{i} @tab imperative. @item @item @code{T} @tab @tab Tense @item @tab @code{a} @tab past, @item @tab @code{r} @tab present, @item @tab @code{f} @tab future. @item @item @code{P} @tab @tab Person @item @tab @code{1} @tab 1, @item @tab @code{2} @tab 2, @item @tab @code{3} @tab 3. @item @item @code{D} @tab @tab Degree @item @tab @code{p} @tab positive, @item @tab @code{c} @tab comparative, @item @tab @code{s} @tab superlative. @item @item @code{N} @tab @tab Number @item @tab @code{s} @tab singular, @item @tab @code{p} @tab plural. @item @item @code{C} @tab @tab Case @item @tab @code{n} @tab nominative, @item @tab @code{g} @tab genitive, @item @tab @code{d} @tab dative, @item @tab @code{a} @tab accusative, @item @tab @code{i} @tab instrumantal, @item @tab @code{l} @tab locative, @item @tab @code{v} @tab vocative. @item @item @code{G} @tab @tab Gender @item @tab @code{p} @tab masculine-personal, @item @tab @code{a} @tab masculine-animal, @item @tab @code{i} @tab masculine-inanimate, @item @tab @code{f} @tab feminine, @item @tab @code{n} @tab neuter. @end multitable @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @c @c @node Examples @c @chapter Examples @c ---------------------------------------------------------------------- @c ---------------------------------------------------------------------- @node GNU Free Documentation License @chapter GNU Free Documentation License @c The GNU Free Documentation License. @center Version 1.2, November 2002 @c This file is intended to be included within another document, @c hence no sectioning command or @node. @display Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. @end display @enumerate 0 @item PREAMBLE The purpose of this License is to make a manual, textbook, or other functional and useful document @dfn{free} in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others. This License is a kind of ``copyleft'', which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software. We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference. @item APPLICABILITY AND DEFINITIONS This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The ``Document'', below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as ``you''. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law. A ``Modified Version'' of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language. A ``Secondary Section'' is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them. The ``Invariant Sections'' are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none. The ``Cover Texts'' are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words. A ``Transparent'' copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not ``Transparent'' is called ``Opaque''. Examples of suitable formats for Transparent copies include plain @sc{ascii} without markup, Texinfo input format, La@TeX{} input format, @acronym{SGML} or @acronym{XML} using a publicly available @acronym{DTD}, and standard-conforming simple @acronym{HTML}, PostScript or @acronym{PDF} designed for human modification. Examples of transparent image formats include @acronym{PNG}, @acronym{XCF} and @acronym{JPG}. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, @acronym{SGML} or @acronym{XML} for which the @acronym{DTD} and/or processing tools are not generally available, and the machine-generated @acronym{HTML}, PostScript or @acronym{PDF} produced by some word processors for output purposes only. The ``Title Page'' means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, ``Title Page'' means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text. A section ``Entitled XYZ'' means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as ``Acknowledgements'', ``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title'' of such a section when you modify the Document means that it remains a section ``Entitled XYZ'' according to this definition. The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License. @item VERBATIM COPYING You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3. You may also lend copies, under the same conditions stated above, and you may publicly display copies. @item COPYING IN QUANTITY If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects. If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages. If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public. It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document. @item MODIFICATIONS You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version: @enumerate A @item Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous versions (which should, if there were any, be listed in the History section of the Document). You may use the same title as a previous version if the original publisher of that version gives permission. @item List on the Title Page, as authors, one or more persons or entities responsible for authorship of the modifications in the Modified Version, together with at least five of the principal authors of the Document (all of its principal authors, if it has fewer than five), unless they release you from this requirement. @item State on the Title page the name of the publisher of the Modified Version, as the publisher. @item Preserve all the copyright notices of the Document. @item Add an appropriate copyright notice for your modifications adjacent to the other copyright notices. @item Include, immediately after the copyright notices, a license notice giving the public permission to use the Modified Version under the terms of this License, in the form shown in the Addendum below. @item Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given in the Document's license notice. @item Include an unaltered copy of this License. @item Preserve the section Entitled ``History'', Preserve its Title, and add to it an item stating at least the title, year, new authors, and publisher of the Modified Version as given on the Title Page. If there is no section Entitled ``History'' in the Document, create one stating the title, year, authors, and publisher of the Document as given on its Title Page, then add an item describing the Modified Version as stated in the previous sentence. @item Preserve the network location, if any, given in the Document for public access to a Transparent copy of the Document, and likewise the network locations given in the Document for previous versions it was based on. These may be placed in the ``History'' section. You may omit a network location for a work that was published at least four years before the Document itself, or if the original publisher of the version it refers to gives permission. @item For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve the Title of the section, and preserve in the section all the substance and tone of each of the contributor acknowledgements and/or dedications given therein. @item Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. Section numbers or the equivalent are not considered part of the section titles. @item Delete any section Entitled ``Endorsements''. Such a section may not be included in the Modified Version. @item Do not retitle any existing section to be Entitled ``Endorsements'' or to conflict in title with any Invariant Section. @item Preserve any Warranty Disclaimers. @end enumerate If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles. You may add a section Entitled ``Endorsements'', provided it contains nothing but endorsements of your Modified Version by various parties---for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard. You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one. The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version. @item COMBINING DOCUMENTS You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers. The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work. In the combination, you must combine any sections Entitled ``History'' in the various original documents, forming one section Entitled ``History''; likewise combine any sections Entitled ``Acknowledgements'', and any sections Entitled ``Dedications''. You must delete all sections Entitled ``Endorsements.'' @item COLLECTIONS OF DOCUMENTS You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects. You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document. @item AGGREGATION WITH INDEPENDENT WORKS A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an ``aggregate'' if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document. If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate. @item TRANSLATION Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail. If a section in the Document is Entitled ``Acknowledgements'', ``Dedications'', or ``History'', the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title. @item TERMINATION You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. @item FUTURE REVISIONS OF THIS LICENSE The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See @uref{http://www.gnu.org/copyleft/}. Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License ``or any later version'' applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation. @end enumerate @page @heading ADDENDUM: How to use this License for your documents To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page: @smallexample @group Copyright (C) @var{year} @var{your name}. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled ``GNU Free Documentation License''. @end group @end smallexample If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the ``with@dots{}Texts.'' line with this: @smallexample @group with the Invariant Sections being @var{list their titles}, with the Front-Cover Texts being @var{list}, and with the Back-Cover Texts being @var{list}. @end group @end smallexample If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation. If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software. @c Local Variables: @c ispell-local-pdict: "ispell-dict" @c End: @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @node Reporting bugs @chapter Reporting bugs Report bugs to <obrebski@@amu.edu.pl>. @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @c @node Copyright @c @chapter Copyright @c @c Copyright 2004 by Tomasz Obrebski @c This software is free for research and educational use. @c --------------------------------------------------------------------- @c --------------------------------------------------------------------- @node Author @chapter Author @bye