concordia-library/concordia/tutorial.dox

289 lines
14 KiB
Plaintext
Raw Normal View History

2015-04-30 21:15:18 +02:00
/** \page tutorial Quick tutorial
\section tutorial1 Code examples
2015-05-04 20:40:44 +02:00
This section describes a few examples of programs in C++ which make use of the Concordia library. You can run them after successful installation of Concordia (the installation process is covered in \ref compilation). Their source codes are located in the project's main directory, in the subfolder "examples".
2015-04-30 21:15:18 +02:00
2015-05-04 20:40:44 +02:00
The directory also contains a simple CMakeLists.txt file, which helps to perform compilation and linking of the examples. In order to compile the examples, issue the following commands from within the examples directory:
2015-04-30 21:15:18 +02:00
\verbatim
2015-05-04 20:40:44 +02:00
mkdir build
cd build
cmake ..
make
2015-04-30 21:15:18 +02:00
\endverbatim
2015-05-04 20:40:44 +02:00
After these operations, three executables are created in the build directory: first, simple_search and concordia_search. A small config.hpp file is also generated to store the path to the examples folder.
2015-04-30 21:15:18 +02:00
\subsection tutorial1_1 Minimal example
2015-05-04 20:40:44 +02:00
This program only creates the Concordia object and print version of the library.
2015-04-30 21:15:18 +02:00
2015-05-04 20:40:44 +02:00
File first.cpp:
2015-04-30 21:15:18 +02:00
\verbatim
#include <concordia/concordia.hpp>
#include <iostream>
2015-05-04 20:40:44 +02:00
#include "config.hpp"
2015-04-30 21:15:18 +02:00
using namespace std;
int main() {
2015-05-04 20:40:44 +02:00
Concordia concordia(EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg");
2015-04-30 21:15:18 +02:00
cout << concordia.getVersion() << endl;
}
\endverbatim
\subsection tutorial1_2 Simple substring lookup
This code snippet shows the basic Concordia functionality - simple substring lookup in the index.
2015-05-04 20:40:44 +02:00
File simple_search.cpp:
2015-04-30 21:15:18 +02:00
\verbatim
#include <concordia/concordia.hpp>
#include <concordia/substring_occurence.hpp>
#include <concordia/example.hpp>
2015-05-04 20:40:44 +02:00
#include "config.hpp"
2015-04-30 21:15:18 +02:00
#include <boost/shared_ptr.hpp>
#include <vector>
using namespace std;
int main() {
2015-05-04 20:40:44 +02:00
Concordia concordia(EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg");
2015-04-30 21:15:18 +02:00
// adding sentences to index
concordia.addExample(Example("Alice has a cat", 56));
concordia.addExample(Example("Alice has a dog", 23));
concordia.addExample(Example("New test product has a mistake", 321));
concordia.addExample(Example("This is just testing and it has nothing to do with the above", 14));
// generating index
concordia.refreshSAfromRAM();
// searching
cout << "Searching for pattern: has a" << endl;
vector<SubstringOccurence> result = concordia.simpleSearch("has a");
// printing results
for(vector<SubstringOccurence>::iterator it = result.begin();
it != result.end(); ++it) {
cout << "Found substring in sentence: " << it->getId() << " at offset: " << it->getOffset() << endl;
2015-05-04 20:40:44 +02:00
}
// clearing index
concordia.clearIndex();
2015-04-30 21:15:18 +02:00
}
\endverbatim
First, sentences are added to the index along with their integer IDs. The pair (sentence, id) is called an Example. Note that the IDs used in the above code are not consecutive, as there is no such requirement. Sentence ID may come from other data sources, e.g. a database and is used only as sentence meta-information.
After adding the examples, index needs to be generated using the method refreshSAfromRAM. Details of this operation are covered in \ref tutorial2.
The search returns a vector of SubstringOccurence objects, which is then printed out. Each occurence represents a single match of the pattern. The pattern has to be matched within a single sentence. Information about the match consists of two integer values: ID of the sentence where the match occured and word-level, 0-based offset of the matched pattern in the sentence. The above code should return the following results:
\verbatim
Found substring in sentence: 56 at offset: 1
Found substring in sentence: 23 at offset: 1
Found substring in sentence: 321 at offset: 3
\endverbatim
Match (321, 3) represents matching of the pattern "has a" in the sentence 321 ("New test product has a mistake"), starting at position 3, i.e. after the third word, which is "product".
\subsection tutorial1_3 Concordia searching
Concordia is equipped with a unique functionality of so called Concordia search, which is best suited to use in Computer-Aided Translation systems. This operation is aimed at finding the longest matches from the index that cover the search pattern. Such match is called MatchedPatternFragment. Then, out of all matched pattern fragments, the best pattern overlay is computed. Pattern overlay is a set of matched pattern fragments which do not intersect with each other. Best pattern overlay is an overlay that matches the most of the pattern with the fewest number of fragments.
Additionally, the score for this best overlay is computed. The score is a real number between 0 and 1, where 0 indicates, that the pattern is not covered at all (i.e. not a single word from this pattern is found in the index). The score 1 represents the perfect match - pattern is covered completely by just one fragment, which means that the pattern is found in the index as one of the examples.
2015-06-27 12:40:24 +02:00
Moreover, the below example presents the feature of retrieving a tokenized version of the example.
2015-05-04 20:40:44 +02:00
File concordia_searching.cpp:
2015-04-30 21:15:18 +02:00
\verbatim
#include <concordia/concordia.hpp>
#include <concordia/concordia_search_result.hpp>
#include <concordia/matched_pattern_fragment.hpp>
#include <concordia/example.hpp>
2015-06-27 12:40:24 +02:00
#include <concordia/tokenized_sentence.hpp>
2015-04-30 21:15:18 +02:00
2015-05-04 20:40:44 +02:00
#include "config.hpp"
2015-04-30 21:15:18 +02:00
#include <boost/shared_ptr.hpp>
#include <boost/foreach.hpp>
using namespace std;
int main() {
2015-05-04 20:40:44 +02:00
Concordia concordia(EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg");
2015-04-30 21:15:18 +02:00
2015-06-27 12:40:24 +02:00
boost::shared_ptr<TokenizedSentence> ts = concordia.addExample(Example("Alice has a cat", 56));
cout << "Added the following tokens: " << endl;
BOOST_FOREACH(TokenAnnotation token, ts->getTokens()) {
cout << "\"" << token.getValue() << "\"" << " at positions: [" << token.getStart() << ","
<< token.getEnd() << ")" << endl;
}
2015-04-30 21:15:18 +02:00
concordia.addExample(Example("Alice has a dog", 23));
concordia.addExample(Example("New test product has a mistake", 321));
concordia.addExample(Example("This is just testing and it has nothing to do with the above", 14));
concordia.refreshSAfromRAM();
cout << "Searching for pattern: Our new test product has nothing to do with computers" << endl;
boost::shared_ptr<ConcordiaSearchResult> result =
concordia.concordiaSearch("Our new test product has nothing to do with computers");
cout << "Printing all matched fragments:" << endl;
BOOST_FOREACH(MatchedPatternFragment fragment, result->getFragments()) {
cout << "Matched pattern fragment found. Pattern fragment: ["
<< fragment.getStart() << "," << fragment.getEnd() << "]"
<< " in sentence " << fragment.getExampleId()
<< ", at offset: " << fragment.getExampleOffset() << endl;
}
cout << "Best overlay:" << endl;
BOOST_FOREACH(MatchedPatternFragment fragment, result->getBestOverlay()) {
cout << "\tPattern fragment: [" << fragment.getStart()
<< "," << fragment.getEnd() << "]"
<< " in sentence " << fragment.getExampleId()
<< ", at offset: " << fragment.getExampleOffset() << endl;
}
cout << "Best overlay score: " << result->getBestOverlayScore() << endl;
2015-05-04 20:40:44 +02:00
// clearing index
concordia.clearIndex();
2015-04-30 21:15:18 +02:00
}
\endverbatim
This program should print:
\verbatim
2015-06-27 12:40:24 +02:00
Added the following tokens:
"alice" at positions: [0,5)
"has" at positions: [6,9)
"a" at positions: [10,11)
"cat" at positions: [12,15)
2015-04-30 21:15:18 +02:00
Searching for pattern: Our new test product has nothing to do with computers
Printing all matched fragments:
Matched pattern fragment found. Pattern fragment: [4,9] in sentence 14, at offset: 6
Matched pattern fragment found. Pattern fragment: [1,5] in sentence 321, at offset: 0
Matched pattern fragment found. Pattern fragment: [5,9] in sentence 14, at offset: 7
Matched pattern fragment found. Pattern fragment: [2,5] in sentence 321, at offset: 1
Matched pattern fragment found. Pattern fragment: [6,9] in sentence 14, at offset: 8
Matched pattern fragment found. Pattern fragment: [3,5] in sentence 321, at offset: 2
Matched pattern fragment found. Pattern fragment: [7,9] in sentence 14, at offset: 9
Matched pattern fragment found. Pattern fragment: [8,9] in sentence 14, at offset: 10
Best overlay:
Pattern fragment: [1,5] in sentence 321, at offset: 0
Pattern fragment: [5,9] in sentence 14, at offset: 7
Best overlay score: 0.53695
\endverbatim
These results list all the longest matched pattern fragments. The longest is [4,9] (length 5, as the end index is exclusive) which corresponds to the pattern fragment "has nothing to do with", found in the sentence 14 at offset 7. However, this longest fragment was not chosen to the best overlay. The best overlay are two fragments of length 4: [1,5] "new test product has" and [5,9] "nothing to do with". Notice that if the fragment [4,9] was chosen to the overlay, it would eliminate the [1,5] fragment.
The score of such overlay is 0.53695, which can be considered as quite satisfactory to serve as an aid for a translator.
\section tutorial2 Concept of HDD and RAM index
Concordia index consists of 4 data structures: hashed index, markers array, word map and suffix array. For searching to work, all of these structures must be present in RAM.
However, due to the fact that hashed index, markers array and word map are potentially large and their generation might take considerable amount of time, they are backed up on hard disk. Each operation of adding to index adds simultaneously to hashed index, markers array and word map in RAM and on HDD.
The last element of the index, the suffix array, is never backed up on disk but always dynamically generated. The generation is done by the method refreshSAfromRAM(). It is used to generate suffix array (SA) based on current hashed index, markers array and word map in RAM. Generation of SA for an index containing 2 000 000 000 sentences takes about 7 seconds on a personal computer. The reason for not backing up SA on HDD is that it needs to be freshly generated from scratch everytime the index changes. There is no way of incrementally augmenting this structure.
There is another method: loadRAMIndexFromDisk(). It loads hashed index, markers array and word map from HDD to RAM and calls refreshSAfromRAM(). The method loadRAMIndexFromDisk() is called when Concordia starts and the paths of hashed index, markers array and word map point to non-empty files on HDD (i.e. something was added to the index in previous runs of Concordia).
\section tutorial3 Concordia configuration
Concordia is configured by the means of a configuration file in the libconfig format (http://www.hyperrealm.com/libconfig/). Here is the sample configuration file, which comes with the library. Its path is <CONCORDIA_HOME>/tests/resources/concordia-config/concordia.cfg. Note that all the settings in this file are required.
Every option is documented in comments within the configuration file.
\verbatim
#----------------------------
# Concordia configuration file
#---------------------------
#
#-------------------------------------------------------------------------------
# The below set the paths for hashed index, markers array and word map files.
# If all the files pointed by these paths exist, Concordia reads them to its
# RAM index. When none of these files exist, a new empty index is created.
# However, if any of these files exist and any other is missing, the index
# is considered corrupt and Concordia does not start.
hashed_index_path = "<CONCORDIA_HOME>/tests/resources/temp/temp_hashed_index.bin"
markers_path = "<CONCORDIA_HOME>/tests/resources/temp/temp_markers.bin"
word_map_path = "<CONCORDIA_HOME>/tests/resources/temp/temp_word_map.bin"
#-------------------------------------------------------------------------------
2015-06-25 10:12:51 +02:00
# The following settings control the sentence tokenizer mechanism. Tokenizer
# takes into account html tags, substitutes predefined symbols
2015-04-30 21:15:18 +02:00
# with a single space, removes stop words (if the option is enabled), as well as
# named entities and special symbols. All these have to be listed in files.
# File containing all html tags (one per line)
html_tags_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/html_tags.txt"
# File containing all symbols to be replaced by spaces
space_symbols_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/space_symbols.txt"
# If set to true, words from predefined list are removed
stop_words_enabled = "false"
# If stop_words_enabled is true, set the path to the stop words file
#stop_words_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/stop_words.txt"
# File containing regular expressions that match named entities
named_entities_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/named_entities.txt"
# File containing special symbols (one per line) to be removed
stop_symbols_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/stop_symbols.txt"
### eof
\endverbatim
\section tutorial4 The concordia-console program
After successful build of the project (see \ref compilation2) the concordia-console program is available in the folder build/concordia-console.
\subsection tutorial4_1 concordia-console options
The full list of program options is given below:
\verbatim
-h [ --help ] Display this message
-c [ --config ] arg Concordia configuration file (required)
-s [ --simple-search ] arg Pattern to be searched in the index
-n [ --silent ] While searching, do not
output search results
-a [ --anubis-search ] arg Pattern to be searched by anubis search in the
index
-x [ --concordia-search ] arg Pattern to be searched by concordia search in
the index
-r [ --read-file ] arg File to be read and added to index
-t [ --test ] arg Run performance and correctness tests on file
\endverbatim
\subsection tutorial4_2 concordia-console example run
From <CONCORDIA_HOME> directory:
Read sentences from file sentences.txt
\verbatim
./build/concordia-console/concordia-console -c tests/resources/concordia-config/concordia.cfg -r ~/sentences.txt
\endverbatim
Run concordia search on the index
\verbatim
./build/concordia-console/concordia-console -c tests/resources/concordia-config/concordia.cfg -x "some pattern"
\endverbatim
*/