This section describes a few examples of programs in C++ which make use of the Concordia library. You can run them after successful installation of Concordia (the installation process is covered in \ref compilation). Their source codes are located in the project's main directory, in the subfolder "examples".
The directory also contains a simple CMakeLists.txt file, which helps to perform compilation and linking of the examples. In order to compile the examples, issue the following commands from within the examples directory:
After these operations, three executables are created in the build directory: first, simple_search and concordia_search. A small config.hpp file is also generated to store the path to the examples folder.
First, sentences are added to the index along with their integer IDs. The pair (sentence, id) is called an Example. Note that the IDs used in the above code are not consecutive, as there is no such requirement. Sentence ID may come from other data sources, e.g. a database and is used only as sentence meta-information.
After adding the examples, index needs to be generated using the method refreshSAfromRAM. Details of this operation are covered in \ref tutorial2.
The search returns a vector of SubstringOccurence objects, which is then printed out. Each occurence represents a single match of the pattern. The pattern has to be matched within a single sentence. Information about the match consists of two integer values: ID of the sentence where the match occured and word-level, 0-based offset of the matched pattern in the sentence. The above code should return the following results:
\verbatim
Found substring in sentence: 56 at offset: 1
Found substring in sentence: 23 at offset: 1
Found substring in sentence: 321 at offset: 3
\endverbatim
Match (321, 3) represents matching of the pattern "has a" in the sentence 321 ("New test product has a mistake"), starting at position 3, i.e. after the third word, which is "product".
\subsection tutorial1_3 Concordia searching
Concordia is equipped with a unique functionality of so called Concordia search, which is best suited to use in Computer-Aided Translation systems. This operation is aimed at finding the longest matches from the index that cover the search pattern. Such match is called MatchedPatternFragment. Then, out of all matched pattern fragments, the best pattern overlay is computed. Pattern overlay is a set of matched pattern fragments which do not intersect with each other. Best pattern overlay is an overlay that matches the most of the pattern with the fewest number of fragments.
Additionally, the score for this best overlay is computed. The score is a real number between 0 and 1, where 0 indicates, that the pattern is not covered at all (i.e. not a single word from this pattern is found in the index). The score 1 represents the perfect match - pattern is covered completely by just one fragment, which means that the pattern is found in the index as one of the examples.
Searching for pattern: Our new test product has nothing to do with computers
Printing all matched fragments:
Matched pattern fragment found. Pattern fragment: [4,9] in sentence 14, at offset: 6
Matched pattern fragment found. Pattern fragment: [1,5] in sentence 321, at offset: 0
Matched pattern fragment found. Pattern fragment: [5,9] in sentence 14, at offset: 7
Matched pattern fragment found. Pattern fragment: [2,5] in sentence 321, at offset: 1
Matched pattern fragment found. Pattern fragment: [6,9] in sentence 14, at offset: 8
Matched pattern fragment found. Pattern fragment: [3,5] in sentence 321, at offset: 2
Matched pattern fragment found. Pattern fragment: [7,9] in sentence 14, at offset: 9
Matched pattern fragment found. Pattern fragment: [8,9] in sentence 14, at offset: 10
Best overlay:
Pattern fragment: [1,5] in sentence 321, at offset: 0
Pattern fragment: [5,9] in sentence 14, at offset: 7
Best overlay score: 0.53695
\endverbatim
These results list all the longest matched pattern fragments. The longest is [4,9] (length 5, as the end index is exclusive) which corresponds to the pattern fragment "has nothing to do with", found in the sentence 14 at offset 7. However, this longest fragment was not chosen to the best overlay. The best overlay are two fragments of length 4: [1,5] "new test product has" and [5,9] "nothing to do with". Notice that if the fragment [4,9] was chosen to the overlay, it would eliminate the [1,5] fragment.
The score of such overlay is 0.53695, which can be considered as quite satisfactory to serve as an aid for a translator.
\section tutorial2 Concept of HDD and RAM index
Concordia index consists of 4 data structures: hashed index, markers array, word map and suffix array. For searching to work, all of these structures must be present in RAM.
However, due to the fact that hashed index, markers array and word map are potentially large and their generation might take considerable amount of time, they are backed up on hard disk. Each operation of adding to index adds simultaneously to hashed index, markers array and word map in RAM and on HDD.
The last element of the index, the suffix array, is never backed up on disk but always dynamically generated. The generation is done by the method refreshSAfromRAM(). It is used to generate suffix array (SA) based on current hashed index, markers array and word map in RAM. Generation of SA for an index containing 2 000 000 000 sentences takes about 7 seconds on a personal computer. The reason for not backing up SA on HDD is that it needs to be freshly generated from scratch everytime the index changes. There is no way of incrementally augmenting this structure.
There is another method: loadRAMIndexFromDisk(). It loads hashed index, markers array and word map from HDD to RAM and calls refreshSAfromRAM(). The method loadRAMIndexFromDisk() is called when Concordia starts and the paths of hashed index, markers array and word map point to non-empty files on HDD (i.e. something was added to the index in previous runs of Concordia).
\section tutorial3 Concordia configuration
Concordia is configured by the means of a configuration file in the libconfig format (http://www.hyperrealm.com/libconfig/). Here is the sample configuration file, which comes with the library. Its path is <CONCORDIA_HOME>/tests/resources/concordia-config/concordia.cfg. Note that all the settings in this file are required.
Every option is documented in comments within the configuration file.