Commit Graph

6 Commits

Author SHA1 Message Date
Tom Morris
0562638ffa
Use standard text normalization - fixes #2898 (#2900)
* Use standard text normalization - fixes #2898

Fixes #2898. Fixes #409. Refs #650

Replaces homegrown ISO Latin-1 only character subsitition
with standard Java Normalize to NFD, followed by diacritic
removal and a few custom character expansions/replacements.

* Fix Mac build

* Improve compatibility with previous code

One intentional change is folding O with stroke to
oe instead of o.

- Use more powerful NFKD instead of NFD
- strip punctuation after decomposition since it can generate
  new punctuation
- Add compatibility test for old asciify() method
- Add some graphically similar characters to substitution table

* Add oe character/ligature & more long S forms

* More tests for ligatures and Latin Extended

* Add Latin-1 Supplement tests
2020-07-07 21:35:41 +02:00
Tom Morris
e61d50a1aa
Fix NGramFingerprintKeyer to ignore accents - fixes #1161 (#2899)
Fixes #1161
This change parallels what was done in #1257 1da3c00 to fix
the FingerprintKeyer and moves the diacritic removal before
the deduping. Includes a test.
2020-07-07 09:02:49 +02:00
Thad Guidry
009c587437
remove unused imports (#2574) 2020-04-21 15:51:01 +02:00
Antonin Delpeuch
95b063162d Fix clusters with single candidates. Closes #2152. 2019-09-11 12:12:32 +01:00
Antonin Delpeuch
46acc21a43 Move tests to their appropriate packages and deduplicate them, for #2133 2019-08-23 13:27:20 +01:00
Antonin Delpeuch
2b03efd84f Rename test packages to match tested ones, for #2133 2019-08-23 11:55:31 +01:00