* Fix text guesser so it doesn't guess wikitext
Fixes#2850
- Add simple magic detector for zip & gzip files to keep
it from attempting to guess binary files
- Add a counter for C0 controls for the same reason
- Tighten wikitable counters to require marker at
beginning of the line, per the specification
- Refactor to use Apache Commons instead of private
counting methods
- Add tests for most TextGuesser formats
* Remove misplaced duplicate test data file
* Fix LGTM warning + minor cleanups
* Use BoundedInputStream to prevent runaway lines
* Add utility functions to check/convert dates
* Add date tests and refactor to DRY up
* Fix date import - fixes#1908
Change from java.util.Date to OpenRefine 3.0+'s OffsetDateTime
Fixes#1908
* Centralize date conversion
* Moving utility methods to ParsingUtilities
* Fix tests
* Use standard text normalization - fixes#2898Fixes#2898. Fixes#409. Refs #650
Replaces homegrown ISO Latin-1 only character subsitition
with standard Java Normalize to NFD, followed by diacritic
removal and a few custom character expansions/replacements.
* Fix Mac build
* Improve compatibility with previous code
One intentional change is folding O with stroke to
oe instead of o.
- Use more powerful NFKD instead of NFD
- strip punctuation after decomposition since it can generate
new punctuation
- Add compatibility test for old asciify() method
- Add some graphically similar characters to substitution table
* Add oe character/ligature & more long S forms
* More tests for ligatures and Latin Extended
* Add Latin-1 Supplement tests
Fixes#1161
This change parallels what was done in #12571da3c00 to fix
the FingerprintKeyer and moves the diacritic removal before
the deduping. Includes a test.
* Truncate any completely empty columns on the right
Fixes#565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.
Also adds a basic ODS import test.
* Fix dates in ODS spreadsheets
Fixes#2224
* Performance optimized version of ToNumber
Approximately 5x faster for floats (data dependent)
and about the same speed for integers.
- Instead of blindly trying to parse as Long, do a quick check
for obvious problems (e.g. decimal point).
- Don't trim. It's already done by called methods.
- Use valueOf() instead of parse() to avoid object creation
* Add Java Microbenchmark Harness
The shaded JAR is missing the OpenRefine classes, for a reason
that I haven't figured out, so requires openrefine-main.jar at runtime.
* Remove old implementations of ToNumber
* Remove unneeded dependencies from main project
* Clean up and reformat
Refs #2863
The tree importer sorts columns/column groups by how populated
they are, which is of arguable utility, but the tie-breaker
of ordering by shortest column name is completely silly.
This change removes that and, in conjunction with a stable sort
algorithm, will preserve the original order of the columns.
* Fix two deprecated methods usages
* Test ToNumber conversions
* Test behavior of all functions when passed 0 or 8 arguments
There are 16 which fail currently on 0 args (return null or
False instead of EvalError), but have been whitelisted until
we can verify whether it's safe to change them without introducing
compatibility issues.
There are 19 which fail to return an error on too many (ie 8) args.
No issue.
- we don't support Excel95, but make sure that it generates an exception
- move the test data file into the appropriate directory
- for any normal test, consider exceptions a failure
Fixes#565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.
Also adds a basic ODS import test.
Fixes#2824
Versions up through 3.14.0 appear to work, but since odfdom bundles
Jena 3.9.0, we're going to be conservative and match that.
As an added bonus, includes a blank node test which will trigger
the failure.
* Fix charset encoding & MIME type handling
Character set (ie what we call "encoding") is part of the Content-Type,
*not* the Content-Encoding, which specifies compression (e.g. gzip).
This correctly sets the character set encoding as well as cleaning
the MIME type so that additional parsing doesn't need to be done
downstream (and removes that code).
* Use "text" instead of "text/line-based" as default fallback format
The TextLineBasedGuesser only tries a limited number of
formats (CSV, TSV, fixed), so we can't get out of that hole to
find JSON, XML, etc.
Start with a more general format instead to improve our
guessing odds.
* Support content type Structured Name Syntax Suffixes (+json +xml)
If we can't find a fully specified content type in our lookup,
fall back to just the suffix (which is registered with a leading +)
Fixes#2800Fixes#2805