Fixes#1161
This change parallels what was done in #12571da3c00 to fix
the FingerprintKeyer and moves the diacritic removal before
the deduping. Includes a test.
* Truncate any completely empty columns on the right
Fixes#565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.
Also adds a basic ODS import test.
* Fix dates in ODS spreadsheets
Fixes#2224
* Performance optimized version of ToNumber
Approximately 5x faster for floats (data dependent)
and about the same speed for integers.
- Instead of blindly trying to parse as Long, do a quick check
for obvious problems (e.g. decimal point).
- Don't trim. It's already done by called methods.
- Use valueOf() instead of parse() to avoid object creation
* Add Java Microbenchmark Harness
The shaded JAR is missing the OpenRefine classes, for a reason
that I haven't figured out, so requires openrefine-main.jar at runtime.
* Remove old implementations of ToNumber
* Remove unneeded dependencies from main project
* Clean up and reformat
Refs #2863
The tree importer sorts columns/column groups by how populated
they are, which is of arguable utility, but the tie-breaker
of ordering by shortest column name is completely silly.
This change removes that and, in conjunction with a stable sort
algorithm, will preserve the original order of the columns.
Fixes#565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.
Also adds a basic ODS import test.
* Fix charset encoding & MIME type handling
Character set (ie what we call "encoding") is part of the Content-Type,
*not* the Content-Encoding, which specifies compression (e.g. gzip).
This correctly sets the character set encoding as well as cleaning
the MIME type so that additional parsing doesn't need to be done
downstream (and removes that code).
* Use "text" instead of "text/line-based" as default fallback format
The TextLineBasedGuesser only tries a limited number of
formats (CSV, TSV, fixed), so we can't get out of that hole to
find JSON, XML, etc.
Start with a more general format instead to improve our
guessing odds.
* Support content type Structured Name Syntax Suffixes (+json +xml)
If we can't find a fully specified content type in our lookup,
fall back to just the suffix (which is registered with a leading +)
Fixes#2800Fixes#2805
* Harden reconciliation - Fixes#2590
- check for non-JSON / unparseable JSON returns
- handle malformed results response with no name for candidates
- catch any Exception, not just IOExceptions
- call processManager.onFailedProcess() for cleanup on error
* Add default constructor for Jackson
Jackson complains about needing a default constructor for the
NON_DEFAULT annotation, but I'm not sure why this worked before.
* Clean up indentation and unused variable - no functional changes
Make indentation consistent throughout the module, changing recently
added lines to use the standard all spaces convention.
Remove unused count variable
* Simplify control flow
* Update limit parameter comment. No functional change.
* Replace ternary expression which is causing NPE
* Add reconciliation tests using mock HTTP server
* Fixes#486. Builds on code from Steffen Stundzig
- Switch from ICU4J to juniversalchardet
(Java port of Mozilla charset detector)
- Replace org.json code with Jackson
- Add tests
- Add TODO for multi-file character encoding mismatches
* Restore dependency lost in rebase
Co-authored-by: Steffen Stundzig <git@stundzig.de>
* Use ContentDisposition instead of ContentType to control download
Fixes#1197. Previously we were using a funky ContentType to attempt
to force a file download rather than display in browser, but this
conflicted with attempts to save UTF-8 which was outside the Basic
Multilingual Plane (BMP).
By switching to ContentDisposition: attachment, which has been
the preferred method for a number of years, we can avoid this conflict.
As part of this, switch to using the "preview" param consistently
to control preview vs download rather than the content type.
* Switch content type to text/plain
Now that we don't need to use ContentType to control download
behavior, we can use something more reasonable.
* Use mockwebserver instead of live network for tests
Fixes#2680. Fixes#1904.
* Remove use of deprecated methods
* Convert to use Apache HTTP Components client library
Fixes#1410 by virtue of redirect following being a built-in
capability of the library, along with retries with binary backoff,
built-in decompression, etc.
* Address review comments
* Fix bug in choice counts for records mode
* Add test for value grouper on records
* Refactor and comment code
* Count distinct instances of null/blank data
* Update test to check for blank data count in records
* Remove unnecessary import statement
* added options ui
* added definition for both separators
* added tests
* removed definitions from backend and added them to frontend
* added reverse order and handling for accented characters
* added tests for accented characters and reverse split
* fixed build errors
* unicode character ranges instead
* added examples
* Convert illegal characters into leagal ones.
* Test tab in key & value string
Also fix up test that depended on previous TAB
related error message and clean up logging
Co-authored-by: Tom Morris <tfmorris@gmail.com>
NOTE: Changes the public API where some of the old types were
embedded which means that any extensions that extend these
interfaces will have to be updated.
Fixes#2690.
* Save preferences JSON using UTF-8 encoding. Bulletproof prefs load.
Fixes#2543. Fixes#2627.
Always use UTF-8 to write JSON because platform default encoding
might not be legal JSON (e.g. ISO 8859-1).
Also be more conservative about keeping backups if we fail to write.
* Handle case where backup prefs is better than more recent
* Recover from corrupted prefs with null starred list.
Fixes#2544. Replaces null with an empty list.
* Run tests with non-UTF-8 encoding
Make sure that we don't depend on UTF-8 being the default encoding
because it isn't true everywhere (e.g. Windows)
* Add test for non-ASCII chars in workspace.json
This depends on the default Java encoding being something
other than UTF-8 to test properly.
* Changed cell.error to cell.errorMessage & added help data.
Changed cell.error to cell.errorMessage and added the informations into the internal help system.
* FR Text correction
* HU Fix text
3 instead of 2.
Fix the true.type() == "boolean" instead of java.lang.Boolean.
Remove all the references to "error" result in Type(). This will be addressed in:
@ToDo fix this with issue #2562
If a new {@code Double} instance is not required, this method
* should generally be used in preference to the constructor
* {@link #Double(double)}, as this method is likely to yield
* significantly better space and time performance by caching
* frequently requested values.
* remove unused imports
* remove unneeded Freebase AGENT_ID
In the past, Freebase editors used Google Refine for making edits to its database and the internal identifier was "/en/google_refine" which equated to a Software Application type with attached metadata and also had ownership privileges for certain Freebase Apps. Since Freebase is no longer around, this identifier, only used by Freebase, can now be removed. (This is not a User-Agent header string but was an internal identifier for the Freebase database which no longer exists)
* Revert "remove unused imports"
This reverts commit 9f6a276f36a54245016bd445680067d2c8862fcb.
* Add error handler for parse error
* Add test for parsing json with incorrect strecture
* Enable localization from front-end
* Add methods to get localized error messages
* Update returned exception message
* Remove unused log and fix file diff issue
* Test auto build
* Refactor getOptions in newly created test
* Use new exception to unwrap original message
* Undo unexpected fix
* Remove unused lines
* Fix exception logic
* Fix typo
* Fix loosing data when importing multiple sheets from same source Excell file
* Add test for importing multi sheets with different column size
* Fix space issues
* Restore old tests and implement new test cases for the new feature
* Restore unexpected delete
* Refactor fix
* Restore unexpected line delete
* Add new unit test for new feature
* added trim ui to csv importer
* added trim functionality
* trimStrings handler only for strings
* added test for trimStrings option in csv/tsv files
* made trim option enabled by default
This allows to keep the same Javascript calls to load languages, so it
does not require any change for extensions to benefit from this.
Closes#1350. Fixes#2209.
Also adds new logic to preserve the JSON representation of unknown operations,
to protect from version downgrading or removal of extensions.
Closes#1990.