Commit Graph

1140 Commits

Author SHA1 Message Date
Tom Morris
0562638ffa
Use standard text normalization - fixes #2898 (#2900)
* Use standard text normalization - fixes #2898

Fixes #2898. Fixes #409. Refs #650

Replaces homegrown ISO Latin-1 only character subsitition
with standard Java Normalize to NFD, followed by diacritic
removal and a few custom character expansions/replacements.

* Fix Mac build

* Improve compatibility with previous code

One intentional change is folding O with stroke to
oe instead of o.

- Use more powerful NFKD instead of NFD
- strip punctuation after decomposition since it can generate
  new punctuation
- Add compatibility test for old asciify() method
- Add some graphically similar characters to substitution table

* Add oe character/ligature & more long S forms

* More tests for ligatures and Latin Extended

* Add Latin-1 Supplement tests
2020-07-07 21:35:41 +02:00
Urvashi Gupta
f62f63706c
Report HTTP error codes to the user when creating a project from a URL (#2870)
* HTTP Error

* urlImportingTestCompleted
2020-07-07 11:58:47 +02:00
Tom Morris
e61d50a1aa
Fix NGramFingerprintKeyer to ignore accents - fixes #1161 (#2899)
Fixes #1161
This change parallels what was done in #1257 1da3c00 to fix
the FingerprintKeyer and moves the diacritic removal before
the deduping. Includes a test.
2020-07-07 09:02:49 +02:00
Tom Morris
3717111db8
Fix Open Office Spreadsheet (ODS) dates (#2843)
* Truncate any completely empty columns on the right

Fixes #565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.

Also adds a basic ODS import test.

* Fix dates in ODS spreadsheets

Fixes #2224
2020-07-04 08:42:33 +02:00
Tom Morris
df8d092132
Micro benchmark harness & ToNumber optimizations (#2859)
* Performance optimized version of ToNumber

Approximately 5x faster for floats (data dependent)
and about the same speed for integers.

- Instead of blindly trying to parse as Long, do a quick check
  for obvious problems (e.g. decimal point).
- Don't trim. It's already done by called methods.
- Use valueOf() instead of parse() to avoid object creation

* Add Java Microbenchmark Harness

The shaded JAR is missing the OpenRefine classes, for a reason
that I haven't figured out, so requires openrefine-main.jar at runtime.

* Remove old implementations of ToNumber

* Remove unneeded dependencies from main project

* Clean up and reformat
2020-07-03 21:42:44 +02:00
Tom Morris
d3db73aa67 Remove shortest-column-name ordering
Refs #2863
The tree importer sorts columns/column groups by how populated
they are, which is of arguable utility, but the tie-breaker
of ordering by shortest column name is completely silly.

This change removes that and, in conjunction with a stable sort
algorithm, will preserve the original order of the columns.
2020-07-02 16:12:55 -04:00
Tom Morris
54291ef441
Use Apache IO Commons IOUtils instead of homerolled (#2845)
Probably should remove the funky Gzip support with the
overloaded use of the encoding parameter, but this is
a start.
2020-06-30 13:49:47 +02:00
Tom Morris
421974cc3d
Truncate any completely empty columns on the right (#2842)
Fixes #565
The current versions of Open Office create default spreadsheets
with over 1000 empty columns. Keep track of the rightmost
non-empty column when importing and truncate everything else.

Also adds a basic ODS import test.
2020-06-30 08:19:00 +02:00
Tom Morris
4b146acc6e
Create Project import improvements (#2806)
* Fix charset encoding & MIME type handling

Character set (ie what we call "encoding") is part of the Content-Type,
*not* the Content-Encoding, which specifies compression (e.g. gzip).

This correctly sets the character set encoding as well as cleaning
the MIME type so that additional parsing doesn't need to be done
downstream (and removes that code).

* Use "text" instead of "text/line-based" as default fallback format

The TextLineBasedGuesser only tries a limited number of
formats (CSV, TSV, fixed), so we can't get out of that hole to
find JSON, XML, etc.

Start with a more general format instead to improve our
guessing odds.

* Support content type Structured Name Syntax Suffixes (+json +xml)

If we can't find a fully specified content type in our lookup,
fall back to just the suffix (which is registered with a leading +)
Fixes #2800 Fixes #2805
2020-06-25 08:36:57 +02:00
Tom Morris
1849e62234
Better error handling for reconciliation process - fixes #2590 (#2671)
* Harden reconciliation - Fixes #2590

- check for non-JSON / unparseable JSON returns
- handle malformed results response with no name for candidates
- catch any Exception, not just IOExceptions
- call processManager.onFailedProcess() for cleanup on error

* Add default constructor for Jackson

Jackson complains about needing a default constructor for the
NON_DEFAULT annotation, but I'm not sure why this worked before.

* Clean up indentation and unused variable - no functional changes

Make indentation consistent throughout the module, changing recently
added lines to use the standard all spaces convention.

Remove unused count variable

* Simplify control flow

* Update limit parameter comment. No functional change.

* Replace ternary expression which is causing NPE

* Add reconciliation tests using mock HTTP server
2020-06-23 21:54:54 +02:00
Tom Morris
e293602897
Restore character encoding guesser (#2755)
* Fixes #486. Builds on code from Steffen Stundzig

- Switch from ICU4J to juniversalchardet
  (Java port of Mozilla charset detector)
- Replace org.json code with Jackson
- Add tests
- Add TODO for multi-file character encoding mismatches

* Restore dependency lost in rebase

Co-authored-by: Steffen Stundzig <git@stundzig.de>
2020-06-22 06:04:51 +02:00
Tom Morris
5d2c10b9d8
Merge pull request #2731 from tfmorris/jena-3.7.0
Bump Jena from 3.6.0 to 3.15.0
2020-06-16 17:08:01 -04:00
Tom Morris
5f368bc56d
Use ContentDisposition instead of ContentType to control download (#2722)
* Use ContentDisposition instead of ContentType to control download

Fixes #1197. Previously we were using a funky ContentType to attempt
to force a file download rather than display in browser, but this
conflicted with attempts to save UTF-8 which was outside the Basic
Multilingual Plane (BMP).

By switching to ContentDisposition: attachment, which has been
the preferred method for a number of years, we can avoid this conflict.

As part of this, switch to using the "preview" param consistently
to control preview vs download rather than the content type.

* Switch content type to text/plain

Now that we don't need to use ContentType to control download
behavior, we can use something more reasonable.
2020-06-16 15:46:07 +02:00
Tom Morris
749704518c
Use Apache HTTP Commons for Fetch URL (#2692)
* Use mockwebserver instead of live network for tests

Fixes #2680. Fixes #1904.

* Remove use of deprecated methods

* Convert to use Apache HTTP Components client library

Fixes #1410 by virtue of redirect following being a built-in
capability of the library, along with retries with binary backoff,
built-in decompression, etc.

* Address review comments
2020-06-16 09:38:06 +02:00
Tom Morris
559494b75d Add TODOs for Jena RDF language names 2020-06-15 20:04:05 -04:00
Tom Morris
348d82d131
Merge pull request #2725 from OpenRefine/issue-2724-wikidata-endpoint
Update URL of Wikidata reconciliation service
2020-06-15 17:20:05 -04:00
james-cui
04055153a1
add archive column (#2573)
Co-authored-by: Antonin Delpeuch <antonin@delpeuch.eu>
2020-06-15 19:56:00 +02:00
Joanne Ong
d57d76f7df
Fix imprecise facet statistics in records mode (#2607)
* Fix bug in choice counts for records mode

* Add test for value grouper on records

* Refactor and comment code

* Count distinct instances of null/blank data

* Update test to check for blank data count in records

* Remove unnecessary import statement
2020-06-15 19:38:50 +02:00
Lisa Chandra
947356ddad
[FEAT]Adds new options for split (#2471)
* added options ui

* added definition for both separators

* added tests

* removed definitions from backend and added them to frontend

* added reverse order and handling for accented characters

* added tests for accented characters and reverse split

* fixed build errors

* unicode character ranges instead

* added examples
2020-06-15 19:30:18 +02:00
Antonin Delpeuch
1bb9e8a67e Update URL of Wikidata reconciliation service. Closes #2724 2020-06-15 00:35:10 +02:00
Tom Morris
bf1c890cc3
Unused imports and other minor cleanups (#2723)
* Two minor fixes

- prevent invalid index error on empty strings (shouldn't normally happen)
- update deprecated Apache Commons Lang method

* Remove unused imports
2020-06-14 21:18:02 +02:00
chuhao zeng
9b03ecae41
Convert illegal characters into legal ones. (#2431)
* Convert illegal characters into leagal ones.

* Test tab in key & value string

Also fix up test that depended on previous TAB
related error message and clean up logging

Co-authored-by: Tom Morris <tfmorris@gmail.com>
2020-06-14 09:47:58 +02:00
Tom Morris
18c18e587e
Replace Apache Ant with Commons Compress (#2691)
NOTE: Changes the public API where some of the old types were
embedded which means that any extensions that extend these
interfaces will have to be updated.

Fixes #2690.
2020-06-11 16:39:51 +02:00
Tom Morris
e6ed8e5d62
Save preferences JSON using UTF-8 encoding. Bulletproof prefs load. (#2657)
* Save preferences JSON using UTF-8 encoding. Bulletproof prefs load.

Fixes #2543. Fixes #2627.

Always use UTF-8 to write JSON because platform default encoding
might not be legal JSON (e.g. ISO 8859-1).

Also be more conservative about keeping backups if we fail to write.

* Handle case where backup prefs is better than more recent

* Recover from corrupted prefs with null starred list.

Fixes #2544. Replaces null with an empty list.

* Run tests with non-UTF-8 encoding

Make sure that we don't depend on UTF-8 being the default encoding
because it isn't true everywhere (e.g. Windows)

* Add test for non-ASCII chars in workspace.json

This depends on the default Java encoding being something
other than UTF-8 to test properly.
2020-06-06 10:00:01 +01:00
Antoine Beaubien
3ca08f6ff1
Changed cell.error to cell.errorMessage & added help data. (#2628)
* Changed cell.error to cell.errorMessage & added help data.

Changed cell.error to cell.errorMessage and added the informations into the internal help system.

* FR Text correction

* HU Fix text

3 instead of 2.
2020-05-23 14:05:25 +02:00
Lu Liu
e89eaf0ee2
support default project name and column name for cross() (#2518) 2020-05-22 09:39:57 +02:00
Tom Morris
557ffad920
Merge pull request #2586 from OpenRefine/issue-2510-type-boolean
Support "boolean" return for type() function.  Closes #2510
2020-05-18 17:24:47 -04:00
Antoine2711
0e86619d86 Fix the true.type() == "boolean"
Fix the true.type() == "boolean" instead of java.lang.Boolean.

Remove all the references to "error" result  in Type(). This will be addressed in:
@ToDo fix this with issue #2562
2020-05-18 17:23:43 -04:00
Antonin Delpeuch
d7d567439e Set version to 3.5-SNAPSHOT 2020-05-13 22:56:33 +02:00
Antonin Delpeuch
5597e1c942 Set version to 3.4-beta 2020-05-13 22:52:25 +02:00
Antonin Delpeuch
825e687b0b
Fix bug when both trim and autodetect are enabled in tabular parser. Closes #2584 (#2610) 2020-05-05 14:00:17 +02:00
Thad Guidry
15710ace17
reduce object creation during JSON serialization (#2576)
If a new {@code Double} instance is not required, this method
     * should generally be used in preference to the constructor
     * {@link #Double(double)}, as this method is likely to yield
     * significantly better space and time performance by caching
     * frequently requested values.
2020-05-05 10:07:54 +02:00
PJ Fanning
f047a88518
poi works better reading files directly (#2597) 2020-04-26 21:27:09 +02:00
PJ Fanning
ab64303cbb
allow xlsx files to have more columns (#2602) 2020-04-26 17:07:26 +02:00
PJ Fanning
1a0e187561
correct excel mime types (#2596)
* correct excel mime types

* address PR issue

* remove use of wildcard
2020-04-26 14:36:37 +02:00
PJ Fanning
88f7fb2852
Use SXSSFWorkbook in XlsExporter to improve memory usage when exporting xlsx files (#2594) 2020-04-26 12:26:05 +02:00
Thad Guidry
e5e2c8f665
remove Freebase AGENT_ID (#2575)
* remove unused imports

* remove unneeded Freebase AGENT_ID

In the past, Freebase editors used Google Refine for making edits to its database and the internal identifier was "/en/google_refine" which equated to a Software Application type with attached metadata and also had ownership privileges for certain Freebase Apps.  Since Freebase is no longer around, this identifier, only used by Freebase, can now be removed.  (This is not a User-Agent header string but was an internal identifier for the Freebase database which no longer exists)

* Revert "remove unused imports"

This reverts commit 9f6a276f36a54245016bd445680067d2c8862fcb.
2020-04-21 18:32:39 +02:00
Thad Guidry
009c587437
remove unused imports (#2574) 2020-04-21 15:51:01 +02:00
Lu Liu
bf84fc9cf1
use string representation for matching (#2571) 2020-04-20 09:07:09 +02:00
Ekta Mishra
05b6a7b2ae
Provides more intuitive representation for arrays in GREL (#2488)
Added test for same
closes #2040
2020-04-01 10:59:25 +02:00
chuhao zeng
1f0111eaed
Fix silent error in JSON/XML importers (#2414)
* Add error handler for parse error

* Add test for parsing json with incorrect strecture

* Enable localization from front-end

* Add methods to get localized error messages

* Update returned exception message

* Remove unused log and fix file diff issue

* Test auto build

* Refactor getOptions in newly created test

* Use new exception to unwrap original message

* Undo unexpected fix

* Remove unused lines

* Fix exception logic

* Fix typo
2020-03-27 09:41:49 +01:00
Albin Larsson
72966af5b6
remove Freebase reconciliation from Excel Importer (#2470) 2020-03-27 09:30:00 +01:00
Lu Liu
f2b06418da
Support lookup by numbers for GREL cross function (#2468)
* support int & long argument for cross function

* support any types of a cell value
2020-03-26 08:57:10 +01:00
chuhao zeng
70b4c6a6d0
Enable gzip compression (#2475)
* Enable gzip compression

* Add test for gzip parser
2020-03-26 08:42:55 +01:00
chuhao zeng
e484625adf
Fix: Data losses when importing multiple sheets from same Excell file (#2404)
* Fix loosing data when importing multiple sheets from same source Excell file

* Add test for importing multi sheets with different column size

* Fix space issues

* Restore old tests and implement new test cases for the new feature

* Restore unexpected delete

* Refactor fix

* Restore unexpected line delete

* Add new unit test for new feature
2020-03-23 22:41:23 +01:00
Thad Guidry
63bef81980
Remove unused variable in JSONUtilities (#2464) 2020-03-23 20:38:03 +01:00
Lu Liu
9ad3b1080f
Make cross() function work for all columns (#2456)
* fix #1950

* migrate from join to lookup

* reformat
2020-03-23 14:48:32 +01:00
Lisa Chandra
ef8ad85c3c
Adds trim whitespace option to separator based files (#2408)
* added trim ui to csv importer

* added trim functionality

* trimStrings handler only for strings

* added test for trimStrings option in csv/tsv files

* made trim option enabled by default
2020-03-21 10:38:43 +00:00
Albin Larsson
9745bfe374
consistent usage of Apache http status constants (#2432) 2020-03-18 06:40:52 +00:00
Lisa Chandra
a91691cb6b
[FIX] json/xml trim whitespace configuration option (#2415)
* trimStrings condition

* added test for trimString xml

* added trimStrings check for json
2020-03-15 16:04:01 +00:00
zengchu2
c90fd31daf
Add cell.error field for error messages (#2363)
* Add case for querying cell.error for error messages

* Add testing file

* Refactor test case for cell with error

* Reformat spaces
2020-03-10 10:14:15 +00:00
Chris Parker
93d34d781a Replaced some deprecated methods 2020-02-24 23:51:41 -06:00
Antonin Delpeuch
429f26c2ae Set version to 3.4-SNAPSHOT 2020-01-31 19:06:56 +01:00
Antonin Delpeuch
58b839b9c5 Set version to 3.3 2020-01-31 18:22:18 +01:00
Antonin Delpeuch
faece760f6 Set version to 3.3-SNAPSHOT 2020-01-08 20:56:51 +01:00
jamessspanggg
5afd93e2d1 Standardise 'edit' cell dialogue with 'toNumber()' behavior 2020-01-07 10:09:28 +08:00
Antonin Delpeuch
e62bb7ac0e Set version to 3.3-rc1 2020-01-06 13:30:39 +01:00
Antonin Delpeuch
904129d0f7 Fix other NPE in expression logging, for #2264 2020-01-06 06:30:56 +01:00
Antonin Delpeuch
14dd4c0112
Merge pull request #2264 from OpenRefine/issue-2086-expression-logging-npe
Fix NPE in expression logging.
2019-12-30 21:52:58 +01:00
Antonin Delpeuch
60089ab716
Merge pull request #2263 from OpenRefine/issue-2213-xlsx-export-url
More robust URI detection in tabular exporter.
2019-12-30 21:52:45 +01:00
Antonin Delpeuch
7593d5484d Add Hyperlink to cell in Excel importer, with fallback to String, for #2213 2019-12-25 22:24:58 +01:00
Antonin Delpeuch
08e175dc66 Fix NPE in expresion logging. Closes #2086. 2019-12-25 12:33:42 +01:00
Antonin Delpeuch
0bd6a0fbd7
Merge pull request #2198 from viniciusbds/master
Dealing with a possible null pointer dereference
2019-12-25 11:42:34 +01:00
Antonin Delpeuch
78853f8fb2 More robust URI detection in tabular exporter. Closes #2213. 2019-12-25 11:33:03 +01:00
Antonin Delpeuch
726395620b
Merge pull request #2202 from viniciusbds/patch-1
Update SqlCreateBuilder.java
2019-12-16 08:18:20 +01:00
Antonin Delpeuch
cc5498a42a
Return best loaded language code in LoadLanguageCommand. (#2232)
Closes #2227.
2019-11-27 15:35:18 +00:00
Antonin Delpeuch
efbfce29bb Add server-side language fallback.
This allows to keep the same Javascript calls to load languages, so it
does not require any change for extensions to benefit from this.

Closes #1350. Fixes #2209.
2019-11-07 17:23:02 +01:00
Vinicius Barbosa
d452e3040c
Update SqlCreateBuilder.java 2019-10-25 12:22:16 -03:00
Vinicius Barbosa
522641e84f
Update SetProjectTagsCommand.java 2019-10-25 11:03:41 -03:00
Antonin Delpeuch
c8eaaee39c Set version to 3.3-beta 2019-10-21 10:31:24 +01:00
viniciusbds
790fc2ffaa Dealing with a possible null pointer dereference 2019-10-18 00:23:26 -03:00
viniciusbds
5d89978000 Dealing with a possible null pointer dereference 2019-10-17 23:59:16 -03:00
Antonin Delpeuch
9ae6a7a581 Tie up CSRF tokens in the frontend 2019-10-15 12:07:14 +01:00
Antonin Delpeuch
5dc005749a Add CSRF protection to remaining commands 2019-10-15 12:07:13 +01:00
Antonin Delpeuch
3559eeb11f CSRF protection for project and recon commands 2019-10-15 12:07:12 +01:00
Antonin Delpeuch
a340c137d0 CSRF protection for OpenWorkspaceDirCommand and language loading 2019-10-15 12:07:04 +01:00
Antonin Delpeuch
91cead27f8 CSRF protection for ImportingController 2019-10-14 16:24:26 +01:00
Antonin Delpeuch
70e37b9085 Add CSRF protection to cell, history, column and expr commands 2019-10-14 16:24:26 +01:00
Antonin Delpeuch
51ddd27909 Require CSRF token in EditOneCellCommand 2019-10-14 16:24:26 +01:00
Antonin Delpeuch
21b841a089 Add CSRF token generation capabilities, for #2164 2019-10-14 16:24:26 +01:00
viniciusbds
496f1fd2d0 Fix bug when accessing empty list 2019-10-02 08:56:22 -03:00
viniciusbds
6743d5c878 Change strings comparison to use equals comparator 2019-10-01 23:05:24 -03:00
Antonin Delpeuch
bbb5766a33
Merge pull request #2155 from OpenRefine/issue-2152-lonely-clusters
Fix clusters with single candidates.
2019-09-18 19:08:18 +01:00
Antonin Delpeuch
36150a874d Fix scatterplot facet filtering 2019-09-12 11:52:28 +01:00
Antonin Delpeuch
573ba18e6d Fix scatterplot drawing command, closes #2117 2019-09-12 10:43:12 +01:00
Antonin Delpeuch
95b063162d Fix clusters with single candidates. Closes #2152. 2019-09-11 12:12:32 +01:00
Antonin Delpeuch
8ab7653e0b Set version to 3.3-SNAPSHOT 2019-07-26 15:52:00 +01:00
Antonin Delpeuch
e3417bff49 Set version to 3.2 2019-07-26 15:29:57 +01:00
Owen Stephens
ac7b5a0a19 Update Find and Tests 2019-07-21 13:34:18 +01:00
Owen Stephens
d6999de0da Match only accepts regular expressions 2019-07-21 13:19:34 +01:00
Antonin Delpeuch
33ff7be18a Fix NPE in StandardReconConfig. Closes #2076. 2019-07-03 10:21:45 +02:00
Antonin Delpeuch
cde59a0dca
Merge pull request #2070 from OpenRefine/issue-2068-duplicate-json-key
Remove duplicate JSON keys.
2019-07-02 10:19:16 +02:00
Antonin Delpeuch
8390d234b1
Merge pull request #2058 from OpenRefine/issue-1994-customMetadata
Fix parsing and display of custom metadata
2019-06-14 14:53:19 +01:00
Antonin Delpeuch
9d76b04a1c Remove duplicate JSON keys. Closes #2068. 2019-06-14 11:38:24 +01:00
Antonin Delpeuch
ad9566502f
Merge pull request #2059 from OpenRefine/issue-1989-filenotfound
Disable error message when workspace.json does not exist.
2019-06-06 20:57:31 +01:00
Antonin Delpeuch
afb787c845 Disable error message when workspace.json does not exist. Fixes #1989 2019-06-06 17:33:04 +01:00
Krzysztof 'impune-pl' Prorok
ae2f44f9d5 Fixed: issue 1998 2019-06-04 17:01:25 +02:00
Antonin Delpeuch
b9573d83e0 Add customMetadata to project metadata parsing test 2019-06-04 12:02:49 +01:00
s_tanaka
b8b9feac0c Fix column removal in reorder leaves undeleted hidden cells. 2019-05-15 19:37:40 +09:00
Antonin Delpeuch
edfa7d8445 Skip unknown operations in ApplyOperationsCommand 2019-04-19 11:25:01 +01:00