RandomSec

Author	SHA1	Message	Date
Tom Morris	959200d141	Maintain order for uniques() - fixes #3235 Also add tests	2020-09-30 17:45:24 -04:00
Thad Guidry	3f6d1eabba	Adds new Jsoup wholeText() function and tests (#3181 ) * Adds new Jsoup wholeText() function and tests - Ref: https://github.com/jhy/jsoup/blob/master/CHANGES#L275 - Ref: https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#wholeText() * update the description of function * Update main/src/com/google/refine/expr/functions/xml/WholeText.java	2020-09-12 16:14:26 +02:00
Tom Morris	eaf881ced7	Importer refactoring cleanup (#2984 ) * Clean up importer refactoring Remove an extra copy of filename setting. Revert some additional API changes (retaining both versions) * Revert archive file name changes & mark as deprecated	2020-09-06 17:46:08 -04:00
Tom Morris	15a069d3d5	Improve exception reporting - refs #3145 (#3146 ) Include the exception name with the message returned to the user.	2020-09-06 08:26:49 +02:00
Tom Morris	a86a6d4e3b	sort() handles nulls instead of throwing NPE - fixes #3152 (#3162 ) * Add utility helpers to create array of comparable items * Extend sort() to handle arrays with nulls - Instead of NullPointerException on nulls, sort them last - add JSON helpers to return Comparable[] in addition to Object[] - Non-homogenous arrays or arrays with non-primitive objects (array or object) are not sortable - Add tests for both new and old sort functionality	2020-09-05 23:01:47 +02:00
Tom Morris	37ae9a3d51	Merge pull request #3156 from OpenRefine/3132-recon-read-timeout Set read timeout to 60 sec for reconciliation.	2020-09-02 13:16:17 -04:00
Antonin Delpeuch	609e4bac4c	Set read timeout to 60 sec for reconciliation. Closes #3132	2020-09-02 16:03:29 +02:00
Tom Morris	aa43445c99	Extend forEach() to support JSON objects (#3150 ) * Refactor GREL Get tests - move helper up to RefineTest - move tests to the correct module * Extend forEach() to support JSON objects - fixes #3149 Also add tests for existing forEach forms in addition to the new one * Add a couple more tests	2020-08-30 08:40:17 +02:00
Tom Morris	95756bf11f	Replace deprecated constant	2020-08-23 14:17:40 -04:00
Antonin Delpeuch	9ac54edbba	Migrate reconciliation calls to Apache HTTP client (#2906 ) * Migrate reconciliation calls to OkHTTP, for #2903 * Migrate to Apache HTTP Commons * Migrate data extension to Apache HTTP client * Deprecate HttpURLConnection in RefineServlet * Use LaxRedirectStrategy, clean up imports * Remove read and pool timeouts, only keep the connection timeout * Adapt mocking of HTTP calls after migration	2020-08-23 14:04:59 +02:00
Tom Morris	fc21d58ed1	Don't count TABs as control characters - fixes #3061 (#3068 ) * Don't count TABs as control characters - fixes #3061 * Add TSV test. Replace info logging w/assert message	2020-08-16 10:35:25 +02:00
Tom Morris	9c403d59d2	Add separator to zip slip check - fixes #3043 (#3048 )	2020-08-09 14:48:55 +02:00
Tom Morris	55edae2b7b	Fix ToDate test failure & inefficiency - fixes #3026 (#3027 ) * Fix ToDate test failure - fixes #3026 Instead of computing offset from UTC at current point in time, use the offset from the parsed date so that we're not affected by crossing a daylight savings time boundary. * Fix date parsing with locale as first format string Also refactors for simpicity, restore some dropped tests, and restores previous behavior of considering a bad format string an error instead of silently ignoring it. It does NOT address another issue which was introduced in May 2018 of treating date/times without timzone information as UTC instead of local. * Restore error checking and messages * Save & restore default timezone for tests Also add some ToDos for places where LocalDate is being misused.	2020-08-09 13:53:43 +02:00
Tom Morris	83ed9ffdaf	Refactor importer APIs - Fixes #2963 (#2978 ) * Make sure data directory is directory, not a file * Add a test for zip archive import Also tests the saving of the archive file name and source filename * Add TODOs - no functional changes * Cosmetic cleanups * Revert importer API changes for archive file name parameter Fixes #2963 - restore binary compatibility to the API - hoist the handling of both fileSource and archiveFileName from TabularImportingParserBase and TreeImportingParserBase to ImportingParserBase so that there's only one copy. These 3 classes are all part of the internal implementation, so there should be no compatibility issue. * Revert weird flow of control for import options metadata This reverts the very convoluted control flow that was introduced when adding the input options to the project metadata. Instead the metadata is all handled in the importer framework rather than having to change APIs are have individual importers worry about it. The feature never had test coverage, so that is still to be added. * Add test for import options in project metadata & fix bug Fixes bug where same options object was being reused and overwritten, so all copies in the list ended up the same.	2020-07-23 18:36:14 +02:00
Tom Morris	a3fab26cca	Fix the text format guesser so it doesn't inappropriately guess WikiText (#2924 ) * Fix text guesser so it doesn't guess wikitext Fixes #2850 - Add simple magic detector for zip & gzip files to keep it from attempting to guess binary files - Add a counter for C0 controls for the same reason - Tighten wikitable counters to require marker at beginning of the line, per the specification - Refactor to use Apache Commons instead of private counting methods - Add tests for most TextGuesser formats * Remove misplaced duplicate test data file * Fix LGTM warning + minor cleanups * Use BoundedInputStream to prevent runaway lines	2020-07-15 08:56:00 +02:00
Tom Morris	ed68541988	Remove informational logging from tests that are passing (#2923 ) * Change logging from info to debug * Make tests less chatty when they're passing	2020-07-14 17:47:36 +02:00
Tom Morris	306b541c69	Fix Excel date import - Fixes #1908 (#2909 ) * Add utility functions to check/convert dates * Add date tests and refactor to DRY up * Fix date import - fixes #1908 Change from java.util.Date to OpenRefine 3.0+'s OffsetDateTime Fixes #1908 * Centralize date conversion * Moving utility methods to ParsingUtilities * Fix tests	2020-07-09 23:13:44 +02:00
Tom Morris	0562638ffa	Use standard text normalization - fixes #2898 (#2900 ) * Use standard text normalization - fixes #2898 Fixes #2898. Fixes #409. Refs #650 Replaces homegrown ISO Latin-1 only character subsitition with standard Java Normalize to NFD, followed by diacritic removal and a few custom character expansions/replacements. * Fix Mac build * Improve compatibility with previous code One intentional change is folding O with stroke to oe instead of o. - Use more powerful NFKD instead of NFD - strip punctuation after decomposition since it can generate new punctuation - Add compatibility test for old asciify() method - Add some graphically similar characters to substitution table * Add oe character/ligature & more long S forms * More tests for ligatures and Latin Extended * Add Latin-1 Supplement tests	2020-07-07 21:35:41 +02:00
Urvashi Gupta	f62f63706c	Report HTTP error codes to the user when creating a project from a URL (#2870 ) * HTTP Error * urlImportingTestCompleted	2020-07-07 11:58:47 +02:00
Tom Morris	e61d50a1aa	Fix NGramFingerprintKeyer to ignore accents - fixes #1161 (#2899 ) Fixes #1161 This change parallels what was done in #1257 `1da3c00` to fix the FingerprintKeyer and moves the diacritic removal before the deduping. Includes a test.	2020-07-07 09:02:49 +02:00
Tom Morris	3717111db8	Fix Open Office Spreadsheet (ODS) dates (#2843 ) * Truncate any completely empty columns on the right Fixes #565 The current versions of Open Office create default spreadsheets with over 1000 empty columns. Keep track of the rightmost non-empty column when importing and truncate everything else. Also adds a basic ODS import test. * Fix dates in ODS spreadsheets Fixes #2224	2020-07-04 08:42:33 +02:00
Tom Morris	df8d092132	Micro benchmark harness & ToNumber optimizations (#2859 ) * Performance optimized version of ToNumber Approximately 5x faster for floats (data dependent) and about the same speed for integers. - Instead of blindly trying to parse as Long, do a quick check for obvious problems (e.g. decimal point). - Don't trim. It's already done by called methods. - Use valueOf() instead of parse() to avoid object creation * Add Java Microbenchmark Harness The shaded JAR is missing the OpenRefine classes, for a reason that I haven't figured out, so requires openrefine-main.jar at runtime. * Remove old implementations of ToNumber * Remove unneeded dependencies from main project * Clean up and reformat	2020-07-03 21:42:44 +02:00
Tom Morris	d3db73aa67	Remove shortest-column-name ordering Refs #2863 The tree importer sorts columns/column groups by how populated they are, which is of arguable utility, but the tie-breaker of ordering by shortest column name is completely silly. This change removes that and, in conjunction with a stable sort algorithm, will preserve the original order of the columns.	2020-07-02 16:12:55 -04:00
Tom Morris	54291ef441	Use Apache IO Commons IOUtils instead of homerolled (#2845 ) Probably should remove the funky Gzip support with the overloaded use of the encoding parameter, but this is a start.	2020-06-30 13:49:47 +02:00
Tom Morris	421974cc3d	Truncate any completely empty columns on the right (#2842 ) Fixes #565 The current versions of Open Office create default spreadsheets with over 1000 empty columns. Keep track of the rightmost non-empty column when importing and truncate everything else. Also adds a basic ODS import test.	2020-06-30 08:19:00 +02:00
Tom Morris	4b146acc6e	Create Project import improvements (#2806 ) * Fix charset encoding & MIME type handling Character set (ie what we call "encoding") is part of the Content-Type, not the Content-Encoding, which specifies compression (e.g. gzip). This correctly sets the character set encoding as well as cleaning the MIME type so that additional parsing doesn't need to be done downstream (and removes that code). * Use "text" instead of "text/line-based" as default fallback format The TextLineBasedGuesser only tries a limited number of formats (CSV, TSV, fixed), so we can't get out of that hole to find JSON, XML, etc. Start with a more general format instead to improve our guessing odds. * Support content type Structured Name Syntax Suffixes (+json +xml) If we can't find a fully specified content type in our lookup, fall back to just the suffix (which is registered with a leading +) Fixes #2800 Fixes #2805	2020-06-25 08:36:57 +02:00
Tom Morris	1849e62234	Better error handling for reconciliation process - fixes #2590 (#2671 ) * Harden reconciliation - Fixes #2590 - check for non-JSON / unparseable JSON returns - handle malformed results response with no name for candidates - catch any Exception, not just IOExceptions - call processManager.onFailedProcess() for cleanup on error * Add default constructor for Jackson Jackson complains about needing a default constructor for the NON_DEFAULT annotation, but I'm not sure why this worked before. * Clean up indentation and unused variable - no functional changes Make indentation consistent throughout the module, changing recently added lines to use the standard all spaces convention. Remove unused count variable * Simplify control flow * Update limit parameter comment. No functional change. * Replace ternary expression which is causing NPE * Add reconciliation tests using mock HTTP server	2020-06-23 21:54:54 +02:00
Tom Morris	e293602897	Restore character encoding guesser (#2755 ) * Fixes #486. Builds on code from Steffen Stundzig - Switch from ICU4J to juniversalchardet (Java port of Mozilla charset detector) - Replace org.json code with Jackson - Add tests - Add TODO for multi-file character encoding mismatches * Restore dependency lost in rebase Co-authored-by: Steffen Stundzig <git@stundzig.de>	2020-06-22 06:04:51 +02:00
Tom Morris	5d2c10b9d8	Merge pull request #2731 from tfmorris/jena-3.7.0 Bump Jena from 3.6.0 to 3.15.0	2020-06-16 17:08:01 -04:00
Tom Morris	5f368bc56d	Use ContentDisposition instead of ContentType to control download (#2722 ) * Use ContentDisposition instead of ContentType to control download Fixes #1197. Previously we were using a funky ContentType to attempt to force a file download rather than display in browser, but this conflicted with attempts to save UTF-8 which was outside the Basic Multilingual Plane (BMP). By switching to ContentDisposition: attachment, which has been the preferred method for a number of years, we can avoid this conflict. As part of this, switch to using the "preview" param consistently to control preview vs download rather than the content type. * Switch content type to text/plain Now that we don't need to use ContentType to control download behavior, we can use something more reasonable.	2020-06-16 15:46:07 +02:00
Tom Morris	749704518c	Use Apache HTTP Commons for Fetch URL (#2692 ) * Use mockwebserver instead of live network for tests Fixes #2680. Fixes #1904. * Remove use of deprecated methods * Convert to use Apache HTTP Components client library Fixes #1410 by virtue of redirect following being a built-in capability of the library, along with retries with binary backoff, built-in decompression, etc. * Address review comments	2020-06-16 09:38:06 +02:00
Tom Morris	559494b75d	Add TODOs for Jena RDF language names	2020-06-15 20:04:05 -04:00
Tom Morris	348d82d131	Merge pull request #2725 from OpenRefine/issue-2724-wikidata-endpoint Update URL of Wikidata reconciliation service	2020-06-15 17:20:05 -04:00
james-cui	04055153a1	add archive column (#2573 ) Co-authored-by: Antonin Delpeuch <antonin@delpeuch.eu>	2020-06-15 19:56:00 +02:00
Joanne Ong	d57d76f7df	Fix imprecise facet statistics in records mode (#2607 ) * Fix bug in choice counts for records mode * Add test for value grouper on records * Refactor and comment code * Count distinct instances of null/blank data * Update test to check for blank data count in records * Remove unnecessary import statement	2020-06-15 19:38:50 +02:00
Lisa Chandra	947356ddad	[FEAT]Adds new options for split (#2471 ) * added options ui * added definition for both separators * added tests * removed definitions from backend and added them to frontend * added reverse order and handling for accented characters * added tests for accented characters and reverse split * fixed build errors * unicode character ranges instead * added examples	2020-06-15 19:30:18 +02:00
Antonin Delpeuch	1bb9e8a67e	Update URL of Wikidata reconciliation service. Closes #2724	2020-06-15 00:35:10 +02:00
Tom Morris	bf1c890cc3	Unused imports and other minor cleanups (#2723 ) * Two minor fixes - prevent invalid index error on empty strings (shouldn't normally happen) - update deprecated Apache Commons Lang method * Remove unused imports	2020-06-14 21:18:02 +02:00
chuhao zeng	9b03ecae41	Convert illegal characters into legal ones. (#2431 ) * Convert illegal characters into leagal ones. * Test tab in key & value string Also fix up test that depended on previous TAB related error message and clean up logging Co-authored-by: Tom Morris <tfmorris@gmail.com>	2020-06-14 09:47:58 +02:00
Tom Morris	18c18e587e	Replace Apache Ant with Commons Compress (#2691 ) NOTE: Changes the public API where some of the old types were embedded which means that any extensions that extend these interfaces will have to be updated. Fixes #2690.	2020-06-11 16:39:51 +02:00
Tom Morris	e6ed8e5d62	Save preferences JSON using UTF-8 encoding. Bulletproof prefs load. (#2657 ) * Save preferences JSON using UTF-8 encoding. Bulletproof prefs load. Fixes #2543. Fixes #2627. Always use UTF-8 to write JSON because platform default encoding might not be legal JSON (e.g. ISO 8859-1). Also be more conservative about keeping backups if we fail to write. * Handle case where backup prefs is better than more recent * Recover from corrupted prefs with null starred list. Fixes #2544. Replaces null with an empty list. * Run tests with non-UTF-8 encoding Make sure that we don't depend on UTF-8 being the default encoding because it isn't true everywhere (e.g. Windows) * Add test for non-ASCII chars in workspace.json This depends on the default Java encoding being something other than UTF-8 to test properly.	2020-06-06 10:00:01 +01:00
Antoine Beaubien	3ca08f6ff1	Changed cell.error to cell.errorMessage & added help data. (#2628 ) * Changed cell.error to cell.errorMessage & added help data. Changed cell.error to cell.errorMessage and added the informations into the internal help system. * FR Text correction * HU Fix text 3 instead of 2.	2020-05-23 14:05:25 +02:00
Lu Liu	e89eaf0ee2	support default project name and column name for cross() (#2518 )	2020-05-22 09:39:57 +02:00
Tom Morris	557ffad920	Merge pull request #2586 from OpenRefine/issue-2510-type-boolean Support "boolean" return for type() function. Closes #2510	2020-05-18 17:24:47 -04:00
Antoine2711	0e86619d86	Fix the true.type() == "boolean" Fix the true.type() == "boolean" instead of java.lang.Boolean. Remove all the references to "error" result in Type(). This will be addressed in: @ToDo fix this with issue #2562	2020-05-18 17:23:43 -04:00
Antonin Delpeuch	d7d567439e	Set version to 3.5-SNAPSHOT	2020-05-13 22:56:33 +02:00
Antonin Delpeuch	5597e1c942	Set version to 3.4-beta	2020-05-13 22:52:25 +02:00
Antonin Delpeuch	825e687b0b	Fix bug when both trim and autodetect are enabled in tabular parser. Closes #2584 (#2610 )	2020-05-05 14:00:17 +02:00
Thad Guidry	15710ace17	reduce object creation during JSON serialization (#2576 ) If a new {@code Double} instance is not required, this method * should generally be used in preference to the constructor * {@link #Double(double)}, as this method is likely to yield * significantly better space and time performance by caching * frequently requested values.	2020-05-05 10:07:54 +02:00
PJ Fanning	f047a88518	poi works better reading files directly (#2597 )	2020-04-26 21:27:09 +02:00

1 2 3 4 5 ...

1107 Commits