Added technical reference to docusaurus based on existing wiki pages (#2940)

This commit is contained in:
Owen Stephens 2020-07-16 17:39:03 +01:00 committed by GitHub
parent dd17d7a2b6
commit 09add9fa31
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
31 changed files with 2028 additions and 315 deletions

View File

@ -1,15 +0,0 @@
---
id: architecture
title: Architecture
sidebar_label: Architecture
---
OpenRefine is a web application, but is designed to be run locally on your own machine. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side.
This architecture provides a good separation of concerns (data vs. UI); allows the use of familiar web technologies (HTML, CSS, Javascript) to implement user interface features; and enables the server side to be called by third-party software through standard GET and POST operations.
- [Technology Stack](techstack): What languages, libraries and frameworks are used in the OpenRefine application
- [Server Side](server): how the data is modeled, stored, changed, etc.
- [Client Side](client): how the UI is built
- [Importing](importing): how OpenRefine supports the import of data to create projects
- [Faceted Browsing](faceted_browsing): how faceted browsing is implemented (this straddles the client and the server)

View File

@ -1,11 +0,0 @@
---
id: client
title: Client Architecture
sidebar_label: Client Architecture
---
The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following Javascript libraries:
- [jQuery](http://jquery.com/)
- [jQueryUI](http:jqueryui.com/)
- [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n)

View File

@ -1,74 +0,0 @@
---
id: faceted_browsing
title: Faceted Browsing Architecture
sidebar_label: Faceted Browsing
---
Faceted browsing support is core to OpenRefine as it is the primary and only mechanism for filtering to a subset of rows on which to do something _en masse_ (ie in bulk). Without faceted browsing or an equivalent querying/browsing mechanism, you can only change one thing at a time (one cell or row) or else change everything all at once; both kinds of editing are practically useless when dealing with large data sets.
In OpenRefine, different components of the code need to know which rows to process from the faceted browsing state (how the facets are constrained). For example, when the user applies some facet selections and then exports the data, the exporter serializes only the matching rows, not all rows in the project. Thus, faceted browsing isn't only hooked up to the data view for displaying data to the user, but it is also hooked up to almost all other parts of the system.
## Engine Configuration
As OpenRefine is a web app, there might be several browser windows opened on the same project, each in a different faceted browsing state. It is best to maintain the faceted browsing state in each browser window while keeping the server side completely stateless with regard to faceted browsing. Whenever the client-side needs something done by the server, it transfers the entire faceted browsing state over to the server-side. The faceted browsing state behaves much like the `WHERE` clause in a SQL query, telling the server-side how to select the rows to process.
In fact, it is best to think of the faceted browsing state as just a database query much like a SQL query. It can be passed around the whole system, to any component needing to know which rows to process. It is serialized into JSON to pass between the client-side and the server-side, or to save in an abstract operation's specification. The job of the faceted browsing subsystem on the client-side is to let the user interactively modify this "faceted browsing query", and the job of the faceted browsing subsystem on the server-side is to resolve that query.
In the code, the faceted browsing state, or faceted browsing query, is actually called the *engine configuration* or *engine config* for short. It consists mostly of an array facet configurations. For each facet, it stores the name of the column on which the facet is based (or an empty string if there is no base column). Each type of facet has different configuration. Text search facets have queries and flags for case-sensitivity mode and regular expression mode. Text facets (aka list facets) and numeric range facets have expressions. Each list facet also has an array of selected choices, an invert flag, and flags for whether blank and error cells are selected. Each numeric range facet has, among other things, a "from" and a "to" values. If you trace the AJAX calls, you'd see the engine configs being shuttled, e.g.,
```json
{
"facets" : [
{
"type": "text",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"mode": "text",
"caseSensitive": false,
"query": "cheese"
},
{
"type": "list",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"expression": "grel:value.toLowercase().split(\",\")",
"omitBlank": false,
"omitError": false,
"selection": [],
"selectBlank":false,
"selectError":false,
"invert":false
},
{
"type": "range",
"name": "Water",
"expression": "value",
"columnName": "Water",
"selectNumeric": true,
"selectNonNumeric": true,
"selectBlank": true,
"selectError": true,
"from": 0,
"to": 53
}
],
"includeDependent": false
}
```
## Server-Side Subsystem
From an engine configuration like the one above, the server-side faceted browsing subsystem is capable of producing:
- an iteration over the rows matching the facets' constraints
- information on how to render the facets (e.g., choice and count pairs for a list facet, histogram for a numeric range facet)
When the engine config JSON arrives in an HTTP request on the server-side, a `com.google.refine.browsing.Engine` object is constructed and initialized with that JSON. It in turns constructs zero or more `com.google.refine.browsing.facets.Facet` objects. Then for each facet, the engine calls its `getRowFilter()` method, which returns `null` if the facet isn't constrained in anyway, or a `com.google.refine.browsing.filters.RowFilter` object. Then, to when iterating over a project's rows, the engine calls on all row filters' `filterRow()` method. If and only if all row filters return `true` the row is considered to match the facets' constraints. How each row filter works depends on the corresponding type of facet.
To produce information on how to render a particular facet in the UI, the engine follows the same procedure described in the previous except it skips over the facet in question. In other words, it produces an iteration over all rows constrained by the other facets. Then it feeds that iteration to the facet in question by calling the facet's `computeChoices()` method. This gives the method a chance to compute the rendering information for its UI counterpart on the client-side. When all facets have been given a chance to compute their rendering information, the engine calls all facets to serialize their information as JSON and returns the JSON to the client-side. Only one HTTP call is needed to compute all facets.
## Client-Side Subsystem
On the client-side there is also an engine object (implemented in Javascript rather than Java) and zero or more facet objects (also in Javascript, obviously). The engine is responsible for distributing the rendering information computed on the server-side to the right facets, and when the user interacts with a facet, the facet tells the engine to update the whole UI. To do so, the engine gathers the configuration of each facet and composes the whole engine config as a single JSON object. Two separate AJAX calls are made with that engine config, one to retrieve the rows to render, and one to re-compute the rendering information for the facets because changing one facet does affect all the other facets.

View File

@ -1,111 +0,0 @@
---
id: importing
title: Importing Architecture
sidebar_label: Importing Architecture
---
OpenRefine has a sophisticated architecture for accommodating a diverse and extensible set of importable file formats and work flows. The formats range from simple CSV, TSV to fixed-width fields to line-based records to hierarchical XML and JSON. The work flows allow the user to preview and tweak many different import settings before creating the project. In some cases, such as XML and JSON, the user also has to select which elements in the data file to import. Additionally, a data file can also be an archive file (e.g., .zip) that contains many files inside; the user can select which of those files to import. Finally, extensions to OpenRefine can inject functionalities into any part of this architecture.
## The Index Page and Action Areas
The opening screen of OpenRefine is implemented by the file refine/main/webapp/modules/core/index.vt and will be referred to here as the index page. Its default implementation contains 3 finger tabs labeled Create Project, Open Project, and Import Project. Each tab selects an "action area". The 3 default action areas are for, obviously, creating a new project, opening an existing project, and importing a project .tar file.
Extensions can add more action areas in Javascript. For example, this is how the Create Project action area is added (refine/main/webapp/modules/core/scripts/index/create-project-ui.js):
```javascript
Refine.actionAreas.push({
id: "create-project",
label: "Create Project",
uiClass: Refine.CreateProjectUI
});
```
The UI class is a constructor function that takes one argument, a jQuery-wrapped HTML element where the tab body of the action area should be rendered.
If your extension requires a very unique importing work flow, or a very novel feature that should be exposed on the index page, then add a new action area. Otherwise, try to use the existing work flows as much as possible.
## The Create Project Action Area
The Create Project action area is itself extensible. Initially, it embeds a set of finger tabs corresponding to a variety of "source selection UIs": you can select a source of data by specifying a file on your computer, or you can specify the URL to a publicly accessible data file or data feed, or you can paste in from the clipboard a chunk of data.
There are actually 3 points of extension in the Create Project action area, and the first is invisible.
### Importing Controllers
The Create Project action area manages a list of "importing controllers". Each controller follows a particular work flow (in UI terms, think "wizard"). Refine comes with a "default importing controller" (refine/main/webapp/modules/core/scripts/index/default-importing-controller/controller.js) and its work flow assumes that the data can be retrieved and cached in whole before getting processed in order to generate a preview for the user to inspect. (If the data cannot be retrieved and cached in whole before previewing, then another importing controller is needed.)
An importing controller is just programming logic, but it can manifest itself visually by registering one or more data source UIs and one or more custom panels in the Create Project action area. The default importing controller registers 3 such custom panels, which act like pages of a wizard.
An extension can register any number of importing controller. Each controller has a client-side part and a server-side part. Its client-side part is just a constructor function that takes an object representing the Create Project action area (usually named `createProjectUI`). The controller (client-side) is expected to use that object to register data source UIs and/or create custom panels. The controller is not expected to have any particular interface method. The default importing controller's client-side code looks like this (refine/main/webapp/modules/core/scripts/index/default-importing-controller/controller.js):
```javascript
Refine.DefaultImportingController = function(createProjectUI) {
this._createProjectUI = createProjectUI; // save a reference to the create project action area
this._progressPanel = createProjectUI.addCustomPanel(); // create a custom panel
this._progressPanel.html('...'); // render the custom panel
... do other stuff ...
};
Refine.CreateProjectUI.controllers.push(Refine.DefaultImportingController); // register the controller
```
We will cover the server-side code below.
### Data Source Selection UIs
Data source selection UIs are another point of extensibility in the Create Project action area. As mentioned previously, by default there are 3 data source UIs. Those are added by the default importing controller.
Extensions can also add their own data source UIs. A data source selection UI object can be registered like so
```javascript
createProjectUI.addSourceSelectionUI({
label: "This Computer",
id: "local-computer-source",
ui: theDataSourceSelectionUIObject
});
```
`theDataSourceSelectionUIObject` is an object that has the following member methods:
- `attachUI(bodyDiv)`
- `focus()`
If you want to install a data source selection UI that is managed by the default importing controller, then register its UI class with the default importing controller, like so (refine/main/webapp/modules/core/scripts/index/default-importing-sources/sources.js):
```javascript
Refine.DefaultImportingController.sources.push({
"label": "This Computer",
"id": "upload",
"uiClass": ThisComputerImportingSourceUI
});
```
The default importing controller will assume that the `uiClass` field is a constructor function and call it with one argument--the controller object itself. That constructor function should save the controller object for later use. More specifically, for data source UIs that use the default importing controller, they can call the controller to kickstart the process that retrieves and caches the data to import:
```javascript
controller.startImportJob(form, "... status message ...");
```
The argument `form` is a jQuery-wrapped FORM element that will get submitted to the server side at the command /command/core/create-importing-job. That command and the default importing controller will take care of uploading or downloading the data, caching it, updating the client side's progress display, and then showing the next importing step when the data is fully cached.
See refine/main/webapp/modules/core/scripts/index/default-importing-sources/sources.js for examples of such source selection UIs. While we write about source selection UIs managed by the default importing controller here, chances are your own extension will not be adding such a new source selection UI. Your extension probably adds with a new importing controller as well as a new source selection UI that work together.
### File Selection Panel
Documentation not currently available
### Parsing UI Panel
Documentation not currently available
## Server-Side Components
### ImportingController
Documentation not currently available
### UrlRewriter
Documentation not currently available
### FormatGuesser
Documentation not currently available
### ImportingParser
Documentation not currently available

View File

@ -1,74 +0,0 @@
---
id: server
title: Server Architecture
sidebar_label: Server Architecture
---
OpenRefine's server-side is written entirely in Java (`main/src/`) and its entry point is the Java servlet `com.google.refine.RefineServlet`. By default, the servlet is hosted in the lightweight Jetty web server instantiated by `server/src/com.google.refine.Refine`. Note that the server class itself is under `server/src/`, not `main/src/`; this separation leaves the possibility of hosting `RefineServlet` in a different servlet container.
The web server configuration is in `main/webapp/WEB-INF/web.xml`; that's where `RefineServlet` is hooked up. `RefineServlet` itself is simple: it just reacts to requests from the client-side by routing them to the right `Command` class in the packages `com.google.refine.commands.**`.
As mentioned before, the server-side maintains states of the data, and the primary class involved is `com.google.refine.ProjectManager`.
## Projects
In OpenRefine there's the concept of a workspace similar to that in Eclipse. When you run OpenRefine it manages projects within a single workspace, and the workspace is embodied in a file directory with sub-directories. The default workspace directories are listed in the [FAQs](https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Where-Is-Data-Stored). You can get OpenRefine to use a different directory by specifying a -d parameter at the command line.
The class `ProjectManager` is what manages the workspace. It keeps in memory the metadata of every project (in the class `ProjectMetadata`). This metadata includes the project's name and last modified date, and any other information necessary to present and let the user interact with the project as a whole. Only when the user decides to look at the project's data would `ProjectManager` load the project's actual data. The separation of project metadata and data is to minimize the amount of stuff loaded into memory.
A project's _actual_ data includes the columns, rows, cells, reconciliation records, and history entries.
A project is loaded into memory when it needs to be displayed or modified, and it remains in memory until 1 hour after the last time it gets modified. Periodically the project manager tries to save modified projects, and it saves as many modified projects as possible within 30 seconds.
## Data Model
A project's data consists of
- _raw data_: a list of rows, each row consisting of a list of cells
- _models_ on top of that raw data that give high-level presentation or interpretation of that data. This design lets the same raw data be viewed in different ways by different models, and let the models be changed without costly changes to the raw data.
### Column Model
Cells in rows are not named and can only be addressed by their list position indices. So, a _column model_ is needed to give a name to each list position. The column model also stores other metadata for each column, including the type that cells in the column have been reconciled to and the overall reconciliation statistics of those cells.
Each column also acts as a cache for data computed from the raw data related to that column.
Columns in the column model can be removed and re-ordered without changing the raw data--the cells in the rows. This makes column removal and ordering operations really quick.
#### Column Groups
Consider the following data:
![Illustration of row groups in OpenRefine](https://raw.github.com/OpenRefine/OpenRefine/2.0/graphics/row-groups.png)
Although the data is in a grid, we humans can understand that it is a tree. First of all, all rows contain data ultimately linked to the movie Austin Powers, although only one row contains the text "Austin Powers" in the "movie title" column. We also know that "USA" and "Germany" are not related to Elizabeth Hurley and Mike Myers respectively (say, as their nationality), but rather, "USA" and "Germany" are related to the movie (where it was released). We know that Mike Myers played both the character "Austin Powers" and the character "Dr. Evil"; and for the latter he received 2 awards. We humans can understand how to interpret the grid as a tree based on its visual layout as well as some knowledge we have about the movie domain but is not encoded in the table.
OpenRefine can capture our knowledge of this transformation from grid to tree using _column groups_, also stored in the column model. Each column group illustrated as a blue bracket above specifies which columns are grouped together, as well as which of those columns is the key column in that group (blue triangle). One column group can span over columns grouped by another column group, and in this way, column groups form a hierarchy determined by which column group envelopes another. This hierarchy of column groups allows the 2-dimensional (grid-shaped) table of rows and cells to be interpreted as a list of hierarchical (tree-shaped) data records.
Blank cells play a very important role. The blank cell in a key column of a row (e.g., cell "character" on row 4) makes that row (row 4) _depend_ on the first preceding row with that column filled in (row 3). This means that "Best Comedy Perf" on row 4 applies to "Dr. Evil" on row 3. Row 3 is said to be a _context row_ for row 4. Similarly, since rows 2 - 6 all have blank cells in the first column, they all depend on row 1, and all their data ultimately applies to the movie Austin Powers. Row 1 depends on no other row and is said to be a _record row_. Rows 1 - 6 together form one _record_.
Currently (as of 12th December 2017) only the XML and JSON importers create column groups, and while the data table view does display column groups but it doesn't support modifying them.
## Changes, History, Processes, and Operations
All changes to the project's data are tracked (N.B. this does not include changes to a project's metadata - such as the project name.)
Changes are stored as `com.google.refine.history.Change` objects. `com.google.refine.history.Change` is an interface, and implementing classes are in `com.google.refine.model.changes.**`. Each change object stores enough data to modify the project's data when its `apply()` method is called, and enough data to revert its effect when its `revert()` method is called. It's only supposed to _store_ data, not to actually _compute_ the change. In this way, it's like a .diff patch file for a code base.
Some change objects can be huge, as huge as the project itself. So change objects are not kept in memory except when they are to be applied or reverted. However, since we still need to show the user some information about changes (as displayed in the History panel in the UI), we keep metadata of changes separate from the change objects. For each change object there is one corresponding `com.google.refine.history.HistoryEntry` for storing its metadata, such as the change's human-friendly description and timestamp.
Each project has a `com.google.refine.history.History` object that contains an ordered list of all `HistoryEntry` objects storing metadata for all changes that have been done since after the project was created. Actually, there are 2 ordered lists: one for done changes that can be reverted (undone), an done for undone changes that can be re-applied (redone). Changes must be done or redone in their exact orders in these lists because each change makes certain assumptions about the state of the project before and after it is applied. As changes cannot be undone/redone out of order, when one change fails to revert, it blocks the whole history from being reverted to any state preceding that change (as happened in [Issue #2](https://github.com/OpenRefine/OpenRefine/issues/2)).
As mentioned before, a change contains only the diff and does not actually compute that diff. The computation is performed by a `com.google.refine.process.Process` object--every change object is created by a process object. A process can be immediate, producing its change object synchronously within a very short period of time (e.g., starring one row); or a process can be long-running, producing its change object after a long time and a lot of computation, including network calls (e.g., reconciling a column).
As the user interacts with the UI on the client-side, their interactions trigger ajax calls to the server-side. Some calls are meant to modify the project. Those are handled by commands that instantiates processes. Processes are queued in a first-in-first-out basis. The first-in process gets run and until it is done all the other processes are stuck in the queue.
A process can effect a change in one thing in the project (e.g., edit one particular cell, star one particular row), or a process can effect changes in _potentially_ many things in the project (e.g., edit zero or more cells sharing the same content, starring all rows filtered by some facets). The latter kind of process is generalizable: it is meaningful to apply them on another similar project. Such a process is associated with an _abstract operation_ `com.google.refine.model.AbstractOperation` that encodes the information necessary to create another instance of that process, but potentially for a different project. When you click "extract" in the History panel, these abstract operations are called to serialize their information to JSON; and when you click "apply" in the History panel, the JSON you paste in is used to re-construct these abstract operations, which in turn create processes, which get run sequentially in a queue to generate change object and history entry pairs.
In summary,
- change objects store diffs
- history entries store metadata of change objects
- processes compute diffs and create change object and history entry pairs
- some processes are long-running and some are immediate; processes are run sequentially in a queue
- generalizable processes can be re-constructed from abstract operations

View File

@ -1,22 +0,0 @@
---
id: techstack
title: Technology Stack
sidebar_label: Technology Stack
---
The server-side part of OpenRefine is implemented in Java as one single servlet which is executed by the [Jetty](http://jetty.codehaus.org/jetty/) web server + servlet container. The use of Java strikes a balance between performance and portability across operating system (there is very little OS-specific code and has mostly to do with starting the application).
OpenRefine has no database using its own in-memory data-store that is built up-front to be optimized for the operations required by faceted browsing and infinite undo.
The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following libraries:
* [jQuery](http://jquery.com/)
* [jQueryUI](http:jqueryui.com/)
* [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n)
The functional extensibility of OpenRefine is provided by the [SIMILE Butterfly](http://code.google.com/p/simile-butterfly/) modular web application framework.
Several projects provide the functionality to read and write custom format files (POI, opencsv, JENA, marc4j).
String clustering is provided by the [SIMILE Vicino](http://code.google.com/p/simile-vicino/) project.
OAuth functionality is provided by the [Signpost](https://github.com/mttkay/signpost) project.

View File

@ -0,0 +1,283 @@
---
id: architecture
title: Architecture
sidebar_label: Architecture
---
OpenRefine is a web application, but is designed to be run locally on your own machine. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side.
This architecture provides a good separation of concerns (data vs. UI); allows the use of familiar web technologies (HTML, CSS, Javascript) to implement user interface features; and enables the server side to be called by third-party software through standard GET and POST operations.
## Technology stack
The server-side part of OpenRefine is implemented in Java as one single servlet which is executed by the [Jetty](http://jetty.codehaus.org/jetty/) web server + servlet container. The use of Java strikes a balance between performance and portability across operating system (there is very little OS-specific code and has mostly to do with starting the application).
OpenRefine has no database using its own in-memory data-store that is built up-front to be optimized for the operations required by faceted browsing and infinite undo.
The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following libraries:
* [jQuery](http://jquery.com/)
* [jQueryUI](http:jqueryui.com/)
* [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n)
The functional extensibility of OpenRefine is provided by the [SIMILE Butterfly](http://code.google.com/p/simile-butterfly/) modular web application framework.
Several projects provide the functionality to read and write custom format files (POI, opencsv, JENA, marc4j).
String clustering is provided by the [SIMILE Vicino](http://code.google.com/p/simile-vicino/) project.
OAuth functionality is provided by the [Signpost](https://github.com/mttkay/signpost) project.
## Server-side architecture
OpenRefine's server-side is written entirely in Java (`main/src/`) and its entry point is the Java servlet `com.google.refine.RefineServlet`. By default, the servlet is hosted in the lightweight Jetty web server instantiated by `server/src/com.google.refine.Refine`. Note that the server class itself is under `server/src/`, not `main/src/`; this separation leaves the possibility of hosting `RefineServlet` in a different servlet container.
The web server configuration is in `main/webapp/WEB-INF/web.xml`; that's where `RefineServlet` is hooked up. `RefineServlet` itself is simple: it just reacts to requests from the client-side by routing them to the right `Command` class in the packages `com.google.refine.commands.**`.
As mentioned before, the server-side maintains states of the data, and the primary class involved is `com.google.refine.ProjectManager`.
### Projects
In OpenRefine there's the concept of a workspace similar to that in Eclipse. When you run OpenRefine it manages projects within a single workspace, and the workspace is embodied in a file directory with sub-directories. The default workspace directories are listed in the [FAQs](https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Where-Is-Data-Stored). You can get OpenRefine to use a different directory by specifying a -d parameter at the command line.
The class `ProjectManager` is what manages the workspace. It keeps in memory the metadata of every project (in the class `ProjectMetadata`). This metadata includes the project's name and last modified date, and any other information necessary to present and let the user interact with the project as a whole. Only when the user decides to look at the project's data would `ProjectManager` load the project's actual data. The separation of project metadata and data is to minimize the amount of stuff loaded into memory.
A project's _actual_ data includes the columns, rows, cells, reconciliation records, and history entries.
A project is loaded into memory when it needs to be displayed or modified, and it remains in memory until 1 hour after the last time it gets modified. Periodically the project manager tries to save modified projects, and it saves as many modified projects as possible within 30 seconds.
### Data Model
A project's data consists of
- _raw data_: a list of rows, each row consisting of a list of cells
- _models_ on top of that raw data that give high-level presentation or interpretation of that data. This design lets the same raw data be viewed in different ways by different models, and let the models be changed without costly changes to the raw data.
#### Column Model
Cells in rows are not named and can only be addressed by their list position indices. So, a _column model_ is needed to give a name to each list position. The column model also stores other metadata for each column, including the type that cells in the column have been reconciled to and the overall reconciliation statistics of those cells.
Each column also acts as a cache for data computed from the raw data related to that column.
Columns in the column model can be removed and re-ordered without changing the raw data--the cells in the rows. This makes column removal and ordering operations really quick.
##### Column Groups
Consider the following data:
![Illustration of row groups in OpenRefine](https://raw.github.com/OpenRefine/OpenRefine/2.0/graphics/row-groups.png)
Although the data is in a grid, we humans can understand that it is a tree. First of all, all rows contain data ultimately linked to the movie Austin Powers, although only one row contains the text "Austin Powers" in the "movie title" column. We also know that "USA" and "Germany" are not related to Elizabeth Hurley and Mike Myers respectively (say, as their nationality), but rather, "USA" and "Germany" are related to the movie (where it was released). We know that Mike Myers played both the character "Austin Powers" and the character "Dr. Evil"; and for the latter he received 2 awards. We humans can understand how to interpret the grid as a tree based on its visual layout as well as some knowledge we have about the movie domain but is not encoded in the table.
OpenRefine can capture our knowledge of this transformation from grid to tree using _column groups_, also stored in the column model. Each column group illustrated as a blue bracket above specifies which columns are grouped together, as well as which of those columns is the key column in that group (blue triangle). One column group can span over columns grouped by another column group, and in this way, column groups form a hierarchy determined by which column group envelopes another. This hierarchy of column groups allows the 2-dimensional (grid-shaped) table of rows and cells to be interpreted as a list of hierarchical (tree-shaped) data records.
Blank cells play a very important role. The blank cell in a key column of a row (e.g., cell "character" on row 4) makes that row (row 4) _depend_ on the first preceding row with that column filled in (row 3). This means that "Best Comedy Perf" on row 4 applies to "Dr. Evil" on row 3. Row 3 is said to be a _context row_ for row 4. Similarly, since rows 2 - 6 all have blank cells in the first column, they all depend on row 1, and all their data ultimately applies to the movie Austin Powers. Row 1 depends on no other row and is said to be a _record row_. Rows 1 - 6 together form one _record_.
Currently (as of 12th December 2017) only the XML and JSON importers create column groups, and while the data table view does display column groups but it doesn't support modifying them.
### Changes, History, Processes, and Operations
All changes to the project's data are tracked (N.B. this does not include changes to a project's metadata - such as the project name.)
Changes are stored as `com.google.refine.history.Change` objects. `com.google.refine.history.Change` is an interface, and implementing classes are in `com.google.refine.model.changes.**`. Each change object stores enough data to modify the project's data when its `apply()` method is called, and enough data to revert its effect when its `revert()` method is called. It's only supposed to _store_ data, not to actually _compute_ the change. In this way, it's like a .diff patch file for a code base.
Some change objects can be huge, as huge as the project itself. So change objects are not kept in memory except when they are to be applied or reverted. However, since we still need to show the user some information about changes (as displayed in the History panel in the UI), we keep metadata of changes separate from the change objects. For each change object there is one corresponding `com.google.refine.history.HistoryEntry` for storing its metadata, such as the change's human-friendly description and timestamp.
Each project has a `com.google.refine.history.History` object that contains an ordered list of all `HistoryEntry` objects storing metadata for all changes that have been done since after the project was created. Actually, there are 2 ordered lists: one for done changes that can be reverted (undone), an done for undone changes that can be re-applied (redone). Changes must be done or redone in their exact orders in these lists because each change makes certain assumptions about the state of the project before and after it is applied. As changes cannot be undone/redone out of order, when one change fails to revert, it blocks the whole history from being reverted to any state preceding that change (as happened in [Issue #2](https://github.com/OpenRefine/OpenRefine/issues/2)).
As mentioned before, a change contains only the diff and does not actually compute that diff. The computation is performed by a `com.google.refine.process.Process` object--every change object is created by a process object. A process can be immediate, producing its change object synchronously within a very short period of time (e.g., starring one row); or a process can be long-running, producing its change object after a long time and a lot of computation, including network calls (e.g., reconciling a column).
As the user interacts with the UI on the client-side, their interactions trigger ajax calls to the server-side. Some calls are meant to modify the project. Those are handled by commands that instantiates processes. Processes are queued in a first-in-first-out basis. The first-in process gets run and until it is done all the other processes are stuck in the queue.
A process can effect a change in one thing in the project (e.g., edit one particular cell, star one particular row), or a process can effect changes in _potentially_ many things in the project (e.g., edit zero or more cells sharing the same content, starring all rows filtered by some facets). The latter kind of process is generalizable: it is meaningful to apply them on another similar project. Such a process is associated with an _abstract operation_ `com.google.refine.model.AbstractOperation` that encodes the information necessary to create another instance of that process, but potentially for a different project. When you click "extract" in the History panel, these abstract operations are called to serialize their information to JSON; and when you click "apply" in the History panel, the JSON you paste in is used to re-construct these abstract operations, which in turn create processes, which get run sequentially in a queue to generate change object and history entry pairs.
In summary,
- change objects store diffs
- history entries store metadata of change objects
- processes compute diffs and create change object and history entry pairs
- some processes are long-running and some are immediate; processes are run sequentially in a queue
- generalizable processes can be re-constructed from abstract operations
## Client-side architecture
The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following Javascript libraries:
* [jQuery](http://jquery.com/)
* [jQueryUI](http:jqueryui.com/)
* [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n)
### Importing architecture
OpenRefine has a sophisticated architecture for accommodating a diverse and extensible set of importable file formats and work flows. The formats range from simple CSV, TSV to fixed-width fields to line-based records to hierarchical XML and JSON. The work flows allow the user to preview and tweak many different import settings before creating the project. In some cases, such as XML and JSON, the user also has to select which elements in the data file to import. Additionally, a data file can also be an archive file (e.g., .zip) that contains many files inside; the user can select which of those files to import. Finally, extensions to OpenRefine can inject functionalities into any part of this architecture.
### The Index Page and Action Areas
The opening screen of OpenRefine is implemented by the file refine/main/webapp/modules/core/index.vt and will be referred to here as the index page. Its default implementation contains 3 finger tabs labeled Create Project, Open Project, and Import Project. Each tab selects an "action area". The 3 default action areas are for, obviously, creating a new project, opening an existing project, and importing a project .tar file.
Extensions can add more action areas in Javascript. For example, this is how the Create Project action area is added (refine/main/webapp/modules/core/scripts/index/create-project-ui.js):
```javascript
Refine.actionAreas.push({
id: "create-project",
label: "Create Project",
uiClass: Refine.CreateProjectUI
});
```
The UI class is a constructor function that takes one argument, a jQuery-wrapped HTML element where the tab body of the action area should be rendered.
If your extension requires a very unique importing work flow, or a very novel feature that should be exposed on the index page, then add a new action area. Otherwise, try to use the existing work flows as much as possible.
### The Create Project Action Area
The Create Project action area is itself extensible. Initially, it embeds a set of finger tabs corresponding to a variety of "source selection UIs": you can select a source of data by specifying a file on your computer, or you can specify the URL to a publicly accessible data file or data feed, or you can paste in from the clipboard a chunk of data.
There are actually 3 points of extension in the Create Project action area, and the first is invisible.
#### Importing Controllers
The Create Project action area manages a list of "importing controllers". Each controller follows a particular work flow (in UI terms, think "wizard"). Refine comes with a "default importing controller" (refine/main/webapp/modules/core/scripts/index/default-importing-controller/controller.js) and its work flow assumes that the data can be retrieved and cached in whole before getting processed in order to generate a preview for the user to inspect. (If the data cannot be retrieved and cached in whole before previewing, then another importing controller is needed.)
An importing controller is just programming logic, but it can manifest itself visually by registering one or more data source UIs and one or more custom panels in the Create Project action area. The default importing controller registers 3 such custom panels, which act like pages of a wizard.
An extension can register any number of importing controller. Each controller has a client-side part and a server-side part. Its client-side part is just a constructor function that takes an object representing the Create Project action area (usually named `createProjectUI`). The controller (client-side) is expected to use that object to register data source UIs and/or create custom panels. The controller is not expected to have any particular interface method. The default importing controller's client-side code looks like this (refine/main/webapp/modules/core/scripts/index/default-importing-controller/controller.js):
```javascript
Refine.DefaultImportingController = function(createProjectUI) {
this._createProjectUI = createProjectUI; // save a reference to the create project action area
this._progressPanel = createProjectUI.addCustomPanel(); // create a custom panel
this._progressPanel.html('...'); // render the custom panel
... do other stuff ...
};
Refine.CreateProjectUI.controllers.push(Refine.DefaultImportingController); // register the controller
```
We will cover the server-side code below.
#### Data Source Selection UIs
Data source selection UIs are another point of extensibility in the Create Project action area. As mentioned previously, by default there are 3 data source UIs. Those are added by the default importing controller.
Extensions can also add their own data source UIs. A data source selection UI object can be registered like so
```javascript
createProjectUI.addSourceSelectionUI({
label: "This Computer",
id: "local-computer-source",
ui: theDataSourceSelectionUIObject
});
```
`theDataSourceSelectionUIObject` is an object that has the following member methods:
- `attachUI(bodyDiv)`
- `focus()`
If you want to install a data source selection UI that is managed by the default importing controller, then register its UI class with the default importing controller, like so (refine/main/webapp/modules/core/scripts/index/default-importing-sources/sources.js):
```javascript
Refine.DefaultImportingController.sources.push({
"label": "This Computer",
"id": "upload",
"uiClass": ThisComputerImportingSourceUI
});
```
The default importing controller will assume that the `uiClass` field is a constructor function and call it with one argument--the controller object itself. That constructor function should save the controller object for later use. More specifically, for data source UIs that use the default importing controller, they can call the controller to kickstart the process that retrieves and caches the data to import:
```javascript
controller.startImportJob(form, "... status message ...");
```
The argument `form` is a jQuery-wrapped FORM element that will get submitted to the server side at the command /command/core/create-importing-job. That command and the default importing controller will take care of uploading or downloading the data, caching it, updating the client side's progress display, and then showing the next importing step when the data is fully cached.
See refine/main/webapp/modules/core/scripts/index/default-importing-sources/sources.js for examples of such source selection UIs. While we write about source selection UIs managed by the default importing controller here, chances are your own extension will not be adding such a new source selection UI. Your extension probably adds with a new importing controller as well as a new source selection UI that work together.
#### File Selection Panel
Documentation not currently available
#### Parsing UI Panel
Documentation not currently available
### Server-side Components
#### ImportingController
Documentation not currently available
#### UrlRewriter
Documentation not currently available
#### FormatGuesser
Documentation not currently available
#### ImportingParser
Documentation not currently available
## Faceted browsing architecture
Faceted browsing support is core to OpenRefine as it is the primary and only mechanism for filtering to a subset of rows on which to do something _en masse_ (ie in bulk). Without faceted browsing or an equivalent querying/browsing mechanism, you can only change one thing at a time (one cell or row) or else change everything all at once; both kinds of editing are practically useless when dealing with large data sets.
In OpenRefine, different components of the code need to know which rows to process from the faceted browsing state (how the facets are constrained). For example, when the user applies some facet selections and then exports the data, the exporter serializes only the matching rows, not all rows in the project. Thus, faceted browsing isn't only hooked up to the data view for displaying data to the user, but it is also hooked up to almost all other parts of the system.
### Engine Configuration
As OpenRefine is a web app, there might be several browser windows opened on the same project, each in a different faceted browsing state. It is best to maintain the faceted browsing state in each browser window while keeping the server side completely stateless with regard to faceted browsing. Whenever the client-side needs something done by the server, it transfers the entire faceted browsing state over to the server-side. The faceted browsing state behaves much like the `WHERE` clause in a SQL query, telling the server-side how to select the rows to process.
In fact, it is best to think of the faceted browsing state as just a database query much like a SQL query. It can be passed around the whole system, to any component needing to know which rows to process. It is serialized into JSON to pass between the client-side and the server side, or to save in an abstract operation's specification. The job of the faceted browsing subsystem on the client-side is to let the user interactively modify this "faceted browsing query", and the job of the faceted browsing subsystem on the server side is to resolve that query.
In the code, the faceted browsing state, or faceted browsing query, is actually called the *engine configuration* or *engine config* for short. It consists mostly of an array facet configurations. For each facet, it stores the name of the column on which the facet is based (or an empty string if there is no base column). Each type of facet has different configuration. Text search facets have queries and flags for case-sensitivity mode and regular expression mode. Text facets (aka list facets) and numeric range facets have expressions. Each list facet also has an array of selected choices, an invert flag, and flags for whether blank and error cells are selected. Each numeric range facet has, among other things, a "from" and a "to" values. If you trace the AJAX calls, you'd see the engine configs being shuttled, e.g.,
```json
{
"facets" : [
{
"type": "text",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"mode": "text",
"caseSensitive": false,
"query": "cheese"
},
{
"type": "list",
"name": "Shrt_Desc",
"columnName": "Shrt_Desc",
"expression": "grel:value.toLowercase().split(\",\")",
"omitBlank": false,
"omitError": false,
"selection": [],
"selectBlank":false,
"selectError":false,
"invert":false
},
{
"type": "range",
"name": "Water",
"expression": "value",
"columnName": "Water",
"selectNumeric": true,
"selectNonNumeric": true,
"selectBlank": true,
"selectError": true,
"from": 0,
"to": 53
}
],
"includeDependent": false
}
```
### Server-Side Subsystem
From an engine configuration like the one above, the server-side faceted browsing subsystem is capable of producing:
- an iteration over the rows matching the facets' constraints
- information on how to render the facets (e.g., choice and count pairs for a list facet, histogram for a numeric range facet)
When the engine config JSON arrives in an HTTP request on the server-side, a `com.google.refine.browsing.Engine` object is constructed and initialized with that JSON. It in turns constructs zero or more `com.google.refine.browsing.facets.Facet` objects. Then for each facet, the engine calls its `getRowFilter()` method, which returns `null` if the facet isn't constrained in anyway, or a `com.google.refine.browsing.filters.RowFilter` object. Then, to when iterating over a project's rows, the engine calls on all row filters' `filterRow()` method. If and only if all row filters return `true` the row is considered to match the facets' constraints. How each row filter works depends on the corresponding type of facet.
To produce information on how to render a particular facet in the UI, the engine follows the same procedure described in the previous except it skips over the facet in question. In other words, it produces an iteration over all rows constrained by the other facets. Then it feeds that iteration to the facet in question by calling the facet's `computeChoices()` method. This gives the method a chance to compute the rendering information for its UI counterpart on the client-side. When all facets have been given a chance to compute their rendering information, the engine calls all facets to serialize their information as JSON and returns the JSON to the client-side. Only one HTTP call is needed to compute all facets.
### Client-side subsystem
On the client-side there is also an engine object (implemented in Javascript rather than Java) and zero or more facet objects (also in Javascript, obviously). The engine is responsible for distributing the rendering information computed on the server-side to the right facets, and when the user interacts with a facet, the facet tells the engine to update the whole UI. To do so, the engine gathers the configuration of each facet and composes the whole engine config as a single JSON object. Two separate AJAX calls are made with that engine config, one to retrieve the rows to render, and one to re-compute the rendering information for the facets because changing one facet does affect all the other facets.

View File

@ -0,0 +1,146 @@
---
id: build-test-run
title: How to build, test and run
sidebar_label: How to build, test and run
---
You will need:
* [OpenRefine source code](https://github.com/OpenRefine/OpenRefine)
* [Java JDK](http://java.sun.com/javase/downloads/index.jsp)
* [Apache Maven](https://maven.apache.org) (OPTIONAL)
* A Unix/Linux shell environment OR the Windows command line
From the top level directory in the OpenRefine application you can build, test and run OpenRefine using the `./refine` shell script (if you are working in a \*nix shell), or using the `refine.bat` script from the Windows command line. Note that the `refine.bat` on Windows only supports a subset of the functionality, supported by the `refine` shell script. The example commands below are using the `./refine` shell script, and you will need to use `refine.bat` if you are working from the Windows command line.
If you are working from the Windows command line you must also install a Java JDK, and [set the JAVA_HOME environment variable](https://confluence.atlassian.com/doc/setting-the-java_home-variable-in-windows-8895.html) (please ensure it points to the JDK, and not the JRE)
### Maven (Optional)
OpenRefine's build script will download Maven for you and use it, if not found already locally installed.
If you will be using your Maven installation instead of OpenRefine's build script download installation, then set the `MVN_HOME` environment variable. You may need to reboot your machine after setting these environment variables. If you receive a message `Could not find the main class: com.google.refine.Refine. Program will exit.` it is likely `JAVA_HOME` is not set correctly.
Ensure that you set your `MAVEN_HOME` environment variable, for example:
```shell
MAVEN_HOME=E:\Downloads\apache-maven-3.5.4-bin\apache-maven-3.5.4\
```
NOTE: You can use Maven commands directly, but running some goals in isolation might fail (try adding the `compile test-compile` goals in your invocation if that is the case).
### Building
To see what functions are supported by OpenRefine's build system, type
```shell
./refine -h
```
To build the OpenRefine application from source type:
```shell
./refine clean
./refine build
```
### Testing
Since OpenRefine is composed of two parts, a server and a in-browser UI, the testing system reflects that:
* on the server side, it's powered by [TestNG](http://testng.org/) and the unit tests are written in Java;
* on the client side, we currently do not have any tests. Previously client side tests were implemented with [Windmill](https://github.com/windmill/windmill), which is no longer maintained.
To run all tests, use:
```shell
./refine test
```
**this option is not available when using refine.bat**
If you want to run only the server side portion of the tests, use:
```shell
./refine server_test
```
## Running
To run OpenRefine from the command line (assuming you have been able to build from the source code successfully)
```shell
./refine
```
By default, OpenRefine will use [refine.ini](https://github.com/OpenRefine/OpenRefine/blob/master/refine.ini) for configuration. You can copy it and rename it to `refine-dev.ini`, which will be used for configuration instead. `refine-dev.ini` won't be tracked by Git, so feel free to put your custom configurations into it.
## Building Distributions (Kits)
The Refine build system uses Apache Ant to automate the creation of the installation packages for the different operating systems. The packages are currently optimized to run on Mac OS X which is the only platform capable of creating the packages for all three OS that we support.
To build the distributions type
```shell
./refine dist <version>
```
where 'version' is the release version.
## Building, Testing and Running OpenRefine from Eclipse
OpenRefine' source comes with Maven configuration files which are recognized by [Eclipse](http://www.eclipse.org/) if the Eclipse Maven plugin (m2e) is installed.
At the command line, go to a directory **not** under your Eclipse workspace directory and check out the source:
```shell
git clone https://github.com/OpenRefine/OpenRefine.git
```
In Eclipse, invoke the `Import...` command and select `Existing Maven Projects`.
![Screenshot of Import a Maven project option](../img/eclipse-import-maven-project-1.png)
Choose the root directory of your clone of the repository. You get to choose which modules of the project will be imported. You can safely leave out the `packaging` module which is only used to generate the Linux, Windows and MacOS distributions.
![Screenshot of Select maven projects to import](../img/eclipse-import-maven-project-2.png)
To run and debug OpenRefine from Eclipse, you will need to add an execution configuration on the `server` sub-project.
Right click on the `server` subproject, click `Run as...` and `Run configurations...`. Just pick the root directory of the project and use `exec:java` as a Maven goal.
![Screenshot of Add a run configuration with the exec:java goal](../img/eclipse-exec-config.png)
This will add a run configuration that you can then use to run OpenRefine from Eclipse.
## Testing in Eclipse
You can run the server tests directly from Eclipse. To do that you need to have the TestNG launcher plugin installed, as well as the TestNG M2E plugin (for integration with Maven). If you don't have it, you can get it by [installing new software](https://help.eclipse.org/2020-03/index.jsp?topic=/org.eclipse.platform.doc.user/tasks/tasks-129.htm) from this update URL http://dl.bintray.com/testng-team/testng-eclipse-release/
Once the TestNG launching plugin is installed in your Eclipse, right click on the source folder ![](main/tests/server/src), select `Run As` -> `TestNG Test`. This should open a new tab with the TestNG launcher running the OpenRefine tests.
### Test coverage in Eclipse
It is possible to analyze test coverage in Eclipse with the `EclEmma Java Code Coverage` plugin. It will add a `Coverage as…` menu similar to the `Run as…` and `Debug as…` menus which will then display the covered and missed lines in the source editor.
### Debug with Eclipse
Here's an example of putting configuration in Eclipse for debugging, like putting values for the Google Data extension. Other type of configurations that can be set are memory, Wikidata login information and more.
![Screenshot of Eclipse debug configuration](../img/eclipse-debug-config.png)
## Building, Testing and Running OpenRefine from IntelliJ idea
At the command line, go to a directory you want to save the OpenRefine project and execute the following command to clone the repository:
```shell
git clone https://github.com/OpenRefine/OpenRefine.git
```
Then, open the IntelliJ idea and go to `file -> open` and select the location of the cloned repository.
![Screenshot of Open option on the IntelliJ File menu](../img/intellij-setup-1.png)
It will prompt you to add as a maven project as the source code contains a pom.xml file in it. Allow `auto-import` so that it can add it as a maven project.
If it doesn't prompt something like this then you can go on the right side of the IDE and click on maven then, click on `reimport all the maven projects` that will add all the dependencies and jar files required for the project.
![Screenshot of Maven project controls in IntelliJ](../img/intellij-maven.png)
After this, you will be able to properly build, test, and run the OpenRefine project from the terminal.
But if you will go to any of the test folders and open some file it will show you some import errors because the project isn't yet set up at the module level.
For removing those errors, and enjoying the features of the IDE like ctrl + click, etc you need to set up the project at the module level too. Open the different modules like `extensions/wikidata`, `main` as a project in the IDE. Then, right-click on the project folder and open the module settings.
![Screenshot of open module settings menu in IntelliJ](../img/intellij-open-module-settings.png)
In the module settings, add the source folder and test source folders of that module.
![Screenshot of module settings in IntelliJ](../img/intellij-module-settings.png)
Then, do the same thing for the main OpenRefine project and now you are good to go.

View File

@ -0,0 +1,41 @@
---
id: contributing
title: Contributing
sidebar_label: Contributing
---
Please read the general [guidelines on contributing to OpenRefine](https://github.com/OpenRefine/OpenRefine/blob/master/CONTRIBUTING.md) first, then review the information on [reporting and tracking issues](#reporting-and-tracking-issues), and on making your [first pull request](#your-first-pull-request) below)
## Reporting and tracking issues
If you need to file a bug or request a feature, [create an Issue in the OpenRefine Github repository](https://github.com/OpenRefine/OpenRefine/issues). Github issues should be used for reporting specific bugs and requesting specific features. If you just don't know how to do something using OpenRefine, or want to discuss some ideas, please:
- [Try the user manual](/)
- [post to our OpenRefine mailing list](http://groups.google.com/group/openrefine/)
## Your first pull request
This describes the overall steps to your first code contribution in OpenRefine. If you have trouble with any of these steps feel free to reach out on the [developer mailing list](https://groups.google.com/forum/#!forum/openrefine-dev) or the [Gitter channel](https://gitter.im/OpenRefine/OpenRefine).
- Install OpenRefine, learn to use it by following some tutorials or watching [some videos](http://openrefine.org/). That will ensure you understand the user workflows and get familiar with the terminology used in the tool.
- Fork the GitHub repository, clone it on your machine and set up your IDE to work on it. We have [instructions for this](https://github.com/OpenRefine/OpenRefine/wiki/Building-OpenRefine-From-Source).
- Browse through the list of issues to find an issue that you find interesting. You should pick one where you understand what the problem is as a user, you can see why fixing it would be an improvement to the tool. It is also a good idea to pick an issue that matches your technical skills: some require work on the backend (in Java) or in the frontend (Javascript), often both. We try to maintain a list of [good first issues](https://github.com/OpenRefine/OpenRefine/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) which should be easier than others and should not require any difficult design decision.
- Reproduce the issue locally, by following the steps described in the issue. You might need to locate a particular dialog, use a specific importer on a sample file, or follow any other user workflow. If you have followed all the steps described in the issue and cannot observe the issue mentioned, write a comment on the issue explaining that you are not able to reproduce it (perhaps it was fixed by another change).
- Locate the code that is relevant for the issue you want to solve. Text search across files is often useful for that. For instance, if the issue you want to solve is about a dialog entitled "Columnize by key/values", you can search for "Columnize" in the entire source code.
- Study how the current code works. You might want to use a debugger to put breakpoints at the relevant locations (for inspecting the backend, use your IDE's debugger, for the frontend, use your browser's developer tools).
- Create a git branch for your fix. The name of your branch should contain the issue number, and a few words to describe the topic of the fix, for instance "issue-1234-columnize-layout".
- Make changes to the code to fix the issue. If you are changing backend code, it would be great if you could also write a test in Java to demonstrate the fix. You can imitate existing tests for that. We currently do not have frontend tests.
- commit your changes, using a message that contains "closes #1234" or "fixes #1234", this will link the commit to the issue you are working on.
- push your branch to your fork and create a pull request for it, explaining the approach you have used, any design decisions you have made.
Thank you!

View File

@ -0,0 +1,261 @@
---
id: data-extension-api
title: Data extension API
sidebar_label: Data extension API
---
This page describes a new optional API for reconciliation services, allowing clients to pull properties of reconciled records. It is supported from OpenRefine 2.8 onwards. A sample server implementation is available in the [Wikidata reconciliation interface](https://tools.wmflabs.org/openrefine-wikidata/).
## Overview of the workflow
1. Reconcile a column with a standard reconciliation service
2. Click "Add column from reconciled values"
3. The user is proposed some properties to fetch, based on the type they reconciled their column against (if any). They can also pick their own property with the suggest widget (same as for the reconciliation dialog).
4. A preview of the columns to be fetched is displayed on the right-hand side of the dialog, based on a sample of the rows.
5. Once the user has clicked "OK", columns are fetched and added to the project. Columns corresponding to other items from the service are directly reconciled, and the column is marked as reconciled against the type suggested by the service for that
property. The user can run data extension again from that column.
[GIF Screencast](http://pintoch.ulminfo.fr/92dcdd20f3/recorded.gif)
## Specification
Services supporting data extension must add an `extend` field in their service metadata. This field is expected to have the following subfields, all optional:
* `propose_properties` stores the endpoint of an API which will be used to suggest properties to fetch (see specification below). The field contains an object with a `service_url` and `service_path` which will be concatenated to obtain the URL where the endpoint is available, just like the other services in the metadata. If this field is not provided, no property will be suggested in the dialog (the user will have to input them manually).
* `property_settings` stores the specification of a form where the user will be able to configure how a given property should be fetched (see specification below). If this field is not provided, the user will not be proposed with settings.
The service endpoint must also accept a new parameter `extend` (in addition to `queries` which is used for reconciliation). Its behaviour is described in the following section.
Example service metadata:
```json
"extend": {
"propose_properties": {
"service_url": "https://tools.wmflabs.org/openrefine-wikidata",
"service_path": "/en/propose_properties"
},
"property_settings": []
}
```
### Property proposal protocol
The role of the property proposal endpoint is to suggest a list of properties to fetch. As only input, it accepts GET parameters:
* the `type` of a column was reconciled against. If no type is provided, it should suggest properties for a column reconciled against no type.
* a `limit` on the number of results to return
The type is specified by its id in the `type` GET parameter of the endpoint, as follows:
https://tools.wmflabs.org/openrefine-wikidata/en/propose_properties?type=Q3354859&limit=3
The endpoint returns a JSON response as follows:
```json
{
"properties": [
{
"id": "P969",
"name": "located at street address"
},
{
"id": "P1449",
"name": "nickname"
},
{
"id": "P17",
"name": "country"
},
],
"type": "Q3354859",
"limit": 3
}
```
This endpoint must support JSONP via the `callback` parameter (just like all other endpoints of the reconciliation service).
### Data extension protocol
After calling the property proposal endpoint, the consumer (OpenRefine) calls the service endpoint with a JSON object in the `extend` parameter, containing the following fields:
* `ids` is a list of strings, each of which being an identifier of a record as returned by the reconciliation method. These are the records whose properties should be retrieved.
* `properties` is a list of JSON objects. They specify the properties to be fetched for each item, and contain the following fields:
* `id` (a string): the identifier of the property as returned by the property suggest service (and optionally the property proposal service)
* `settings`: a JSON object storing parameters about how the property should be fetched (see below).
Example:
```json
{
"ids": [
"Q7205598",
"Q218765",
"Q845632",
"Q5661356"
],
"properties": [
{
"id": "P856"
},
{
"id": "P159"
}
]
}
```
The service returns a JSON response formatted as follows:
* `meta` contains a list of column metadata. The order of the properties must be the same
as the one provided in the query. Each element is an object containing the following keys:
* `id` (mandatory): the identifier of the property
* `name` (mandatory): the human-readable name of the property
* `type` (optional): an object with `id` and `name` keys representing the expected
type of values for that property. The notion of type is the same as the one
used for reconciliation. The `type` field should only be provided when the property returns
reconciled items.
* `rows` contains an object. Its keys must be exactly the record ids (`ids`) passed in the query.
The value for each record id is an object representing a row for that id. The keys of a row object must be exactly the property ids passed in the query (`"P856"` and `"P159"` in the example above). The value for a property id should be a list of cell objects.
Cell objects are JSON objects which contain the representation of an OpenRefine cell.
* an object with a single `"str"` key and a string value for it represents
a cell with a (bare) string in it.
Example: `{"str": "193.54.0.0/15"}`
* an object with `"id"` and `"name"` represents a reconciled value
(from the same reconciliation service). It will be stored as
a matched cell (with maximum reconciliation score).
Example: `{"name": "Warsaw","id": "Q270"}`
* an empty object `{}` represents an empty cell
* an object with `"date"` and an ISO-formatted date string represents a point in time.
Example: `{"date": "1987-02-01T00:00:00+00:00"}`
* an object with `"float"` and a numerical value represents a quantity.
Example: `{"float": 48.2736}`
* an object with `"int"` and an integer represents a number.
Example: `{"int": 54}`
* an object with `"bool"` and `true` or `false` represents a boolean.
Example: `{"bool": false}`
Example of a full response (for the example query above):
```json
{
"rows": {
"Q5661356": {
"P159": [],
"P856": []
},
"Q7205598": {
"P159": [
{
"name": "Warsaw",
"id": "Q270"
}
],
"P856": [
{
"str": "http://www.polkomtel.com.pl/english"
},
{
"str": "http://www.polkomtel.com.pl/"
}
]
},
"Q845632": {
"P159": [
{
"name": "Bærum",
"id": "Q57076"
}
],
"P856": [
{
"str": "http://www.telenor.com/"
}
]
},
"Q218765": {
"P159": [
{
"name": "Paris",
"id": "Q90"
}
],
"P856": [
{
"str": "http://www.sfr.fr/"
}
]
}
},
"meta": [
{
"id": "P159",
"name": "headquarters location",
"type": {
"id": "Q7540126",
"name": "headquarters",
}
},
{
"id": "P856",
"name": "official website",
}
]
}
```
### Settings specification
The `property_settings` field in the service metadata allows the service to declare it accepts some settings for the properties it fetches. They are specified as a list of JSON objects which define the fields which should be exposed to the user.
Each setting object looks like this:
```json
{
"default": 0,
"type": "number",
"label": "Limit",
"name": "limit",
"help_text": "Maximum number of values to return per row (0 for no limit)"
}
```
It is essentially a definition of a form field in JSON, with self-explanatory fields.
The `type` field specifies the type of the form field (among `number`, `select`, `text`, `checkbox`).
The field `default` gives the default value of the form: the service must assume this value if the
client does not specify this setting.
For the `select` field, an additional `choices` field defines the possible choices, with both labels and values:
```json
{
"default": "any",
"label": "References",
"name": "references",
"type": "select",
"choices": [
{
"value": "any",
"name": "Any statement"
},
{
"value": "referenced",
"name": "At least one reference"
},
{
"value": "no_wiki",
"name": "At least one non-wiki reference"
}
],
"help_text": "Filter statements by their references"
}
```
When querying the service for rows, the client can pass an optional `settings` object in each of the requested columns:
```json
{
"id": "P342",
"settings": {
"limit": "20",
"references": "referenced",
}
}
```
Each key of the settings object must correspond to one form field proposed by the service. The value of that key is the value of the form field represented as a string (for uniformity and consistency with JSON form serialization).
The settings are intended to modify the results returned by the service: of course, the semantics of the settings is up to the service (as the service defines itself what settings it accepts).

View File

@ -0,0 +1,29 @@
---
id: development-roadmap
title: Development roadmap
sidebar_label: Development roadmap
---
Please be aware that the OpenRefine roadmap is subject to change at any time, so please check back regularly, and monitor [milestones](https://github.com/OpenRefine/OpenRefine/milestones), [projects](https://github.com/OpenRefine/OpenRefine/projects) and [issues](https://github.com/OpenRefine/OpenRefine/issues) in Github to keep up to date with current plans.
If there are features you would like to see that are not currently listed here or in current [milestones](https://github.com/OpenRefine/OpenRefine/milestones), [projects](https://github.com/OpenRefine/OpenRefine/projects) and [issues](https://github.com/OpenRefine/OpenRefine/issues), please add them to the [issue tracker](https://github.com/OpenRefine/OpenRefine/issues).
## Planned releases
### 4.0
[New backend storage option to allow using much bigger datasets at the expense of real-time feedback.](https://github.com/OpenRefine/OpenRefine/milestone/7)
New UI (possibly Vue or React based)
## Work in progress
Alongside the planned releases there are often smaller pieces of work in progress. Check for [recently updated issues](https://github.com/OpenRefine/OpenRefine/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc) and [pull requests](https://github.com/OpenRefine/OpenRefine/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc) to see what is currently in the works.
## On the back burner
Some aspects of OpenRefine have previously been targeted for release, but have not made it into a release and have not been worked on recently. If you would like to see features in these areas, please create an issue the describes what development you would like to see:
- Streamlining traditional features
- Views: map, timeline, protovis (D3.js) charts
- Better machinery to guess and re-encode cell values (useful for fixing encoding issues)
- Collaborative editing support (see documentation on the '[broker protocol](https://github.com/OpenRefine/OpenRefine/wiki/Broker-Protocol)' to see where this work was going)
- Column groups

View File

@ -0,0 +1,5 @@
---
id: homebrew-cask-process
title: Maintaining OpenRefine's Homebrew cask
sidebar_label: Maintaining OpenRefine's Homebrew cask
---

View File

@ -0,0 +1,169 @@
---
id: migrating-older-extensions
title: Migrating older Extensions
sidebar_label: Migrating older Extensions
---
## Migrating from Ant to Maven
### Why are we doing this change?
Ant is a fairly old (antique?) build system that does not incorporate any dependency management.
By migrating to Maven we are making it easier for developers to extend OpenRefine with new libraries, and stop having to ship dozens of .jar files in the repository. Using the Maven repository also encourages developers to add dependencies to released versions of libraries instead of custom snapshots that are hard to update.
### When was this change made?
The migration was done between 3.0 and 3.1-beta with this commit:
https://github.com/OpenRefine/OpenRefine/commit/47323a9e750a3bc9d43af606006b5eb20ca397b8
### How to migrate an extension
You will need to write a `pom.xml` in the root folder of your extension to configure the compilation process with Maven. Sample `pom.xml` files for extensions can be found in the extensions that are shipped with OpenRefine (`gdata`, `database`, `jython`, `pc-axis` and `wikidata`). A sample extension (`sample`) is also provided, with a minimal build file.
For any library that your extension depends on, you should try to find a matching artifact in the Maven Central repository. If you can find such an artifact, delete the `.jar` file from your extension and add the dependency in your `pom.xml` file. If you cannot find such an artifact, it is still possible to incorporate your own `.jar` file using `maven-install-plugin` that you can configure in your `pom.xml` file as follows:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
<executions>
<execution>
<id>install-wdtk-datamodel</id>
<phase>process-resources</phase>
<configuration>
<file>${basedir}/lib/my-proprietary-library.jar</file>
<repositoryLayout>default</repositoryLayout>
<groupId>com.my.company</groupId>
<artifactId>my-library</artifactId>
<version>0.5.3-SNAPSHOT</version>
<packaging>jar</packaging>
<generatePom>true</generatePom>
</configuration>
<goals>
<goal>install-file</goal>
</goals>
</execution>
<!-- if you need to add more than one jar, add more execution blocks here -->
</executions>
</plugin>
And add the dependency to the `<dependencies>` section as usual:
<dependency>
<groupId>com.my.company</groupId>
<artifactId>my-library</artifactId>
<version>0.5.3-SNAPSHOT</version>
</dependency>
## Migrating to Wikimedia's i18n jQuery plugin
### Why are we doing this change?
This adds various important localization features, such as the ability to handle plurals or interpolation. This also restores the language fallback (displaying strings in English if they are not available in the target language) which did not work with the previous set up.
### When was the migration made?
The migration was made between 3.1-beta and 3.1, with this commit: https://github.com/OpenRefine/OpenRefine/commit/22322bd0272e99869ab8381b1f28696cc7a26721
### How to migrate an extension
You will need to update your translation files, merging nested objets in one global object, concatenating keys. You can do this by running the following Python script on all your JSON translation files:
import json
import sys
with open(sys.argv[1], 'r') as f:
j = json.loads(f.read())
result = {}
def translate(obj, path):
res = {}
if type(obj) == str:
result['/'.join(path)] = obj
else:
for k, v in obj.items():
new_path = path + [k]
translate(v, new_path)
translate(j, [])
with open(sys.argv[1], 'w') as f:
f.write(json.dumps(result, ensure_ascii=False, indent=4))
Then your javascript files which retrieve the translated strings should be updated: `$.i18n._('core-dialogs')['cancel']` becomes `$.i18n('core-dialogs/cancel')`. You can do this with the following `sed` script:
sed -i "s/\$\.i18n._(['\"]\([A-Za-z0-9/_\\-]*\)['\"])\[['\"]\([A-Za-z0-9\-\_]*\)[\"']\]/$.i18n('\1\/\2')/g" my_javascript_file.js
You can then chase down the places where you are concatenating translated strings, and replace that with more flexible patterns using [the plugin's features](https://github.com/wikimedia/jquery.i18n#jqueryi18n-plugin).
## Migrating from org.json to Jackson
### Why are we doing this change?
The org.json (or json-java) library has multiple drawbacks.
* First, it has limited functionality - all the serialization and deserialization has to be done explicitly - an important proportion of OpenRefine's code was dedicated to implementing these;
* Second, its implementation is not optimized for speed - multiple projects have reported speedups when migrating to more modern JSON libraries;
* Third, and this was the decisive factor to initiate the migration: [its license](https://json.org/license) is the MIT license with an additional condition which makes it non-free. Getting rid of this dependency was required by the Software Freedom Conservancy as a prerequisite to become a fiscal sponsor for the project.
### When was the migration made?
This change was made between 3.1 and 3.2-beta, with this commit: https://github.com/OpenRefine/OpenRefine/commit/5639f1b2f17303b03026629d763dcb6fef98550b
### How to migrate an extension or fork
You will need to use the Jackson library to serialize the classes that implement interfaces or extend classes exposed by OpenRefine.
The interface `Jsonizable` was deleted. Any class that used to implement this now needs to be serializable by Jackson, producing the same format as the previous serialization code. This applies to any operation, facet, overlay model or GREL function. If you are new to Jackson, have a look at [this tutorial](https://www.baeldung.com/jackson) to learn how to annotate your class for serialization. Once this is done, you can remove the `void write(JSONWriter writer, Properties options)` method from your class. Note that it is important that you do this migration for all classes implementing the `Jsonizable` interface that are exposed to OpenRefine's core.
We encourage you to migrate out of org.json completely, but this is only required for the classes that interact with OpenRefine's core.
#### General notes about migrating
OpenRefine's ObjectMapper is available at `ParsingUtilities.mapper`. It is configured to only serialize the fields and getters that have been explicitly marked with `@JsonProperty` (to avoid accidental JSON format changes due to refactoring). On deserialization it will ignore any field in the JSON payload that does not correspond to a field in the Java class. It has serializers and deserializers for `OffsetDateTime` and `LocalDateTime`.
Useful snippets to use in tests:
* deserialize an instance: `MyClass instance = ParsingUtilities.mapper.readValue(jsonString, MyClass.class);` (replaces calls to `Jsonizable.write`);
* serialize an instance: `String json = ParsingUtilities.mapper.writeValueAsString(myInstance);` (replaces calls to static methods such as `load`, `loadStreaming` or `reconstruct`);
* the equivalent of `JSONObject` is `ObjectNode`, the equivalent of `JSONArray` is `ArrayNode`;
* create an empty JSON object: `ParsingUtilities.mapper.createObjectNode()` (replaces `new JSONObject()`);
* create an empty JSON array: `ParsingUtilities.mapper.createArrayNode()` (replaces `new JSONArray()`).
Before undertaking the migration, we recommend that you write some tests which serialize and deserialize your objects. This will help you make sure that the JSON format is preserved during the migration. One way to do this is to collect some sample JSON representations of your objects, and check in your tests that deserializing these JSON payloads and serializing them back to JSON preserves the JSON payload. Some utilities are available to help you with that in [`TestUtils`](https://github.com/OpenRefine/OpenRefine/blob/master/main/tests/server/src/com/google/refine/tests/util/TestUtils.java) (we had [some to test org.json serialization](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/tests/server/src/com/google/refine/tests/util/TestUtils.java) before we got rid of the dependency, feel free to copy them).
#### For functions
Before the migration, you had to explicitly define JSON serialization of functions with a `write` method. You should now override the getters returning the various documentation fields.
Example: `Cos` function [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/expr/functions/math/Cos.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/expr/functions/math/Cos.java).
#### For operations
Before the JSON migration we refactored engine-dependent operations so that the engine configuration is represented by an `EngineConfig` object instead of a `JSONObject`. Therefore the constructor for your operation should be updated to use this new class. Your constructor should also be annotated to be used during deserialization.
Note that you do not need to explicitly serialize the operation type, this is already done for you by `AbstractOperation`.
Example: `ColumnRemovalOperation` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/operations/column/ColumnRemovalOperation.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/operations/column/ColumnRemovalOperation.java).
#### For changes
Changes are serialized in plain text but often relies on JSON serialization for parts of the data. Just use the methods above with `ParsingUtilities.mapper` to maintain this behaviour.
Example: `ReconChange` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/model/changes/ReconChange.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/operations/column/ColumnRemovalOperation.java).
#### For importers
The importing options have been migrated from `JSONObject` to `ObjectNode`. Your compiler should help you propagate this change. Utility functions in `JSONUtilities` have been migrated to Jackson so you should have minimal changes if you used them.
Example: `TabularImportingParserBase` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/importers/TabularImportingParserBase.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/importers/TabularImportingParserBase.java).
#### For overlay models
Migrate serialization and deserialization as for other objects.
Example: `WikibaseSchema` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/extensions/wikidata/src/org/openrefine/wikidata/schema/WikibaseSchema.java#L203) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/extensions/wikidata/src/org/openrefine/wikidata/schema/WikibaseSchema.java#L60)
#### For preference values
Any class that is stored in OpenRefine's preference now needs to implement the `com.google.refine.preferences.PreferenceValue` interface. The static `load` method and the `write` method used previously for deserialization should be deleted and regular Jackson serialization and deserialization should be implemented instead. Note that you do not need to explicitly serialize the class name, this is already done for you by the interface.
Example: `TopList` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/preference/TopList.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/preference/TopList.java)

View File

@ -0,0 +1,276 @@
---
id: openrefine-api
title: OpenRefine API
sidebar_label: OpenRefine API
---
This is a generic API reference for interacting with OpenRefine's HTTP API.
## Create project:
> **Command:** _POST /command/core/create-project-from-upload_
When uploading files you will need to send the data as `multipart/form-data`. This is different to all other API calls which use a mixture of query string and POST parameters.
multipart form-data:
'project-file' : file contents
'project-name' : project name
'format' : format of data in project-file (e.g. 'text/line-based/*sv') [optional]
'options' : json object containing options relevant to the file format [optional]
The formats supported will depend on the version of OpenRefine you are using and any Extensions you have installed. The common formats include:
* 'text/line-based': Line-based text files
* 'text/line-based/*sv': CSV / TSV / separator-based files [separator to be used in specified in the json submitted to the options parameter]
* 'text/line-based/fixed-width': Fixed-width field text files
* 'binary/text/xml/xls/xlsx': Excel files
* 'text/json': JSON files
* 'text/xml': XML files
If the format is omitted OpenRefine will try to guess the format based on the file extension and MIME type.
The values which can be specified in the JSON object submitted to the 'options' parameter will vary depending on the format being imported. If not specified the options will either be guessed at by OpenRefine (e.g. separator being used in a separated values file) or a default value used. The import options for each file format are not currently documented, but can be seen in the OpenRefine GUI interface when importing a file of the relevant format.
If the project creation is successful, you will be redirected to a URL of the form:
http://127.0.0.1:3333/project?project=<project id>
From the project parameter you can extract the project id for use in future API calls. The content of the response is the HTML for the OpenRefine interface for viewing the project.
### Get project models:
> **Command:** _GET /command/core/get-models?_
'project' : project id
Recovers the models for the specific project. This includes columns, records, overlay models, scripting. In the columnModel a list of the columns is displayed, key index and name, and column groupings.
### Response:
**On success:**
```JSON
{
"columnModel":{
"columns":[
{
"cellIndex":0,
"originalName":"email",
"name":"email"
},
{
"cellIndex":1,
"originalName":"name",
"name":"name"
},
{
"cellIndex":2,
"originalName":"state",
"name":"state"
},
{
"cellIndex":3,
"originalName":"gender",
"name":"gender"
},
{
"cellIndex":4,
"originalName":"purchase",
"name":"purchase"
}
],
"keyCellIndex":0,
"keyColumnName":"email",
"columnGroups":[
]
},
"recordModel":{
"hasRecords":false
},
"overlayModels":{
},
"scripting":{
"grel":{
"name":"General Refine Expression Language (GREL)",
"defaultExpression":"value"
},
"jython":{
"name":"Python / Jython",
"defaultExpression":"return value"
},
"clojure":{
"name":"Clojure",
"defaultExpression":"value"
}
}
}
```
## Apply operations
> **Command:** _POST /command/core/apply-operations?_
In the parameter
'project' : project id
In the form data
'operations' : Valid JSON **Array** of OpenRefine operations
Example of a Valid JSON **Array**
```JSON
'[
{
"op":"core/column-addition",
"description":"Create column zip type at index 15 based on column Zip Code 2 using expression grel:value.type()",
"engineConfig":{
"mode":"row-based",
"facets":[]
},
"newColumnName":"zip type",
"columnInsertIndex":15,
"baseColumnName":"Zip Code 2",
"expression":"grel:value.type()",
"onError":"set-to-blank"
},
{
"op":"core/column-addition",
"description":"Create column testing at index 15 based on column Zip Code 2 using expression grel:value.toString()0,5]",
"engineConfig":{
"mode":"row-based",
"facets":[]
},
"newColumnName":"testing",
"columnInsertIndex":15,
"baseColumnName":"Zip Code 2",
"expression":"grel:value.toString()[0,5]",
"onError":"set-to-blank"
}
]
```
On success returns JSON response
`{ "code" : "ok" }`
## Export rows
> **Command:** _POST /command/core/export-rows_
'project' : project id
'engine' : JSON string... (e.g. '{"facets":[],"mode":"row-based"}')
'format' : format... (e.g 'tsv', 'csv')
Returns exported row data in the specified format. The formats supported will depend on the version of OpenRefine you are using and any Extensions you have installed. The common formats include:
* csv
* tsv
* xls
* xlsx
* ods
* html
## Delete project
> **Command:** _POST /command/core/delete-project_
'project' : project id...
Returns JSON response
## Check status of async processes
> **Command:** _GET /command/core/get-processes_
'project' : project id...
Returns JSON response
## Get all projects metadata:
> **Command:** _GET /command/core/get-all-project-metadata_
Recovers the meta data for all projects. This includes the project's id, name, time of creation and last time of modification.
### Response:
```json
{
"projects":{
"[project_id]":{
"name":"[project_name]",
"created":"[project_creation_time]",
"modified":"[project_modification_time]"
},
...[More projects]...
}
}
```
## Expression Preview
> **Command:** _POST /command/core/preview-expression_
Pass some expression (GREL or otherwise) to the server where it will be executed on selected columns and the result returned.
### Parameters:
* **cellIndex**: _[column]_
The cell/column you wish to execute the expression on.
* **rowIndices**: _[rows]_
The rows to execute the expression on as JSON array. Example: `[0,1]`
* **expression**: _[language]_:_[expression]_
The expression to execute. The language can either be grel, jython or clojure. Example: grel:value.toLowercase()
* **project**: _[project_id]_
The project id to execute the expression on.
* **repeat**: _[repeat]_
A boolean value (true/false) indicating whether or not this command should be repeated multiple times. A repeated command will be executed until the result of the current iteration equals the result of the previous iteration.
* **repeatCount**: _[repeatCount]_
The maximum amount of times a command will be repeated.
### Response:
**On success:**
```json
{
"code": "ok",
"results" : [result_array]
}
```
The result array will hold up to ten results, depending on how many rows there are in the project that was specified by the [project_id] parameter. Each result is the string that would be put in the cell if the GREL command was executed on that cell. Note that any expression that would return an array or JSon object will be jsonized, although the output can differ slightly from the jsonize() function.
**On error:**
```JSON
{
"code": "error",
"type": "[error_type]",
"message": "[error message]"
}
```
## Third-party software libraries
Libraries using the [OpenRefine API](openrefine-api):
### Python
* [refine-client-py](https://github.com/PaulMakepeace/refine-client-py/)
* Or this fork of the above with an extended CLI [openrefine-client](https://github.com/felixlohmeier/openrefine-client)
* [refine-python](https://github.com/maxogden/refine-python)
### Ruby
* [refine-ruby](https://github.com/distillytics/refine-ruby)
* The above is a maintained fork of [google-refine](https://github.com/maxogden/refine-ruby)
* [google_refine](https://github.com/chengguangnan/google_refine)
### NodeJS
* [node-openrefine](https://github.com/pm5/node-openrefine)
### R
* [rrefine](https://cran.r-project.org/web/packages/rrefine/index.html)
### PHP
* [openrefine-php-client](https://github.com/keboola/openrefine-php-client)
### Java
* [refine-java](https://github.com/dtap-gmbh/refine-java)
### Bash
* [bash-refine.sh](https://gist.github.com/felixlohmeier/d76bd27fbc4b8ab6d683822cdf61f81d) (templates for shell scripts)

View File

@ -0,0 +1,259 @@
---
id: reconciliation-api
title: Reconciliation API
sidebar_label: Reconciliation API
---
_This page is kept for the record. [A cleaner version of this specification](https://reconciliation-api.github.io/specs/0.1/) was written by the [W3C Entity Reconciliation Community Group](https://www.w3.org/community/reconciliation/), which has been formed to improve and promote this API. Join the community group to get involved!_
_This is a technical description of the mechanisms behind the reconciliation system in OpenRefine. For usage instructions, see [Reconciliation](Reconciliation)._
## Introduction
A reconciliation service is a web service that, given some text which is a name or label for something, and optionally some additional details, returns a ranked list of potential entities matching the criteria. The candidate text does not have to match each entity's official name perfectly, and that's the whole point of reconciliation--to get from ambiguous text name to precisely identified entities. For instance, given the text "apple", a reconciliation service probably should return the fruit apple, the Apple Inc. company, and New York city (also known as the Big Apple).
Entities are identified by strong identifiers in some particular identifier space. In the same identifier space, identifiers follow the same syntax. For example, given the string "apple", a reconciliation service might return entities identified by the strings " [Q89](https://www.wikidata.org/wiki/Q89)", "[Q312](https://www.wikidata.org/wiki/Q312)", and "[Q60](https://www.wikidata.org/wiki/Q60)", in the Wikidata ID space. Each reconciliation service can only reconcile to one single identifier space, but several reconciliation services can reconcile to the same identifier space.
OpenRefine defines a reconciliation API so that users can use the reconciliation features of OpenRefine with various databases (this API was originally developed to work with the now deprecated "[Freebase](https://en.wikipedia.org/wiki/Freebase)" API).
Informally, the main function of any reconciliation service is to find good candidates in the underlying database, given the following data:
* A string, which is normally the name or title of the entity, in some language.
* Optionally, a type which can be used to narrow down the search to entities of this type. OpenRefine does not define a particular set of acceptable types: this choice is left to the reconciliation service (see the suggest API for that).
* Optionally, a list of properties and their values, which can be used to refine the search. For instance, when reconciling a database of books, the author name or the publication date are useful bits of information that can be transferred to the reconciliation service. This information will be sent to the reconciliation service if the user binds columns to properties. Again, the notion of property is not predefined in OpenRefine: its definition depends on the reconciliation service.
A standard reconciliation service is a HTTP-based RESTful JSON-formatted API. It consists of various endpoints, each of which fulfills a specific function. Only the first one is mandatory.
* The root URL. This is the URL that users will need to add the service to OpenRefine. For instance, the Wikidata reconciliation interface in English has the following URL: [https://tools.wmflabs.org/openrefine-wikidata/en/api](https://tools.wmflabs.org/openrefine-wikidata/en/api)
* _Optional._ The suggest API, which enables autocompletion at various places in OpenRefine;
* _Optional._ The preview API, which lets users preview the reconciled items directly from OpenRefine;
* _Optional._ The data extension API, which lets users add columns from reconciled values based on the properties of the items in the reconciliation service.
The specification of each of these endpoints is given in the following sections.
## Workflow overview
OpenRefine communicates with reconciliation services in the following way.
* The user adds the service by inputting its endpoint in the dialog. OpenRefine queries this URL to retrieve its [service and metadata](reconciliation-api#service-metadata) which contains basic metadata about the service (such as its name and the available features).
* The user selects a service to configure reconciliation using this service. OpenRefine queries the reconciliation service in [batch mode](reconciliation-api#multiple-query-mode) on the first ten items of the column to be reconciled. The reconciliation results are not presented to the user, but the types of the candidate items are aggregated and proposed to the user to restrict the matching to one of them. For example, if your service reconciles both people and companies, and the query for the first 10 names returns 15 candidates which are people and 7 candidates which are companies, the types will be presented to the user in that order. They can override the order and pick whichever type they want, as well as chosen a different type by hand or choose to reconcile without any type information.
* The user configures the reconciliation. If a [suggest service](reconciliation-api#suggest-apis) is available, it will be used to provide auto-completion in the dialog, to choose types or properties.
* When reconciliation starts, OpenRefine queries the service in [batch mode](reconciliation-api#multiple-query-mode) for small batches of rows and stores the responses of the service.
* Once reconciliation is complete, the results are displayed. The user makes reconciliation decisions based on the choices provided. If a [suggest service](reconciliation-api#suggest-apis) is available, it will be used to input custom reconciliation decisions. If a [preview service](reconciliation-api#preview-api) is available, the user will be able to preview the reconciliation candidates without leaving OpenRefine.
## Main reconciliation service
The root URL has two functions:
* it returns the [service and metadata](reconciliation-api#service-metadata) if no query is provided.
* when given the `queries` parameter, it performs a [set of queries in batch mode](reconciliation-api#multiple-query-mode) and returns the results for each query. This makes it efficient
There is a deprecated "single query" mode which is used if the `query` parameter is given. This mode is no longer supported or used by OpenRefine and other API consumers should not rely on it.
### Service metadata
When a service is called with just a JSONP `callback` parameter and no other parameters, it must return its _service metadata_ as a JSON object literal with the following fields:
* `"name"`: the name of the service, which will be used to display the service in the reconciliation menu;
* `"identifierSpace"`: an URI for the type of identifiers returned by the service;
* `"schemaSpace"`: an URI for the type of types understood by the service.
* `"view"` an object with a template URL to view a given item from its identifier: `"view": {"url":"http://example.com/object/{{id}}"} `
The last two parameters are mainly useful to assert that the identifiers returned by two different reconciliation services mean the same thing. Other fields are optional: they are used to specify the URLs for the other endpoints (suggest, preview and extend) described in the next sections.
Here are two live examples:
1. [https://tools.wmflabs.org/openrefine-wikidata/en/api](https://tools.wmflabs.org/openrefine-wikidata/en/api)
2. [http://refine.codefork.com/reconcile/viaf](http://refine.codefork.com/reconcile/viaf)
```json
{
"name" : "Wikidata Reconciliation for OpenRefine (en)",
"identifierSpace" : "http://www.wikidata.org/entity/",
"schemaSpace" : "http://www.wikidata.org/prop/direct/",
"view" : {
"url" : "https://www.wikidata.org/wiki/{{id}}"
},
"defaultTypes" : [],
"preview" : {
...
},
"suggest" : {
...
},
"extend" : {
...
},
}
```
## Query Request
### Multiple Query Mode
A call to a standard reconciliation service API for multiple queries looks like this:
http://foo.com/bar/reconcile?queries={...json object literal...}
The json object literal has zero or more key/value pairs with arbitrary keys where the value is in the same format as a single query, e.g.
http://foo.com/bar/reconcile?queries={ "q0" : { "query" : "foo" }, "q1" : { "query" : "bar" } }
"q0" and "q1" can be arbitrary strings. They will be used to key the results returned.
For larger data, it can make sense to use POST requests instead of GET:
```shell
curl -X POST -d 'queries={ "q0" : { "query" : "foo" }, "q1" : { "query" : "bar" } }' http://foo.com/bar/reconcile
```
OpenRefine uses POST for all requests, so make sure your service supports the format above.
### **DEPRECATED** Single Query Mode
A call to a reconciliation service API for a single query looks like either of these:
http://foo.com/bar/reconcile?query=...string...
http://foo.com/bar/reconcile?query={...json object literal...}
If the query parameter is a string, then it's an abbreviation of `query={"query":...string...}`. Here are two live examples:
1. [https://tools.wmflabs.org/openrefine-wikidata/en/api?query=boston](https://tools.wmflabs.org/openrefine-wikidata/en/api?query=boston)
2. [https://tools.wmflabs.org/openrefine-wikidata/en/api?query={%22query%22:%22boston%22,%22type%22:%22Q515%22}](https://tools.wmflabs.org/openrefine-wikidata/en/api?query={%22query%22:%22boston%22,%22type%22:%22Q515%22})
### Query JSON Object
The query json object literal has a few fields
| Parameter | Description |
| --- | --- |
| "query" | A string to search for. Required. |
| "limit" | An integer to specify how many results to return. Optional. |
| "type" | A single string, or an array of strings, specifying the types of result e.g., person, product, ... The actual format of each type depends on the service (e.g., "Q515" as a Wikidata type). Optional. |
| "type\_strict" | A string, one of "any", "all", "should". Optional. |
| "properties" | Array of json object literals. Optional |
Each json object literal of the `"properties"` array is of this form
```json
{
"p" : string, property name, e.g., "country", or
"pid" : string, property ID, e.g., "P17" as a Wikidata property ID
"v" : a single, or an array of, string or number or object literal, e.g., "Japan"
}
```
A `"v"` object literal would have a single key `"id"` whose value is an identifier resolved previously to the same identity space.
Here is an example of a full query parameter:
```json
{
"query" : "Ford Taurus",
"limit" : 3,
"type" : "Q3231690",
"type_strict" : "any",
"properties" : [
{ "p" : "P571", "v" : 2009 },
{ "pid" : "P176" , "v" : { "id" : "Q20827633" } }
]
}
```
## Query Response
For multiple queries, the response is a JSON literal object with the same keys as in the request
```json
{
"q0" : {
"result" : { ... }
},
"q1" : {
"result" : { ... }
}
}
```
Each result consists of a JSON literal object with the structure
```json
{
"result" : [
{
"id" : ... string, database ID ...
"name" : ... string ...
"type" : ... array of strings ...
"score" : ... double ...
"match" : ... boolean, true if the service is quite confident about the match ...
},
... more results ...
],
... potentially some useful envelope data, such as timing stats ...
}
```
The results should be sorted by decreasing score.
The service must also support JSONP through a callback parameter ie &callback=foo.
## Preview API
The preview service API (complementary to the reconciliation service API) is quite simple. Pass it an identifier and it renders information about the corresponding entity in an HTML page, which will be shown in an iframe inside OpenRefine. The given width and height dimensions tell OpenRefine how to size that iframe.
## Suggest APIs
In the "Start Reconciling" dialog box in OpenRefine, you can specify which type of entities the column in question contains. For instance, the column might contains titles of scientific journals. But you don't know the identifier corresponding to the "scientific journal" type. So we need a suggest API that translates "scientific journal" to something like, say, "[Q5633421](https://www.wikidata.org/wiki/Q5633421)" if we're reconciling against Wikidata.
In the same dialog box, you can specify that other columns should be used to provide more details for the reconciliation. For instance, if there is a column specifying the journals ISSN (a standard identifier for serial publications), passing that data onto the reconciliation service might make reconciliation more accurate. You might want to specify how that second column is related to the column being reconciled, but you might not now how to specify "ISSN" as a precise relationship. So we need a suggest API that translates "ISSN" to something like "[P236](https://www.wikidata.org/wiki/Property:P236)" (once again using Wikidata as an example).
There is also a need for a suggest service for entities rather than just for types and properties. When a cell has no good candidate, then you would want to perform a search yourself (by clicking on "search for match" in that cell).
Each suggest API has 2 jobs to do:
* translate what the user type into a ranked list of entities (and this is similar to the core reconciliation service and might share the same implementation)
* render a flyout when an entity is moused over or highlighted using arrow keys (and this is similar to the preview API and might share the same implementation)
The metadata for each suggest API (type, property, or entity) is as follows:
```json
{
"service_url" : "... url including only the domain ...",
"service_path" : "... optional relative path ...",
"flyout_service_url" : "... optional url including only the domain ...",
"flyout_service_path" : "... optional relative path ..."
}
```
The `service_url` field is required and it should look like this: `http://foo.com`. There should be no trailing `/` at the end. The other fields are optional and have defaults if not provided:
* `service_path` defaults to `/private/suggest`
* `flyout_service_url` defaults to the provided `service_url` field
* `flyout_service_path` defaults to `/private/flyout`
Refer to [the Suggest API documentation](suggest-api) for further details.
## Data Extension
From OpenRefine 2.8 it is possible to fetch values from reconcilied sources natively. This is only possible for the reconciliation endpoints that support this additional feature, described in the [Data Extension API documentation](Data-Extension-API).
## Examples
We've cloned a number of the Refine reconciliation services as a way of providing them visibility. They can be found at [https://github.com/OpenRefine](https://github.com/OpenRefine)
Some examples of reconciliation services which have made code available include:
* [https://github.com/dergachev/redmine-reconcile](https://github.com/dergachev/redmine-reconcile) - Python & Flask implementation that just returns the given name/number with a base url prepended
* [https://github.com/okfn/helmut](https://github.com/okfn/helmut) - A generic Refine reconciliation API implementation using Python & Flask
* [https://github.com/mblwhoi/reconciliation_service_skeleton](https://github.com/mblwhoi/reconciliation_service_skeleton) - Skeleton for Standalone Python & Flask Reconciliation Service for Refine
* [https://github.com/mikejs/reconcile-demo](https://github.com/mikejs/reconcile-demo)
* [https://github.com/rdmpage/phyloinformatics/tree/master/services) - PHP examples (reconciliation\_\*.php](https://github.com/rdmpage/phyloinformatics/tree/master/services)
* [http://lucene.apache.org/solr/) and Python Django](https://github.com/opensemanticsearch/open-semantic-entity-search-api](https://github.com/opensemanticsearch/open-semantic-entity-search-api) - Open Source REST-API for Named Entity Extraction, Normalization, Reconciliation, Recommendation, Named Entity Disambiguation and Named Entity Linking of named entities in full-text documents by SKOS thesaurus, RDF ontologies, SQL databases and lists of names (powered by [Apache Solr)
* [https://github.com/granoproject/grano-reconcile](https://github.com/granoproject/grano-reconcile) - python example
* [https://github.com/codeforkjeff/conciliator](https://github.com/codeforkjeff/conciliator) - a Java framework for creating reconciliation services over the top of existing data sources. The code includes reconciliation services layered over [the Virtual International Authority File (VIAF)](http://viaf.org), [ORCID](http://orcid.org), [the Open Library](http://openlibrary.org) and [Apache Solr](http://lucene.apache.org/solr/).
* The open-reconcile project provides a complete Java based reconciliation service which queries a SQL database. [https://code.google.com/p/open-reconcile](https://code.google.com/p/open-reconcile)
* The [RDF Extension](http://refine.deri.ie) incorporates, among other things, reconciliation support with different approaches:
* a service to reconciliate against querying a SPARQL endpoint
* reconcile against a provided RDF file
* based on Apache Stanbol ([implementation details](https://github.com/fadmaa/grefine-rdf-extension/pull/59))
* [Sunlight Labs](https://github.com/sunlightlabs) implemented a reconciliation service using Piston on Django for their [Influence Explorer](https://sunlightlabs.github.io/datacommons/). [The code is available](https://github.com/sunlightlabs/datacommons/blob/master/dcapi/reconcile/handlers.py)
Also look at the [[Reconcilable Data Sources]] page for other examples of available reconciliation services that are compatible with Refine. Not all of them are open source, but they might spark some ideas.

View File

@ -0,0 +1,101 @@
---
id: suggest-api
title: Suggest API
sidebar_label: Suggest API
---
The Suggest API has 2 entry points:
- `suggest`: translates some text that the user has typed, optionally constrained in some ways, to a ranked list of entities (this is very similar to a [Reconciliation API](reconciliation-api) and can share the same implementation)
- `flyout` : renders a small view of an entity
For the `suggest` entry point, it is important to balance speed versus accuracy. The widget must respond in interactive time, meaning about 200 msec, for each change to the user's text input. At the same time, the ranked list of entities must seem quite relevant to what the user types.
Similarly, for the `flyout` entry point, it is important to respond quickly while providing enough essential details so that the user can visually check if the highlighted entity is the desired one. You probably would want to embed a thumbnail image, as we have found that images are excellent for visual identification.
## suggest Entry Point
The `suggest` entry point takes the following URL parameters
Parameter | Description | Required/Optional
----------|-----------------------------|------------------
"prefix" | a string the user has typed | required
"type" | optional, a single string, or an array of strings, specifying the types of result e.g., person, product, ... The actual format of each type depends on the service | optional |
"type\_strict" | optional, a string, one of "any", "all", "should" | optional |
"limit" | optional, an integer to specify how many results to return | optional |
"start" | optional, an integer to specify the first result to return (thus in conjunction with `limit`, support pagination) | optional |
The Suggest API should return results as JSON (JSONP must also be supported). The JSON should consist of a 'result' array containing objects with at least a 'name' and 'id'. The JSON response can optionally include other information as illustrated in this structure for a full JSON response:
```json
{
"code" : "/api/status/ok",
"status" : "200 OK",
"prefix" : ... string, the prefix URL parameter echoed back ...
"result" : [
{
"id" : ... string, identifier of entity ...
"name" : ... string, nameof entity ...
"notable" : [{
"id" : ... string, identifier of type ...
"name" : ... string, name of type ...
}, ...]
},
... more results ...
],
}
```
* `code` (optional) error/success state of the API above the level of HTTP. Use "/api/status/ok" for a successful request and "/api/status/error" if there has been an error.
* `status` (optional) should correspond to the HTTP response code.
* `prefix` (optional) the query string submitted to the Suggest API.
* `result` (required) array containing multiple results from the Suggest API consisting of at least an id and name.
* `id` (required) the id of an entity being suggested by the Suggest API
* `name` (required) a short string which labels or names the entity being suggested by the Suggest API
* `description` (optional) a short description of the item, which will be displayed below the name
* `notable` (optional) is a a list of JSON objects that describes the types of the entity. They are rendered in addition to the entity's name to provide more disambiguation details and stored in the reconciliation data of the cells. This list can also be supplied as a list of type identifiers (such as `["Q5"]` instead of `[{"id":"Q5","name":"human"}]`).
Here is an example of a minimal request and response using the Suggest API layered over [Wikidata](https://www.wikidata.org):
URL: https://tools.wmflabs.org/openrefine-wikidata/en/suggest/entity?prefix=A5
JSON response:
```json
{
"result": [
{
"name": "A5",
"description": "road",
"id": "Q429719"
},
{
"name": "Apple A5",
"description": null,
"id": "Q420764"
},
{
"name": "A5 autoroute",
"description": "controlled-access highway from Paris's Francilienne to the A31 near Beauchemin",
"id": "Q788832"
},
...
]
}
```
## flyout Entry Point
The `flyout` entry point takes a single URL parameter: `id`, which is the identifier of the entity to render, as a string. It also takes a `callback` parameter to support JSONP. It returns a JSON object literal with a single field: `html`, which is the rendered view of the given entity.
Here is an example of a minimal request and response using the Suggest API layered over [[Wikidata](https://www.wikidata.org) with only the required fields in each case:
URL: https://tools.wmflabs.org/openrefine-wikidata/en/flyout/entity?id=Q786288
JSON response:
```json
{
"html": "<p style=\"font-size: 0.8em; color: black;\">national road in Latvia</p>",
"id": "Q786288"
}
```
OpenRefine incorporates a set of `fbs-` CSS class names which can be used in the flyout HTML if desired to render the flyout information in a standard style.

View File

@ -0,0 +1,7 @@
---
id: technical-reference-index
title: OpenRefine technical reference
sidebar_label: Technical Reference Index
---
Technical reference index

View File

@ -0,0 +1,105 @@
---
id: translating
title: Translate the OpenRefine interface
sidebar_label: Translate the OpenRefine interface
---
Currently supported languages include English, Spanish, Chinese, French, Hebrew, Italian and Japanese.
![Translation status](https://hosted.weblate.org/widgets/openrefine/-/287x66-grey.png)
You can help translate OpenRefine into your language by visiting [Weblate](https://hosted.weblate.org/engage/openrefine/?utm_source=widget) which provides a web based UI to edit and add translations and sends automatic pull requests back to our project.
Click to help translate --> [Weblate](https://hosted.weblate.org/engage/openrefine/?utm_source=widget)
## User entry of language data ##
Localized strings are entered in a .json file, one per language. They are located in the folder `main/webapp/modules/core/langs/` in a file named `translation-xx`.json, where xx is the language code (i.e. fr for French).
### Simple case of localized string ###
This is an example of a simple string, with the start of the JSON file. This example is for French.
```
{
"name": "Français",
"core-index/help": "Aide",
(… more lines)
}
```
So the key `core-index/help` will render as `"Aide"` in French.
### Localization with a parameterized value ###
In this example, the name of the column (represented by `$1` in this example), will be substituted with the string of the name of the column.
`"core-facets/edit-facet-title": "Cliquez ici pour éditer le nom de la facette\nColonne : $1",`
### Localization with a singular/plural value ###
In this example, one of the parameter will have a different string depending if the value is 1 or another value.
In this example, the string for page, the second parameter, `$2`, will have an « s » or not depending on the value of `$2`.
`"core-views/goto-page": "$1 de $2 {{plural:$2|page|pages}}"`
## Front End Coding
The OpenRefine front end has been localized using the [Wikidata jquery.i18n library](https://github.com/OpenRefine/OpenRefine/pull/1285. The localized text is stored in a JSON dictionary on the server and retrieved with a new OpenRefine command.
### Adding a new string
There should be no hard-coded language strings in the HTML or JSON used for the front end. If you need a new string, first check the existing strings to make sure there isn't an equivalent string, **in an equivalent context**, that you can reuse. Context is important because it can affect how the same literal English text is translated. This cuts down on the amount of text which needs to be translated.
Strings should be entire sentences or phrases and should include substitution variables for any parameters. Do not concatenate strings in either Java or Javascript (or implicitly by laying them out in a specific order). So, instead of `"You have " + count + " row(s)"` (or worse `count != 1 ? " rows" : " row"`), internationalize everything together so that it can be translated taking into account word ordering and plurals for different languages, ie `"You have $1 {{plural $1: row|rows}}"`, passing the parameter(s) into the `$.i18n` call.
If there's no string you can reuse, allocate an available key in the appropriate translation dictionary and add the default string, e.g.
```json
...,
"section/newkey": "new default string for this key",
...
```
and then set the text (or HTML) of your HTML element using i18n helper method. So given an HTML fragment like:
```html
<label id="new-element-id">[untranslated text would have appeared here before]</label>
```
we could set its text using:
```
$('#new-html-element-id').text($.i18n('section/newkey']));
```
or, if you need to embed HTML tags:
```
$('#new-html-element-id').html($.i18n('section/newkey']);
```
### Adding a new language
The language dictionaries are stored in the `langs` subdirectory for the module e.g.
* https://github.com/OpenRefine/OpenRefine/tree/master/main/webapp/modules/core/langs for the main interface
* https://github.com/OpenRefine/OpenRefine/tree/master/extensions/gdata/module/langs for google spreadsheet connection
* https://github.com/OpenRefine/OpenRefine/tree/master/extensions/database/module/langs for database via JDBC
* https://github.com/OpenRefine/OpenRefine/tree/master/extensions/wikidata/module/langs for Wikidata
To add support for a new language, copy `translation-en.json` to `translation-<locale>.json` and have your translator translate all the value strings (ie right hand side).
#### Main interface
The translation is best done [with Weblate](https://hosted.weblate.org/engage/openrefine/?utm_source=widget). Files are periodically merged by the developer team.
Run the latest (hopefully cloned from github) version and check whether translated words fit to the layout. Not all items can be translated word by word, especially into non-Ìndo-European languages.
If you see any text which remains in English even when you have checked all items, please create bug report in the issue tracker so that the developers can fix it.
#### Extensions
Extensions can be translated via Weblate just like the core software.
The new extension for Wikidata contains lots of domain-specific concepts, with which you may not be familiar. The Wikidata may not have reconciliation service for your language. I recommend checking the glossary(https://www.wikidata.org/wiki/Wikidata:Glossary) to be consistent.
By default, the system tries to load the language file corresponding to the currently in-use browser language. To override this setting a new menu item ("Language Settings") has been added at the index page.
To support a new language file, the developer should add a corresponding entry to the dropdown menu in this file: `/OpenRefine/main/webapp/modules/core/scripts/index/lang-settings-ui.html`. The entry should look like:
```javascript
<option value="<locale>">[Language Label]</option>
```
## Server / Backend Coding
Currently no back end functions are translated, so things like error messages, undo history, etc may appear in English form. Rather than sending raw error text to the front end, it's better to send an error code which is translated into text on the front end. This allows for multiple languages to be supported.

View File

@ -0,0 +1,5 @@
---
id: version-release-process
title: How to do an OpenRefine version release
sidebar_label: How to do an OpenRefine version release
---

View File

@ -0,0 +1,326 @@
---
id: writing-extensions
title: Writing Extensions
sidebar_label: Writing Extensions
---
## Introduction
This is a very brief overview of the structure of OpenRefine extensions. For more detailed documentation and step-by-step guides please see the following external documentation/tutorials:
* Giuliano Tortoreto has [written documentation detailling how to build extension for OpenRefine](https://github.com/giTorto/OpenRefineExtensionDoc/raw/master/main.pdf)
* Owen Stephens has written [a guide to developing an extension which adds new GREL functions to OpenRefine](http://www.meanboyfriend.com/overdue_ideas/2017/05/writing-an-extension-to-add-new-grel-functions-to-openrefine/).
OpenRefine makes use of a modified version of the [Butterfly framework](https://github.com/OpenRefine/simile-butterfly/tree/openrefine) to provide an extension architecture. OpenRefine extensions are Butterfly modules. You don't really need to know about Butterfly itself, but you might encounter "butterfly" here and there in the code base.
Extensions that come with the code base are located under [the extensions subdirectory](https://github.com/OpenRefine/OpenRefine/tree/master/extensions), but when you develop your own extension, you can put its code anywhere as long as you point Butterfly to it. That is done by any one of the following methods
* refer to your extension's directory in [the butterfly.properties file](https://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/WEB-INF/butterfly.properties) through a `butterfly.modules.path` setting.
* specify the butterfly.modules.path property on the command line when you run OpenRefine. This overrides the values in the property file, so you need to include the default values first e.g. `-Dbutterfly.modules.path=modules,../../extensions,/path/to/your/extension`
Please note that you should bundle any dependencies yourself, so you are insulated from OpenRefine packaging changes over time.
### Directory Layout
A OpenRefine extension sits in a file directory that contains the following files and sub-directories:
```
pom.xml
src/
com/foo/bar/... *.java source files
module/
*.html, *.vt files
scripts/... *.js files
styles/... *.css and *.less files
images/... image files
MOD-INF/
lib/*.jar files
classes/... java class files
module.properties
controller.js
```
The file named module.properties (see [example](https://github.com/OpenRefine/OpenRefine/blob/master/extensions/sample/module/MOD-INF/module.properties)) contains the extension's metadata. Of importance is the name field, which gives the extension a name that's used in many other places to refer to it. This can be different from the extension's directory name.
```
name = my-extension-name
```
Your extension's client-side resources (.html, .js, .css files) stored in the module/ subdirectory will be accessible from http://127.0.0.1:3333/extension/my-extension-name/ when OpenRefine is running.
Also of importance is the dependency
```
requires = core
```
which makes sure that the core module of OpenRefine is loaded before the extension attempts to hook into it.
The file named controller.js is responsible for registering the extension's hooks into OpenRefine. Look at the sample-extension extension's [controller.js](https://github.com/OpenRefine/OpenRefine/blob/master/extensions/sample/module/MOD-INF/controller.js) file for an example. It should have a function called init() that does the hook registrations.
The `pom.xml` file is an [Apache Maven](http://maven.apache.org/) build file. You can make a copy of the sample extension's `pom.xml` file to get started. The important point here is that the Java classes should be built into the `module/MOD-INF/classes` sub-directory.
Note that your extension's Java code would need to reference some libraries used in OpenRefine and OpenRefine's Java classes themselves. These dependencies are reflected in the Maven configuration for the extension.
## Sample extension
The sample extension is included in the code base so that you can copy it and get started on writing your own extension. After you copy it, make sure you change its name inside its `module/MOD-INF/controller.js` file.
### Basic Structure
The sample extension's code is in `refine/extensions/sample/`. In that directory, Java source code is contained under the `src` sub-directory, and webapp code is under the `module` sub-directory. Here is the full directory layout:
```
refine/extensions/sample/
build.xml (ant build script)
src/
com/google/refine/sampleExtension/
... Java source code ...
module/
MOD-INF/
module.properties (module settings)
controller.js (module init and routing logic in Javascript)
classes/
... compiled Java classes ...
lib/
... Java jars ...
... velocity templates (.vt) ...
... LESS css files ...
... client-side files (.html, .css, .js, image files) ...
```
The sub-directory `MOD-INF` contains the Butterfly module's metadata and is what Butterfly looks for when it scans directories for modules. `MOD-INF` serves similar functions as `WEB-INF` in other web frameworks.
Java code is built into the sub-directory `classes` inside `MOD-INF`, and supporting external Java jars are in the `lib` sub-directory. Those will be automatically loaded by Butterfly. (The build.xml script is wired to compile into the `classes` sub-directory.)
Client-side code is in the inner `module` sub-directory. They can be plain old .html, .css, .js, and image files, or they can be [LESS](http://lesscss.org/) files that get processed into CSS. There are also Velocity .vt files, but they need to be routed inside `MOD-INF/controller.js`.
`MOD-INF/controller.js` lets you configure the extension's initialization and URL routing in Javascript rather than in Java. For example, when the requested URL path is either `/` or an empty string, we process and return `MOD-INF/index.vt` ( [see http://127.0.0.1:3333/extension/sample/](http://127.0.0.1:3333/extension/sample/) if OpenRefine is running).
The `init()` function in `controller.js` allows the extension to register various client-side handlers for augmenting pages served by Refine's core. These handlers are feature-specific. For example, [this is where the jython extension adds its parser](https://github.com/OpenRefine/OpenRefine/blob/master/extensions/jython/module/MOD-INF/controller.js#L46). As for the sample extension, it adds its script `project-injection.js` and style `project-injection.less` into the `/project` page. If you [view the source of the /project page](http://127.0.0.1:3333/project), you will see references to those two files.
### Wiring Up the Extension
The Extensions are loaded by the Butterfly framework. Butterfly refers to these as 'modules'. [The location of modules is set in the `main/webapp/butterfly.properties` file](https://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/WEB-INF/butterfly.properties#L27). Butterfly simply descends into each of those paths and looks for any `MOD-INF` directories.
For more information, see [Extension Points](Extension-Points).
## Extension points
### Client-side: Javascript and CSS
The UI in OpenRefine for working with a project is coded in [the /main/webapp/modules/core/project.vt file](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/project.vt). The file is quite small, and that's because almost all of its content is to be expanded dynamically through the Velocity variables $scriptInjection and $styleInjection. So that your own Javascript and CSS files get loaded, you need to register them with the ClientSideResourceManager, which is done in the /module/MOD-INF/controller.js file. See [the controller.js file in this sample extension code](http://github.com/OpenRefine/OpenRefine/blob/master/extensions/sample/module/MOD-INF/controller.js) for an example.
In the registration call, the variable `module` is already available to your code by default, and it refers to your own extension.
```
ClientSideResourceManager.addPaths(
"project/scripts",
module,
[
"scripts/foo.js",
"scripts/subdir/bar.js"
]
);
```
You can specify one or more files for registration, and their paths are relative to the `module` subdirectory of your extension. They are included in the order listed.
Javascript Bundling: Note that `project.vt` belongs to the core module and is thus under the control of the core module's [controller.js file](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/MOD-INF/controller.js). The Javascript files to be included in `project.vt` are by default bundled together for performance. When debugging, you can prevent this bundling behavior by setting `bundle` to `false` near the top of that `controller.js` file. (If you have commit access to this code base, be sure not to check that change in.)
### Client-side: Images
We recommend that you always refer to images through your CSS files rather than in your Javascript code. URLs to images will thus be relative to your CSS files, e.g.,
```
.foo {
background: url(../images/x.png);
}
```
If you really really absolutely need to refer to your images in your Javascript code, then look up your extension's URL path in the global Javascript variable `ModuleWirings`:
```
ModuleWirings["my-extension"] + "images/x.png"
```
### Client-side: HTML Templates
Beside Javascript, CSS, and images, your extension might also include HTML templates that get loaded on the fly by your Javascript code and injected into the page's DOM. For example, here is [the Cluster edit dialog template](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/scripts/dialogs/clustering-dialog.html), which gets loaded by code in [the equivalent javascript file 'clustering-dialog.js'](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/scripts/dialogs/clustering-dialog.js):
```
var dialog = $(DOM.loadHTML("core", "scripts/dialogs/clustering-dialog.html"));
```
`DOM.loadHTML` returns the content of the file as a string, and `$(...)` turns it into a DOM fragment. Where `"core"` is, you would want your extension's name. The path of the HTML file is relative to your extension's `module` subdirectory.
### Client-side: Project UI Extension Points
Getting your extension's Javascript code included in `project.vt` doesn't accomplish much by itself unless your code also registers hooks into the UI. For example, you can surely implement an exporter in Javascript, but unless you add a corresponding menu command in the UI, your user can't use your exporter.
#### Main Menu
The main menu can be extended by calling any one of the methods `MenuBar.appendTo`, `MenuBar.insertBefore`, and `MenuBar.insertAfter`. Each method takes 2 arguments: an array of strings that identify a particular existing menu item or submenu, and one new single menu item or submenu or an array of menu items and submenus. For example, to insert 2 menu items and a menu separator before the menu item Project > Export Filtered Rows > Templating..., write this Javascript code wherever that would execute when your Javascript files get loaded:
```
MenuBar.insertBefore(
["core/project", "core/export", "core/export-templating"],
[
{
"label":"Menu item 1",
"click": function() { ... }
},
{
"label":"Menu item 2",
"click": function() { ... }
},
{} // separator
]
);
```
The array `["core/project", "core/export", "core/export-templating"]` pinpoints the reference menu item.
See the beginning of [/main/webapp/modules/core/scripts/project/menu-bar.js](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/scripts/project/menu-bar.js) for IDs of menu items and submenus.
#### Column Header Menu
The drop-down menu of each column can also be extended, but the mechanism is slightly different compared to the main menu. Because the drop-down menu for a particular column is constructed on the fly when the user actually clicks the drop-down menu button, extending the column header menu can't really be done once at start-up time, but must be done every time a column header menu gets created. So, registration in this case involves providing a function that gets called each such time:
```
DataTableColumnHeaderUI.extendMenu(function(column, columnHeaderUI, menu) { ... do stuff to menu ... });
```
That function takes in the column object (which contains the column's name), the column header UI object (generally not so useful), and the menu to extend. In the previous code line where it says "do stuff to menu", you can write something like this:
```
MenuSystem.appendTo(menu, ["core/facet"], [
{
id: "core/text-facet",
label: "My Facet on " + column.name,
click: function() {
... use column.name and do something ...
}
},
]);
```
In addition to `MenuSystem.appendTo`, you can also call `MenuSystem.insertBefore` and `MenuSystem.insertAfter` which the same 3 arguments. To see what IDs you can use, see the function `DataTableColumnHeaderUI.prototype._createMenuForColumnHeader` in [/main/webapp/modules/core/scripts/views/data-table/column-header-ui.js](http://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/scripts/views/data-table/column-header-ui.js).
### Server-side: Ajax Commands
The client-side of OpenRefine gets things done by calling AJAX commands on the server-side. These commands must be registered with the OpenRefine servlet, so that the servlet knows how to route AJAX calls from the client-side. This can be done inside the `init` function in your extension's `controller.js` file, e.g.,
```
function init() {
var RefineServlet = Packages.com.google.refine.RefineServlet;
RefineServlet.registerCommand(module, "my-command", new Packages.com.foo.bar.MyCommand());
}
```
Your command will then be accessible at [http://127.0.0.1:3333/command/my-extension/my-command](http://127.0.0.1:3333/command/my-extension/my-command).
### Server-side: Operations
Most commands change the project's data. Most of them do so by creating abstract operations. See the Changes, History, Processes, and Operations section of the [Server Side Architecture](Server-side+Architecture) document.
You can register an operation **class** in the `init` function as follows:
```
Packages.com.google.refine.operations.OperationRegistry.registerOperation(
module,
"operation-name",
Packages.com.foo.bar.MyOperation
);
```
Do not call `new` to construct an operation instance. You must register the class itself. The class should have a static function for reconstructing an operation instance from a JSON blob:
```
static public AbstractOperation reconstruct(Project project, JSONObject obj) throws Exception {
...
}
```
### Server-side: GREL
GREL can be extended with new functions. This is also done in the `init` function in `controller.js`, e.g.,
```
Packages.com.google.refine.grel.ControlFunctionRegistry.registerFunction(
"functionName", new Packages.com.foo.bar.TheFunctionClass());
```
You might also want to provide new variables (beyond just `value`, `cells`, `row`, etc.) available to expressions. This is done by registering a binder that implements the interface `com.google.refine.expr.Binder`:
```
Packages.com.google.refine.expr.ExpressionUtils.registerBinder(
new Packages.com.foo.bar.MyBinder());
```
### Server-side: Importers
You can register an importer as follows:
```
Packages.com.google.refine.importers.ImporterRegistry.registerImporter(
"importer-name", new Packages.com.foo.bar.MyImporter());
```
The string `"importer-name"` isn't important at all. It's not really related to file extension or mime-type. Just use something unique. Your importer will be explicitly called to test if it can import something.
### Server-side: Exporters
You can register an exporter as follows:
```
Packages.com.google.refine.exporters.ExporterRegistry.registerExporter(
"exporter-name", new Packages.com.foo.bar.MyExporter());
```
The string `"exporter-name"` isn't important at all. It's only used by the client-side to tell the server-side which exporter to use. Just use something unique and, of course, relevant.
### Server-side: Overlay Models
Overlay models are objects attached onto a core Project object to store and manage additional data for that project. For example, the schema alignment skeleton is managed by the Protograph overlay model. An overlay model implements the interface `com.google.refine.model.OverlayModel` and can be registered like so:
```
Packages.com.google.refine.model.Project.registerOverlayModel(
"model-name",
Packages.com.foo.bar.MyOverlayModel);
```
Note that you register the **class** , not an instance. The class should implement the following static method for reconstructing an overlay model instance from a JSON blob:
```
static public OverlayModel reconstruct(JSONObject o) throws JSONException {
...
}
```
When the project gets saved, the overlay model instance's `write` method will be called:
```
public void write(JSONWriter writer, Properties options) throws JSONException {
...
}
```
### Server-side: Scripting Languages
A scripting language (such as Jython) can be registered as follows:
```
Packages.com.google.refine.expr.MetaParser.registerLanguageParser(
"jython",
"Jython",
Packages.com.google.refine.jython.JythonEvaluable.createParser(),
"return value"
);
```
The first string is the prefix that gets prepended to each expression so that we know which language the expression is in. This should be short, unique, and identifying. The second string is a user-friendly name of the language. The third is an object that implements the interface `com.google.refine.expr.LanguageSpecificParser`. The final string is the default expression in that language that would return the cell's value.
In 2018 we are making important changes to OpenRefine to modernize it, for the benefit of users and contributors. This page describes the changes that impact developers of extensions or forks and is intended to minimize the effort required on their end to follow the transition. The instructions are written specifically with extension maintainers in mind, but fork maintainers should also find it useful.
This document describes the migrations in the order they are committed to the master branch. This means that it should be possible to perform each migration in turn, with the ability to run the software between each stage by checking out the appropriate git commit.

View File

@ -20,7 +20,7 @@ module.exports = {
label: 'User Manual',
position: 'left',
},
{to: 'tech/architecture',
{to: 'technical-reference/technical-reference-index',
label: 'Technical Reference',
position: 'left'},
{

View File

@ -14,13 +14,20 @@ module.exports = {
'manual/troubleshooting'
],
'Technical Reference': [
'tech/architecture',
'tech/techstack',
'tech/server',
'tech/client',
'tech/importing',
'tech/faceted_browsing'
'technical-reference/technical-reference-index',
'technical-reference/architecture',
'technical-reference/openrefine-api',
'technical-reference/reconciliation-api',
'technical-reference/suggest-api',
'technical-reference/data-extension-api',
'technical-reference/contributing',
'technical-reference/build-test-run',
'technical-reference/development-roadmap',
'technical-reference/version-release-process',
'technical-reference/homebrew-cask-process',
'technical-reference/writing-extensions',
'technical-reference/migrating-older-extensions',
'technical-reference/translating',
]
},
};

BIN
docs/static/img/eclipse-debug-config.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

BIN
docs/static/img/eclipse-exec-config.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

BIN
docs/static/img/intellij-maven.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

BIN
docs/static/img/intellij-setup-1.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
eclipse-debug-config.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB