10 KiB
id | title | sidebar_label |
---|---|---|
data-extension-api | Data extension API | Data extension API |
This page describes a new optional API for reconciliation services, allowing clients to pull properties of reconciled records. It is supported from OpenRefine 2.8 onwards. A sample server implementation is available in the Wikidata reconciliation interface.
Overview of the workflow
-
Reconcile a column with a standard reconciliation service
-
Click "Add column from reconciled values"
-
The user is proposed some properties to fetch, based on the type they reconciled their column against (if any). They can also pick their own property with the suggest widget (same as for the reconciliation dialog).
-
A preview of the columns to be fetched is displayed on the right-hand side of the dialog, based on a sample of the rows.
-
Once the user has clicked "OK", columns are fetched and added to the project. Columns corresponding to other items from the service are directly reconciled, and the column is marked as reconciled against the type suggested by the service for that property. The user can run data extension again from that column.
Specification
Services supporting data extension must add an extend
field in their service metadata. This field is expected to have the following subfields, all optional:
propose_properties
stores the endpoint of an API which will be used to suggest properties to fetch (see specification below). The field contains an object with aservice_url
andservice_path
which will be concatenated to obtain the URL where the endpoint is available, just like the other services in the metadata. If this field is not provided, no property will be suggested in the dialog (the user will have to input them manually).property_settings
stores the specification of a form where the user will be able to configure how a given property should be fetched (see specification below). If this field is not provided, the user will not be proposed with settings.
The service endpoint must also accept a new parameter extend
(in addition to queries
which is used for reconciliation). Its behaviour is described in the following section.
Example service metadata:
"extend": {
"propose_properties": {
"service_url": "https://tools.wmflabs.org/openrefine-wikidata",
"service_path": "/en/propose_properties"
},
"property_settings": []
}
Property proposal protocol
The role of the property proposal endpoint is to suggest a list of properties to fetch. As only input, it accepts GET parameters:
- the
type
of a column was reconciled against. If no type is provided, it should suggest properties for a column reconciled against no type. - a
limit
on the number of results to return
The type is specified by its id in the type
GET parameter of the endpoint, as follows:
https://tools.wmflabs.org/openrefine-wikidata/en/propose_properties?type=Q3354859&limit=3
The endpoint returns a JSON response as follows:
{
"properties": [
{
"id": "P969",
"name": "located at street address"
},
{
"id": "P1449",
"name": "nickname"
},
{
"id": "P17",
"name": "country"
},
],
"type": "Q3354859",
"limit": 3
}
This endpoint must support JSONP via the callback
parameter (just like all other endpoints of the reconciliation service).
Data extension protocol
After calling the property proposal endpoint, the consumer (OpenRefine) calls the service endpoint with a JSON object in the extend
parameter, containing the following fields:
ids
is a list of strings, each of which being an identifier of a record as returned by the reconciliation method. These are the records whose properties should be retrieved.properties
is a list of JSON objects. They specify the properties to be fetched for each item, and contain the following fields:id
(a string): the identifier of the property as returned by the property suggest service (and optionally the property proposal service)settings
: a JSON object storing parameters about how the property should be fetched (see below).
Example:
{
"ids": [
"Q7205598",
"Q218765",
"Q845632",
"Q5661356"
],
"properties": [
{
"id": "P856"
},
{
"id": "P159"
}
]
}
The service returns a JSON response formatted as follows:
meta
contains a list of column metadata. The order of the properties must be the same as the one provided in the query. Each element is an object containing the following keys:id
(mandatory): the identifier of the propertyname
(mandatory): the human-readable name of the propertytype
(optional): an object withid
andname
keys representing the expected type of values for that property. The notion of type is the same as the one used for reconciliation. Thetype
field should only be provided when the property returns reconciled items.
rows
contains an object. Its keys must be exactly the record ids (ids
) passed in the query. The value for each record id is an object representing a row for that id. The keys of a row object must be exactly the property ids passed in the query ("P856"
and"P159"
in the example above). The value for a property id should be a list of cell objects.
Cell objects are JSON objects which contain the representation of an OpenRefine cell.
-
an object with a single
"str"
key and a string value for it represents a cell with a (bare) string in it. Example:{"str": "193.54.0.0/15"}
-
an object with
"id"
and"name"
represents a reconciled value (from the same reconciliation service). It will be stored as a matched cell (with maximum reconciliation score). Example:{"name": "Warsaw","id": "Q270"}
-
an empty object
{}
represents an empty cell -
an object with
"date"
and an ISO-formatted date string represents a point in time. Example:{"date": "1987-02-01T00:00:00+00:00"}
-
an object with
"float"
and a numerical value represents a quantity. Example:{"float": 48.2736}
-
an object with
"int"
and an integer represents a number. Example:{"int": 54}
-
an object with
"bool"
andtrue
orfalse
represents a boolean. Example:{"bool": false}
Example of a full response (for the example query above):
{
"rows": {
"Q5661356": {
"P159": [],
"P856": []
},
"Q7205598": {
"P159": [
{
"name": "Warsaw",
"id": "Q270"
}
],
"P856": [
{
"str": "http://www.polkomtel.com.pl/english"
},
{
"str": "http://www.polkomtel.com.pl/"
}
]
},
"Q845632": {
"P159": [
{
"name": "Bærum",
"id": "Q57076"
}
],
"P856": [
{
"str": "http://www.telenor.com/"
}
]
},
"Q218765": {
"P159": [
{
"name": "Paris",
"id": "Q90"
}
],
"P856": [
{
"str": "http://www.sfr.fr/"
}
]
}
},
"meta": [
{
"id": "P159",
"name": "headquarters location",
"type": {
"id": "Q7540126",
"name": "headquarters",
}
},
{
"id": "P856",
"name": "official website",
}
]
}
Settings specification
The property_settings
field in the service metadata allows the service to declare it accepts some settings for the properties it fetches. They are specified as a list of JSON objects which define the fields which should be exposed to the user.
Each setting object looks like this:
{
"default": 0,
"type": "number",
"label": "Limit",
"name": "limit",
"help_text": "Maximum number of values to return per row (0 for no limit)"
}
It is essentially a definition of a form field in JSON, with self-explanatory fields.
The type
field specifies the type of the form field (among number
, select
, text
, checkbox
).
The field default
gives the default value of the form: the service must assume this value if the
client does not specify this setting.
For the select
field, an additional choices
field defines the possible choices, with both labels and values:
{
"default": "any",
"label": "References",
"name": "references",
"type": "select",
"choices": [
{
"value": "any",
"name": "Any statement"
},
{
"value": "referenced",
"name": "At least one reference"
},
{
"value": "no_wiki",
"name": "At least one non-wiki reference"
}
],
"help_text": "Filter statements by their references"
}
When querying the service for rows, the client can pass an optional settings
object in each of the requested columns:
{
"id": "P342",
"settings": {
"limit": "20",
"references": "referenced",
}
}
Each key of the settings object must correspond to one form field proposed by the service. The value of that key is the value of the form field represented as a string (for uniformity and consistency with JSON form serialization). The settings are intended to modify the results returned by the service: of course, the semantics of the settings is up to the service (as the service defines itself what settings it accepts).