254 lines
11 KiB
Markdown
254 lines
11 KiB
Markdown
---
|
||
id: schema-alignment
|
||
title: Schema alignment
|
||
sidebar_label: Schema alignment
|
||
---
|
||
|
||
A Wikibase schema is a template of Wikidata edits that is applied
|
||
to each row in the project. This page describes how each part of this
|
||
template works, and how it generates edits depending on the contents of
|
||
the table cells.
|
||
|
||
## Items {#items}
|
||
|
||
An item in the schema represents a set of changes on a particular
|
||
Wikidata item, generated by a single row. This item can contain changes
|
||
in [terms](#terms) (labels, descriptions and aliases) or
|
||
[statements](#statements).
|
||
|
||
It is possible to make edits on different items for each row of your
|
||
table: just add multiple items in your schema. Each item has a subject,
|
||
which can be either entered manually (when the item on which the edits
|
||
should be made is the same for all rows), or any reconciled column can
|
||
be dropped in this field. In this case, the edits will depend on the
|
||
reconciliation status of each cell:
|
||
|
||
- If the cell is matched to an item, edits will be made on that item;
|
||
- If the cell is marked as corresponding to a new item, a new item
|
||
will be created for it. See [New items](./new-entities) for more
|
||
details about how this works;
|
||
- If the cell has reconciliation candidates but has not been matched
|
||
to any of them, the edit will be skipped (even if there is only one
|
||
candidate with a high reconciliation score);
|
||
- If the cell is not reconciled or blank, the edit will be skipped.
|
||
|
||
Do not worry about the ordering of items in the schema or the order of
|
||
your rows, as OpenRefine will rearrange your edits to optimize their
|
||
upload. If your project makes edits on the same item across multiple
|
||
rows, these edits will be merged together and performed in one edit. See
|
||
[Uploading your changes](./uploading) about that.
|
||
|
||
## Terms {#terms}
|
||
|
||
**Terms** are the language-specific strings that you find at the top of
|
||
Wikidata items: labels, descriptions and aliases. OpenRefine lets you
|
||
edit these terms via the Wikidata schema.
|
||
|
||
### Languages {#languages}
|
||
|
||
Each term belongs to a particular language. Wikidata supports [hundreds
|
||
of languages](https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all), which
|
||
are designated by language codes. For each term that you want to add to
|
||
an item, you will need to specify the language for this term. There are
|
||
two cases:
|
||
|
||
- Either the language is constant across your dataset: you know that
|
||
all the names in a given column are spelled in the same language. In
|
||
this case, type the name of the language in the input and select the
|
||
language in the drop-down suggestion dialog. This will place the
|
||
appropriate language code in the input.
|
||
- Or the language varies across your dataset. In this case, you need
|
||
to provide a column of Wikimedia language codes that indicates the
|
||
language for each term that you want to add. Just drag and drop this
|
||
column to the language field. If there are any invalid language
|
||
codes in this column, the corresponding terms will be ignored.
|
||
OpenRefine will translate any deprecated language codes to their
|
||
preferred values silently.
|
||
|
||
### Labels {#labels}
|
||
|
||
This is because Wikidata items can have at most one label per language,
|
||
so you need to choose whether to override any existing label (default
|
||
behaviour before 3.2) or only insert your label if there is no such
|
||
label in the given language (default behaviour starting from 3.2). When
|
||
the content of the cell providing the label is blank, nothing will be
|
||
changed (so, it is not possible to remove labels).
|
||
|
||
### Descriptions {#descriptions}
|
||
|
||
Descriptions work like labels: there is at most one description per
|
||
language, and OpenRefine can override existing descriptions or leave
|
||
them unchanged. It is not possible to remove descriptions either.
|
||
|
||
### Aliases {#aliases}
|
||
|
||
Aliases are added to the list of existing aliases in the given language.
|
||
When adding an alias in a language where no label has been added yet,
|
||
the alias is automatically promoted to a label for this language. It is
|
||
not possible to remove aliases or to override any existing aliases.
|
||
|
||
## Statements {#statements}
|
||
|
||
You can add statements in the schema: this will generate new statements
|
||
on the corresponding items. These statements will be merged with any
|
||
existing statements on the actual Wikidata items and [this merging process depends on the upload medium](./uploading#Merging-strategies-for-statements).
|
||
It is forecast to give more control over the merging strategy in the
|
||
near future.
|
||
|
||
### Main values {#main-values}
|
||
|
||
Statements must have main values: \"novalue\" or \"somevalue\"
|
||
statements are not supported yet. The main value of a statement is a
|
||
data value whose type depends on the property used for the statement. If
|
||
the main value cannot be evaluated (for instance because one of the
|
||
cells it depends on is empty), then the entire statement will be
|
||
skipped.
|
||
|
||
See the [data values](#data-values) section for more details
|
||
about how to specify each type of data value and when they are skipped.
|
||
|
||
### Qualifiers {#qualifiers}
|
||
|
||
Qualifiers can be added on each statement. When their values are
|
||
skipped, only the qualifier will be discarded: the rest of the statement
|
||
will still be added.
|
||
|
||
### References {#references}
|
||
|
||
References can (and should) be added to back each statement. If values
|
||
inside the reference are skipped, the corresponding part of the
|
||
reference will be discarded but the reference will still be added
|
||
(unless the reference becomes empty).
|
||
|
||
### Ranks {#ranks}
|
||
|
||
All statements ranks are set to **Normal**. It is currently not possible
|
||
to set a different rank.
|
||
|
||
## Data values {#data-values}
|
||
|
||
Data values are the data that you can find as target of a statement (or
|
||
qualifier, or part of a reference). Each property dictates a particular
|
||
type of data value. In each case, OpenRefine uses a particular process
|
||
to translate cell contents to a data value of the appropriate type. This
|
||
section explains the process for all data types.
|
||
|
||
### Items {#items-1}
|
||
|
||
Items are evaluated in the same way as the subjects of items in the
|
||
schema. They can be input directly using the auto-suggest service
|
||
provided, or any column reconciled against Wikidata can be used. Refer to
|
||
[the first Items section](#items) to see how they are
|
||
evaluated.
|
||
|
||
### Strings and external identifiers {#strings-and-external-identifiers}
|
||
|
||
Bare strings and external identifiers can be input directly as constants
|
||
(if they do not change across rows) or using any column. If a reconciled
|
||
column is used for a string value, it is the value of the cell that is
|
||
going to be used, not the name of the reconciled item (which is what
|
||
OpenRefine displays). Values are skipped when the column is blank or
|
||
null.
|
||
|
||
### Monolingual texts {#monolingual-texts}
|
||
|
||
Monolingual texts consist of two parts:
|
||
|
||
- the language: see [Languages](#languages) for their
|
||
structure;
|
||
- the value of the text: see [the section above](#strings-and-external-identifiers).
|
||
|
||
A monolingual text is skipped when any of its parts is skipped (that is,
|
||
if the language or the text are invalid).
|
||
|
||
### Dates {#dates}
|
||
|
||
Dates are parsed from cell contents (or from any constant provided in
|
||
the schema) and the precision of the date is inferred from its format.
|
||
Here are the valid formats:
|
||
|
||
- `YYYYM`, such as `2001M` (millenium precision)
|
||
- `YYYYC`, such as `1901C` (century precision)
|
||
- `YYYYD`, such as `1981D` (decade precision)
|
||
- `YYYY`, such as `1984` (year precision)
|
||
- `YYYY-MM`, such as `2019-03` (month precision)
|
||
- `YYYY-MM-DD`, such as `1897-08-14` (day precision)
|
||
|
||
Any value that does not match any of these formats will be ignored. All
|
||
dates are represented in UTC, Gregorian calendar.
|
||
|
||
In OpenRefine 3.3, the following new formats have been introduced:
|
||
|
||
- `TODAY` returns today's date with day precision. This will be
|
||
evaluated when performing the edits (or exporting to
|
||
QuickStatements);
|
||
- `YYYY-MM-DD_QID` can be used to specify a date in a particular
|
||
calendar (such as the [proleptic Julian calendar (Q1985786)](https://www.wikidata.org/wiki/Q1985786).
|
||
|
||
In OpenRefine 3.5, the following new format has been introduced:
|
||
|
||
- `-234` represents the year 234 [BCE](https://en.wikipedia.org/wiki/Common_Era)
|
||
|
||
### Quantities {#quantities}
|
||
|
||
Quantities consist of two parts: the amount and the unit.
|
||
|
||
- the amount is mandatory and must be a string, such as `18,229.1020`.
|
||
The precision that is displayed will be respected (the same number
|
||
of trailing zeros will be shown in Wikidata). By default, no upper
|
||
and lower bounds will be set. To define these, one needs to use the
|
||
engineering notation, such as `3.45E+3`, which will be interpreted
|
||
as `3,450±5`. As usual, the amount can be provided as a constant or
|
||
as a column variable. In the latter case, the values in the column
|
||
must be strings.
|
||
- the unit is optional. It is an item, so it can be provided either
|
||
with the auto-suggest dialog or as a reconciled column. It is
|
||
important to note that if a reconciled column is used, any
|
||
unreconciled cells will discard the entire quantity value. So a
|
||
template for a quantity value is either always unit-less, or always
|
||
has a unit.
|
||
|
||
### Globe coordinates {#globe-coordinates}
|
||
|
||
Geographic coordinates are specified as strings with the following
|
||
formats, where all components are floating point numbers in degrees:
|
||
|
||
- `latitude,longitude` for a default precision of ten micro degrees
|
||
(for instance:
|
||
[`49.265278,4.028611`](https://tools.wmflabs.org/geohack/geohack.php?params=49.265277777778_N_4.0286111111111_E_globe:earth&language=en)
|
||
can be used indicate the position of Reims, France.
|
||
|
||
|
||
- `latitude,longitude,precision` when specifying an explicit precision
|
||
(for instance: `49.265278,4.028611,0.1` can be used indicate the
|
||
position of Reims within a tenth of a degree).
|
||
|
||
All globe coordinates are on Earth ([Q2](https://www.wikidata.org/wiki/Q2)).
|
||
|
||
If your coordinates are in a different format, such as
|
||
`49° 15′ 55″ N, 4° 1′ 43″ E`, you will need to convert them to decimal
|
||
format first.
|
||
|
||
### Media on Commons {#media-on-commons}
|
||
|
||
Media on Wikimedia Commons is treated like strings, whose values must
|
||
exactly match filenames on Commons. These values are not checked during
|
||
schema evaluations: if they are wrong, uploading the statements will
|
||
fail.
|
||
|
||
Tabular data and Geoshapes must be prefixed with the `Data:` namespace.
|
||
This is indicated by the placeholder in the field that appears when
|
||
constructing the schema.
|
||
|
||
### Properties {#properties}
|
||
|
||
Properties are always constants: there is currently no way to reconcile
|
||
a column against properties. They have to be selected with the
|
||
auto-suggest dialog.
|
||
|
||
### Other data types {#other-data-types}
|
||
|
||
URLs, mathematical expressions and other textual datatypes are supported
|
||
and treated as strings. At the time of writing, all datatypes supported
|
||
by Wikidata are supported by OpenRefine.
|