RandomSec/OpenRefine/docs/versioned_docs/version-3.5/manual/wikibase/schema-alignment.md

254 lines
11 KiB
Markdown
Raw Normal View History

2022-01-30 23:08:52 +01:00
---
id: schema-alignment
title: Schema alignment
sidebar_label: Schema alignment
---
A Wikibase schema is a template of Wikidata edits that is applied
to each row in the project. This page describes how each part of this
template works, and how it generates edits depending on the contents of
the table cells.
## Items {#items}
An item in the schema represents a set of changes on a particular
Wikidata item, generated by a single row. This item can contain changes
in [terms](#terms) (labels, descriptions and aliases) or
[statements](#statements).
It is possible to make edits on different items for each row of your
table: just add multiple items in your schema. Each item has a subject,
which can be either entered manually (when the item on which the edits
should be made is the same for all rows), or any reconciled column can
be dropped in this field. In this case, the edits will depend on the
reconciliation status of each cell:
- If the cell is matched to an item, edits will be made on that item;
- If the cell is marked as corresponding to a new item, a new item
will be created for it. See [New items](./new-entities) for more
details about how this works;
- If the cell has reconciliation candidates but has not been matched
to any of them, the edit will be skipped (even if there is only one
candidate with a high reconciliation score);
- If the cell is not reconciled or blank, the edit will be skipped.
Do not worry about the ordering of items in the schema or the order of
your rows, as OpenRefine will rearrange your edits to optimize their
upload. If your project makes edits on the same item across multiple
rows, these edits will be merged together and performed in one edit. See
[Uploading your changes](./uploading) about that.
## Terms {#terms}
**Terms** are the language-specific strings that you find at the top of
Wikidata items: labels, descriptions and aliases. OpenRefine lets you
edit these terms via the Wikidata schema.
### Languages {#languages}
Each term belongs to a particular language. Wikidata supports [hundreds
of languages](https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all), which
are designated by language codes. For each term that you want to add to
an item, you will need to specify the language for this term. There are
two cases:
- Either the language is constant across your dataset: you know that
all the names in a given column are spelled in the same language. In
this case, type the name of the language in the input and select the
language in the drop-down suggestion dialog. This will place the
appropriate language code in the input.
- Or the language varies across your dataset. In this case, you need
to provide a column of Wikimedia language codes that indicates the
language for each term that you want to add. Just drag and drop this
column to the language field. If there are any invalid language
codes in this column, the corresponding terms will be ignored.
OpenRefine will translate any deprecated language codes to their
preferred values silently.
### Labels {#labels}
This is because Wikidata items can have at most one label per language,
so you need to choose whether to override any existing label (default
behaviour before 3.2) or only insert your label if there is no such
label in the given language (default behaviour starting from 3.2). When
the content of the cell providing the label is blank, nothing will be
changed (so, it is not possible to remove labels).
### Descriptions {#descriptions}
Descriptions work like labels: there is at most one description per
language, and OpenRefine can override existing descriptions or leave
them unchanged. It is not possible to remove descriptions either.
### Aliases {#aliases}
Aliases are added to the list of existing aliases in the given language.
When adding an alias in a language where no label has been added yet,
the alias is automatically promoted to a label for this language. It is
not possible to remove aliases or to override any existing aliases.
## Statements {#statements}
You can add statements in the schema: this will generate new statements
on the corresponding items. These statements will be merged with any
existing statements on the actual Wikidata items and [this merging process depends on the upload medium](./uploading#Merging-strategies-for-statements).
It is forecast to give more control over the merging strategy in the
near future.
### Main values {#main-values}
Statements must have main values: \"novalue\" or \"somevalue\"
statements are not supported yet. The main value of a statement is a
data value whose type depends on the property used for the statement. If
the main value cannot be evaluated (for instance because one of the
cells it depends on is empty), then the entire statement will be
skipped.
See the [data values](#data-values) section for more details
about how to specify each type of data value and when they are skipped.
### Qualifiers {#qualifiers}
Qualifiers can be added on each statement. When their values are
skipped, only the qualifier will be discarded: the rest of the statement
will still be added.
### References {#references}
References can (and should) be added to back each statement. If values
inside the reference are skipped, the corresponding part of the
reference will be discarded but the reference will still be added
(unless the reference becomes empty).
### Ranks {#ranks}
All statements ranks are set to **Normal**. It is currently not possible
to set a different rank.
## Data values {#data-values}
Data values are the data that you can find as target of a statement (or
qualifier, or part of a reference). Each property dictates a particular
type of data value. In each case, OpenRefine uses a particular process
to translate cell contents to a data value of the appropriate type. This
section explains the process for all data types.
### Items {#items-1}
Items are evaluated in the same way as the subjects of items in the
schema. They can be input directly using the auto-suggest service
provided, or any column reconciled against Wikidata can be used. Refer to
[the first Items section](#items) to see how they are
evaluated.
### Strings and external identifiers {#strings-and-external-identifiers}
Bare strings and external identifiers can be input directly as constants
(if they do not change across rows) or using any column. If a reconciled
column is used for a string value, it is the value of the cell that is
going to be used, not the name of the reconciled item (which is what
OpenRefine displays). Values are skipped when the column is blank or
null.
### Monolingual texts {#monolingual-texts}
Monolingual texts consist of two parts:
- the language: see [Languages](#languages) for their
structure;
- the value of the text: see [the section above](#strings-and-external-identifiers).
A monolingual text is skipped when any of its parts is skipped (that is,
if the language or the text are invalid).
### Dates {#dates}
Dates are parsed from cell contents (or from any constant provided in
the schema) and the precision of the date is inferred from its format.
Here are the valid formats:
- `YYYYM`, such as `2001M` (millenium precision)
- `YYYYC`, such as `1901C` (century precision)
- `YYYYD`, such as `1981D` (decade precision)
- `YYYY`, such as `1984` (year precision)
- `YYYY-MM`, such as `2019-03` (month precision)
- `YYYY-MM-DD`, such as `1897-08-14` (day precision)
Any value that does not match any of these formats will be ignored. All
dates are represented in UTC, Gregorian calendar.
In OpenRefine 3.3, the following new formats have been introduced:
- `TODAY` returns today's date with day precision. This will be
evaluated when performing the edits (or exporting to
QuickStatements);
- `YYYY-MM-DD_QID` can be used to specify a date in a particular
calendar (such as the [proleptic Julian calendar (Q1985786)](https://www.wikidata.org/wiki/Q1985786).
In OpenRefine 3.5, the following new format has been introduced:
- `-234` represents the year 234 [BCE](https://en.wikipedia.org/wiki/Common_Era)
### Quantities {#quantities}
Quantities consist of two parts: the amount and the unit.
- the amount is mandatory and must be a string, such as `18,229.1020`.
The precision that is displayed will be respected (the same number
of trailing zeros will be shown in Wikidata). By default, no upper
and lower bounds will be set. To define these, one needs to use the
engineering notation, such as `3.45E+3`, which will be interpreted
as `3,450±5`. As usual, the amount can be provided as a constant or
as a column variable. In the latter case, the values in the column
must be strings.
- the unit is optional. It is an item, so it can be provided either
with the auto-suggest dialog or as a reconciled column. It is
important to note that if a reconciled column is used, any
unreconciled cells will discard the entire quantity value. So a
template for a quantity value is either always unit-less, or always
has a unit.
### Globe coordinates {#globe-coordinates}
Geographic coordinates are specified as strings with the following
formats, where all components are floating point numbers in degrees:
- `latitude,longitude` for a default precision of ten micro degrees
(for instance:
[`49.265278,4.028611`](https://tools.wmflabs.org/geohack/geohack.php?params=49.265277777778_N_4.0286111111111_E_globe:earth&language=en)
can be used indicate the position of Reims, France.
- `latitude,longitude,precision` when specifying an explicit precision
(for instance: `49.265278,4.028611,0.1` can be used indicate the
position of Reims within a tenth of a degree).
All globe coordinates are on Earth ([Q2](https://www.wikidata.org/wiki/Q2)).
If your coordinates are in a different format, such as
`49° 15 55″ N, 4° 1 43″ E`, you will need to convert them to decimal
format first.
### Media on Commons {#media-on-commons}
Media on Wikimedia Commons is treated like strings, whose values must
exactly match filenames on Commons. These values are not checked during
schema evaluations: if they are wrong, uploading the statements will
fail.
Tabular data and Geoshapes must be prefixed with the `Data:` namespace.
This is indicated by the placeholder in the field that appears when
constructing the schema.
### Properties {#properties}
Properties are always constants: there is currently no way to reconcile
a column against properties. They have to be selected with the
auto-suggest dialog.
### Other data types {#other-data-types}
URLs, mathematical expressions and other textual datatypes are supported
and treated as strings. At the time of writing, all datatypes supported
by Wikidata are supported by OpenRefine.