RandomSec/OpenRefine/docs/versioned_docs/version-3.5/manual/wikibase/schema-alignment.md
2022-01-04 16:31:32 +01:00

11 KiB
Raw Blame History

id title sidebar_label
schema-alignment Schema alignment Schema alignment

A Wikibase schema is a template of Wikidata edits that is applied to each row in the project. This page describes how each part of this template works, and how it generates edits depending on the contents of the table cells.

Items

An item in the schema represents a set of changes on a particular Wikidata item, generated by a single row. This item can contain changes in terms (labels, descriptions and aliases) or statements.

It is possible to make edits on different items for each row of your table: just add multiple items in your schema. Each item has a subject, which can be either entered manually (when the item on which the edits should be made is the same for all rows), or any reconciled column can be dropped in this field. In this case, the edits will depend on the reconciliation status of each cell:

  • If the cell is matched to an item, edits will be made on that item;
  • If the cell is marked as corresponding to a new item, a new item will be created for it. See New items for more details about how this works;
  • If the cell has reconciliation candidates but has not been matched to any of them, the edit will be skipped (even if there is only one candidate with a high reconciliation score);
  • If the cell is not reconciled or blank, the edit will be skipped.

Do not worry about the ordering of items in the schema or the order of your rows, as OpenRefine will rearrange your edits to optimize their upload. If your project makes edits on the same item across multiple rows, these edits will be merged together and performed in one edit. See Uploading your changes about that.

Terms

Terms are the language-specific strings that you find at the top of Wikidata items: labels, descriptions and aliases. OpenRefine lets you edit these terms via the Wikidata schema.

Languages

Each term belongs to a particular language. Wikidata supports hundreds of languages, which are designated by language codes. For each term that you want to add to an item, you will need to specify the language for this term. There are two cases:

  • Either the language is constant across your dataset: you know that all the names in a given column are spelled in the same language. In this case, type the name of the language in the input and select the language in the drop-down suggestion dialog. This will place the appropriate language code in the input.
  • Or the language varies across your dataset. In this case, you need to provide a column of Wikimedia language codes that indicates the language for each term that you want to add. Just drag and drop this column to the language field. If there are any invalid language codes in this column, the corresponding terms will be ignored. OpenRefine will translate any deprecated language codes to their preferred values silently.

Labels

This is because Wikidata items can have at most one label per language, so you need to choose whether to override any existing label (default behaviour before 3.2) or only insert your label if there is no such label in the given language (default behaviour starting from 3.2). When the content of the cell providing the label is blank, nothing will be changed (so, it is not possible to remove labels).

Descriptions

Descriptions work like labels: there is at most one description per language, and OpenRefine can override existing descriptions or leave them unchanged. It is not possible to remove descriptions either.

Aliases

Aliases are added to the list of existing aliases in the given language. When adding an alias in a language where no label has been added yet, the alias is automatically promoted to a label for this language. It is not possible to remove aliases or to override any existing aliases.

Statements

You can add statements in the schema: this will generate new statements on the corresponding items. These statements will be merged with any existing statements on the actual Wikidata items and this merging process depends on the upload medium. It is forecast to give more control over the merging strategy in the near future.

Main values

Statements must have main values: "novalue" or "somevalue" statements are not supported yet. The main value of a statement is a data value whose type depends on the property used for the statement. If the main value cannot be evaluated (for instance because one of the cells it depends on is empty), then the entire statement will be skipped.

See the data values section for more details about how to specify each type of data value and when they are skipped.

Qualifiers

Qualifiers can be added on each statement. When their values are skipped, only the qualifier will be discarded: the rest of the statement will still be added.

References

References can (and should) be added to back each statement. If values inside the reference are skipped, the corresponding part of the reference will be discarded but the reference will still be added (unless the reference becomes empty).

Ranks

All statements ranks are set to Normal. It is currently not possible to set a different rank.

Data values

Data values are the data that you can find as target of a statement (or qualifier, or part of a reference). Each property dictates a particular type of data value. In each case, OpenRefine uses a particular process to translate cell contents to a data value of the appropriate type. This section explains the process for all data types.

Items

Items are evaluated in the same way as the subjects of items in the schema. They can be input directly using the auto-suggest service provided, or any column reconciled against Wikidata can be used. Refer to the first Items section to see how they are evaluated.

Strings and external identifiers

Bare strings and external identifiers can be input directly as constants (if they do not change across rows) or using any column. If a reconciled column is used for a string value, it is the value of the cell that is going to be used, not the name of the reconciled item (which is what OpenRefine displays). Values are skipped when the column is blank or null.

Monolingual texts

Monolingual texts consist of two parts:

A monolingual text is skipped when any of its parts is skipped (that is, if the language or the text are invalid).

Dates

Dates are parsed from cell contents (or from any constant provided in the schema) and the precision of the date is inferred from its format. Here are the valid formats:

  • YYYYM, such as 2001M (millenium precision)
  • YYYYC, such as 1901C (century precision)
  • YYYYD, such as 1981D (decade precision)
  • YYYY, such as 1984 (year precision)
  • YYYY-MM, such as 2019-03 (month precision)
  • YYYY-MM-DD, such as 1897-08-14 (day precision)

Any value that does not match any of these formats will be ignored. All dates are represented in UTC, Gregorian calendar.

In OpenRefine 3.3, the following new formats have been introduced:

  • TODAY returns today's date with day precision. This will be evaluated when performing the edits (or exporting to QuickStatements);
  • YYYY-MM-DD_QID can be used to specify a date in a particular calendar (such as the proleptic Julian calendar (Q1985786).

In OpenRefine 3.5, the following new format has been introduced:

  • -234 represents the year 234 BCE

Quantities

Quantities consist of two parts: the amount and the unit.

  • the amount is mandatory and must be a string, such as 18,229.1020. The precision that is displayed will be respected (the same number of trailing zeros will be shown in Wikidata). By default, no upper and lower bounds will be set. To define these, one needs to use the engineering notation, such as 3.45E+3, which will be interpreted as 3,450±5. As usual, the amount can be provided as a constant or as a column variable. In the latter case, the values in the column must be strings.
  • the unit is optional. It is an item, so it can be provided either with the auto-suggest dialog or as a reconciled column. It is important to note that if a reconciled column is used, any unreconciled cells will discard the entire quantity value. So a template for a quantity value is either always unit-less, or always has a unit.

Globe coordinates

Geographic coordinates are specified as strings with the following formats, where all components are floating point numbers in degrees:

  • latitude,longitude for a default precision of ten micro degrees (for instance: 49.265278,4.028611 can be used indicate the position of Reims, France.

  • latitude,longitude,precision when specifying an explicit precision (for instance: 49.265278,4.028611,0.1 can be used indicate the position of Reims within a tenth of a degree).

All globe coordinates are on Earth (Q2).

If your coordinates are in a different format, such as 49° 15 55″ N, 4° 1 43″ E, you will need to convert them to decimal format first.

Media on Commons

Media on Wikimedia Commons is treated like strings, whose values must exactly match filenames on Commons. These values are not checked during schema evaluations: if they are wrong, uploading the statements will fail.

Tabular data and Geoshapes must be prefixed with the Data: namespace. This is indicated by the placeholder in the field that appears when constructing the schema.

Properties

Properties are always constants: there is currently no way to reconcile a column against properties. They have to be selected with the auto-suggest dialog.

Other data types

URLs, mathematical expressions and other textual datatypes are supported and treated as strings. At the time of writing, all datatypes supported by Wikidata are supported by OpenRefine.