Flesh out the documentation for the key/value columnize operation.

This is just to get some agreement on the sort of documentation content we are expecting.
Related to #1215.
This commit is contained in:
Antonin Delpeuch 2020-04-09 14:42:49 +02:00
parent 195c1cb31b
commit 3d69602003
5 changed files with 180 additions and 2 deletions

View File

@ -3,7 +3,8 @@
"Getting Started": [
"index",
"install",
"user_data"
"user_data",
"records_mode"
],
"Importing Data": [
"importers/csv",
@ -13,7 +14,10 @@
"Operations": [
"operations/transform",
"operations/add_column",
"operations/recon"
"operations/fill_down",
"operations/blank_down",
"operations/recon",
"operations/key_value_columnize"
],
"Facets": [
"facets/text",

View File

@ -0,0 +1,7 @@
---
id: blank_down
title: Blank down
sidebar_label: Blank down
---

View File

@ -0,0 +1,7 @@
---
id: fill_down
title: Fill down
sidebar_label: Fill down
---

View File

@ -0,0 +1,152 @@
---
id: key_value_columnize
title: Columnize by key/value columns
sidebar_label: Columnize by key/value
---
This operation can be used to reshape a table which contains *key* and *value* columns, such that the contents of the key column become new column names, and the contents of the value column are spread in the new columns. This operation can be invoked from
any column menu, via **Transpose****Columnize by key/value columns**.
Overview
--------
Consider the following table:
| Field | Data |
|---------|-----------------------|
| Name | Galanthus nivalis |
| Color | White |
| IUCN ID | 162168 |
| Name | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
In this format, each flower species is described by multiple attributes, which are spread on consecutive rows.
In this example, the "Field" column contains the keys and the "Data" column contains the values. With
this configuration, the *Columnize by key/value columns* operations transforms this table as follows:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| Narcissus cyclamineus | Yellow | 161899 |
Entries with multiple values in the same column
-----------------------------------------------
If an entry has multiple values for a given key, then these values will be grouped on consecutive rows,
to form a [record structure](../records_mode).
For instance, flower species can have multiple colors:
| Field | Data |
|-------------|-----------------------|
| Name | Galanthus nivalis |
| **Color** | **White** |
| **Color** | **Green** |
| IUCN ID | 162168 |
| Name | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
This table is transformed by the operation as follows:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| | Green | |
| Narcissus cyclamineus | Yellow | 161899 |
The first key encountered by the operation serves as record key.
The "Green" value is attached to the "Galanthus nivalis" name because it the latest record key encountered by the operation as it scans the table. See the [Row order](#row-order) section for more details about the influence of row order on
the results of the operation.
Notes column
------------
In addition to the key and value columns, a *notes* column can be used optionally. This can be used
to store extra metadata associated to a key/value pair.
Consider the following example:
| Field | Data | Source |
|---------|-----------------------|-----------------------|
| Name | Galanthus nivalis | IUCN |
| Color | White | Contributed by Martha |
| IUCN ID | 162168 | |
| Name | Narcissus cyclamineus | Legacy |
| Color | Yellow | 2009 survey |
| IUCN ID | 161899 | |
If the "Source" column is selected as notes column, this table is transformed to:
| Name | Color | IUCN ID | Source : Name | Source : Color |
|-----------------------|----------|---------|---------------|-----------------------|
| Galanthus nivalis | White | 162168 | IUCN | Contributed by Martha |
| Narcissus cyclamineus | Yellow | 161899 | Legacy | 2009 survey |
Notes columns can therefore be used to preserve provenance or other context about a particular key/value pair.
Extra columns
-------------
If the table contains extra columns, which are not used as key, value or notes columns, they can be preserved
by the operation. For this to work, they must have the same value in all old rows corresponding to a new row.
Consider for instance the following table, where the "Field" and "Data" columns are used as key and value columns
respectively, and the "Wikidata ID" column is not selected:
| Field | Data | Wikidata ID |
|---------|-----------------------|-------------|
| Name | Galanthus nivalis | Q109995 |
| Color | White | Q109995 |
| IUCN ID | 162168 | Q109995 |
| Name | Narcissus cyclamineus | Q1727024 |
| Color | Yellow | Q1727024 |
| IUCN ID | 161899 | Q1727024 |
This will be transformed to
| Wikidata ID | Name | Color | IUCN ID |
|-------------|-----------------------|----------|---------|
| Q109995 | Galanthus nivalis | White | 162168 |
| Q1727024 | Narcissus cyclamineus | Yellow | 161899 |
If extra columns do not contain identical values for all old rows spanning an entry, this can
be fixed beforehand by using the [fill down operation](fill_down).
Row order
---------
In the absence of extra columns, it is important to note that the order in which
the key/value pairs appear matters. Specifically, the operation will use the first key it encounters as delimiter for entries:
every time it encounters this key again, it will produce a new row and add the following other key/value pairs to that row.
Consider for instance the following table:
| Field | Data |
|----------|-----------------------|
| **Name** | Galanthus nivalis |
| Color | White |
| IUCN ID | 162168 |
| **Name** | Crinum variabile |
| **Name** | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
The occurrences of the "Name" value in the "Field" column define the boundaries of the entries. Because there is
no other row between the "Crinum variabile" and the "Narcissus cyclamineus" rows, the "Color" and "IUCN ID" columns
for the "Crinum variabile" entry will be empty:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| Crinum variabile | | |
| Narcissus cyclamineus | Yellow | 161899 |
This sensitivity to order is removed if there are extra columns: in that case, the first extra column will serve as root identifier
for the entries.
Behaviour in records mode
-------------------------
In records mode, this operation behaves just like in rows mode, except that any facets applied to it will be interpreted in records mode.

8
docs/src/records_mode.md Normal file
View File

@ -0,0 +1,8 @@
---
id: records_mode
title: The records mode
sidebar_label: Records mode
---