RandomSec/OpenRefine/docs/versioned_docs/version-3.4/manual/exploring.md
2022-01-30 23:08:52 +01:00

9.2 KiB
Raw Blame History

id title sidebar_label
exploring Exploring data Overview

Overview

OpenRefine offers lots of features to help you learn about your dataset, even if you dont change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.

Unlike spreadsheets, OpenRefine doesnt store formulas and display the output of those calculations; it only shows the value inside each cell. It doesnt support cell colors or text formatting.

Data types

Each piece of information (each cell) in OpenRefine is assigned a data type. Some file formats, when imported, can set data types that are recognized by OpenRefine. Cells without an associated data type on import will be considered a “string” at first, but you can have OpenRefine convert cell contents into other data types later. This is set at the cell level, not at the column level.

You can see data types in action when you preview a new project: check the box next to Attempt to parse cell text into numbers, and cells will be converted to the “number” data type based on their contents. Youll see numbers change from black text to green if they are recognized.

The data type will determine what you can do with the value. For example, if you want to add two values together, they must both be recognized as the number type.

You can check data types at any time by:

  • clicking “edit” on a single cell (where you can also edit the type)
  • creating a Custom Text Facet on a column, and inserting type(value) into the Expression field. This will generate the data type in the preview, and you can facet by data type if you press OK.

The data types supported are:

  • string (one or more text characters)
  • number (one or more characters of numbers only)
  • boolean (values of “true” or “false”)
  • date (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)

OpenRefine recognizes two further data types as a result of its own processes:

  • error
  • null

An “error” data type is created when the cell is storing an error generated during a transformation in OpenRefine.

A “null” data type is a special type that means “this cell has no value.” Its distinct from cells that have values such as “0” or “false”, or cells that look empty but have whitespace in them, or cells that contain empty strings. When you use type(value), it will show you that the cells value is “null” and its type is “undefined.” You can opt to show “null” values, by going to AllViewShow/Hide null values in cells.

Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as TransformCommon transformsTo date may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section.

To transform data from one type to another, see Transforming data for information on using common tranforms, and see Expressions for information on using toString(), toDate(), and other functions.

Dates

A “date” type is created when a column is transformed into dates, when an expression is used to convert cells to dates or when individual cells are set to have the data type “date”.

Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.

When you run Edit cellsCommon transformsTo date, the following column of strings on the left will transform into the values on the right:

Input Output
23/12/2019 2019-12-23T00:00:00Z
14-10-2015 2015-10-14T00:00:00Z
2012 02 16 2012-02-16T00:00:00Z
August 2nd 1964 1964-08-02T00:00:00Z
today today
never never

OpenRefine uses a variety of tools to recognize, convert, and format dates and so some of the values above can be reformatted using other methods. In this case, clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.

You can do more precise conversion and formatting using expressions and arguments based on the state of your data: see the GREL functions reference section on Date functions for more help.

You can convert dates into a more human-readable format when you export your data using the custom tabular exporter. You are given the option to keep your dates in the ISO 8601 format, to output short, medium, long, or full locale formats, or to specify a custom format. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after working with ISO-8601-formatted dates in your project.

The following table shows some example date and time formatting styles for the U.S. and French locales:

Style U.S. Locale French Locale
Default Jun 30, 2009 7:03:47 AM 30 juin 2009 07:03:47
Short 6/30/09 7:03 AM 30/06/09 07:03
Medium Jun 30, 2009 7:03:47 AM 30 juin 2009 07:03:47
Long June 30, 2009 7:03:47 AM PDT 30 juin 2009 07:03:47 PDT
Full Tuesday, June 30, 2009 7:03:47 AM PDT mardi 30 juin 2009 07 h 03 PDT

Rows vs. records

A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response.

In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefines records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value youd like to work with.

Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.

OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset.

If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group.

You can imagine the structure as a tree with many branches, all leading back to the same trunk.

For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.

Work Actor Role
The Wizard of Oz Judy Garland Dorothy Gale
Ray Bolger "Hunk"
The Scarecrow
Jack Haley "Hickory"
The Tin Man
Bert Lahr "Zeke"
The Cowardly Lion
Frank Morgan Professor Marvel
The Gatekeeper
The Carriage Driver
The Guard
The Wizard of Oz
Margaret Hamilton Miss Almira Gulch
The Wicked Witch of the West

Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).

OpenRefine assigns a unique key behind the scenes, so your records dont need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the All column.

To split multi-valued cells and apply other operations that take advantage of records mode, see Transforming data.

Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another.