Links, HTML, small rewrites
This commit is contained in:
parent
699e80b94e
commit
5c13397965
@ -9,15 +9,17 @@ OpenRefine offers a number of features to edit and improve the contents of cells
|
||||
|
||||
One way of doing this is editing through a [text facet](facets#text-facet). Once you have created a facet on a column, hover over the displayed results in the sidebar. Click on the small “edit” button that appears to the right of the facet, and type in a new value. This will apply to all the cells in the facet.
|
||||
|
||||
You can apply a text facet on numbers, boolean values, and dates, but if you edit a value it will be converted into the text [data type](exploring#data-types) (regardless of whether you edit a date into another correctly-formatted date, or a “true” value into “false”, etc.).
|
||||
|
||||
## Transform
|
||||
|
||||
Select “Edit cells” → “Transforms” to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as `toUppercase` or `toLowercase`, used in expressions as `toUppercase(value)` or `toLowercase(value)`. In all of these cases, `value` is the value in each cell in the selected column.
|
||||
Select <span class="menuItems">Edit cells</span> → <span class="menuItems">Transforms</span> to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as [`toUppercase()`](grelfunctions#touppercases or [`toLowercase()`](grelfunctions#tolowercases), used in expressions as `toUppercase(value)` or `toLowercase(value)`. In these cases, `value` is the value in each cell in the selected column.
|
||||
|
||||
Use the preview to ensure your data is being transformed correctly.
|
||||
|
||||
You can also switch to the “History” tab inside the expressions window to reuse expressions you’ve already attempted in this project, whether they have been undone or not.
|
||||
You can also switch to the <span class="tabLabels">Undo / Redo</span> tab inside the expressions window to reuse expressions you’ve already attempted in this project, whether they have been undone or not.
|
||||
|
||||
OpenRefine offers you some frequently-used transformations in the next menu option, “Common transforms.” For more custom transforms, read up on [expressions](expressions).
|
||||
OpenRefine offers you some frequently-used transformations in the next menu option, <span class="menuItems">Common transforms</span>. For more custom transforms, read up on [expressions](expressions).
|
||||
|
||||
## Common transforms
|
||||
|
||||
@ -27,11 +29,11 @@ Often cell contents that should be identical, and look identical, are different
|
||||
|
||||
### Collapse consecutive whitespace
|
||||
|
||||
You may also find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
|
||||
You may find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
|
||||
|
||||
### Unescape HTML
|
||||
|
||||
Your data may come from an HTML-formatted source that expresses some characters through references (such as “&nbsp;” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent.
|
||||
Your data may come from an HTML-formatted source that expresses some characters through references (such as “&nbsp;” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent. For other formatting that needs to be escaped, try a custom transformation with [`escape()`](grelfunctions#escapes-s-mode).
|
||||
|
||||
### Replace smart quotes with ASCII
|
||||
|
||||
@ -39,28 +41,17 @@ Smart quotes (or curly quotes) recognize whether they come at the beginning or e
|
||||
|
||||
### Case transforms
|
||||
|
||||
You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which many functions are) causing problems in your analysis.
|
||||
You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which some functions are) causing problems in your analysis. Consider also using a [custom facet](facets#custom-text-facet) to temporarily modify cases instead of this permanent operation if appropriate.
|
||||
|
||||
### Data-type transforms
|
||||
|
||||
As detailed in [Data types](exploring#data-types), OpenRefine recognizes different data types: string, number, boolean, and date. When you use these transforms, OpenRefine will check to see if the given values can be converted, then both transform the data in the cells (such as “3” as a text string to “3” as a number) and convert the data type on each successfully transformed cell. Cells that cannot be transformed will output the original value and maintain their original data type.
|
||||
|
||||
For example, the following column of strings on the left will transform into the values on the right:
|
||||
:::caution
|
||||
Be aware that dates may require manual intervention to transform successfully: see the section on [Dates](exploring#dates) for more information.
|
||||
:::
|
||||
|
||||
|Input|>|Output|
|
||||
|---|---|---|
|
||||
|23/12/2019|>|2019-12-23T00:00:00Z|
|
||||
|14-10-2015|>|2015-10-14T00:00:00Z|
|
||||
|2012 02 16|>|2012-02-16T00:00:00Z|
|
||||
|August 2nd 1964|>|1964-08-02T00:00:00Z|
|
||||
|today|>|today|
|
||||
|never|>|never|
|
||||
|
||||
This is based on OpenRefine’s ability to recognize dates with the [`toDate()` function](expressions#date-functions).
|
||||
|
||||
Clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
|
||||
|
||||
Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells.
|
||||
Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells. Also consider faceting by data type, with the GREL function [`type()`](grelfunctions#typeo).
|
||||
|
||||
You can also convert cells into null values or empty strings. This can be useful if you wish to, for example, erase duplicates that you have identified and are analyzing as a subset.
|
||||
|
||||
@ -68,41 +59,45 @@ You can also convert cells into null values or empty strings. This can be useful
|
||||
|
||||
Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity.
|
||||
|
||||
If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will convert the data based on the remaining values in the first column. Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the value to come first, so that you can clean up empty rows later, once you blank down.
|
||||
If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column.
|
||||
|
||||
Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the row with the second-column value to come first, so that you can clean up empty rows later, once you blank down.
|
||||
|
||||
If, conversely, you’ve received data with empty cells because it was already in something akin to records mode, you can fill down information to the rest of the rows. This will duplicate whatever value exists in the topmost cell with a value: if the first row in the record is blank, it will take information from the next cell, or the cell after that, until it finds a value. The blank cells above this will remain blank.
|
||||
|
||||
## Split multi-valued cells
|
||||
|
||||
Splitting cells with more than one value in them is a common way to get your data from single rows into multi-row records. Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
|
||||
Splitting cells with more than one value in them is a common way to get your data from single rows into [multi-row records](exploring#rows-vs-records). Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
|
||||
|
||||
You can split a column based on any character or series of characters you input, such as a semi-colon (;) or a slash (/). The default is a comma. Splitting based on a separator will remove the separator characters, so you may wish to include a space with your separator (; ) if it exists in your data.
|
||||
|
||||
You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “regular expression” checkbox.
|
||||
You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “[regular expression](expressions#regular-expressions)” checkbox.
|
||||
|
||||
This can be useful if the split is not straightforward: say, if a capital letter indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
|
||||
Regular expressions can be useful if the split is not straightforward: say, if a capital letter (`[A-Z]`) indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
|
||||
|
||||
You can also split based on the lengths of the strings you expect to find. This can be useful if you have predictable data in the cells: for example, a 10-digit phone number, followed by a space, followed by another 10-digit phone number. Any characters past the explicit length you’ve specified will be discarded: if you split by “11, 10” any characters that may come after the 21st character will disappear. If some cells only have one phone number, you will end up with blank rows.
|
||||
|
||||
If you have data that should be split into multiple columns instead of multiple rows, see [split into several columns(columnediting#split-into-several-columns).
|
||||
If you have data that should be split into multiple columns instead of multiple rows, see [Split into several columns](columnediting#split-into-several-columns).
|
||||
|
||||
## Join multi-valued cells
|
||||
|
||||
Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional.
|
||||
Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional. We suggest the separator | as a sufficiently rare character.
|
||||
|
||||
## Cluster and edit
|
||||
|
||||
Creating a facet on a column is a great way to look for inconsistencies in your data; clustering is a great way to fix those inconsistencies. Clustering uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match. Where editing a single cell or text facet at a time can be time-consuming and difficult, clustering is quick and streamlined.
|
||||
|
||||
Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all those matching cells (or type in your own version). OpenRefine will do a number of cleanup operations behind the scenes, in memory, in order to do its analysis, but only the merges you approve will modify your data.
|
||||
Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all the matching cells (or type in your own version).
|
||||
|
||||
You can start the process in two ways: using the dropdown menu on your column, select “Edit cells” → “Cluster and edit…” or create a text facet and then press the “Cluster” button that appears in the facet box.
|
||||
OpenRefine will do a number of cleanup operations behind the scenes in order to do its analysis, but only the merges you approve will modify your data. Understanding those different behind-the-scenes cleanups can help you choose which clustering method will be more accurate and effective.
|
||||
|
||||
You can start the process in two ways: using the dropdown menu on your column, select <span class="menuItems">Edit cells</span> → <span class="menuItems">Cluster and edit…</span>; or create a text facet and then press the “Cluster” button that appears in the facet box.
|
||||
|
||||
![A screenshot of the Clustering window.](/img/cluster.png)
|
||||
|
||||
The clustering pop-up window will take a small amount of time to analyze your column, and then make some suggestions based on the clustering method currently active.
|
||||
|
||||
For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with “Merge selected & re-cluster,” and you can work through all the methods this way.
|
||||
For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with <span class="buttonLabels">Merge selected & re-cluster</span>, and you can work through all the methods this way.
|
||||
|
||||
You can also export the currently identified clusters as a JSON file, or close the window with or without applying your changes. You can also use the histograms on the right to narrow down to, for example, clusters with lots of matching rows, or clusters of long or short values.
|
||||
|
||||
@ -127,9 +122,9 @@ The clustering pop-up window offers you a variety of clustering methods:
|
||||
|
||||
**Key collisions** are very fast and can process millions of cells in seconds:
|
||||
|
||||
**Fingerprinting** is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “Wang Zhenyi”). This makes comparing those types of name values very easy.
|
||||
**Fingerprinting** is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “wang zhenyi”).
|
||||
|
||||
**N-gram fingerprinting** allows you to set the _n_ value to whatever number you’d like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a _fingerprint_. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
|
||||
**N-gram fingerprinting** allows you to set the _n_ value to whatever number you’d like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a _fingerprint_. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify because it keeps words separated). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
|
||||
|
||||
##### Phonetic clustering
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
---
|
||||
---
|
||||
id: exploring
|
||||
title: Exploring data
|
||||
sidebar_label: Overview
|
||||
@ -6,7 +6,7 @@ sidebar_label: Overview
|
||||
|
||||
## Overview
|
||||
|
||||
OpenRefine is a powerful tool for learning about your dataset, even if you don’t change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
|
||||
OpenRefine offers lots of features to help you learn about your dataset, even if you don’t change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
|
||||
|
||||
Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output of those calculations; it only shows the value inside each cell. It doesn’t support cell colors or text formatting.
|
||||
|
||||
@ -14,21 +14,19 @@ Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output
|
||||
|
||||
Each piece of information (each cell) in OpenRefine is assigned a data type. Some file formats, when imported, can set data types that are recognized by OpenRefine. Cells without an associated data type on import will be considered a “string” at first, but you can have OpenRefine convert cell contents into other data types later. This is set at the cell level, not at the column level.
|
||||
|
||||
You can see data types in action when you preview a new project: check the box that says “Attempt to parse cell text into numbers” and cells will be converted to the “number” data type based on their contents. You’ll see numbers change from black text to green if they are recognized.
|
||||
You can see data types in action when you preview a new project: check the box next to <span class="fieldLabels">Attempt to parse cell text into numbers</span>, and cells will be converted to the “number” data type based on their contents. You’ll see numbers change from black text to green if they are recognized.
|
||||
|
||||
The data type will determine what you can do with the value. For example, if you want to add two values together, they must both be recognized as the number type.
|
||||
|
||||
You can check data types at any time by:
|
||||
* clicking “edit” on a single cell (where you can also edit the type)
|
||||
* creating a Custom Text Facet on a column, and inserting `type(value)` into the “Expression” field. This will generate the data type in the preview, and you can facet by data type if you press “OK.”
|
||||
* creating a <span class="menuItems">Custom Text Facet</span> on a column, and inserting `type(value)` into the <span class="fieldLabels">Expression</span> field. This will generate the data type in the preview, and you can facet by data type if you press <span class="buttonLabels">OK</span>.
|
||||
|
||||
The data types supported are:
|
||||
* string (one or more text characters)
|
||||
* number (one or more characters of numbers only)
|
||||
* boolean (values of “true” or “false”)
|
||||
* date (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
|
||||
|
||||
A “date” type is created when a text column is [transformed into dates](transforming#to-date), or when individual cells are set to have the data type “date.”
|
||||
* [date](#dates) (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
|
||||
|
||||
OpenRefine recognizes two further data types as a result of its own processes:
|
||||
* error
|
||||
@ -36,33 +34,59 @@ OpenRefine recognizes two further data types as a result of its own processes:
|
||||
|
||||
An “error” data type is created when the cell is storing an error generated during a transformation in OpenRefine.
|
||||
|
||||
A “null” data type is a special value which basically means “this cell has no value.” It’s used to differentiate between cells that have values such as “0” or “false” - or a cell that looks empty but has, for example, spaces in it. When you use `type(value)`, it will show you that the cell’s value is “null” and its type is “undefined.” You can opt to [show “null” values](#view) to differentiate them from empty strings, by going to “All” → “View” → “Show/Hide ‘null’ values in cells.”
|
||||
A “null” data type is a special type that means “this cell has no value.” It’s distinct from cells that have values such as “0” or “false”, or cells that look empty but have whitespace in them, or cells that contain empty strings. When you use `type(value)`, it will show you that the cell’s value is “null” and its type is “undefined.” You can opt to [show “null” values](sortview#showhide-null), by going to <span class="menuItems">All</span> → <span class="menuItems">View</span> → <span class="menuItems">Show/Hide ‘null’ values in cells</span>.
|
||||
|
||||
Converting a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as “Transform” → “Common transforms …” → “to date” may not convert all values successfully, but going to an individual cell, clicking “edit” and changing the data type can successfully convert text to a date. These operations use different underlying code.
|
||||
Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as <span class="menuItems">Transform</span> → <span class="menuItems">Common transforms</span> → <span class="menuItems">To date</span> may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section.
|
||||
|
||||
To transform data from one type to another, see [Transforming data](cellediting#data-type-transforms) for information on using common tranforms, and see [Expressions](expressions) for information on using [toString()](grelfunctions#tostringo-string-format-optional), [toDate()](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-), and other functions.
|
||||
|
||||
To transform data from one type to another, see [Transforming data](transforming#transform) for information on using common tranforms, and see [Expressions](expressions) for information on using `toString()`, `toDate()`, and other functions.
|
||||
|
||||
### Dates
|
||||
|
||||
Date-formatted data in OpenRefine relies on a number of conversion tools and standards. When you convert a cell into a "date" data type, what you'll be doing is trying to transform the original contents in an ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
|
||||
A “date” type is created when a column is [transformed into dates](transforming#to-date), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date.”
|
||||
|
||||
You can convert dates when you [export your data using the custom tabular exporter](exporting#custom-tabular-exporter). You are given the option to keep your dates in ISO 8601 format, or to output short, medium, long, or full locale formats. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after manipulating your data formatted into ISO 8601.
|
||||
Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
|
||||
|
||||
When you run <span class="menuItems">Edit cells</span> → <span class="menuItems">Common transforms</span> → <span class="menuItems">To date</span>, the following column of strings on the left will transform into the values on the right:
|
||||
|
||||
|Input|→|Output|
|
||||
|---|---|---|
|
||||
|23/12/2019|→|2019-12-23T00:00:00Z|
|
||||
|14-10-2015|→|2015-10-14T00:00:00Z|
|
||||
|2012 02 16|→|2012-02-16T00:00:00Z|
|
||||
|August 2nd 1964|→|1964-08-02T00:00:00Z|
|
||||
|today|→|today|
|
||||
|never|→|never|
|
||||
|
||||
OpenRefine uses a variety of tools to recognize, convert, and format [dates](exploring#dates) and so some of the values above can be reformatted using other methods. In this case, clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
|
||||
|
||||
You can do more precise conversion and formatting using expressions and arguments based on the state of your data: see the GREL functions reference section on [Date functions](grelfunctions#date-functions) for more help.
|
||||
|
||||
You can convert dates into a more human-readable format when you [export your data using the custom tabular exporter](exporting#custom-tabular-exporter). You are given the option to keep your dates in the ISO 8601 format, to output short, medium, long, or full locale formats, or to specify a custom format. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after working with ISO-8601-formatted dates in your project.
|
||||
|
||||
The following table shows some example [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html):
|
||||
|
||||
The following table shows the [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html):
|
||||
|Style |U.S. Locale |French Locale|
|
||||
|DEFAULT |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|
||||
|SHORT |6/30/09 7:03 AM |30/06/09 07:03|
|
||||
|MEDIUM |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|
||||
|LONG |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT|
|
||||
|FULL |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT|
|
||||
|---|---|---|
|
||||
|Default |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|
||||
|Short |6/30/09 7:03 AM |30/06/09 07:03|
|
||||
|Medium |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|
||||
|Long |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT|
|
||||
|Full |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT|
|
||||
|
||||
## Rows vs. records
|
||||
|
||||
A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response. In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record (a survey response, for example) as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with.
|
||||
A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response.
|
||||
|
||||
Generally, when you import some data, OpenRefine reads that data in row mode. From there you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
|
||||
In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with.
|
||||
|
||||
OpenRefine understands records based on the content of the first column, what we call the "key column." Splitting a row into a multi-row record will base all association on the first column in your dataset. If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record: you can imagine this structure as a tree with many branches, all leading back to the same trunk.
|
||||
Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
|
||||
|
||||
OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset.
|
||||
|
||||
If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group.
|
||||
|
||||
You can imagine the structure as a tree with many branches, all leading back to the same trunk.
|
||||
|
||||
For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.
|
||||
|
||||
@ -83,9 +107,9 @@ For example, your key column may be a film or television show, with multiple cas
|
||||
||Margaret Hamilton|Miss Almira Gulch|
|
||||
|||The Wicked Witch of the West|
|
||||
|
||||
Once you are in records mode, you can still move columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
|
||||
Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
|
||||
|
||||
OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column (but you will likely have one, to ensure data stays properly sorted). You can keep track of which rows are assigned to which record by the record number that appears under the “All” column.
|
||||
OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the <span class="menuItems">All</span> column.
|
||||
|
||||
To [split multi-valued cells](transforming#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming).
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
---
|
||||
---
|
||||
id: facets
|
||||
title: Exploring facets
|
||||
sidebar_label: Facets
|
||||
@ -6,16 +6,18 @@ sidebar_label: Facets
|
||||
|
||||
## Overview
|
||||
|
||||
Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from! Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
|
||||
Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from!
|
||||
|
||||
Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
|
||||
|
||||
Faceted browsing gives you a big-picture look at your data (do they agree or disagree?) and also allows you to filter down to a specific subset to explore it more (what do people who disagree say in other responses?).
|
||||
|
||||
Typically, you create a facet on a particular column. That facet selection appears on the left, in the Facet/Filter tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
|
||||
Typically, you create a facet on a particular column. That facet selection appears on the left, in the <span class="tabLabels">Facet/Filter</span> tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
|
||||
|
||||
|
||||
### An example
|
||||
|
||||
You can learn about facets and filtering with the following example.
|
||||
You can learn about facets and filtering with the following example. You can copy the following table and paste it using the <span class="menuItems">Clipboard</span> method of starting a project if you would like to try it yourself.
|
||||
|
||||
We collected a list of the [10 most populous cities from Wikidata](https://w.wiki/3Em), using an example query of theirs. We removed the GPS coordinates and added the country.
|
||||
|
||||
@ -32,9 +34,9 @@ We collected a list of the [10 most populous cities from Wikidata](https://w.wik
|
||||
| Guangzhou | 13080500 | People's Republic of China |
|
||||
| São Paulo | 12106920 | Brazil |
|
||||
|
||||
If we want to see which countries have the most populous cities, we can create a “text facet” on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
|
||||
If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
|
||||
|
||||
We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count, you’ll learn which countries hold the most populous cities.
|
||||
We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count at the top of the facet window, you’ll learn which countries hold the most populous cities.
|
||||
|
||||
|Facet|Count|
|
||||
|---|---|
|
||||
@ -48,25 +50,25 @@ We will see in the sidebar that the countries identified are displayed, along wi
|
||||
|
||||
If we want to learn more about a particular country, we can click on its appearance in the facet sidebar. This narrows our dataset down temporarily to only rows matching that facet.
|
||||
|
||||
You’ll see the “10 rows” notification change to “4 matching rows (10 total)” if you click on “People’s Republic of China”. In the data grid, you’ll see the same number of rows, but only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
|
||||
You’ll see the “10 rows” indicator change to “4 matching rows (10 total)” if you click on “People’s Republic of China”. In the data grid, you’ll see fewer rows: only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
|
||||
|
||||
If you want to go back to the original dataset, click “reset” or “exclude.” If you want to view the most populous cities in both China and India, click “include” next to each facet. Now you’ll see 5 rows - #1, 2, 5, 8, 9.
|
||||
If you want to go back to the original dataset, click <span class="buttonLabels">Reset All</span> or the small “exclude” text next to the facet. If you want to view the most populous cities in both China and India, click “include” next to each facet. Now you’ll see 5 rows - #1, 2, 5, 8, 9.
|
||||
|
||||
We can also explore our data using the population information. In this case, because population is a number, we can create a numeric facet. This will give us the ability to explore by range rather than by exact matching values.
|
||||
|
||||
With the numeric facet, we are given a scale from the smallest to the largest value in the column. We can drag the range minimum and maximum to narrow the results. In this case, if we narrow down to only cities with more than 20 million in population, we get 3 matching rows out of the original 10.
|
||||
|
||||
When you look at the facet display of countries, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
|
||||
When you look back at the text facet display of country names, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
|
||||
|
||||
We can combine these facets - say, by narrowing to only the Chinese cities with populations greater than 20 million - simply by clicking in both. You should see 2 matching rows for both these criteria.
|
||||
|
||||
### Things to know about facets
|
||||
|
||||
When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you press “Export” and copy your data out of OpenRefine while facets are active, you will only export the matching rows, not all the rows in your project.
|
||||
When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you click <span class="menuItems">Export</span> and copy your data out of OpenRefine while facets are active, many of the exporting options will only export the matching rows, not all the rows in your project.
|
||||
|
||||
OpenRefine has several default facets, which you’ll learn about below. The most powerful facets are the ones designed by you - custom facets, written using [expressions](expressions) to transform the data behind the scenes and help you narrow down to precisely what you’re looking for.
|
||||
|
||||
Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the "permalink" next to the project’s name.
|
||||
Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the <span class="menuItems">[Permalink](running#the-project-bar)</span> next to the project’s name.
|
||||
|
||||
You can modify any facet expression by clicking the “change” button to the right of the column name in the facet sidebar.
|
||||
|
||||
@ -74,15 +76,17 @@ Facet boxes that appear in the sidebar can be resized and rearranged. You can dr
|
||||
|
||||
## Text facet
|
||||
|
||||
A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to “Facet” → “Text facet”. The created facet will be sorted alphabetically, and can be sorted by count.
|
||||
A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to <span class="menuItems">Facet</span> → <span class="menuItems">Text facet</span>. The created facet will be sorted alphabetically, and can be sorted by count.
|
||||
|
||||
A text facet is very simple: it takes the total contents of the cells of the column in question and matches them up. It does no guessing about typos or near-matches.
|
||||
|
||||
You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in your data. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit), with the “Cluster” button displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
|
||||
You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in the column. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit): a “Cluster” button is displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
|
||||
|
||||
Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which will increase the processing work required by your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will edit that preference for you.
|
||||
Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which may slow down your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will permanently edit that preference for you.
|
||||
|
||||
The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet.
|
||||
The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet. This can be useful to generate small summary tables of your data.
|
||||
|
||||
![A column of years faceted as text and numbers, and with the count ready to be copied.](/img/yeardata.png)
|
||||
|
||||
## Numeric facet
|
||||
|
||||
@ -92,73 +96,95 @@ Whereas a text facet groups unique text values into groups, a numeric facet sort
|
||||
|
||||
You will be offered the option to include blank, non-numeric, and error values in your numeric visualization; these will appear in the visual range as “0” values.
|
||||
|
||||
:::info
|
||||
You can create a text facet on numeric data, which will treat each entry as a string. This can be useful if you wish, for example, to manually include facets instead of selecting a range, or sort by count, or copy that count.
|
||||
:::
|
||||
|
||||
## Timeline facet
|
||||
|
||||
![A screenshot of an example timeline facet.](/img/timelinefacet.png)
|
||||
|
||||
Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on dates formatted as “date” data types (e.g. by [using the `toDate()` function](expressions#dates) to transform text into dates, or by manually setting the [data type](#cell-data-types) on individual cells) and in the structure of the ISO-8601-compliant extended format with time in UTC: **YYYY**-**MM**-**DD**T**HH**:**MM**:**SS**Z.
|
||||
Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on cells formatted as the [“date” data type](exploring#dates).
|
||||
|
||||
The facet appears with a count of blank cells and those with errors, which can help you analyze whether your date cells are correctly converted.
|
||||
|
||||
## Scatterplot facet
|
||||
|
||||
A scatterplot facet can be generated on any number-formatted column. You require two or more number columns to generate scatterplots.
|
||||
A scatterplot is a visual representation of two related sets of numeric data.
|
||||
|
||||
You have the option to generate linear scatterplots (where the X and Y axes show continuous increases) or logarithmic scatterplots (where the X and Y axes show exponential or scaled increases). You can also rotate the plot by 45 degrees in either direction, and you can choose the size of the dot indicating a datapoint. You can make these choices in both the preview and in the facet display.
|
||||
|
||||
Selecting “Facet” → “Scatterplot facet” will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further.
|
||||
A scatterplot facet can be generated on any column. You require two or more number columns to generate scatterplots. Selecting <span class="menuItems">Facet</span> → <span class="menuItems">Scatterplot facet</span> will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further. You can control which columns are on the X and Y axes by rearranging the columns in your dataset.
|
||||
|
||||
When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle. This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
|
||||
![A simple scatterplot of two numeric values.](/img/scatterplot.png)
|
||||
|
||||
When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle (as shown by the rectangle inside the square in the image above). This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
|
||||
|
||||
If you have multiple facets applied, plotted points in your scatterplot displays will be greyed out if they are not part of the current matching data subset. If the rectangle you have drawn within a scatterplot display only includes grey dots, you will see no matching rows.
|
||||
|
||||
If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG image that you can save.
|
||||
If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG file that you can save.
|
||||
|
||||
## Custom text facet
|
||||
|
||||
You may want to explore your textual data in a way that doesn’t involve modifying it but does require being more selective about what gets considered. Creating custom text facets will load your column into memory, transform the data, and store those transformations inside the facet.
|
||||
You may want to explore your textual data with modifications that aren't permanent. Creating custom text facets will load your column into memory, transform the data temporarily, and store those transformations inside the facet.
|
||||
|
||||
You can also use text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
|
||||
You can also use custom text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
|
||||
|
||||
If you would like to build your own version of a text facet, you can use the “Custom Text Facet” option. Clicking on “Facets” → “Custom text facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works.
|
||||
Clicking on <span class="menuItems">Facet</span> → <span class="menuItems">Custom text facet…</span> will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works.
|
||||
|
||||
A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot edit the facets that appear in the sidebar and change the matching cells in your dataset.
|
||||
A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot click “edit” on the facets that appear in the sidebar and change the matching cells in your dataset - because what they display is modified, not the original entries.
|
||||
|
||||
For example, you may wish to analyze only the first word in a text field - perhaps the first name in a column of “[First Name] [Last Name]” entries. In this case, you can tell OpenRefine to facet only on the information that comes before the first space:
|
||||
|
||||
```value.split(" ")[0]```
|
||||
```
|
||||
value.split(" ")[0]
|
||||
```
|
||||
|
||||
In this case, `split()` is creating an array of text strings based on every space in the cells - in this case, one space, so two values in the array. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. We can do the same splitting and ask for the last name with
|
||||
In this case, `split()` is creating an array of text strings based on every space in the cells ["Firstname", "Lastname"]. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. (Assuming the first name is one word, not something like “Mary Anne”.) We can do the same splitting and ask for the last name with
|
||||
|
||||
```value.split(" ")[1]```
|
||||
```
|
||||
value.split(" ")[1]
|
||||
```
|
||||
|
||||
You may want to create a facet that references several columns. For example, let’s say you have two columns, "First Name" and "Last Name", and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
|
||||
You may want to create a facet that references several columns. For example, let’s say you have two columns, “First Name” and “Last Name”, and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
|
||||
|
||||
```cells["First Name"].value[0] == cells["Last Name"].value[0]```
|
||||
```
|
||||
cells["First Name"].value[0] == cells["Last Name"].value[0]
|
||||
```
|
||||
|
||||
That expression will facet your rows into `true` and `false`.
|
||||
That expression will look for the first letter (the character at index 0) of each entry and compare them. Then it will facet your rows into `true` and `false`.
|
||||
|
||||
You can learn more about text-modification functions on the [Expressions page](expressions).
|
||||
|
||||
## Custom numeric facet
|
||||
|
||||
You may want to explore your numerical data in a way that doesn’t involve modifying it but does require being more selective about what gets considered. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
|
||||
You may want to explore your numerical data with modifications that aren't permanent. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
|
||||
|
||||
If you would like to build your own version of a numeric facet, you can use the “Custom Numeric Facet” option. Clicking on “Facets” → “Custom Numeric Facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
|
||||
If you would like to build your own version of a numeric facet, you can use the <span class="menuItems">Custom Numeric Facet</span> option. Clicking on <span class="menuItems">Facet</span> → <span class="menuItems">Custom Numeric Facet…</span> will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
|
||||
|
||||
For example, you may wish to create a numeric facet that rounds your value to the nearest integer, enter
|
||||
|
||||
```round(value)```
|
||||
```
|
||||
round(value)
|
||||
```
|
||||
|
||||
If you have two columns of numbers and for each row you wish to create a numeric facet only on the larger of the two, enter
|
||||
|
||||
```max(cells["Column1"].value, cells[“Column2”].value)```
|
||||
```
|
||||
max(cells["Column1"].value, cells["Column2"].value)
|
||||
```
|
||||
|
||||
If the numeric values in a column are drawn from a power law distribution, then it's better to group them by their logs:
|
||||
|
||||
```value.log()```
|
||||
```
|
||||
value.log()
|
||||
```
|
||||
|
||||
If the values are periodic you could take the modulus by the period to understand if there's a pattern:
|
||||
|
||||
```mod(value, 7)```
|
||||
```
|
||||
mod(value, 7)
|
||||
```
|
||||
|
||||
You can learn more about numeric-modification functions on the [Expressions page](expressions).
|
||||
|
||||
@ -166,13 +192,15 @@ You can learn more about numeric-modification functions on the [Expressions page
|
||||
|
||||
Customized facets have been added to expand the number of default facets users can apply with a single click. They represent some common and useful functions you shouldn’t have to work out using an [expression](expressions).
|
||||
|
||||
All facets that display in the “Facet/Filter” sidebar can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
|
||||
All facets that display in the <span class="tabLabels">Facet/Filter</span> tab can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
|
||||
|
||||
### Word facet
|
||||
|
||||
Word facet is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
|
||||
A <span class="menuItems">Word facet</span> is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
|
||||
|
||||
```value.split(" ")```
|
||||
```
|
||||
value.split(" ")
|
||||
```
|
||||
|
||||
This can be useful for exploring the language used in a corpus, looking for common first and last names or titles, or seeing what’s in multi-valued cells you don’t wish to split up.
|
||||
|
||||
@ -180,19 +208,21 @@ Word facet is case-sensitive and only splits by spaces, not by line breaks or ot
|
||||
|
||||
### Duplicates facet
|
||||
|
||||
A duplicates facet will return only rows that have non-unique values in the column you’ve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
|
||||
A <span class="menuItems">Duplicates facet</span> will return only rows that have non-unique values in the column you’ve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
|
||||
|
||||
```facetCount(value, 'value', '[Column]') > 1```
|
||||
```
|
||||
facetCount(value, 'value', '[Column]') > 1
|
||||
```
|
||||
|
||||
Duplicates facets are case-sensitive and you may wish to filter out things like leading and trailing whitespace or other hard-to-see issues. You can modify the facet expression, for example, with:
|
||||
|
||||
```facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1```
|
||||
```
|
||||
facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1
|
||||
```
|
||||
|
||||
### Numeric log facet
|
||||
|
||||
Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a numeric log facet can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible.
|
||||
|
||||
OpenRefine uses a base-10 log, the "common logarithm."
|
||||
Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a <span class="menuItems">Numeric log facet</span> can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible. OpenRefine uses a base-10 log, the “common logarithm.”
|
||||
|
||||
For example, we can look at [this data about the body weight of various mammals](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight):
|
||||
|
||||
@ -220,13 +250,15 @@ A 1-bounded numeric log facet can be used if you'd like to exclude all the value
|
||||
|
||||
### Text-length facet
|
||||
|
||||
The text-length facet returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
|
||||
The <span class="menuItems">Text-length facet</span> returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
|
||||
|
||||
```value.length()```
|
||||
```
|
||||
value.length()
|
||||
```
|
||||
|
||||
This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date, as YYYY/MM/DD, is eight to ten characters).
|
||||
This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date in YYYY/MM/DD is eight to ten characters).
|
||||
|
||||
You can also employ a log of text-length facet that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
|
||||
You can also employ a <span class="menuItems">Log of text-length facet</span> that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
|
||||
|
||||
|
||||
### Unicode character-code facet
|
||||
@ -243,19 +275,23 @@ An error is a data type created by OpenRefine in the process of transforming dat
|
||||
|
||||
![A view of the expressions window with an error converting a string to a number.](/img/error.png)
|
||||
|
||||
To store errors in cells, ensure that you have “store error” selected for the “On error” option in the expressions window.
|
||||
To store errors in cells, ensure that you have <span class="fieldLabels">store error</span> selected for the “On error” option in the expressions window.
|
||||
|
||||
### Facet by null, empty, or blank
|
||||
|
||||
Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content. “Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
|
||||
Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content.
|
||||
|
||||
An empty cell is a cell that is set to contain a string, but doesn’t have any characters in it (a zero-length string). This can be a leftover from an operation that removed characters, or from manually editing a cell and deleting its contents.
|
||||
“Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
|
||||
|
||||
An empty cell is a cell that is set to contain a string, but doesn’t have any characters in it (a zero-length string). This can be left over from an operation that removed characters, or from manually editing a cell and deleting its contents.
|
||||
|
||||
### Facet by star or flag
|
||||
|
||||
Stars and flags offer you the opportunity to mark specific rows for yourself for later focus. Stars and flags persist through closing and opening your project, and thus can provide a different function than using a permalink to persist your facets. Stars and flags can be used in any way you want, although they are designed to help you flag errors and star rows of particular importance.
|
||||
|
||||
You can manually star or flag rows simply by clicking on the icons to the left of each row. You can also apply stars or flags to all matching rows by using the “All” dropdown menu and selecting “Edit rows” → “Star rows” or “Flag rows.” These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
|
||||
You can manually star or flag rows simply by clicking on the icons to the left of each row.
|
||||
|
||||
You can also apply stars or flags to all matching rows by using the <span class="menuItems">All</span> dropdown menu (on the first column) and selecting <span class="menuItems">Edit rows</span> → <span class="menuItems">Star rows</span> or <span class="menuItems">Flag rows</span>. This will create “true” and “false” facets in the <span class="tabLabels">Facet/Filter</span>. These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
|
||||
|
||||
You may wish to create a custom subset of your data through a series of separate faceting activities (rather than successively narrowing down with multiple facets applied). For example, you may wish to:
|
||||
* apply a facet
|
||||
@ -266,21 +302,21 @@ You may wish to create a custom subset of your data through a series of separate
|
||||
* remove that facet
|
||||
* and then work with all of the cumulative starred rows.
|
||||
|
||||
You can use the dropdown menu on the “All” column and selecting “Facet by star” or “Facet by flag.” This will create “true” and “false” facets in the facet sidebar.
|
||||
|
||||
You can also create a text facet on any column with the expression `row.starred` or `row.flagged`.
|
||||
|
||||
## Text filter
|
||||
|
||||
Filters allow you to narrow down your data based on whether a given column includes a text string.
|
||||
|
||||
When you choose “Text filter” a box appears in the “Facet/Filter” sidebar that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
|
||||
When you choose <span class="menuItems">Text filter</span> a box appears in the <span class="tabLabels">Facet/Filter</span> tab that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
|
||||
|
||||
For example, you can enter in "side" as a text filter, and it will return all cells in that column containing "side," "sideways," "offside," etc.
|
||||
For example, you can enter in “side” as a text filter, and it will return all cells in that column containing “side,” “sideways,” “offside,” etc.
|
||||
|
||||
The text filter field supports [Java's regular expression language](http://download.oracle.com/javase/tutorial/essential/regex/). For example, you can employ a regular expression to view all properly-formatted emails:
|
||||
The text filter field supports [regular expressions](expressions#regular-expressions). For example, you can employ a regular expression to view all properly-formatted emails:
|
||||
|
||||
```([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})```
|
||||
```
|
||||
([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})
|
||||
```
|
||||
|
||||
You can press “invert” on this facet to then see blank cells or invalid email addresses.
|
||||
|
||||
|
@ -182,7 +182,7 @@ Returns the array of strings obtained by splitting s into substrings with the gi
|
||||
|
||||
Returns the array of strings obtained by splitting s by sep, or by guessing either tab or comma separation if there is no sep given. Handles quotes properly and understands cancelled characters. The separator can be either a string or a regex pattern. For example, `value.smartSplit("\n")` will split at a carriage return or a new-line character.
|
||||
|
||||
Note: `value.[escape](#escapes-s-mode)('javascript')` is useful for previewing unprintable characters prior to using smartSplit().
|
||||
Note: [`value.escape('javascript')`](#escapes-s-mode) is useful for previewing unprintable characters prior to using smartSplit().
|
||||
|
||||
###### splitByCharType(s)
|
||||
|
||||
|
@ -45,7 +45,7 @@ For the absolute latest development updates, see the [snapshot releases](https:/
|
||||
|
||||
#### What’s changed
|
||||
|
||||
Our [latest version is OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1), released September 24th 2020. The major changes in this version are listed on the [3.4 release page](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) with the downloadable packages.
|
||||
Our [latest version is OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1), released September 24th 2020. The major changes in this version are listed on the [3.4.1 release page](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) with the downloadable packages.
|
||||
|
||||
You can find information about all OpenRefine versions on the [Releases page on Github](https://github.com/OpenRefine/OpenRefine/releases).
|
||||
|
||||
@ -70,7 +70,7 @@ Take note of the [extensions](#installing-extensions) you have currently install
|
||||
|
||||
[Java Development Kit (JDK)](https://jdk.java.net/) is required to run OpenRefine and should be installed first. [OpenRefine installation packages for Mac and Windows come bundled with JDK](https://openrefine.org/download.html), so you do not need to install it separately if you use those bundles.
|
||||
|
||||
There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version: at the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
|
||||
There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version. At the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
|
||||
|
||||
Download the archive (either a `.tar.gz` or a `.zip`) to your computer and then extract its contents to a location of your choice. There is no installation process, so you may wish to extract this folder directly into a place where you put program files, or another stable folder.
|
||||
|
||||
@ -91,16 +91,16 @@ import TabItem from '@theme/TabItem';
|
||||
|
||||
<TabItem value="win">
|
||||
|
||||
1. On Windows 10, click the Windows start menu button, type “env,” and look at the search results. Click “Edit the system environment variables.” (If you are using an earlier version of Windows, use the “Search” or “Search programs and files” box in the start menu.)
|
||||
1. On Windows 10, click the Start Menu button, type `env`, and look at the search results. Click <span class="buttonLabels">Edit the system environment variables</span>. (If you are using an earlier version of Windows, use the “Search” or “Search programs and files” box in the Start Menu.)
|
||||
|
||||
![A screenshot of the search results for 'env'.](/img/env.png "A screenshot of the search results for 'env'.")
|
||||
|
||||
2. Click “Environment Variables…” at the bottom of the “Advanced” window that appears.
|
||||
3. In the “Environment Variables” dialog that appears, click “New…” and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
|
||||
2. Click <span class="buttonLabels">Environment Variables…</span> at the bottom of the <span class="tabLabels">Advanced</span> window.
|
||||
3. In the <span class="tabLabels">Environment Variables</span> window that appears, click <span class="buttonLabels">New…</span> and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
|
||||
|
||||
![A screenshot of 'Environment Variables'.](/img/javahome.png "A screenshot of 'Environment Variables'.")
|
||||
|
||||
4. Set the `Value` to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the “Browse directory...” button.
|
||||
4. Set the `Value` to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the <span class="buttonLabels">Browse directory...</span> button.
|
||||
|
||||
</TabItem>
|
||||
|
||||
@ -174,7 +174,7 @@ Save and close the file. When you are back in the terminal, type
|
||||
source /etc/environment
|
||||
```
|
||||
|
||||
Exit the terminal and restart your system. You can then check that JAVA_HOME is set properly by opening another terminal and typing
|
||||
Exit the terminal and restart your system. You can then check that `JAVA_HOME` is set properly by opening another terminal and typing
|
||||
```
|
||||
echo $JAVA_HOME
|
||||
```
|
||||
@ -208,7 +208,9 @@ If you have extensions installed, do not delete the `webapp\extensions` folder w
|
||||
|
||||
<TabItem value="win">
|
||||
|
||||
Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`). You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
|
||||
Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`).
|
||||
|
||||
You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
|
||||
|
||||
</TabItem>
|
||||
|
||||
@ -311,7 +313,7 @@ tar xzf openrefine-linux-3.4.tar.gz
|
||||
|
||||
### Set where data is stored
|
||||
|
||||
OpenRefine stores data in two places: program files in the program directory, wherever it is you’ve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking “Browse workspace directory.”
|
||||
OpenRefine stores data in two places: program files in the program directory, wherever it is you’ve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking <span class="buttonLabels">Browse workspace directory</span>.
|
||||
|
||||
By default this is:
|
||||
|
||||
@ -359,7 +361,7 @@ If the folder does not exist, OpenRefine will create it.
|
||||
~/Library/Application Support/OpenRefine/
|
||||
```
|
||||
|
||||
For older versions as Google Refine:
|
||||
For older versions, as Google Refine:
|
||||
|
||||
```
|
||||
~/Library/Application Support/Google/Refine/
|
||||
@ -418,7 +420,7 @@ You can access OpenRefine server logs from the terminal on Mac:
|
||||
|
||||
## Increasing memory allocation
|
||||
|
||||
OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large data sets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
|
||||
OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large datasets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
|
||||
* more than one million total cells
|
||||
* an input file size of more than 50 megabytes (MB)
|
||||
* more than 50 [rows per record in records mode](running#records-mode)
|
||||
@ -430,7 +432,7 @@ A good practice is to start with no more than 50% of whatever memory is left ove
|
||||
All of the settings below use a four-digit number to specify the megabytes (MB) used (actually [mebibytes](https://en.wikipedia.org/wiki/Mebibyte)). The default is usually 1024MB, but the new value doesn't need to be a multiple of 1024.
|
||||
|
||||
:::info Dealing with large datasets
|
||||
If your project is big enough to need more than the default amount of memory, consider turning off “Parse cell text into numbers, dates, ...” on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default “string” type.
|
||||
If your project is big enough to need more than the default amount of memory, consider turning off <span class="fieldLabels">Parse cell text into numbers, dates, ...</span> on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default “string” type.
|
||||
:::
|
||||
|
||||
<Tabs
|
||||
@ -464,7 +466,7 @@ Once you increase the memory allocation, you may find that you cannot run `openr
|
||||
|
||||
On Windows, OpenRefine can also be run by using the file `refine.bat` in the program directory. If you start OpenRefine using `refine.bat`, the memory available to OpenRefine can be specified either through command line options, or through the `refine.ini` file.
|
||||
|
||||
To set the maximum amount of memory on the command line when using `refine.bat`, "cd" to the program directory, then type
|
||||
To set the maximum amount of memory on the command line when using `refine.bat`, `cd` to the program directory, then type
|
||||
|
||||
```refine.bat /m 2048m```
|
||||
|
||||
@ -523,7 +525,7 @@ If you’d like to create or modify an extension, [see our developer documentati
|
||||
|
||||
### Two ways to install extensions
|
||||
|
||||
You can [install extensions in one of two places](installing#set-where-data-is-stored):
|
||||
You can [install extensions in one of two places](#set-where-data-is-stored):
|
||||
|
||||
* Into your OpenRefine program folder, so they will only be available to that version/installation of OpenRefine (meaning the extension will not run if you upgrade OpenRefine), or
|
||||
* Into your workspace, where your projects are stored, so they will be available no matter which version of OpenRefine you’re using.
|
||||
@ -532,7 +534,7 @@ We provide these options because you may wish to reinstall a given extension man
|
||||
|
||||
### Find the right place to install
|
||||
|
||||
If you want to install the extension into the program folder, go to your program directory and then go to `/webapp/extensions` (or create it if not does not exist).
|
||||
If you want to install the extension into the program folder, go to your program directory and then go to `webapp\extensions` (or create it if not does not exist).
|
||||
|
||||
If you want to install the extension into your workspace, you can:
|
||||
* launch OpenRefine and click <span class="menuItems">Open Project</span> in the sidebar
|
||||
@ -540,7 +542,7 @@ If you want to install the extension into your workspace, you can:
|
||||
* A file-explorer or finder window will open in your workspace
|
||||
* Create a new folder called “extensions” inside the workspace if it does not exist.
|
||||
|
||||
You can also [find your workspace on each operating system using these instructions](installing#set-where-data-is-stored).
|
||||
You can also [find your workspace on each operating system using these instructions](#set-where-data-is-stored).
|
||||
|
||||
### Install the extension
|
||||
|
||||
@ -551,7 +553,7 @@ Some extensions may have multiple versions, to match OpenRefine versions, so be
|
||||
Generally, the installation process will be:
|
||||
|
||||
* Download the extension (usually as a zip file from GitHub)
|
||||
* Extract the zip contents into the `extensions` directory, making sure all the contents go into one folder with the name of the extension
|
||||
* Extract the zip contents into the “extensions” directory, making sure all the contents go into one folder with the name of the extension
|
||||
* Start (or restart) OpenRefine.
|
||||
|
||||
To confirm that installation was a success, follow the instructions provided by the extension. Each extension will appear in its own way inside the OpenRefine interface: make sure you read the documentation to know where the functionality will appear, such as under specific dropdown menus.
|
||||
To confirm that installation was a success, follow the instructions provided by the extension. Each extension will appear in its own way inside the OpenRefine interface. Make sure you read its documentation to know where the functionality will appear, such as under specific dropdown menus.
|
@ -1,4 +1,4 @@
|
||||
---
|
||||
---
|
||||
id: running
|
||||
title: Running OpenRefine
|
||||
sidebar_label: Running
|
||||
@ -8,9 +8,9 @@ sidebar_label: Running
|
||||
|
||||
OpenRefine does not require internet access to run its basic functions. Once you download and install it, it runs as a small web server on your own computer, and you access that local web server by using your browser.
|
||||
|
||||
You will see a command line window open when you run OpenRefine. Leave that window alone while you work on datasets in your browser.
|
||||
You will see a command line window open when you run OpenRefine. Ignore that window while you work on datasets in your browser.
|
||||
|
||||
No matter how you load OpenRefine, it will load in your computer’s default browser. If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: http://127.0.0.1:3333/.
|
||||
No matter how you start OpenRefine, it will load its interface in your computer’s default browser. If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: [http://127.0.0.1:3333/](http://127.0.0.1:3333/).
|
||||
|
||||
OpenRefine works best on browsers based on Webkit, such as:
|
||||
* Google Chrome
|
||||
@ -20,7 +20,7 @@ OpenRefine works best on browsers based on Webkit, such as:
|
||||
|
||||
We are aware of some minor rendering and performance issues on other browsers such as Firefox. We don't support Internet Explorer.
|
||||
|
||||
You can launch multiple projects at the same time by simply having multiple tabs or browser windows open. From the <span class="menuItems">Open Project</span> screen, you can right-click on project names and open them in new tabs or windows.
|
||||
You can view and work on multiple projects at the same time by simply having multiple tabs or browser windows open. From the <span class="menuItems">Open Project</span> screen, you can right-click on project names and open them in new tabs or windows.
|
||||
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
@ -37,21 +37,28 @@ import TabItem from '@theme/TabItem';
|
||||
|
||||
<TabItem value="win">
|
||||
|
||||
To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard. This will save any last changes to your projects.
|
||||
|
||||
#### With openrefine.exe
|
||||
You can run OpenRefine by double-clicking `openrefine.exe` or calling it from the command line. If you want to [modify the way `openrefine.exe` opens](#starting-with-modifications), you can edit the `openrefine.l4j.ini` file.
|
||||
You can run OpenRefine by double-clicking `openrefine.exe` or calling it from the command line.
|
||||
|
||||
If you want to [modify the way `openrefine.exe` opens](#starting-with-modifications), you can edit the `openrefine.l4j.ini` file.
|
||||
|
||||
#### With refine.bat
|
||||
On Windows, OpenRefine can also be run by using the file `refine.bat` in the program directory. If you start OpenRefine using `refine.bat`, you can do so by opening the file itself, or by calling it from the command line.
|
||||
|
||||
If you call `refine.bat` from the command line, you can [start OpenRefine with modifications](#starting-with-modifications). If you want to modify the way `refine.bat` opens through double-clicking or using a shortcut, you can edit the `refine.ini` file.
|
||||
If you call `refine.bat` from the command line, you can [start OpenRefine with modifications](#starting-with-modifications).
|
||||
If you want to modify the way `refine.bat` opens through double-clicking or using a shortcut, you can edit the `refine.ini` file.
|
||||
|
||||
#### Exiting
|
||||
|
||||
To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard. This will save any last changes to your projects.
|
||||
|
||||
</TabItem>
|
||||
|
||||
<TabItem value="mac">
|
||||
|
||||
You can find OpenRefine in your Applications folder, or you can call it from the command line. To exit, close all your OpenRefine browser tabs, go back to the terminal window and press `Command` and `Q` to close it down.
|
||||
You can find OpenRefine in your Applications folder, or you can call it from the command line with `./refine`.
|
||||
|
||||
To exit, close all your OpenRefine browser tabs, go back to the terminal window and press `Command` and `Q` to close it down.
|
||||
|
||||
:::caution Problems starting?
|
||||
If you are using an older version of OpenRefine or are on an older version of MacOS, [check our Wiki for solutions to problems with MacOS](https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions#macos).
|
||||
@ -64,7 +71,7 @@ If you are using an older version of OpenRefine or are on an older version of Ma
|
||||
Use a terminal to launch OpenRefine. First, navigate to the installation folder. Then call the program:
|
||||
|
||||
```
|
||||
cd openrefine-3.4
|
||||
cd openrefine-3.4.1
|
||||
./refine
|
||||
```
|
||||
|
||||
@ -134,9 +141,9 @@ To see the full list of command-line options, run `./refine -h`.
|
||||
|-m|Memory maximum heap|./refine -m 6000M|
|
||||
|-p|Port|./refine -p 3334|
|
||||
|-i|Interface (IP address, or IP and port)|./refine -i 127.0.0.2:3334|
|
||||
|-k|Add a Google API key|_need an example_|
|
||||
|-v|Verbosity (from low to high)|error,warn,info,debug,trace|
|
||||
|-x|Additional configuration parameters|_need an example_|
|
||||
|-k|Add a Google API key|./refine -k YOUR_API_KEY|
|
||||
|-v|Verbosity (from low to high: error,warn,info,debug,trace)|./refine -v info|
|
||||
|-x|Additional configuration parameters||
|
||||
|--debug|Enable debugging (on port 8000)|./refine --debug|
|
||||
|--jmx|Enable JMX monitoring for Jconsole and JvisualVM|./refine --jmx|
|
||||
|
||||
@ -153,9 +160,9 @@ To see the full list of command-line options, run `./refine -h`.
|
||||
|-m|Memory maximum heap|./refine -m 6000M|
|
||||
|-p|Port|./refine -p 3334|
|
||||
|-i|Interface (IP address, or IP and port)|./refine -i 127.0.0.2:3334|
|
||||
|-k|Add a Google API key|_need an example_|
|
||||
|-v|Verbosity (from low to high)|error,warn,info,debug,trace|
|
||||
|-x|Additional configuration parameters|_need an example_|
|
||||
|-k|Add a Google API key|./refine -k YOUR_API_KEY|
|
||||
|-v|Verbosity (from low to high: error,warn,info,debug,trace)|./refine -v info|
|
||||
|-x|Additional configuration parameters||
|
||||
|--debug|Enable debugging (on port 8000)|./refine --debug|
|
||||
|--jmx|Enable JMX monitoring for Jconsole and JvisualVM|./refine --jmx|
|
||||
|
||||
@ -168,7 +175,9 @@ To see the full list of command-line options, run `./refine -h`.
|
||||
#### Modifications set within files
|
||||
|
||||
On Windows, you can modify the way `openrefine.exe` runs by editing `openrefine.l4j.ini`; you can modify the way `refine.bat` runs by editing `refine.ini`.
|
||||
You can modify the Mac application by editing `Info.plist`.
|
||||
|
||||
You can modify the Mac application by editing `info.plist`.
|
||||
|
||||
On Linux, you can edit `refine.ini`.
|
||||
|
||||
Some settings, such as changing memory allocations, are already set inside these files, and all you have to do is change the values. Some lines need to be un-commented to work.
|
||||
@ -189,16 +198,18 @@ REFINE_MIN_MEMORY=1400M
|
||||
...
|
||||
```
|
||||
|
||||
Further modifications can be performed by using JVM preferences.
|
||||
##### JVM preferences
|
||||
|
||||
These JVM preferences are different options and have different syntax than the key/value descriptions used on the command line. Some of the most common keys (with their defaults) are:
|
||||
* -Drefine.autosave (5 [minutes])
|
||||
* -Drefine.data_dir (/)
|
||||
* -Drefine.development (false)
|
||||
* -Drefine.headless (false)
|
||||
* -Drefine.host (127.0.0.1)
|
||||
* -Drefine.port (3333)
|
||||
* -Drefine.webapp (main/webapp)
|
||||
Further modifications can be performed by using JVM preferences. These JVM preferences are different options and have different syntax than the key/value descriptions used on the command line.
|
||||
|
||||
Some of the most common keys (with their defaults) are:
|
||||
* The project [autosave](starting#autosaving) frequency: `-Drefine.autosave` (5 [minutes])
|
||||
* The workspace director: `-Drefine.data_dir` (/)
|
||||
* Development mode: `-Drefine.development` (false)
|
||||
* Headless mode: `-Drefine.headless` (false)
|
||||
* IP: `-Drefine.host` (127.0.0.1)
|
||||
* Port: `-Drefine.port` (3333)
|
||||
* The application folder: `-Drefine.webapp` (main/webapp)
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
@ -214,7 +225,7 @@ The syntax is as follows:
|
||||
|
||||
<TabItem value="win">
|
||||
|
||||
Inside the `refine.l4j.ini` file, insert lines in this way:
|
||||
Locate the `refine.l4j.ini` file, and insert lines in this way:
|
||||
|
||||
```
|
||||
-Drefine.port=3334
|
||||
@ -232,9 +243,11 @@ JAVA_OPTIONS=-Drefine.data_dir=C:\Users\user\Documents\OpenRefine\ -Drefine.port
|
||||
|
||||
<TabItem value="mac">
|
||||
|
||||
Find the 'array' element that follows the line:
|
||||
Locate the `info.plist`, and find the `array` element that follows the line
|
||||
|
||||
`<key>JVMOptions</key>`
|
||||
```
|
||||
<key>JVMOptions</key>
|
||||
```
|
||||
|
||||
Typically this looks something like:
|
||||
|
||||
@ -248,7 +261,7 @@ Typically this looks something like:
|
||||
</array>
|
||||
```
|
||||
|
||||
Add in values like:
|
||||
Add in values such as:
|
||||
|
||||
```
|
||||
<key>JVMOptions</key>
|
||||
@ -267,7 +280,7 @@ Add in values like:
|
||||
|
||||
<TabItem value="linux">
|
||||
|
||||
In `refine.ini`, add `JAVA_OPTIONS=` before the `-Drefine.preference` declaration. You can un-comment and edit the existing suggested lines, or add lines:
|
||||
Locate the `refine.ini` file, and add `JAVA_OPTIONS=` before the `-Drefine.preference` declaration. You can un-comment and edit the existing suggested lines, or add lines:
|
||||
|
||||
```
|
||||
JAVA_OPTIONS=-Drefine.autosave=2
|
||||
@ -288,9 +301,11 @@ Refer to the [official Java documentation](https://docs.oracle.com/javase/8/docs
|
||||
|
||||
When you first launch OpenRefine, you will see a screen with a menu on the left hand side that includes <span class="menuItems">Create Project</span>, <span class="menuItems">Open Project</span>, <span class="menuItems">Import Project</span>, and <span class="menuItems">Language Settings</span>. This is called the “home screen,” where you can manage your projects and general settings.
|
||||
|
||||
In the lower left-hand corner of the screen, you'll see <span class="menuItems">Preferences</span>, <span class="menuItems">Help</span>, and <span class="menuItems">About</span>.
|
||||
|
||||
### Language settings
|
||||
|
||||
You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface:
|
||||
From the home screen, look in the options to the left for <span class="menuItems">Language Settings</span>. You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface:
|
||||
|
||||
* Cebuano
|
||||
* German
|
||||
@ -307,13 +322,15 @@ You can set your preferred interface language here. This language setting will p
|
||||
* Tagalog
|
||||
* Chinese (简体中文)
|
||||
|
||||
To leave the Language Settings screen, click on the diamond “OpenRefine” logo.
|
||||
|
||||
:::info
|
||||
We use Weblate to provide translations for the interface. You can check [our profile on Weblate](https://hosted.weblate.org/projects/openrefine/translations/) to see which languages are in the process of being supported. See [our technical reference if you are interested in contributing translation work](https://docs.openrefine.org/technical-reference/translating) to make OpenRefine accessible to people in other languages.
|
||||
:::
|
||||
|
||||
### Preferences
|
||||
|
||||
At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it.
|
||||
In the bottom left corner of the screen, look for <span class="menuItems">Preferences</span>. At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it.
|
||||
|
||||
|Setting|Key|Value syntax|Default|Example|
|
||||
|---|---|---|---|---|
|
||||
@ -337,7 +354,7 @@ The project screen (or work screen) is where you will spend most of your time on
|
||||
|
||||
The project bar runs across the very top of the project screen. It contains the the OpenRefine logo, the project title, and the project control buttons on the right side.
|
||||
|
||||
At any time you can close your current project and go back to the home screen by clicking on the OpenRefine logo. If you’d like to open another project in a new browser tab or window, you can right-click on the logo and use “Open in a new tab.” You will lose your current facets and view settings if you close your project (but data transformations will be saved in the [History](#history-undoredo) of the project).
|
||||
At any time you can close your current project and go back to the home screen by clicking on the OpenRefine logo. If you’d like to open another project in a new browser tab or window, you can right-click on the logo and use “Open in a new tab.” You will lose [your current facets and view settings](#facetfilter) if you close your project (but data transformations will be saved in the [History](#history-undoredo) of the project).
|
||||
|
||||
:::caution
|
||||
Don’t click the “back” button on your browser - it will likely close your current project and you will lose your facets and view settings.
|
||||
@ -345,21 +362,21 @@ Don’t click the “back” button on your browser - it will likely close your
|
||||
|
||||
You can rename a project at any time by clicking inside the project title, which will turn into a text field. Project names don’t have to be unique, as OpenRefine organizes them based on a unique identifier behind the scenes.
|
||||
|
||||
The <span class="menuItems">Permalink</span> allows you to return to a project at a specific view state - that is, with facets and filters applied. The permalink can help you pick up where you left off if you have to close your project while working with facets and filters. It puts view-specific information directly into the URL: clicking on it will load this current-view URL in the existing tab. You can right-click and copy the Permalink URL to copy the current view state to your clipboard, without refreshing the tab you’re using.
|
||||
The <span class="menuItems">Permalink</span> allows you to return to a project at a specific view state - that is, with [facets and filters](facets) applied. The <span class="menuItems">Permalink</span> can help you pick up where you left off if you have to close your project while working with facets and filters. It puts view-specific information directly into the URL: clicking on it will load this current-view URL in the existing tab. You can right-click and copy the <span class="menuItems">Permalink</span> URL to copy the current view state to your clipboard, without refreshing the tab you’re using.
|
||||
|
||||
The <span class="menuItems">Open…</span> button will open up a new browser tab showing the <span class="menuItems">Create Project</span> screen. From here you can change settings, start a new project, or open an existing project.
|
||||
|
||||
<span class="menuItems">Export</span> is a dropdown menu that allows you to pick a format for exporting your current dataset. It will only export rows and records that are currently visible - the currently selected facets and filters, not the total data in the project.
|
||||
<span class="menuItems">Export</span> is a dropdown menu that allows you to pick a format for exporting a dataset. Many of the export options will only export rows and records that are currently visible - the currently selected facets and filters, not the total data in the project.
|
||||
|
||||
<span class="menuItems">Help</span> will open up a new browser tab and bring you to this user manual on the web.
|
||||
|
||||
### The grid header
|
||||
|
||||
The grid header sits below the project bar and above the project grid (the data of your project). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in rows or records mode.
|
||||
The grid header sits below the project bar and above the project grid (where the data of your project is displayed). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in [rows or records mode](exploring#rows-vs-records).
|
||||
|
||||
It will also tell you if you’re currently looking at a select number of rows via facets or filtering, rather than the entire dataset, by displaying either, for example, “180 rows” or “67 matching rows (180 total).”
|
||||
|
||||
Directly below the row number, you have the ability to switch between [row mode and records mode](exploring#rows-vs-records). OpenRefine stores which projects are in records mode, and displays your data as records by default if you are.
|
||||
Directly below the row number, you have the ability to switch between [row mode and records mode](exploring#rows-vs-records). OpenRefine stores projects persistently in one of the two modes, and displays your data as records by default if you are.
|
||||
|
||||
To the right of the rows/records selection is the array of options for how many rows/records to view on screen at one time. At the far right of the screen you can navigate through your entire dataset one page at a time.
|
||||
|
||||
@ -369,13 +386,13 @@ The <span class="menuItems">Extensions</span> dropdown offers you options for ex
|
||||
|
||||
### The grid
|
||||
|
||||
The area of the project screen that displays your dataset is called the “project grid” (or the “data grid,” or simply the “grid”). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you.
|
||||
The area of the project screen that displays your dataset is called the “grid” (or the “data grid,” or the “project grid”). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you.
|
||||
|
||||
Columns widths are automatically set based on their contents; some column headers may be cut off, but can be viewed by mousing over the headers.
|
||||
|
||||
In each column header you will see a small arrow. Clicking on this arrow brings up a dropdown menu containing column-specific data exploration and transformation options. You will learn about each of these options in the [Exploring data](exploring) and [Transforming data](transforming) sections.
|
||||
|
||||
The first column in every project will always be “All,” which contains options to flag, star, and do non-column-specific operations. The “All” column is also where rows/records are numbered.
|
||||
The first column in every project will always be <span class="menuItems">All</span>, which contains options to flag, star, and do non-column-specific operations. The <span class="menuItems">All</span> column is also where rows/records are numbered. Numbering shows the permanent order of rows and records; a temporary sorting or facet may reorder the rows or show a limited set, but numbering will show you the original identifiers unless you make a permanent change.
|
||||
|
||||
The project grid may display with both vertical and horizontal scrolling, depending on the number and width of columns, and the number of rows/records displayed. You can control the display of the project grid by using [Sort and View options](exploring#sort-and-view).
|
||||
|
||||
@ -383,17 +400,19 @@ Mousing over individual cells will allow you to [edit cells individually](celled
|
||||
|
||||
### Facet/Filter
|
||||
|
||||
The Facet/Filter tab is one of the main ways of exploring your data: displaying the patterns and trends in your data, and helping you narrow your focus and modify that data. [Facets](facets) and [filters](facets#text-filter) are explained more in [Exploring data](exploring).
|
||||
The <span class="tabLabels">Facet/Filter</span> tab is one of the main ways of exploring your data: displaying the patterns and trends in your data, and helping you narrow your focus and modify that data. [Facets](facets) and [filters](facets#text-filter) are explained more in [Exploring data](exploring).
|
||||
|
||||
![A screenshot of facets and filters in action.](/img/facetfilter.png)
|
||||
|
||||
In the interface, you will see three buttons: <span class="menuItems">Refresh</span>, <span class="menuItems">Reset all</span>, and <span class="menuItems">Remove all</span>. Refreshing your facets will ensure you are looking at the latest information about each facet, if you have changed the counts or eliminated some options, for example.
|
||||
In the tab, you will see three buttons: <span class="menuItems">Refresh</span>, <span class="menuItems">Reset all</span>, and <span class="menuItems">Remove all</span>.
|
||||
|
||||
Resetting your facets will remove any inclusion or exclusion you may have set - the facet options will stay in the sidebar, but your view settings will be reset.
|
||||
Refreshing your facets will ensure you are looking at the latest information about each facet, for example if you have changed the counts or eliminated some options.
|
||||
|
||||
Removing your facets will clear out the sidebar entirely. If you have written custom facets using expressions, these will be lost.
|
||||
Resetting your facets will remove any inclusion or exclusion you may have set - the facet options will stay in the sidebar, but your view settings will be undone.
|
||||
|
||||
You can preserve your facets and filters for future use by copying a [Permalink](#the-project-bar).
|
||||
Removing your facets will clear out the sidebar entirely. If you have written custom facets using [expressions](expressions), these will be lost.
|
||||
|
||||
You can preserve your facets and filters for future use by copying a <span class="menuItems">[Permalink](#the-project-bar)</span>.
|
||||
|
||||
### History (Undo/Redo)
|
||||
|
||||
@ -403,7 +422,7 @@ Project history gets saved when you export a project archive, and restored when
|
||||
|
||||
![A screenshot of the History (Undo/Redo) tab with 13 steps.](/img/history.png "A screenshot of the History (Undo/Redo) tab with 13 steps.")
|
||||
|
||||
When you click on <span class="menuItems">Undo / Redo</span> in the sidebar of any project, that project’s history is shown as a list of changes in order, with the first “change” being the action of creating the project itself. (That first change, indexed as step zero, cannot be undone.) Here is a sample history with 3 changes:
|
||||
When you click on the <span class="tabLabels">Undo / Redo</span> tab in the sidebar of any project, that project’s history is shown as a list of changes in order, with the first “change” being the action of creating the project itself. (That first change, indexed as step zero, cannot be undone.) Here is a sample history with 3 changes:
|
||||
|
||||
```
|
||||
0. Create project
|
||||
@ -420,15 +439,15 @@ In this example, changes #2 and #3 will now be grayed out. You can redo a change
|
||||
|
||||
If you have moved back one or more states, and then you perform a new operation on your data, the later actions (everything that’s greyed out) will be erased and cannot be re-applied.
|
||||
|
||||
The Undo/Redo tab will show you which step you’re on, and if you’re about to risk erasing work - by saying something like “4/5" or “1/7” at the end.
|
||||
The Undo/Redo tab will indicate which step you’re on, and if you’re about to risk erasing work - by saying something like “4/5" or “1/7” at the end.
|
||||
|
||||
#### Reusing operations
|
||||
|
||||
Operations that you perform in OpenRefine can be reused. For example, a formula you wrote inside one project can be copied and applied to another project later.
|
||||
|
||||
To reuse one or more operations, you first extract it from the project where it was first applied. Click to the Undo/Redo tab and click <span class="menuItems">Extract…</span>. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON off to the clipboard.
|
||||
To reuse one or more operations, first extract it from the project where it was first applied. Click to the <span class="tabLabels">Undo/Redo</span> tab and click <span class="menuItems">Extract…</span>. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON to the clipboard.
|
||||
|
||||
Move to the second project, go to the Undo/Redo tab, click <span class="menuItems">Apply…</span> and paste in that JSON.
|
||||
Move to the second project, go to the <span class="tabLabels">Undo/Redo</span> tab, click <span class="menuItems">Apply…</span> and paste in that JSON.
|
||||
|
||||
Not all operations can be extracted. Edits to a single cell, for example, can’t be replicated.
|
||||
|
||||
@ -477,7 +496,6 @@ Some users may wish to employ OpenRefine for batch processing as part of a large
|
||||
The following are all third-party extensions and code; the OpenRefine team does not maintain them and cannot guarantee that any of them work.
|
||||
:::
|
||||
|
||||
|
||||
Some examples:
|
||||
|
||||
* This project allows OpenRefine to be run from the command line using [operations saved in a JSON file](running#reusing-operations): [OpenRefine batch processing](https://github.com/opencultureconsulting/openrefine-batch)
|
||||
@ -485,6 +503,4 @@ Some examples:
|
||||
* And the same in Ruby: [Refine-Ruby](https://github.com/maxogden/refine-ruby)
|
||||
* Another Python client library, by Paul Makepeace: [OpenRefine Python Client Library](https://github.com/PaulMakepeace/refine-client-py)
|
||||
|
||||
To look for other instances, search our Google Groups [for users](https://groups.google.com/g/openrefine and [for developers](https://groups.google.com/g/openrefine-dev), where [these projects were originally posted](https://groups.google.com/g/openrefine/c/GfS1bfCBJow/m/qWYOZo3PKe4J).
|
||||
|
||||
|
||||
To look for other instances, search our Google Groups [for users](https://groups.google.com/g/openrefine) and [for developers](https://groups.google.com/g/openrefine-dev), where [these projects were originally posted](https://groups.google.com/g/openrefine/c/GfS1bfCBJow/m/qWYOZo3PKe4J).
|
@ -6,30 +6,30 @@ sidebar_label: Sort and view
|
||||
|
||||
## Sort
|
||||
|
||||
You can temporarily sort your rows by one column. You can sort:
|
||||
You can temporarily sort your rows by one column. You can sort based on [data type](exploring#data-types):
|
||||
* text alphabetically or reverse
|
||||
* numbers by largest or smallest
|
||||
* dates by earliest or latest
|
||||
* boolean values by false first or true first
|
||||
* boolean values by false first or true first.
|
||||
|
||||
You can also choose where to place errors and blank cells in the sorting. Text can be case-sensitive or not: cells that start with lowercase characters will appear ahead of uppercase.
|
||||
You can also choose where to place errors and blank cells in the sorting. Text can be case-sensitive or not: if so, cells that start with lowercase characters will appear ahead of those that start with uppercase characters.
|
||||
|
||||
![A screenshot of the Sort window.](/img/sort.png)
|
||||
|
||||
After you apply a sorting method, you can make it permanent, remove it, reverse it, or apply a subsequent sorting. You’ll find “Sort” in the project grid header to the right of the rows-display setting, which will show all current sorting settings.
|
||||
After you apply a sorting method, you can make it permanent, remove it, reverse it, or apply a subsequent sorting. When it is applied, you’ll find <span class="menuItems">Sort</span> in the project grid header to the right of the rows-display setting, which will show all current sorting settings.
|
||||
|
||||
If you have multiple sorting methods applied, they will work in the order you applied them (represented in order in the "Sort" menu). For example, you can sort an "authors" column alphabetically, and then sort books by publication date, for those authors that have more than one book. If you apply those in a different order - sort all the publication dates in the dataset first, and then alphabetically by author - your dataset will look different.
|
||||
If you have multiple sorting methods applied, they will work in the order you applied them (represented in order in the <span class="menuItems">Sort</span> menu). For example, you can sort an “authors” column alphabetically, and then sort their books by publication date, for those authors that have more than one book. If you apply those in a different order - sort all the publication dates in the dataset first, and then alphabetically by author - your dataset will look different.
|
||||
|
||||
![Temporarily sorted rows.](/img/sort2.png)
|
||||
|
||||
When the sorting method you've applied is temporary, you will see that the rows retain their original numbering. When you make that sorting method permanent, by selecting "Reorder rows permanently," the row numbers will change and the "Sort" menu in the project grid header will disappear. This will apply all current sorting methods.
|
||||
When the sorting method you've applied is temporary, you will see that the rows retain their original numbering. When you make that sorting method permanent, by selecting <span class="menuItems">Reorder rows permanently</span>, the row numbers will change and the <span class="menuItems">Sort</span> menu in the project grid header will disappear. This will apply all current sorting methods.
|
||||
|
||||
## View
|
||||
|
||||
You can control what data you view in the grid. On each column, you can “collapse” that specific column, all other columns, all columns to the left, and all columns to the right. Using the “All” column’s dropdown menu, you can collapse all columns, and expand all the columns that you previously collapsed.
|
||||
You can control what data you view in the grid. On each column, you will see a <span class="menuItems">View</span> menu option. From there, you can “collapse” (hide) that specific column, all other columns, all columns to the left, and all columns to the right. Using the <span class="menuItems">View</span> option that appears in the <span class="menuItems">All</span> column’s dropdown menu, you can collapse all columns, and expand all the columns that you previously collapsed.
|
||||
|
||||
### Show/hide “null”
|
||||
|
||||
You can also use the “All” dropdown to show and hide [“null” values](#cell-data-types). A small grey “null” will appear in each applicable cell.
|
||||
You can find, under <span class="menuItems">All</span> → <span class="menuItems">View</span>, the option to show and hide [“null” values](exploring#data-types). A small grey “null” will appear in each applicable cell. Remember that a null cell is not the same thing as an empty cell.
|
||||
|
||||
![A screenshot of what a null value looks like.](/img/null.png)
|
||||
|
@ -1,4 +1,4 @@
|
||||
---
|
||||
---
|
||||
id: starting
|
||||
title: Starting a project
|
||||
sidebar_label: Starting a project
|
||||
@ -8,21 +8,21 @@ sidebar_label: Starting a project
|
||||
|
||||
An OpenRefine project is started by importing in some existing data - OpenRefine doesn’t allow you to create a dataset from nothing.
|
||||
|
||||
No matter where your data comes from, OpenRefine doesn’t modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored).
|
||||
No matter where your data comes from, OpenRefine won’t modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored).
|
||||
|
||||
The data and all of your edits are [automatically saved](#autosaving) inside the project file. When you’re finished modifying the data, you can [export it back out](exporting) into the file format of your choice.
|
||||
|
||||
You can also receive and open other people’s projects, or send them yours, by [exporting a project archive](exporting#export-a-project) and [importing it](#import-a-project).
|
||||
|
||||
## Create project by importing data
|
||||
## Create a project by importing data
|
||||
|
||||
When you start OpenRefine, you’ll be taken to the <span class="menuItems">Create Project</span> screen. You’ll see on the left side of the screen that your options are to:
|
||||
|
||||
* import data from a file on your computer
|
||||
* import data from a link to the web
|
||||
* import data from one or more files on your computer
|
||||
* import data from one or more links on the web
|
||||
* import data by pasting in text from your clipboard
|
||||
* import data from a database (using SQL), and
|
||||
* import Sheets from Google Drive.
|
||||
* import one or more Sheets from Google Drive.
|
||||
|
||||
From these sources, you can load any of the following file formats:
|
||||
|
||||
@ -40,7 +40,7 @@ From these sources, you can load any of the following file formats:
|
||||
|
||||
More formats can be imported by [adding extensions to provide that functionality](https://openrefine.org/download.html).
|
||||
|
||||
If you supply two or more files for one project, the files’ rows will be loaded in the order that you specify, and OpenRefine will create a column at the beginning of the dataset with the source URL or file name in it to help you identify where each row came from. If the files have matching columns, the data will load in each column; if not, the successive files will append all of their new columns to the end of the dataset:
|
||||
If you supply two or more files for one project, the files’ rows will be loaded in the order that you specify, and OpenRefine will create a column at the beginning of the dataset with the source URL or file name in it to help you identify where each row came from. If the files have columns with identical names, the data will load in those columns; if not, the successive files will append all of their new columns to the end of the dataset:
|
||||
|
||||
|File|Fruit|Quantity|Berry|Berry source|
|
||||
|---|---|---|---|---|
|
||||
@ -49,19 +49,19 @@ If you supply two or more files for one project, the files’ rows will be loade
|
||||
|berries.csv||9|Mulberry|Greece|
|
||||
|berries.csv||2|Blueberry|Canada|
|
||||
|
||||
You cannot combine two datasets into one project by appending data within rows. You can, however, combine two projects later using functions such as [cross()](grelfunctions/#crosscell-s-projectname-s-columnname).
|
||||
You cannot combine two datasets into one project by appending data within rows. You can, however, combine two projects later using functions such as [cross()](grelfunctions/#crosscell-s-projectname-s-columnname), or [fetch further data](columnediting) using other methods.
|
||||
|
||||
For whichever method you choose, when you click <span class="menuItems">Next >></span> you will be given a preview and a chance to configure the way OpenRefine interprets the file.
|
||||
For whichever method you choose to start your project, when you click <span class="menuItems">Next >></span> you will be given a preview and a chance to configure the way OpenRefine interprets the data you input.
|
||||
|
||||
### Get data from this computer
|
||||
|
||||
Click on <span class="menuItems">Browse…</span> and select a file on your hard drive. All files will be shown, not just compatible ones.
|
||||
Click on <span class="menuItems">Browse…</span> and select a file (or several) on your hard drive. All files will be shown, not just compatible ones.
|
||||
|
||||
If you import an archive file (something with the extension `.zip`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.gz`, or `.bz2`), OpenRefine detects the files inside it, shows you a preview screen, and allows you to select which ones to load. This does not work with `.rar` files.
|
||||
|
||||
### Web Addresses (URLs)
|
||||
|
||||
Type or paste the URL to the data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview it for you.
|
||||
Type or paste the URL to a data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview the project for you.
|
||||
|
||||
If you supply two or more file URLs, OpenRefine will identify each one and ask you to choose which (or all) to load.
|
||||
|
||||
@ -69,33 +69,33 @@ Do not use this form to load a Google Sheet by its link; use [the Google Data fo
|
||||
|
||||
### Clipboard
|
||||
|
||||
You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into multi-column rows. OpenRefine recognizes each new text line as a row.
|
||||
You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into rows. OpenRefine recognizes each new text line as a row.
|
||||
|
||||
This can be useful if you want to pre-select a specific number of rows from your source data, or paste together rows from different places, rather than delete unwanted rows later in the project interace.
|
||||
|
||||
This can also be useful if you would like to paste in a list of URLs, which you can use later to fetch the data online and build columns with.
|
||||
This can also be useful if you would like to paste in a list of URLs, which you can use later to [fetch more data](columnediting).
|
||||
|
||||
### Database (SQL)
|
||||
|
||||
If you are an administrator or have SQL access to a database of information, you may want to pull the latest dataset directly from there. This could include an online catalogue, a content management system, or a digital repository or collection management system. You can also load a database (`.db`) file saved locally. You will need to use an [SQL query](https://www.w3schools.com/sql/) to import your intended data.
|
||||
|
||||
There are some publicly-accessible databases you can query, such as [one provided by Rfam](https://docs.rfam.org/en/latest/database.html). The instructions provided by Rfam can help you understand how to connect to and query from any database.
|
||||
There are some publicly-accessible databases you can query, such as [one provided by Rfam](https://docs.rfam.org/en/latest/database.html). The instructions provided by Rfam can help you understand how to connect to and query from other databases.
|
||||
|
||||
OpenRefine can connect to PostgreSQL, MySQL, MariaDB, and SQLite database systems. It will automatically populate the <span class="menuItems">Port</span> field based on which of these you choose, but you can manually edit this if needed.
|
||||
OpenRefine can connect to PostgreSQL, MySQL, MariaDB, and SQLite database systems. It will automatically populate the <span class="fieldLabels">Port</span> field based on which of these you choose, but you can manually edit this if needed.
|
||||
|
||||
If you have a `.db` file, you can supply the path to the file on your computer directly in the <span class="menuItems">Database</span> field at the bottom of the form. You can leave the rest of the fields blank.
|
||||
If you have a `.db` file, you can supply the path to the file on your computer in the <span class="fieldLabels">Database</span> field at the bottom of the form. You can leave the rest of the fields blank.
|
||||
|
||||
To import data directly from a database, you will need the database type (such as MySQL), database name, the hostname (either an IP address or the domain that hosts the database), and the port on the host. You will need an account authorized for access, and you may need to add OpenRefine's IP address or host to the <span class="menuItems">allowable hosts</span> for that account. You can find that information by pressing <span class="menuItems">Test</span> and getting the IP from the error message that results.
|
||||
To import data directly from a database, you will need the database type (such as MySQL), database name, the hostname (either an IP address or the domain that hosts the database), and the port on the host. You will need an account authorized for access, and you may need to add OpenRefine's IP address or host to the "allowable hosts" for that account. You can find that information by pressing <span class="buttonLabels">Test</span> and getting the IP address from the error message that results.
|
||||
|
||||
You can either connect just once to gather data, or save the connection to use it again later. If you press <span class="menuItems">Connect</span> without saving, OpenRefine will forget all the information you just entered. If you’d like to save the connection, name your connection in a way you will recognize later. Click <span class="menuItems">Save</span> and it will appear in the <span class="menuItems">Saved Connections</span> list on the left. From now on, you can click on the <span class="menuItems">...</span> ellipsis to the right of the connection you’ve saved, and click <span class="menuItems">Connect</span>.
|
||||
You can either connect just once to gather data, or save the connection to use it again later. If you press <span class="buttonLabels">Connect</span> without saving, OpenRefine will forget all the information you just entered. If you’d like to save the connection, name your connection in a way you will recognize later. Click <span class="buttonLabels">Save</span> and it will appear in the <span class="menuItems">Saved Connections</span> list on the left. From now on, you can click on the <span class="buttonLabels">...</span> ellipsis to the right of the connection you’ve saved, and click <span class="buttonLabels">Connect</span>.
|
||||
|
||||
If your connection is successful, you will see a Query Editor where you can run your SQL query. OpenRefine will give you an error if you write a statement that tries to modify the source database in any way.
|
||||
|
||||
### Google Data
|
||||
|
||||
You have two ways to load in data from Google Sheets:
|
||||
* A link to an accessible Google Sheet (that is, one with link-sharing turned on)
|
||||
* Selecting a Google Sheet in your Google Drive.
|
||||
* providing a link to an accessible Google Sheet (that is, one with link-sharing turned on), and
|
||||
* selecting a Google Sheet in your Google Drive.
|
||||
|
||||
#### Google Sheet by URL
|
||||
|
||||
@ -111,11 +111,13 @@ This will only work with Sheets, not with any other Google Drive file that might
|
||||
|
||||
You can authorize OpenRefine to access your Google Drive data and import data from any Google Sheet it finds there. This will include Sheets that belong to you and Sheets that are shared with you, as well as Sheets that are in your trash.
|
||||
|
||||
When you select a Google option (either here, or [when exporting project data to Google Drive or Google Sheets](exporting), you will see a pop-up window that asks you to select a Google account to authorize with. You may see an error message when you authorize: if so, try your import or export operation again and it should succeed.
|
||||
|
||||
OpenRefine will not show spreadsheets that are in your email inbox or stored in any other Google property - only in Drive. It also won’t show all compatible file formats, only Sheets files.
|
||||
|
||||
OpenRefine will generate a list of all Sheets it finds, with the most recently modified Sheets at the top. If a file you’ve just added isn’t showing in this list, you can close and restart OpenRefine, or simply navigate to an existing project, open it, then head back to the <span class="menuItems">Create Project</span> window and check again.
|
||||
|
||||
When you click <span class="menuItems">Preview</span> the Sheet will open in a new browser tab. When you click the Sheet title, OpenRefine will begin to process the data.
|
||||
When you click <span class="buttonLabels">Preview</span> the Sheet will open in a new browser tab. When you click the Sheet title, OpenRefine will begin to process the data.
|
||||
|
||||
|
||||
## Project preview
|
||||
@ -124,36 +126,39 @@ Once OpenRefine is ready to import the data, you will see a screen with <span cl
|
||||
|
||||
At the bottom of the screen you will find options for telling OpenRefine how to process what it has found. You can tell it which row(s) to parse as column headers, as well as to ignore any number of rows at the top. You can also select a specific range of rows to work with, by discarding some rows at the top (excluding the header) and limiting the total number of rows it loads.
|
||||
|
||||
OpenRefine tries to guess how to parse your data based on the file extension. For example, `.xml` files are going to be parsed as though they are formatted in XML. An unknown file extension (or your clipboard copy-paste) is assumed to be either tab-separated or comma-separated. OpenRefine looks for a tab character; if one is found, it assumes you have imported tab-separated data.
|
||||
OpenRefine tries to guess how to parse your data based on the file extension. For example, `.xml` files are going to be parsed as though they are formatted in XML. An unknown file extension (or your clipboard copy-paste) is assumed to be either tab-separated or comma-separated. OpenRefine looks for a tab character, and if one is found, it assumes you have imported tab-separated data.
|
||||
|
||||
If OpenRefine isn’t certain what format you imported, it will provide a list of possibilities under <span class="menuItems">Parse data as</span> and some settings. You can specify a custom separator now, or split columns later on in the project interface.
|
||||
If OpenRefine isn’t certain what format you imported, it will provide a list of possibilities under <span class="menuItems">Parse data as</span> and some settings. You can specify a custom separator now, or split columns later while [transforming your data](transforming).
|
||||
|
||||
If you imported a spreadsheet with multiple worksheets, they will be listed along with the number of rows they contain. You can only select data from one worksheet.
|
||||
|
||||
Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file.
|
||||
Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file. Hyperlinked text will be input as plain text, but OpenRefine will recognize links and make them clickable inside the project interface.
|
||||
|
||||
:::info
|
||||
Look for character encoding issues at this stage. You may want to manually select an encoding, such as UTF-8, UTF-16, or ASCII, if OpenRefine does not display some characters correctly in the preview. Once your project is created, you can specify another encoding for specific columns using the [reinterpret() function](grelfunctions#reinterprets-s-encoder).
|
||||
:::
|
||||
|
||||
You should create a project name at this stage. You can also supply tags to keep your projects organized. When you’re happy with the preview, click <span class="menuItems">Create Project</span>.
|
||||
You should create a project name at this stage. You can also supply tags to keep your projects organized. When you’re happy with the preview, click <span class="buttonLabels">Create Project</span>.
|
||||
|
||||
|
||||
## Import a project
|
||||
|
||||
Because OpenRefine only runs locally on your computer, you can’t have a project accessible to more than one person at the same time.
|
||||
|
||||
The best way to collaborate with another person is to export and import projects that save all your changes, so that you can pick up where someone else left off. You can also [export projects](exporting#export-a-project) and import them to new computers of your own, such as for working on the same project from the office and from home.
|
||||
The best way to collaborate with another person is to export and import projects that save all your changes, so that you can pick up where someone else left off. You can also [export projects](exporting#export-a-project) and import them to other computers, such as for working on the same project from the office and from home.
|
||||
|
||||
An exported project will include all of the [history](running#history-undoredo), so you can see (and undo) all the changes from the previous user. It is essentially a point-in-time snapshot of their work. OpenRefine only exports projects as `.tar.gz` files at this time.
|
||||
:::caution
|
||||
If you wish to hide the original state of your data and your history of edits (for example, if you are using OpenRefine to anonymize information), export your cleaned dataset only and do not share your project archive.
|
||||
:::
|
||||
|
||||
Once someone has sent you a project archive file from their computer, you can save it anywhere, including your Downloads folder.
|
||||
Once someone has sent you a project archive file from their computer, you can save it anywhere. OpenRefine will import it like a new project and save its information to your workspace directory.
|
||||
|
||||
In the left-hand menu of the home screen, click <span class="menuItems">Import Project</span>. Click <span class="menuItems">Browse…</span> and navigate to wherever you saved the file you were sent (for example, your Downloads folder).
|
||||
In the left-hand menu of the home screen, click <span class="buttonLabels">Import Project</span>. Click <span class="buttonLabels">Browse…</span> and navigate to wherever you saved the file you were sent (for example, your Downloads folder).
|
||||
|
||||
You can rename the project if you’d like - we recommend adding your name, a date, or a version number, if you’re planning to continue collaborating with another person (or working from multiple computers).
|
||||
|
||||
Then, click <span class="menuItems">Import Project</span>. Your project should appear with a step count beside <span class="menuItems">Undo/Redo</span> if steps were saved by the exporter.
|
||||
Then, click <span class="buttonLabels">Import Project</span>. Your project should appear with a step count beside <span class="tabLabels">Undo/Redo</span> if steps were saved by the exporter.
|
||||
|
||||
OpenRefine will store the project in its own workspace directory, so you can now delete the original file that was sent to you.
|
||||
|
||||
@ -162,27 +167,26 @@ OpenRefine will store the project in its own workspace directory, so you can now
|
||||
|
||||
You can access all of your created projects by clicking on <span class="menuItems">Open Project</span>. Your project list can be organized by modification date, title, row count, and other metadata you can supply (such as subject, descripton, tags, or creator). To edit the fields you see here, click <span class="menuItems">About</span> to the left of each project. There you can edit a number of available fields. You can also see the project ID that corresponds to the name of the folder in your work directory.
|
||||
|
||||
|
||||
### Naming projects
|
||||
|
||||
You may have multiple projects from the same dataset, or multiple versions from sharing a project with another person. OpenRefine automatically generates a project name from the imported file, or “clipboard” when you use <span class="menuItems">Clipboard</span> importing. Project names don’t have to be unique, so OpenRefine will create many projects with the same name unless you intervene.
|
||||
You may have multiple projects from the same dataset, or multiple versions from sharing a project with another person. OpenRefine automatically generates a project name from the imported file, or “clipboard” when you use <span class="menuItems">Clipboard</span> importing. Project names don’t have to be unique, and OpenRefine will create many projects with the same name unless you intervene.
|
||||
|
||||
You can name a project when you create it or import it, and you can rename a project by opening it and clicking on the project name at the top of the screen.
|
||||
You can edit a project's name when you create it or import it, and you can rename a project later by opening it and clicking on the project name at the top of the screen.
|
||||
|
||||
### Autosaving
|
||||
|
||||
OpenRefine [saves all of your actions](running#history-undoredo) (everything you can see in the <span class="menuItems">Undo/Redo</span> panel). That includes flagging and starring rows.
|
||||
OpenRefine [saves all of your actions](running#history-undoredo) (everything you can see in the <span class="tabLabels">Undo/Redo</span> panel). That includes flagging and starring rows.
|
||||
|
||||
It doesn’t, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, whether you are showing your data as rows or records, and any sorting or column collapsing you may have done. A good rule of thumb is: if it’s not showing in <span class="menuItems">Undo/Redo</span>, you will lose it when you leave the project workspace.
|
||||
It doesn’t, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, and any sorting or column collapsing you may have done. A good rule of thumb is: if it’s not showing in <span class="tabLabels">Undo/Redo</span>, you will lose it when you leave the project workspace.
|
||||
|
||||
You can only save and share facets and filters, not any other type of view. To save current facets and filters, click <span class="menuItems">Permalink</span>. The project will reload with a different URL, which you can then copy and save elsewhere. This permalink will save both the facets and filters you’ve set, and the settings for each one (such as sorting by count rather than by name).
|
||||
|
||||
### Deleting projects
|
||||
|
||||
You can delete projects, which will erase the project files from the work directory on your computer. This is immediate and cannot be undone.
|
||||
You can delete projects, which will erase the project files from the workspace directory on your computer. This is immediate and cannot be undone.
|
||||
|
||||
Go to <span class="menuItems">Open Project</span> and find the project you want to delete. Click on the <span class="menuItems">X</span> to the left of the project name. There will be a confirmation dialog.
|
||||
|
||||
### Project files
|
||||
|
||||
You can find all of your raw project files in your work directory. They will be named according to the unique “Project ID” that OpenRefine has assigned them, which you can find on the <span class="menuItems">Open Project</span> screen, under the “About” link for each project.
|
||||
You can find all of your raw project files in your work directory. They will be named according to the unique “Project ID” that OpenRefine has assigned them, which you can find on the <span class="menuItems">Open Project</span> screen, under the “About” link for each project.
|
||||
|
@ -10,13 +10,12 @@ OpenRefine gives you powerful ways to clean, correct, codify, and extend your da
|
||||
|
||||
This section of ways to improve data are organized by their appearance in the menu options in OpenRefine. You can:
|
||||
|
||||
* change the order of rows or columns
|
||||
* edit cell contents within a particular column
|
||||
* edit cell contents across all rows and columns
|
||||
* transform rows into columns, and columns into rows
|
||||
* split or join columns
|
||||
* add new columns based on existing data or through reconciliation
|
||||
* convert your rows of data into multi-row records.
|
||||
* change the order of [rows](#edit-rows) or [columns](columnediting#rename-remove-and-move)
|
||||
* edit [cell contents](cellediting) within a particular column
|
||||
* [transform](transposing) rows into columns, and columns into rows
|
||||
* [split or join columns](columnediting#split-or-join)
|
||||
* [add new columns](columnediting) based on existing data, with fetching new information, or through [reconciliation](reconciling)
|
||||
* convert your rows of data into [multi-row records](exploring#rows-vs-records).
|
||||
|
||||
## Edit rows
|
||||
|
||||
@ -26,8 +25,10 @@ You can [sort your data](sortview#sort) based on the values in one column, but t
|
||||
|
||||
![A screenshot of where to find the Sort menu with a sorting applied.](/img/sortPermanent.png)
|
||||
|
||||
In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select “Reorder rows permanently.” You will see the numbering of the rows change under the “All” column.
|
||||
In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select <span class="menuItems">Reorder rows permanently</span>. You will see the numbering of the rows change under the <span class="menuItems">All</span> column.
|
||||
|
||||
Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through facets and filters.
|
||||
:::info
|
||||
Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through [facets and filters](facets).
|
||||
:::
|
||||
|
||||
You can undo this action using the [“History” sidebar](running#history-undoredo).
|
||||
You can undo this action using the [<span class="fieldLabels">History</span> tab](running#history-undoredo).
|
Loading…
Reference in New Issue
Block a user