Merge pull request #3380 from allanaaa/docs-formatting

Docs > formatting
This commit is contained in:
Owen Stephens 2021-01-16 17:59:52 +00:00 committed by GitHub
commit eb7761e230
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
27 changed files with 924 additions and 854 deletions

View File

@ -9,15 +9,17 @@ OpenRefine offers a number of features to edit and improve the contents of cells
One way of doing this is editing through a [text facet](facets#text-facet). Once you have created a facet on a column, hover over the displayed results in the sidebar. Click on the small “edit” button that appears to the right of the facet, and type in a new value. This will apply to all the cells in the facet.
You can apply a text facet on numbers, boolean values, and dates, but if you edit a value it will be converted into the text [data type](exploring#data-types) (regardless of whether you edit a date into another correctly-formatted date, or a “true” value into “false”, etc.).
## Transform
Select “Edit cells” → “Transforms” to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as `toUppercase` or `toLowercase`, used in expressions as `toUppercase(value)` or `toLowercase(value)`. In all of these cases, `value` is the value in each cell in the selected column.
Select <span class="menuItems">Edit cells</span><span class="menuItems">Transform...</span> to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as [`toUppercase()`](grelfunctions#touppercases) or [`toLowercase()`](grelfunctions#tolowercases), used in expressions as `toUppercase(value)` or `toLowercase(value)`. When used on a column operation, `value` is the information in each cell in the selected column.
Use the preview to ensure your data is being transformed correctly.
You can also switch to the “History” tab inside the expressions window to reuse expressions youve already attempted in this project, whether they have been undone or not.
You can also switch to the <span class="tabLabels">Undo / Redo</span> tab inside the expressions window to reuse expressions youve already attempted in this project, whether they have been undone or not.
OpenRefine offers you some frequently-used transformations in the next menu option, “Common transforms.” For more custom transforms, read up on [expressions](expressions).
OpenRefine offers you some frequently-used transformations in the next menu option, <span class="menuItems">Common transforms</span>. For more custom transforms, read up on [expressions](expressions).
## Common transforms
@ -27,11 +29,11 @@ Often cell contents that should be identical, and look identical, are different
### Collapse consecutive whitespace
You may also find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
You may find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
### Unescape HTML
Your data may come from an HTML-formatted source that expresses some characters through references (such as “&amp;nbsp;” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent.
Your data may come from an HTML-formatted source that expresses some characters through references (such as “&amp;nbsp;” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent. For other formatting that needs to be escaped, try a custom transformation with [`escape()`](grelfunctions#escapes-s-mode).
### Replace smart quotes with ASCII
@ -39,28 +41,17 @@ Smart quotes (or curly quotes) recognize whether they come at the beginning or e
### Case transforms
You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which many functions are) causing problems in your analysis.
You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which some functions are) causing problems in your analysis. Consider also using a [custom facet](facets#custom-text-facet) to temporarily modify cases instead of this permanent operation if appropriate.
### Data-type transforms
As detailed in [Data types](exploring#data-types), OpenRefine recognizes different data types: string, number, boolean, and date. When you use these transforms, OpenRefine will check to see if the given values can be converted, then both transform the data in the cells (such as “3” as a text string to “3” as a number) and convert the data type on each successfully transformed cell. Cells that cannot be transformed will output the original value and maintain their original data type.
For example, the following column of strings on the left will transform into the values on the right:
:::caution
Be aware that dates may require manual intervention to transform successfully: see the section on [Dates](exploring#dates) for more information.
:::
|Input|>|Output|
|---|---|---|
|23/12/2019|>|2019-12-23T00:00:00Z|
|14-10-2015|>|2015-10-14T00:00:00Z|
|2012 02 16|>|2012-02-16T00:00:00Z|
|August 2nd 1964|>|1964-08-02T00:00:00Z|
|today|>|today|
|never|>|never|
This is based on OpenRefines ability to recognize dates with the [`toDate()` function](expressions#dates).
Clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells.
Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells. Also consider faceting by data type, with the GREL function [`type()`](grelfunctions#typeo).
You can also convert cells into null values or empty strings. This can be useful if you wish to, for example, erase duplicates that you have identified and are analyzing as a subset.
@ -68,41 +59,45 @@ You can also convert cells into null values or empty strings. This can be useful
Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity.
If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will convert the data based on the remaining values in the first column. Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the value to come first, so that you can clean up empty rows later, once you blank down.
If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column.
Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the row with the second-column value to come first, so that you can clean up empty rows later, once you blank down.
If, conversely, youve received data with empty cells because it was already in something akin to records mode, you can fill down information to the rest of the rows. This will duplicate whatever value exists in the topmost cell with a value: if the first row in the record is blank, it will take information from the next cell, or the cell after that, until it finds a value. The blank cells above this will remain blank.
## Split multi-valued cells
Splitting cells with more than one value in them is a common way to get your data from single rows into multi-row records. Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
Splitting cells with more than one value in them is a common way to get your data from single rows into [multi-row records](exploring#rows-vs-records). Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
You can split a column based on any character or series of characters you input, such as a semi-colon (;) or a slash (/). The default is a comma. Splitting based on a separator will remove the separator characters, so you may wish to include a space with your separator (; ) if it exists in your data.
You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “regular expression” checkbox.
You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “[regular expression](expressions#regular-expressions)” checkbox.
This can be useful if the split is not straightforward: say, if a capital letter indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
Regular expressions can be useful if the split is not straightforward: say, if a capital letter (`[A-Z]`) indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
You can also split based on the lengths of the strings you expect to find. This can be useful if you have predictable data in the cells: for example, a 10-digit phone number, followed by a space, followed by another 10-digit phone number. Any characters past the explicit length youve specified will be discarded: if you split by “11, 10” any characters that may come after the 21st character will disappear. If some cells only have one phone number, you will end up with blank rows.
If you have data that should be split into multiple columns instead of multiple rows, see [split into several columns(columnediting#split-into-several-columns).
If you have data that should be split into multiple columns instead of multiple rows, see [Split into several columns](columnediting#split-into-several-columns).
## Join multi-valued cells
Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional.
Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional. We suggest the separator | as a sufficiently rare character.
## Cluster and edit
Creating a facet on a column is a great way to look for inconsistencies in your data; clustering is a great way to fix those inconsistencies. Clustering uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match. Where editing a single cell or text facet at a time can be time-consuming and difficult, clustering is quick and streamlined.
Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all those matching cells (or type in your own version). OpenRefine will do a number of cleanup operations behind the scenes, in memory, in order to do its analysis, but only the merges you approve will modify your data.
Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all the matching cells (or type in your own version).
You can start the process in two ways: using the dropdown menu on your column, select “Edit cells” → “Cluster and edit…” or create a text facet and then press the “Cluster” button that appears in the facet box.
OpenRefine will do a number of cleanup operations behind the scenes in order to do its analysis, but only the merges you approve will modify your data. Understanding those different behind-the-scenes cleanups can help you choose which clustering method will be more accurate and effective.
You can start the process in two ways: using the dropdown menu on your column, select <span class="menuItems">Edit cells</span><span class="menuItems">Cluster and edit…</span>; or create a text facet and then press the “Cluster” button that appears in the facet box.
![A screenshot of the Clustering window.](/img/cluster.png)
The clustering pop-up window will take a small amount of time to analyze your column, and then make some suggestions based on the clustering method currently active.
For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with “Merge selected & re-cluster,” and you can work through all the methods this way.
For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with <span class="buttonLabels">Merge selected & re-cluster</span>, and you can work through all the methods this way.
You can also export the currently identified clusters as a JSON file, or close the window with or without applying your changes. You can also use the histograms on the right to narrow down to, for example, clusters with lots of matching rows, or clusters of long or short values.
@ -123,13 +118,19 @@ The clustering pop-up window offers you a variety of clustering methods:
* levenshtein
* ppm
#### Key collision
**Key collisions** are very fast and can process millions of cells in seconds:
**Fingerprinting** is the least likely to produce false positives, so its a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “Wang Zhenyi”). This makes comparing those types of name values very easy.
**Fingerprinting** is the least likely to produce false positives, so its a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split up all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “wang zhenyi”).
**N-gram fingerprinting** allows you to set the _n_ value to whatever number youd like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a _fingerprint_. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself wont identify). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
**N-gram fingerprinting** allows you to set the _n_ value to whatever number youd like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a fingerprint. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”).
The next four methods are phonetic algorithsm: they know whether two letters sound the same when pronounced out loud, and assess text values based on that (such as knowing that a word with an “S” might be a mistype of a word with a “Z”). They are great for spotting mistakes made by not knowing the spelling of a word or name after only hearing it spoken aloud.
This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself wont identify because it separates words). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
##### Phonetic clustering
The next four methods are phonetic algorithms: they identify letters that sound the same when pronounced out loud, and assess text values based on that (such as knowing that a word with an “S” might be a mistype of a word with a “Z”). They are great for spotting mistakes made by not knowing the spelling of a word or name after hearing it spoken aloud.
**Metaphone3 fingerprinting** is an English-language phonetic algorithm. For example, “Reuben Gevorkiantz” and “Ruben Gevorkyants” share the same phonetic fingerprint in English.
@ -139,24 +140,26 @@ The next four methods are phonetic algorithsm: they know whether two letters sou
Regardless of the language of your data, applying each of them might find different potential matches: for example, Metaphone clusters “Cornwall” and “Corn Hill” and “Green Hill,” while Cologne clusters “Greenvale” and “Granville” and “Cornwall” and “Green Wall.”
**Nearest neighbor** clustering methods are slower than key collision methods. They allow the user to set a radius - a threshold for matching or not matching. OpenRefine uses a “blocking” method first, which sorts values based on whether they have a certain amount of similarity (the default is “6” for a six-character string of identical characters) and then runs the nearest-neighbor operations on those sorted groups. We recommend setting the block number to at least 3, and then increasing it if you need to be more strict (for example, if every value with “river” is being matched, you should increase it to 6 or more). Note bigger block values will take much longer to process, while smaller blocks may miss matches. Increasing the radius will make the matches more lax, as bigger differences will be clustered:
#### Nearest neighbor
**Levenshtein distance** counts the number of edits required to make one value perfectly match another. As in the key collision methods above, it will do things like change uppercase to lowercase, fix whitespace, change special characters, etc. Each character that gets changed counts as 1 “distance.” “New York” and “newyork” have an edit distance value of 3 (“N” to “n”, “Y” to “y,” remove the space). It can do relatively advanced edits, such as understand the distance between “M. Makeba” and “Miriam Makeba” (5), but it may create false positives if these distances are greater than other, simpler transformations (such as the one-character distance to “B. Makeba,” another person entirely).
**Nearest neighbor** clustering methods are slower than key collision methods. They allow the user to set a radius - a threshold for matching or not matching. OpenRefine uses a “blocking” method first, which sorts values based on whether they have a certain amount of similarity (the default is “6” for a six-character string of identical characters) and then runs the nearest-neighbor operations on those sorted groups.
**PPM (or Prediction by Partial Matching)** uses compression to see whether two values are similar or different. In practice, this method is very lax even for small radius values and tends to generate many false positives, but because it operates at a sub-character level it is capable of finding substructures that are not easily identifiable by distances that work at the character level. So it should be used as a 'last resort' clustering method. It is also more effective on longer strings than on shorter ones.
We recommend setting the block number to at least 3, and then increasing it if you need to be more strict (for example, if every value with “river” is being matched, you should increase it to 6 or more). Note that bigger block values will take much longer to process, while smaller blocks may miss matches. Increasing the radius will make the matches more lax, as bigger differences will be clustered.
**Levenshtein distance** counts the number of edits required to make one value perfectly match another. As in the key collision methods above, it will do things like change uppercase to lowercase, fix whitespace, change special characters, etc. Each character that gets changed counts as 1 “distance.” “New York” and “newyork” have an edit distance value of 3 (“N” to “n”; “Y” to “y”; remove the space). It can do relatively advanced edits, such as understand the distance between “M. Makeba” and “Miriam Makeba” (5), but it may create false positives if these distances are greater than other, simpler transformations (such as the one-character distance to “B. Makeba,” another person entirely).
**PPM (Prediction by Partial Matching)** uses compression to see whether two values are similar or different. In practice, this method is very lax even for small radius values and tends to generate many false positives, but because it operates at a sub-character level it is capable of finding substructures that are not easily identifiable by distances that work at the character level. So it should be used as a “last resort” clustering method. It is also more effective on longer strings than on shorter ones.
For more of the theory behind clustering, see [Clustering In Depth](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth).
## Replace
OpenRefine provides a find/replace function for you to edit your data. Selecting “Edit cells” → “Replace” will bring up a simple window where you can input a string to search and a string to replace it with. You can set case-sensitivity, and set it to only select whole words, defined by a string with spaces or punctuation around it (to prevent, for example, “house” selecting the “house” part of “doghouse”). You can use regular expressions in this field.
You may wish to preview the results of this operation by testing it with a [Text filter](facets#text-filter) first.
OpenRefine provides a find/replace function for you to edit your data. Selecting <span class="menuItems">Edit cells</span><span class="menuItems">Replace</span> will bring up a simple window where you can input a string to search and a string to replace it with. You can set case-sensitivity, and set it to only select whole words, defined by a string with spaces or punctuation around it (to prevent, for example, “house” selecting the “house” part of “doghouse”). You can use [regular expressions](expressions#regular-expressions) in this field. You may wish to preview the results of this operation by testing it with a [Text filter](facets#text-filter) first.
You can also perform a sort of find/replace operation by editing one cell, and selecting “apply to all identical cells.”
## Edit one cell at a time
You can edit individual cells by hovering your mouse over that cell. You should see a tiny blue button labeled “edit.” Click it to edit the cell. That pops up a window with a bigger text field for you to edit. You can change the data type of that cell, and you can apply these changes to all identical cells (in the same column), using this pop-up window.
You can edit individual cells by hovering your mouse over that cell. You should see a tiny blue link labeled “edit.” Click it to edit the cell. That pops up a window with a bigger text field for you to edit. You can change the [data type](exploring#data-types) of that cell, and you can apply these changes to all identical cells (in the same column), using this pop-up window.
You will likely want to avoid doing this except in rare cases - the more efficient means of improving your data will be through automated and bulk operations.

View File

@ -6,47 +6,45 @@ sidebar_label: Column editing
## Overview
Column editing contains some of the most powerful data-improvement methods in OpenRefine. While we call it “edit column,” this includes using one column of data to add entirely new columns and fields to your dataset.
Column editing contains some of the most powerful data-improvement methods in OpenRefine. The operations in the <span class="menuItems">Edit column</span> menu involve using one column of data to add entirely new columns and fields to your dataset.
## Split or Join
## Splitting or joining
Many users find that they frequently need to make their data more granular: for example, splitting a “Firstname Lastname” column into two columns, one for first names and one for last names. You may want to split out an address column into columns for street addresses, cities, territories, and postal codes.
The reverse is also often true: you may have several columns of category values that you want to join into one “category” column.
### Split into several columns...
Many users find that they frequently need to make their data more granular: for example, splitting a “Firstname Lastname” column into two columns, one for first names and one for last names. The reverse is also often true: you may have several columns of category values that you want to join into one “category” column.
.
### Split into several columns
![A screenshot of the settings window for splitting columns.](/img/columnsplit.png)
Splitting one column into several columns requires you to identify the character, string lengths, or evaluating expression you want to split on. Just like [splitting multi-valued cells into rows](cellediting#split-multi-valued-cells), splitting cells into multiple columns will remove the separator character or string you indicate. Lengths will discard any information that comes after the specified total length.
You can find this operation at <span class="menuItems">Edit column</span><span class="menuItems">Split into several columns...</span>. Splitting one column into several columns requires you to identify the character, string lengths, or evaluating expression you want to split on. Just like [splitting multi-valued cells into rows](cellediting#split-multi-valued-cells), splitting cells into multiple columns will remove the separator character or string you indicate. Splitting by lengths will discard any information that comes after the specified total length.
You can also specify a maximum number of new columns to be made: separator characters after this limit will be ignored, and the remaining characters will end up in the last column.
New columns will be named after the original column, with a number: “Location 1,” “Location 2,” etc. You can have the original column removed with this operation, and you can have [data types](exploring#data-types) identified where possible. This function will work best with converting to numbers, and may not work with dates.
New columns will be named after the original column, with a number: “Location 1,” “Location 2,” etc. You can choose to remove the original column with this operation, and you can have [data types](exploring#data-types) identified where possible. This function will work best with converting strings to numbers, and may not work with [dates](exploring#dates).
### Join columns
### Join columns
![A screenshot of the settings window for joining columns.](/img/columnjoin.png)
You can join columns by selecting “Edit column” → “Join columns…”. All the columns currently in your dataset will appear in the pop-up window. You can select or un-select all the columns you want to join, and drag columns to put them in the order you want to join them in. You will define a separator character (optional) and define a string to insert into empty cells (nulls).
You can join columns by selecting <span class="menuItems">Edit column</span><span class="menuItems">Join columns...</span>. All the columns currently in your dataset will appear in the pop-up window. You can select or un-select all the columns you want to join, and drag columns to put them in the order you want to join them in. You will define a separator character (optional) and define a string to insert into empty cells (nulls).
The joined data will appear in the column you originally selected, or you can create a new column based on this join and specify a name. You can delete all the columns that were used in this join operation.
The joined data will appear in the column you originally selected, or you can create a new column for this content and specify a name. You can delete all the columns that were used in this join operation.
## Add column based on this column
This selection will open up an [expressions](expressions) window where you can transform the data from this column (using `value`), or write a more complex expression that takes information from any number of columns or from external reconciliation sources.
Selecting <span class="menuItems">Edit column</span><span class="menuItems">Add column based on this column...</span> will open up an [expressions](expressions) window where you can transform the data from this column (using `value`), or write a more complex expression that takes information from any number of columns or from external sources.
The simplest way to use this operation is simply leave the default `value` in the expression field, to create an exact copy of your column.
Expressions used in this operation will rely on your knowledge of variables. You can learn more in the [Expressions section on variables](expressions#variables).
For a reconciled column, you can use the variable `cell` instead, to copy both the original string and the existing reconciliation data. This will include matched values, candidates, and new items. You can learn other useful variables in the [Expressions section on GREL variables](expressions#variables).
The simplest way to use this operation is simply leave the default `value` in the expression field, to create an exact copy of your column. For a column of [reconciled data](reconciling), you can use the variable `cell` instead, to copy both the original string and the existing reconciliation data. This will include matched values, candidates, and new items.
You can create a column based on concatenating (merging) two other columns. Select either of the source columns, apply “Column editing” → “Add column based on this column...”, name your new column, and use the following format in the expression window:
One useful expression is to create a column based on concatenating (merging) two other columns. Select either of the source columns, choose <span class="menuItems">Edit column</span><span class="menuItems">Add column based on this column...</span>, name your new column, and use the following format in the expression window:
```
cells[“Column 1”].value + cells[“Column 2”].value
cells["Column 1"].value + cells["Column 2"].value
```
If your column names do not contain spaces, you can use the following format:
If your column names do not contain spaces, you can use the following format instead:
```
cells.Column1.value + cells.Column2.value
@ -58,29 +56,29 @@ If you are in records mode instead of rows mode, you can concatenate using the f
row.record.cells.Column1.value + row.record.cells.Column2.value
```
You may wish to add separators or spaces, or modify your input during this operation with more advanced GREL.
You may wish to add separators or spaces, or modify your input during this operation with more advanced expressions.
## Add column by fetching URLs
Through the “Add column by fetching URLs” function, OpenRefine supports the ability to fetch HTML or data from web pages or services. In this operation you will be taking strings from your selected column and inserting them into URL strings. This presumes your chosen column contains parts of paths to valid HTML pages or files online.
Through the <span class="menuItems">Add column by fetching URLs</span> function, OpenRefine supports the ability to fetch HTML or data from web pages or services. In this operation you will be building URL strings based on your column of data, by using `value` to insert a relevant substring. Your chosen column needs to contains parts of paths to valid HTML pages or files online.
If you have a column of URLs and watch to fetch the information that they point to, you can simply run the expression as `value`. If your column has, for example, unique identifiers for Wikidata entities (numerical values starting with Q), you can download the JSON-formatted metadata about each entity with
If you have a column of URLs and want to fetch the information that they point to, you can simply run the expression as `value`. If your column has, for example, unique identifiers for Wikidata entities (numerical values starting with Q), you can download the JSON-formatted metadata about each entity with
```
“https://www.wikidata.org/wiki/Special:EntityData/” + value + “.json”
"https://www.wikidata.org/wiki/Special:EntityData/" + value + ".json"
```
or whatever metadata format you prefer. [Information about these Wikidata options can be found here](https://www.wikidata.org/wiki/Wikidata:Data_access).
This service is more useful when getting metadata files instead of HTML, but you may wish to work with a pages entire HTML contents and then parse out information from that.
Be aware that the fetching process can take quite some time and that servers may not want to fulfill hundreds or thousands of page requests in seconds. Fetching allows you to set a “throttle delay” which determines the amount of time between requests. The default is 5 seconds per row in your dataset (5000 milliseconds).
or whatever metadata format you prefer. Information about the format options in Wikidata can be found [here](https://www.wikidata.org/wiki/Wikidata:Data_access). The service you are fetching data from may have similar documentation on its provided options.
![A screenshot of the settings window for fetching URLs.](/img/fetchingURLs.png)
Note the following:
This service is more useful when getting metadata files instead of HTML, but you may wish to work with a pages entire HTML contents and then parse out information from that.
* Many systems prevent you from making too many requests per second. To avoid this problem, set the throttle delay, which tells OpenRefine to wait the specified number of milliseconds between URL requests.
:::caution
Be aware that the fetching process can take quite some time and that servers may not want to fulfill hundreds or thousands of page requests in seconds. Fetching allows you to set a “throttle delay” which determines the amount of time between requests. The default is 5 seconds per row in your dataset (5000 milliseconds). We recommend leaving this at 1000 or greater.
:::
Note the following:
* Before pressing “OK,” copy and paste a URL or two from the preview and test them in another browser tab to make sure they work.
* In some situations you may need to set [HTTP request headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers). To set these, click the small “Show” button next to “HTTP headers to be used when fetching URLs” in the settings window. The authorization credentials get logged in your operation history in plain text, which may be a security concern for you. You can set the following request headers:
* [User-Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent)
@ -89,7 +87,7 @@ Note the following:
### Common errors
When OpenRefine attempts to fetch information from a web page or service, it can fail in a variety of ways. The following information is meant to help troubleshoot and fix problems encountered when using this function.
When OpenRefine attempts to fetch information from a web service, it can fail in a variety of ways. The following information is meant to help troubleshoot and fix problems encountered when using this function.
First, make sure that your fetching operation is storing errors (check “store error”). Then run the fetch and look at the error messages.
@ -109,7 +107,7 @@ Note that for Mac users and for Windows users with the OpenRefine installation w
* On Mac, it will look something like `/Applications/OpenRefine.app/Contents/PlugIns/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security`.
* On Windows: `\server\target\jre\lib\security`.
An error that includes **“javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed”** can occur when you try to retrieve information over HTTPS but the remote site is using a certificate not trusted by your local Java installation. You will need to make sure that the certificate, or (more likely) the root certificate, is trusted.
**“javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed”** can appear when the remote site is using an HTTPS certificate not trusted by your local Java installation. You will need to make sure that the certificate, or (more likely) the root certificate, is trusted.
The list of trusted certificates is stored in an encrypted file called `cacerts` in your local Java installation. This can be read and updated by a tool called “keytool.” You can find directions on how to add a security certificate to the list of trusted certificates for a Java installation [here](http://magicmonster.com/kb/prg/java/ssl/pkix_path_building_failed.html) and [here](http://javarevisited.blogspot.co.uk/2012/03/add-list-certficates-java-keystore.html).
@ -118,8 +116,9 @@ Note that for Mac users and for Windows users with the OpenRefine installation w
* On Mac, it will look something like `/Applications/OpenRefine.app/Contents/PlugIns/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/cacerts`.
* On Windows: `\server\target\jre\lib\security\`.
## Rename, Remove, and Move
## Renaming, removing, and moving
Any column can be repositioned, renamed, or deleted with these actions. They can be undone, but a removed column cannot be restored later if you keep modifying your data. If you wish to temporarily hide a column, go to “View” → “Collapse this column” instead.
Every column's <span class="menuItems">Edit column</span> dropdown contains options to move it (to the beginning, end, left, or right), rename it, and delete it.
These operations can be undone, but a removed column cannot be restored later if you keep modifying your data. If you wish to temporarily hide a column, go to <span class="menuItems">[View](sortview#view)</span><span class="menuItems">Collapse this column</span> instead.
Be cautious about moving columns in [records mode](cellediting#rows-vs-records): if you change the first column in your dataset (the key column), your records may change in unintended ways.

View File

@ -1,4 +1,4 @@
---
---
id: exploring
title: Exploring data
sidebar_label: Overview
@ -6,7 +6,7 @@ sidebar_label: Overview
## Overview
OpenRefine is a powerful tool for learning about your dataset, even if you dont change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
OpenRefine offers lots of features to help you learn about your dataset, even if you dont change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
Unlike spreadsheets, OpenRefine doesnt store formulas and display the output of those calculations; it only shows the value inside each cell. It doesnt support cell colors or text formatting.
@ -14,21 +14,19 @@ Unlike spreadsheets, OpenRefine doesnt store formulas and display the output
Each piece of information (each cell) in OpenRefine is assigned a data type. Some file formats, when imported, can set data types that are recognized by OpenRefine. Cells without an associated data type on import will be considered a “string” at first, but you can have OpenRefine convert cell contents into other data types later. This is set at the cell level, not at the column level.
You can see data types in action when you preview a new project: check the box that says “Attempt to parse cell text into numbers” and cells will be converted to the “number” data type based on their contents. Youll see numbers change from black text to green if they are recognized.
You can see data types in action when you preview a new project: check the box next to <span class="fieldLabels">Attempt to parse cell text into numbers</span>, and cells will be converted to the “number” data type based on their contents. Youll see numbers change from black text to green if they are recognized.
The data type will determine what you can do with the value. For example, if you want to add two values together, they must both be recognized as the number type.
You can check data types at any time by:
* clicking “edit” on a single cell (where you can also edit the type)
* creating a Custom Text Facet on a column, and inserting `type(value)` into the “Expression” field. This will generate the data type in the preview, and you can facet by data type if you press “OK.”
* creating a <span class="menuItems">Custom Text Facet</span> on a column, and inserting `type(value)` into the <span class="fieldLabels">Expression</span> field. This will generate the data type in the preview, and you can facet by data type if you press <span class="buttonLabels">OK</span>.
The data types supported are:
* string (one or more text characters)
* number (one or more characters of numbers only)
* boolean (values of “true” or “false”)
* date (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
A “date” type is created when a text column is [transformed into dates](transforming#to-date), or when individual cells are set to have the data type “date.”
* [date](#dates) (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
OpenRefine recognizes two further data types as a result of its own processes:
* error
@ -36,19 +34,59 @@ OpenRefine recognizes two further data types as a result of its own processes:
An “error” data type is created when the cell is storing an error generated during a transformation in OpenRefine.
A “null” data type is a special value which basically means “this cell has no value.” Its used to differentiate between cells that have values such as “0” or “false” - or a cell that looks empty but has, for example, spaces in it. When you use `type(value)`, it will show you that the cells value is “null” and its type is “undefined.” You can opt to [show “null” values](#view) to differentiate them from empty strings, by going to “All” → “View” → “Show/Hide null values in cells.”
A “null” data type is a special type that means “this cell has no value.” Its distinct from cells that have values such as “0” or “false”, or cells that look empty but have whitespace in them, or cells that contain empty strings. When you use `type(value)`, it will show you that the cells value is “null” and its type is “undefined.” You can opt to [show “null” values](sortview#showhide-null), by going to <span class="menuItems">All</span><span class="menuItems">View</span><span class="menuItems">Show/Hide null values in cells</span>.
Converting a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as “Transform” → “Common transforms …” → “to date” may not convert all values successfully, but going to an individual cell, clicking “edit” and changing the data type can successfully convert text to a date. These operations use different underlying code.
Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as <span class="menuItems">Transform</span><span class="menuItems">Common transforms</span><span class="menuItems">To date</span> may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section.
To transform data from one type to another, see [Transforming data](transforming#transform) for information on using common tranforms, and see [Expressions](expressions) for information on using `toString()`, `toDate()`, and other functions.
To transform data from one type to another, see [Transforming data](cellediting#data-type-transforms) for information on using common tranforms, and see [Expressions](expressions) for information on using [toString()](grelfunctions#tostringo-string-format-optional), [toDate()](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-), and other functions.
### Dates
A “date” type is created when a column is [transformed into dates](transforming#to-date), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date”.
Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
When you run <span class="menuItems">Edit cells</span><span class="menuItems">Common transforms</span><span class="menuItems">To date</span>, the following column of strings on the left will transform into the values on the right:
|Input|→|Output|
|---|---|---|
|23/12/2019|→|2019-12-23T00:00:00Z|
|14-10-2015|→|2015-10-14T00:00:00Z|
|2012 02 16|→|2012-02-16T00:00:00Z|
|August 2nd 1964|→|1964-08-02T00:00:00Z|
|today|→|today|
|never|→|never|
OpenRefine uses a variety of tools to recognize, convert, and format [dates](exploring#dates) and so some of the values above can be reformatted using other methods. In this case, clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
You can do more precise conversion and formatting using expressions and arguments based on the state of your data: see the GREL functions reference section on [Date functions](grelfunctions#date-functions) for more help.
You can convert dates into a more human-readable format when you [export your data using the custom tabular exporter](exporting#custom-tabular-exporter). You are given the option to keep your dates in the ISO 8601 format, to output short, medium, long, or full locale formats, or to specify a custom format. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after working with ISO-8601-formatted dates in your project.
The following table shows some example [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html):
|Style |U.S. Locale |French Locale|
|---|---|---|
|Default |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|Short |6/30/09 7:03 AM |30/06/09 07:03|
|Medium |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
|Long |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT|
|Full |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT|
## Rows vs. records
A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response. In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefines records mode: this defines a single record (a survey response, for example) as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value youd like to work with.
A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response.
Generally, when you import some data, OpenRefine reads that data in row mode. From there you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefines records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value youd like to work with.
OpenRefine understands records based on the content of the first column, what we call the "key column." Splitting a row into a multi-row record will base all association on the first column in your dataset. If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record: you can imagine this structure as a tree with many branches, all leading back to the same trunk.
Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset.
If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group.
You can imagine the structure as a tree with many branches, all leading back to the same trunk.
For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.
@ -69,10 +107,10 @@ For example, your key column may be a film or television show, with multiple cas
||Margaret Hamilton|Miss Almira Gulch|
|||The Wicked Witch of the West|
Once you are in records mode, you can still move columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
OpenRefine assigns a unique key behind the scenes, so your records dont need a unique identifier in the key column (but you will likely have one, to ensure data stays properly sorted). You can keep track of which rows are assigned to which record by the record number that appears under the “All” column.
OpenRefine assigns a unique key behind the scenes, so your records dont need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the <span class="menuItems">All</span> column.
To [split multi-valued cells](transforming#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming).
Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another.
Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another.

View File

@ -6,24 +6,26 @@ sidebar_label: Exporting
## Overview
Once your data is cleaned, you will need to get it out of OpenRefine and into the system of your choice. OpenRefine outputs a number of file formats, can upload your data directly into Google Sheets, and can create or update statements on Wikidata.
Once your dataset is ready, you will need to get it out of OpenRefine and into the system of your choice. OpenRefine outputs a number of file formats, can upload your data directly into Google Sheets, and can create or update statements on Wikidata.
You can also [export your full project data](#export-a-project) so that it can be opened by someone else using OpenRefine (or yourself, on another computer).
## Export data
Many of the following options only export data in the current view - that is, with current filters and facets applied. Some will give you the choice to export your entire dataset or just your current view.
![A screenshot of the Export dropdown.](/img/export-menu.png)
To export from a project, click the <span class="menuItems">Export</span> dropdown button at the top right corner and pick the format you want. You options are:
Many of the options only export data in the current view - that is, with current filters and facets applied. Some will give you the choice to export your entire dataset or just the currently-viewed rows.
To export data from a project, click the <span class="menuItems">Export</span> dropdown button in the top right corner and pick the format you want. Your options are:
* Tab-separated value (TSV) or Comma-separated value (CSV)
* HTML-formatted table
* Excel (XLS or XLSX)
* ODF spreadsheet
* Excel spreadsheet (XLS or XLSX)
* Open Document Format (ODF) spreadsheet (ODS)
* Upload to Google Sheets (requires [Google account authorization](starting#google-sheet-from-drive))
* [Custom tabular exporter](#custom-tabular-exporter)
* [SQL statement exporter](#sql-statement-exporter)
* [Templating exporter](#templating-exporter)
* [Templating exporter](#templating-exporter), which generates JSON by default
You can also export reconciled data to Wikidata, or export your Wikidata schema for future use with other OpenRefine projects:
@ -35,7 +37,7 @@ You can also export reconciled data to Wikidata, or export your Wikidata schema
![A screenshot of the custom tabular content tab.](/img/custom-tabular-exporter.png)
With the custom tabular exporter, you can choose which of your data to export, the separator you wish to use, and whether you'd like to download it to your computer or upload it into a Google Sheet.
With the custom tabular exporter, you can choose which of your data to export, the separator you wish to use, and whether you'd like to download the result to your computer or upload it into a Google Sheet.
On the <span class="tabLabels">Content</span> tab, you can drag and drop the columns appearing in the column list to reorder the output. The options for reconciled and date data are applied to each column individually.
@ -44,23 +46,23 @@ This exporter is especially useful with reconciled data, as you can choose wheth
“Output nothing for unmatched cells” will export empty cells for both newly-created matches and cells with no chosen matches. “Link to matched entity's page” will produce hyperlinked text in an HTML table output, but have no effect in other formats.
At this time, the date-formatting options in this window do not work. You can [keep track of this issue on Github](https://github.com/OpenRefine/OpenRefine/issues/3368).
In the future, you will also be able to choose how to [output date-formatted cells](exploring#dates). You can create a custom date output by using [formatting according to the SimpleDateFormat parsing key found here](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-).
In the future, you will be able to choose how to [output date-formatted cells](exploring#dates). You can create a custom date output by using [formatting according to the SimpleDateFormat parsing key found here](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-).
![A screenshot of the custom tabular file download tab.](/img/custom-tabular-exporter2.png)
On the <span class="tabLabels">Download</span> tab, you can generate a preview of how the first ten rows of your dataset will output. If you do not choose one of the file formats on the right, the <span class="buttonLabels">Download</span> button will generate a text file. On the <span class="tabLabels">Upload</span> tab, you can create a new Google Sheet.
With the <span class="tabLabels">Option Code</span> tab, you can copy JSON of your current settings to reuse on another project, or you can paste in existing JSON settings to apply to the current project.
With the <span class="tabLabels">Option Code</span> tab, you can copy JSON of your current custom settings to reuse on another export, or you can paste in existing JSON settings to apply to the current project.
### SQL exporter
The SQL exporter creates a SQL statement containing the data youve exported, which you can use to overwrite or add to an existing database. Choosing <span class="menuItems">Export</span><span class="menuItems">SQL exporter</span> will bring up a window with two tabs: one to define what data to output, and another to modify other aspects of the SQL statement with options to preview and download the statement.
The SQL exporter creates a SQL statement containing the data youve exported, which you can use to overwrite or add to an existing database. Choosing <span class="menuItems">Export</span><span class="menuItems">SQL exporter</span> will bring up a window with two tabs: one to define what data to output, and another to modify other aspects of the SQL statement, with options to preview and download the statement.
![A screenshot of the SQL statement content window.](/img/sql-exporter.png)
The <span class="tabLabels">Content</span> tab allows you to craft your dataset into an SQL table. From here, you can choose which columns to export, the data type to export for each (or choose "VARCHAR"), and the maximum character length for each field (if applicable based on the data type). You can set a default value for empty cells after unchecking “Allow null” in one or more columns.
With this output tool, you can choose whether to output only currently visible rows, or all the rows in your dataset, as well as whether to include empty rows. Trimming column names will remove their whitespace characters.
With this output tool, you can choose whether to output only currently visible rows, or all the rows in your dataset, as well as whether to include empty rows. The option to “Trim column names will remove their whitespace characters.
![A screenshot of the SQL statement download window.](/img/sql-exporter2.png)
@ -68,9 +70,9 @@ The <span class="tabLabels">Download</span> tab allows you to finalize your comp
<span class="fieldLabels">Include schema</span> means that you will start your statement with the creation of a table. Without that, you will only have an INSERT statement.
<span class="fieldLabels">Include content</span> means the INSERT statement with data from your project. Without that, you will only create empty columns.
<span class="fieldLabels">Include content</span> means including the INSERT statement with data from your project. Without that, you will only create empty columns.
You can include DROP and IF EXISTS if you require them, and set a name for the table which the statement will refer to.
You can include DROP and IF EXISTS if you require them, and set a name for the table to which the statement will refer.
You can then preview your statement, which will open up a new browser tab/window showing a statement with the first ten rows of your data (if included), or you can save a `.sql` file to your computer.
@ -78,30 +80,36 @@ You can then preview your statement, which will open up a new browser tab/window
If you pick <span class="menuItems">Templating…</span> from the <span class="menuItems">Export</span> dropdown menu, you can “roll your own” exporter. This is useful for formats that we don't support natively yet, or won't support. The Templating exporter generates JSON by default.
The window that appears allows you to set your own separators, prefix, and suffix to create a complete dataset in the language of your choice. In the <span class="fieldLabels">Row Template</span> section, you can choose which columns to generate from each row by calling them with variables.
![A screenshot of the Templating exporter generating JSON by default.](/img/templating-exporter.png)
The Templating Export window allows you to set your own separators, prefix, and suffix to create a complete dataset in the language of your choice. In the <span class="fieldLabels">Row template</span> section, you can choose which columns to generate from each row by calling them with [variables](expressions#variables).
This can be used to:
* output reconciliation data (`cells["column name"].recon.match.name`, `.recon.match.id`, and `.recon.best.name`, for example) instead of cell values
* create multiple columns of output from different member fields of a single project column
* employ GREL expressions to modify cell data for output (for example, `cells["column name"].value.toUppercase()`).
* output [reconciliation data](expressions#reconciliation), such as `cells["ColumnName"].recon.match.name`
* create multiple columns of output from different [member fields](expressions#variables) of a single project column
* employ [expressions](expressions) to modify data for output: for example, `cells["ColumnName"].value.toUppercase()`.
Anything that appears inside doubled curly braces ({{}}) is treated as a GREL expression; anything outside is generated as straight text. You can use Jython or Clojure by declaring it at the start: for example, `{{jython:return cells["Author"].value}}` will run a Jython expression.
Anything that appears inside doubled curly braces ({{ }}) is treated as a GREL expression; anything outside is generated as straight text. You can use Jython or Clojure by declaring it at the start:
```
{{jython:return cells["ColumnName"].value}}
```
:::caution
Note that some syntax is different in this tool than elsewhere in OpenRefine: a forward slash must be escaped with a backslash, while other characters do not need escaping. You cannot, at this time, include a closing curly brace (}) anywhere in your expression, or it will cause it to malfunction.
:::
You can include [regular expressions](expressions#regular-expressions) as usual (inside forward slashes, with any GREL function that accepts them). For example, you could output a version of your cells with punctuation removed, using an expression such as `{{jsonize(cells["Column Name"].value.replaceChars("/[.!?$&,/]/",""))}}`.
You can include [regular expressions](expressions#regular-expressions) as usual (inside forward slashes, with any GREL function that accepts them). For example, you could output a version of your cells with punctuation removed, using an expression such as
```
{{jsonize(cells["ColumnName"].value.replaceChars("/[.!?$&,/]/",""))}}
```
You could also simply output a plain-text document inserting data from your project into sentences (for example, "In `{{cells["Year"].value}}` we received `{{cells["RequestCount"].value}}` requests.").
You could also simply output a plain-text document inserting data from your project into sentences: for example, "In `{{cells["Year"].value}}` we received `{{cells["RequestCount"].value}}` requests."
You can use the shorthand `${Column Name}` (no need for quotes) to insert column values directly. You cannot use this inside an expression, because of the closing curly brace.
You can use the shorthand `${ColumnName}` (no need for quotes) to insert column values directly. You cannot use this inside an expression, because of the closing curly brace.
If your projects is in records mode, the <span class="fieldLabels">Row separator</span> field will insert a separator between records, rather than individual rows. Rows inside a single record will be directly appended to one another as per the content in the <span class="fieldLabels">Row Template</span> field.
![A screenshot of the Templating exporter generating JSON by default.](/img/templating-exporter.png)
Once you have created your template, you may wish to save the text you produced in each field, in order to reuse it in the future. Once you click <span class="buttonLabels">Export</span> OpenRefine will output a simple text file, and your template will be discarded.
Once you have created your template, you may wish to save the text you produced in each field, in order to reuse it in the future. Once you click <span class="buttonLabels">Export</span> OpenRefine will output a simple `.txt` file, and your template will be discarded.
We have recipes on using the Templating exporter to [produce several different formats](https://github.com/OpenRefine/OpenRefine/wiki/Recipes#12-templating-exporter).
@ -109,13 +117,17 @@ We have recipes on using the Templating exporter to [produce several different f
You can share a project in progress with another computer, a colleague, or with someone who wants to check your history. This can be useful for showing that your data cleanup didnt distort or manipulate the information in any way. Once you have exported a project, another OpenRefine installation can [import it as a new project](starting#import-a-project).
You can either save it locally or upload it to Google Drive (which requires you to authorize a Google account).
:::caution
OpenRefine project archives contain confidential data from previous steps which is still accessible to anyone who has the file. If you are hoping to keep your original dataset hidden for privacy reasons, such as using OpenRefine to anonymize information, do not share your project archive.
OpenRefine project archives contain confidential data from previous steps, which will still be accessible to anyone who has the archive. If you are hoping to keep your original dataset hidden for privacy reasons, such as using OpenRefine to anonymize information, do not share your project archive.
:::
From the <span class="menuItems">Export</span> dropdown, select <span class="menuItems">OpenRefine project archive to file</span>. OpenRefine exports your full project with all of its history. It does not export any current views or applied facets. Any reconciliation information will be preserved, but the importing installation will need to add the same reconciliation services to keep working with that data.
To save your project archive locally: from the <span class="menuItems">Export</span> dropdown, select <span class="menuItems">OpenRefine project archive to file</span>. OpenRefine exports your full project with all of its history. It does not export any current views or applied facets. Existing reconciliation information will be preserved, but the importing computer will need to add the same reconciliation services to keep working with that data.
OpenRefine exports files in `.tar.gz` format. You can rename the file when you save it; otherwise it will bear the project name. You can either save it locally or upload it to Google Drive (which requires you to authorize a Google account), using the <span class="menuItems">OpenRefine project archive to Google Drive...</span> option. OpenRefine will not share the link with you, only confirm that the file was uploaded.
OpenRefine exports files in `.tar.gz` format. You can rename the file when you save it; otherwise it will bear the project name.
To save your project archive to Google Drive: from the <span class="menuItems">Export</span> dropdown, select <span class="menuItems">OpenRefine project archive to Google Drive...</span>. OpenRefine will not share the link with you, only confirm that the file was uploaded.
## Export operations

View File

@ -6,14 +6,12 @@ sidebar_label: Expressions
## Overview
You can use expressions in multiple places in OpenRefine to extend data cleanup and manipulation.
Expressions are available with the following functions:
You can use expressions in multiple places in OpenRefine to extend data cleanup and transformation. Expressions are available with the following functions:
* <span class="menuItems">Facet</span>:
* <span class="menuItems">Custom text facet...</span>
* <span class="menuItems">Custom numeric facet…</span>
* You can also manually “change” most Customized facets after they have been created, which will bring up an expressions window.
* <span class="menuItems">Customized facets</span> (click “change” after they have been created to bring up an expressions window)
* <span class="menuItems">Edit cells</span>:
* <span class="menuItems">Transform…</span>
@ -24,11 +22,11 @@ Expressions are available with the following functions:
* <span class="menuItems">Split</span>
* <span class="menuItems">Join</span>
* <span class="menuItems">Add column based on this column</span>
* <span class="menuItems">Add column by fetching URLs</span>
* <span class="menuItems">Add column by fetching URLs</span>.
In the expressions editor window you will have the opportunity to select one supported language. The default is [GREL (General Refine Expression Language)](#grel-general-refine-expression-language); OpenRefine also comes with support for [Clojure](#clojure) and [Jython](#jython). Extensions may offer support for more expressions languages.
In the expressions editor window you have the opportunity to select a supported language. The default is [GREL (General Refine Expression Language)](#grel-general-refine-expression-language); OpenRefine also comes with support for [Clojure](#clojure) and [Jython](#jython). Extensions may offer support for more expressions languages.
These languages have some syntax differences but support most of the same [variables](#variables). For example, the GREL expression `value.split(" ")[1]` would be written in Jython as `return value.split(" ")[1]`.
These languages have some syntax differences but support many of the same [variables](#variables). For example, the GREL expression `value.split(" ")[1]` would be written in Jython as `return value.split(" ")[1]`.
This page is a general reference for available functions, variables, and syntax. For examples that use these expressions for common data tasks, look at the [Recipes section on the Wiki](https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users#recipes-and-worked-examples).
@ -49,34 +47,34 @@ Were you to apply a transformation to the “friend” column with the expressio
value.split(" ")[1]
```
OpenRefine would work through each row, splitting the “friend” values based on a space character. `value` for row 1 would be “John Smith” so the output would be “Smith” (as "[1]" selects the second part of the created output); `value` for row 2 would be “Jane Doe” so the output would be “Doe.” Using variables, a single expression yields different results for different rows. The old information would be discarded; you couldn't get "John" and "Jane" back unless you undid the operation in the History tab.
OpenRefine would work through each row, splitting the “friend” values based on a space character. The `value` for row 1 is “John Smith” so the output would be “Smith” (as "[1]" selects the second part of the created output); the `value` for row 2 is “Jane Doe” so the output would be “Doe”. Using variables, a single expression yields different results for different rows. The old information would be discarded; you couldn't get "John" and "Jane" back unless you undid the operation in the [History](running#history-undoredo) tab.
For another example, if you were to create a new column based on your data using the expression `row.starred`, it would generate a column of true and false values based on whether your rows were starred at that moment. If you were to then star more rows and unstar some rows, that data would not dynamically update - you would need to run the operation again to have current true/false values.
Note that an expression is typically based on one particular column in the data - the column whose drop-down menu is invoked. Many variables are created to stand for things about the cell in that “base column” of the current row on which the expression is evaluated. There are also variables about rows, which you can use to access cells in other columns.
Note that an expression is typically based on one particular column in the data - the column whose drop-down menu is first selected. Many variables are created to stand for things about the cell in that “base column” of the current row on which the expression is evaluated. There are also variables about rows, which you can use to access cells in other columns.
### The expressions editor
When you select a function that offers the ability to supply expressions, you will see a window overlay the screen with what we call the expressions editor.
When you select a function that accepts expressions, you will see a window overlay the screen with what we call the expressions editor.
![The expressions editor window with a simple expression: value + 10.](/img/expression-editor.png)
The expressions editor offers you a field for entering your formula and shows you a preview of its transformation on your first few rows of cells.
There is a dropdown menu from which you can choose an expression language. The default is GREL. Jython and Clojure are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations.
There is a dropdown menu from which you can choose an expression language. The default at first is GREL; if you begin working with another language, that selection will persist across OpenRefine. Jython and Clojure are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations.
There are also tabs for:
* History, which shows you formulas youve recently used from across all your projects
* Starred, which shows you formulas from your History that youve starred for reuse
* Help, a quick reference to GREL functions.
* <span class="tabLabels">History</span>, which shows you formulas youve recently used from across all your projects
* <span class="tabLabels">Starred</span>, which shows you formulas from your History that youve starred for reuse
* <span class="tabLabels">Help</span>, a quick reference to GREL functions.
Starring formulas youve used in the past can be very helpful for repetitive tasks youre performing in batches.
Starring formulas youve used in the past can be helpful for repetitive tasks youre performing in batches.
You can also choose how formula errors are handled: replicate the original cell value, output an error message into the cell, or ouput a blank cell.
### Regular expressions
OpenRefine offers several fields that support the use of regular expressions (regex), such as in a <span>Text filter</span> or a <span>Replace…</span> operation. GREL and other expressions can also use regular expression markup to extend their functionality.
OpenRefine offers several fields that support the use of regular expressions (regex), such as in a <span class="menuItems">Text filter</span> or a <span class="menuItems">Replace…</span> operation. GREL and other expressions can also use regular expression markup to extend their functionality.
If this is your first time working with regex, you may wish to read [this tutorial specific to the Java syntax that OpenRefine supports](https://docs.oracle.com/javase/tutorial/essential/regex/). We also recommend this [testing and learning tool](https://regexr.com/).
@ -92,19 +90,19 @@ the regular expression is `\s+`, and the syntax used in the expression wraps it
Do not use slashes to wrap regular expressions outside of a GREL expression.
The [GREL functions](#grel-general-refine-expression-language) that support regex are:
* contains
* replace
* find
* match
* partition
* rpartition
* split
* smartSplit
On the [GREL functions](#grel-general-refine-expression-language) page, functions that support regex will indicate that with a “p” for “pattern.” The GREL functions that support regex are:
* [contains](grelfunctions#containss-sub-or-p)
* [replace](grelfunctions#replaces-s-or-p-find-s-replace)
* [find](grelfunctions#finds-sub-or-p)
* [match](grelfunctions#matchs-p)
* [partition](grelfunctions#partitions-s-or-p-fragment-b-omitfragment-optional)
* [rpartition](grelfunctions#rpartitions-s-or-p-fragment-b-omitfragment-optional)
* [split](grelfunctions#splits-s-or-p-sep)
* [smartSplit](grelfunctions#smartsplits-s-or-p-sep-optional)
#### Jython-supported regex
You can also use [regex with Jython expressions](http://www.jython.org/docs/library/re.html), instead of GREL, for example with a Custom Text Facet:
You can also use [regex with Jython expressions](http://www.jython.org/docs/library/re.html), instead of GREL, for example with a <span class="menuItems">Custom Text Facet</span>:
```
python import re g = re.search(ur"\u2014 (.*),\s*BWV", value) return g.group(1)
@ -120,7 +118,7 @@ clojure (nth (re-find #"\u2014 (.*),\s*BWV" value) 1)
## Variables
Most of the OpenRefine-specific variables have attributes: aspects of the variables that can be called separately. We call these attributes "member fields" because they belong to certain variables. For example, you can query a record to find out how many rows it contains with `row.record.rowCount`: `rowCount` is a member field specific to `record`, which is a member field of `row`. Member fields can be called using a dot separator, or with square brackets (`row["record"]`).
Most OpenRefine variables have attributes: aspects of the variables that can be called separately. We call these attributes “member fields” because they belong to certain variables. For example, you can query a record to find out how many rows it contains with `row.record.rowCount`: `rowCount` is a member field specific to the `record` variable, which is a member field of `row`. Member fields can be called using a dot separator, or with square brackets (`row["record"]`). The square bracket syntax is also used for variables that can call columns by name, for example, `cells["Postal Code"]`.
|Variable |Meaning |
|-|-|
@ -141,49 +139,51 @@ The `row` variable itself is best used to access its member fields, which you ca
|-|-|
| `row.index` | The index value of the current row (the first row is 0) |
| `row.cells` | The cells of the row, returned as an array |
| `row.columnNames` | An array of the column names of the row, i.e. the column names in the project. This will report all columns, even those with null cell values in the particular row. Call a column by number with row.columnNames[3] |
| `row.columnNames` | An array of the column names of the project. This will report all columns, even those with null cell values in that particular row. Call a column by number with `row.columnNames[3]` |
| `row.starred` | A boolean indicating if the row is starred |
| `row.flagged` | A boolean indicating if the row is flagged |
| `row.record` | The [record](#record) object containing the current row |
For array objects such as `row.columnNames` you can preview the array using the expressions window, and output it as a string using `toString(row.columnNames)` or with something like:
```forEach(row.columnNames,v,v).join("; ")```
```
forEach(row.columnNames,v,v).join("; ")
```
### Cells
The `cells` object is used to call information from the columns in your project. For example, `cells.Foo` returns a [cell](#cell) object representing the cell in the column named “Foo” of the current row. If the column name has spaces, use square brackets, e.g., `cells["Postal Code"]`. There is no `cells.value` - it can only be used with member fields. To get the corresponding column value inside the `cells` variable, use `.value` at the end, for example `cells["Postal Code"].value`.
The `cells` object is used to call information from the columns in your project. For example, `cells.Foo` returns a [cell](#cell) object representing the cell in the column named “Foo” of the current row. If the column name has spaces, use square brackets, e.g., `cells["Postal Code"]`. To get the corresponding column's value inside the `cells` variable, use `.value` at the end, for example, `cells["Postal Code"].value`. There is no `cells.value` - it can only be used with member fields.
### Cell
A `cell` object contains all the data of a cell and is stored as a single object that has two fields.
A `cell` object contains all the data of a cell and is stored as a single object.
You can use `cell` on its own in the expressions editor to copy all the contents of a column to another column, including reconciliation information. Although the preview in the expressions editor will only show a small representation [object Cell], it will actually copy all the cell's data. Try this with <span class="menuItems">Edit Column</span><span class="menuItems">Add Column based on this column ...</span>.
You can use `cell` on its own in the expressions editor to copy all the contents of a column to another column, including reconciliation information. Although the preview in the expressions editor will only show a small representation (“[object Cell]”), it will actually copy all the cell's data. Try this with <span class="menuItems">Edit Column</span><span class="menuItems">Add Column based on this column ...</span>.
|Field |Meaning |Member fields |
|-|-|-|
| `cell` | An object containing the entire contents of the cell | .value, .recon, .errorMessage |
| `cell.value` | The value in the cell, which can be a string, a number, a boolean, null, or an error | |
| `cell.recon` | An object encapsulating reconciliation results for that cell | See the reconciliation section below |
| `cell.recon` | An object encapsulating reconciliation results for that cell | See the [reconciliation](expressions#reconciliation) section |
| `cell.errorMessage` | Returns the message of an *EvalError* instead of the error object itself (use value to return the error object) | .value |
### Reconciliation
Several of the fields here are equivalent to what can be used through [reconciliation facets](reconciling#reconciliation-facets). You must type `cell.recon`; `recon` on its own will not work.
Several of the fields here provide the data used in [reconciliation facets](reconciling#reconciliation-facets). You must type `cell.recon`; `recon` on its own will not work.
|Field|Meaning |Member fields |
|-|-|-|
| `cell.recon.judgment` | A string, either "matched", "new", "none" | |
| `cell.recon.judgmentAction` | A string, either "single" or "similar" (or "unknown") | |
| `cell.recon.judgment` | A string: either “matched”, "new”, "none” | |
| `cell.recon.judgmentAction` | A string: either "single” or “similar” (or “unknown”) | |
| `cell.recon.judgmentHistory` | A number, the epoch timestamp (in milliseconds) of your judgment | |
| `cell.recon.matched` | A boolean, true if judgment is "matched" | |
| `cell.recon.matched` | A boolean, true if judgment is “matched” | |
| `cell.recon.match` | The recon candidate that has been matched against this cell (or null) | .id, .name, .type |
| `cell.recon.best` | The highest scoring recon candidate from the reconciliation service (or null) | .id, .name, .type, .score |
| `cell.recon.features` | An array of reconciliation features to help you assess the accuracy of your matches | .typeMatch, .nameMatch, .nameLevenshtein, .nameWordDistance |
| `cell.recon.features.typeMatch` | A boolean, true if your chosen type is "matched" and false if not (or "(no type)" if unreconciled) | |
| `cell.recon.features.nameMatch` | A boolean, true if the cell and candidate strings are identical and false if not (or "(unreconciled)") | |
| `cell.recon.features.nameLevenshtein` | A number, representing the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance): larger if the difference is greater between value and candidate | |
| `cell.recon.features.nameWordDistance` | A number, based on the [word similarity](reconciling#reconciliation-facets) | |
| `cell.recon.features.typeMatch` | A boolean, true if your chosen type is “matched” and false if not (or “(no type)” if unreconciled) | |
| `cell.recon.features.nameMatch` | A boolean, true if the cell and candidate strings are identical and false if not (or “(unreconciled)”) | |
| `cell.recon.features.nameLevenshtein` | A number representing the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance): larger if the difference is greater between value and candidate | |
| `cell.recon.features.nameWordDistance` | A number based on the [word similarity](reconciling#reconciliation-facets) | |
| `cell.recon.candidates` | An array of the top 3 candidates (default) | .id, .name, .type, .score |
The `cell.recon.candidates` and `cell.recon.best` objects have a few deeper fields: `id`, `name`, `type`, and `score`. `type` is an array of type identifiers for a list of candidates, or a single string for the best candidate.
@ -203,7 +203,7 @@ A `row.record` object encapsulates one or more rows that are grouped together, w
| `row.record.cells` | The cells of the row |
| `row.record.fromRowIndex` | The row index of the first row in the record |
| `row.record.toRowIndex` | The row index of the last row in the record + 1 (i.e. the next row) |
| `row.record.rowCount` | count of the number of rows in the record |
| `row.record.rowCount` | A count of the number of rows in the record |
## GREL (General Refine Expression Language)
@ -219,7 +219,9 @@ GREL is designed to resemble Javascript. Formulas use variables and depend on da
| `value.substring(7, 10)` | Output the substring of the value from character index 7, 8, and 9 (excluding character index 10) |
| `value.substring(13)` | Output the substring from index 13 to the end of the string |
If you're used to Excel, note that the operator for string concatenation is + (not &). Evaluating conditions uses symbols such as <, >, *, /, etc. To check whether two objects are equal, use two equal signs (`value=="true"`).
Note that the operator for string concatenation is `+` (not “&” as is used in Excel).
Evaluating conditions uses symbols such as <, >, *, /, etc. To check whether two objects are equal, use two equal signs (`value=="true"`).
### Syntax
@ -234,7 +236,7 @@ The second form is a shorthand to make expressions easier to read. It simply pul
| `value.trim().length()` | `length(trim(value))` |
| `value.substring(7, 10)` | `substring(value, 7, 10)` |
So, in the dot shorthand, the functions occur from left to right in the order of calling, rather than in the reverse order with parentheses.
So, in the dot shorthand, the functions occur from left to right in the order of calling, rather than in the reverse order with parentheses. This allows you to string together multiple functions in a readable order.
The dot notation can also be used to access the member fields of [variables](#variables). For referring to column names that contain spaces (anything not a continuous string), use square brackets instead of dot notation:
@ -243,7 +245,7 @@ The dot notation can also be used to access the member fields of [variables](#va
| `FirstName.cells` | Access the cell in the column named “FirstName” of the current row |
| `cells["First Name"]` | Access the cell in the column called “First Name” of the current row |
Brackets can also be used to get substrings and sub-arrays, and single items from arrays:
Square brackets can also be used to get substrings and sub-arrays, and single items from arrays:
|Example |Description |
|-|-|
@ -251,24 +253,26 @@ Brackets can also be used to get substrings and sub-arrays, and single items fro
| `"internationalization"[1,-2]` | Will return “nternationalizati” (negative indexes are counted from the end) |
| `row.columnNames[5]` | Will return the name of the fifth column |
Any function that outputs an array can use square brackets to select only one part of the array to output as a string (remember that the index of the items in an array starts with 0). For example, partition() would normally output an array of three items: the part before your chosen fragment, the fragment you've identified, and the part after. Selecting the third part with "internationalization".partition("nation")[2] will output “alization” (and so will [-1], indicating the final item in the array).
Any function that outputs an array can use square brackets to select only one part of the array to output as a string (remember that the index of the items in an array starts with 0).
### Controls
For example, partition() would normally output an array of three items: the part before your chosen fragment, the fragment you've identified, and the part after. Selecting only the third part with `"internationalization".partition("nation")[2]` will output “alization” (and so will [-1], indicating the final item in the array).
GREL offers controls to support branching and looping (that is, “if” and “for” functions), but unlike functions, their arguments don't all get evaluated before they get run. A control can decide which part of the code to execute and can affect the environment bindings. Functions, on the other hand, can't do either. Each control decides which of their arguments to evaluate to value, and how.
### GREL controls
GREL offers controls to support branching and looping (that is, “if” and “for” functions), but unlike functions, their arguments don't all get evaluated before they get run. A control can decide which part of the code to execute and can affect the environment bindings. Functions, on the other hand, can't do either. Each control decides which of their arguments to evaluate to `value`, and how.
Please note that the GREL control names are case-sensitive: for example, the isError() control can't be called with iserror().
#### if(e, expression eTrue, expression eFalse)
#### if(e, eTrue, eFalse)
Expression o is evaluated to a value. If that value is true, then expression eTrue is evaluated and the result is the value of the whole `if` expression. Otherwise, expression eFalse is evaluated and that result is the value.
Expression e is evaluated to a value. If that value is true, then expression eTrue is evaluated and the result is the value of the whole if() expression. Otherwise, expression eFalse is evaluated and that result is the value.
Examples:
| Example expression | Result |
| ------------------------------------------------------------------------ | ------------ |
| `if("internationalization".length() > 10, "big string", "small string")` | big string |
| `if(mod(37, 2) == 0, "even", "odd")` | odd |
| `if("internationalization".length() > 10, "big string", "small string")` | big string |
| `if(mod(37, 2) == 0, "even", "odd")` | odd |
Nested if (switch case) example:
@ -290,7 +294,7 @@ Evaluates expression e1 and binds its value to variable v. Then evaluates expres
| `with("european union".split(" "), a, forEach(a, v, v.length()))` | [ 8, 5 ] |
| `with("european union".split(" "), a, forEach(a, v, v.length()).sum() / a.length())` | 6.5 |
#### filter(e1, variable v, e test)
#### filter(e1, v, e test)
Evaluates expression e1 to an array. Then for each array element, binds its value to variable v, evaluates expression test - which should return a boolean. If the boolean is true, pushes v onto the result array.
@ -298,7 +302,7 @@ Evaluates expression e1 to an array. Then for each array element, binds its valu
| ---------------------------------------------- | ------------- |
| `filter([ 3, 4, 8, 7, 9 ], v, mod(v, 2) == 1)` | [ 3, 7, 9 ] |
#### forEach(e1, variable v, e2)
#### forEach(e1, v, e2)
Evaluates expression e1 to an array. Then for each array element, binds its value to variable v, evaluates expression e2, and pushes the result onto the result array.
@ -306,7 +310,7 @@ Evaluates expression e1 to an array. Then for each array element, binds its valu
| ------------------------------------------ | ------------------- |
| `forEach([ 3, 4, 8, 7, 9 ], v, mod(v, 2))` | [ 1, 0, 0, 1, 1 ] |
#### forEachIndex(e1, variable i, variable v, e2)
#### forEachIndex(e1, i, v, e2)
Evaluates expression e1 to an array. Then for each array element, binds its index to variable i and its value to variable v, evaluates expression e2, and pushes the result onto the result array.
@ -314,15 +318,15 @@ Evaluates expression e1 to an array. Then for each array element, binds its inde
| ------------------------------------------------------------------------------- | --------------------------- |
| `forEachIndex([ "anne", "ben", "cindy" ], i, v, (i + 1) + ". " + v).join(", ")` | 1. anne, 2. ben, 3. cindy |
#### forRange(n from, n to, n step, variable v, e)
#### forRange(n from, n to, n step, v, e)
Iterates over the variable v starting at from, incrementing by step each time while less than to. At each iteration, evaluates expression e, and pushes the result onto the result array.
Iterates over the variable v starting at from, incrementing by the value of step each time while less than to. At each iteration, evaluates expression e, and pushes the result onto the result array.
#### forNonBlank(e, variable v, expression eNonBlank, expression eBlank)
#### forNonBlank(e, v, eNonBlank, eBlank)
Evaluates expression e. If it is non-blank, forNonBlank() binds its value to variable v, evaluates expression eNonBlank and returns the result. Otherwise (if o evaluates to blank), forNonBlank() evaluates expression eBlank and returns that result instead.
Evaluates expression e. If it is non-blank, forNonBlank() binds its value to variable v, evaluates expression eNonBlank and returns the result. Otherwise (if e evaluates to blank), forNonBlank() evaluates expression eBlank and returns that result instead.
Unlike other GREL functions beginning with "for", forNonBlank() is not iterative. forNonBlank() essentially offers a shorter syntax to achieving the same outcome by using the isNonBlank() function within an "if" statement.
Unlike other GREL functions beginning with “for,” forNonBlank() is not iterative. forNonBlank() essentially offers a shorter syntax to achieving the same outcome by using the isNonBlank() function within an “if” statement.
#### isBlank(e), isNonBlank(e), isNull(e), isNotNull(e), isNumeric(e), isError(e)
@ -333,7 +337,7 @@ Examples:
| Expression | Result |
| ------------------- | ------- |
| `isBlank("abc")` | false |
| `isNonBlank("abc")` | true |
| `isNonBlank("abc")` | true |
| `isNull("abc")` | false |
| `isNotNull("abc")` | true |
| `isNumeric(2)` | true |
@ -341,7 +345,7 @@ Examples:
| `isError("abc")` | false |
| `isError(1 / 0)` | true |
Remember that these are controls and not functions. So you cant use dot notation (the `e.isX()` syntax).
Remember that these are controls and not functions: you cant use dot notation (for example, the format `e.isX()` will not work).
### Constants
|Name |Meaning |
@ -352,9 +356,13 @@ Remember that these are controls and not functions. So you cant use dot notat
## Jython
Jython 2.7.2 comes bundled with the default installation of OpenRefine 3.4.1. You can add libraries and code by following [this tutorial](https://github.com/OpenRefine/OpenRefine/wiki/Extending-Jython-with-pypi-modules). A large number of Python files (.py or .pyc) are compatible. Python code that depends on C bindings will not work in OpenRefine, which uses Java / Jython only. Since Jython is essentially Java, you can also import Java libraries and utilize those. Remember to restart OpenRefine, so that new Jython/Python libraries are initialized during Butterfly's startup.
Jython 2.7.2 comes bundled with the default installation of OpenRefine 3.4.1. You can add libraries and code by following [this tutorial](https://github.com/OpenRefine/OpenRefine/wiki/Extending-Jython-with-pypi-modules). A large number of Python files (`.py` or `.pyc`) are compatible.
OpenRefine now has [most of the Jsoup.org library built into GREL functions](#jsoup-xml-and-html-parsing-functions), for parsing and working with HTML elements and extraction.
Python code that depends on C bindings will not work in OpenRefine, which uses Java / Jython only. Since Jython is essentially Java, you can also import Java libraries and utilize those.
You will need to restart OpenRefine, so that new Jython or Python libraries are initialized during startup.
OpenRefine now has [most of the Jsoup.org library built into GREL functions](#jsoup-xml-and-html-parsing-functions) for parsing and working with HTML and XML elements.
### Syntax
@ -368,19 +376,19 @@ Expressions in Jython must have a `return` statement:
return rowIndex%2
```
Fields have to be accessed using the bracket operator rather than the dot operator:
Fields have to be accessed using the bracket operator rather than dot notation:
```
return cells["col1"]["value"]
```
To access the Levenshtein distance between the reconciled value and the cell value (?) use the [recon variables](#reconciliation):
For example, to access the [edit distance](reconciling#reconciliation-facets) between a reconciled value and an original cell value using [recon variables](#reconciliation):
```
return cell["recon"]["features"]["nameLevenshtein"]
```
To return the lower case of value (if the value is not null):
To return the lower case of `value` (if the value is not null):
```
if value is not None:
@ -391,7 +399,7 @@ To return the lower case of value (if the value is not null):
### Tutorials
- [Extending Jython with pypi modules](https://github.com/OpenRefine/OpenRefine/wiki/Extending-Jython-with-pypi-modules)
- [Working with Phone numbers using Java libraries inside Python](https://github.com/OpenRefine/OpenRefine/wiki/Jython#tutorial---working-with-phone-numbers-using-java-libraries-inside-python)
- [Working with phone numbers using Java libraries inside Python](https://github.com/OpenRefine/OpenRefine/wiki/Jython#tutorial---working-with-phone-numbers-using-java-libraries-inside-python)
Full documentation on the Jython language can be found on its official site: [http://www.jython.org](http://www.jython.org).
@ -415,4 +423,4 @@ For help with syntax, see the [Clojure website's guide to syntax](https://clojur
User-contributed Clojure recipes can be found on our wiki at [https://github.com/OpenRefine/OpenRefine/wiki/Recipes#11-clojure](https://github.com/OpenRefine/OpenRefine/wiki/Recipes#11-clojure).
Full documentation on the Clojure language can be found on its official site: [https://clojure.org/](https://clojure.org/).
Full documentation on the Clojure language can be found on its official site: [https://clojure.org/](https://clojure.org/).

View File

@ -1,4 +1,4 @@
---
---
id: facets
title: Exploring facets
sidebar_label: Facets
@ -6,16 +6,18 @@ sidebar_label: Facets
## Overview
Facets are one of OpenRefines strongest features - thats where the diamond logo comes from! Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
Facets are one of OpenRefines strongest features - thats where the diamond logo comes from!
Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
Faceted browsing gives you a big-picture look at your data (do they agree or disagree?) and also allows you to filter down to a specific subset to explore it more (what do people who disagree say in other responses?).
Typically, you create a facet on a particular column. That facet selection appears on the left, in the Facet/Filter tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
Typically, you create a facet on a particular column. That facet selection appears on the left, in the <span class="tabLabels">Facet/Filter</span> tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
### An example
You can learn about facets and filtering with the following example.
You can learn about facets and filtering with the following example. You can copy the following table and paste it using the <span class="menuItems">Clipboard</span> method of starting a project if you would like to try it yourself.
We collected a list of the [10 most populous cities from Wikidata](https://w.wiki/3Em), using an example query of theirs. We removed the GPS coordinates and added the country.
@ -32,9 +34,9 @@ We collected a list of the [10 most populous cities from Wikidata](https://w.wik
| Guangzhou | 13080500 | People's Republic of China |
| São Paulo | 12106920 | Brazil |
If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count, youll learn which countries hold the most populous cities.
We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count at the top of the facet window, youll learn which countries hold the most populous cities.
|Facet|Count|
|---|---|
@ -48,25 +50,25 @@ We will see in the sidebar that the countries identified are displayed, along wi
If we want to learn more about a particular country, we can click on its appearance in the facet sidebar. This narrows our dataset down temporarily to only rows matching that facet.
Youll see the “10 rows” notification change to “4 matching rows (10 total)” if you click on “Peoples Republic of China”. In the data grid, youll see the same number of rows, but only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
Youll see the “10 rows” indicator change to “4 matching rows (10 total)” if you click on “Peoples Republic of China”. In the data grid, youll see fewer rows: only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
If you want to go back to the original dataset, click “reset” or “exclude.” If you want to view the most populous cities in both China and India, click “include” next to each facet. Now youll see 5 rows - #1, 2, 5, 8, 9.
If you want to go back to the original dataset, click <span class="buttonLabels">Reset All</span> or the small “exclude” text next to the facet. If you want to view the most populous cities in both China and India, click “include” next to each facet. Now youll see 5 rows - #1, 2, 5, 8, 9.
We can also explore our data using the population information. In this case, because population is a number, we can create a numeric facet. This will give us the ability to explore by range rather than by exact matching values.
With the numeric facet, we are given a scale from the smallest to the largest value in the column. We can drag the range minimum and maximum to narrow the results. In this case, if we narrow down to only cities with more than 20 million in population, we get 3 matching rows out of the original 10.
When you look at the facet display of countries, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
When you look back at the text facet display of country names, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
We can combine these facets - say, by narrowing to only the Chinese cities with populations greater than 20 million - simply by clicking in both. You should see 2 matching rows for both these criteria.
### Things to know about facets
When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you press “Export” and copy your data out of OpenRefine while facets are active, you will only export the matching rows, not all the rows in your project.
When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you click <span class="menuItems">Export</span> and copy your data out of OpenRefine while facets are active, many of the exporting options will only export the matching rows, not all the rows in your project.
OpenRefine has several default facets, which youll learn about below. The most powerful facets are the ones designed by you - custom facets, written using [expressions](expressions) to transform the data behind the scenes and help you narrow down to precisely what youre looking for.
Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the "permalink" next to the projects name.
Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the <span class="menuItems">[Permalink](running#the-project-bar)</span> next to the projects name.
You can modify any facet expression by clicking the “change” button to the right of the column name in the facet sidebar.
@ -74,15 +76,17 @@ Facet boxes that appear in the sidebar can be resized and rearranged. You can dr
## Text facet
A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to “Facet” → “Text facet”. The created facet will be sorted alphabetically, and can be sorted by count.
A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to <span class="menuItems">Facet</span><span class="menuItems">Text facet</span>. The created facet will be sorted alphabetically, and can be sorted by count.
A text facet is very simple: it takes the total contents of the cells of the column in question and matches them up. It does no guessing about typos or near-matches.
You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in your data. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit), with the “Cluster” button displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in the column. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit): a “Cluster” button is displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which will increase the processing work required by your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will edit that preference for you.
Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which may slow down your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will permanently edit that preference for you.
The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet.
The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet. This can be useful to generate small summary tables of your data.
![A column of years faceted as text and numbers, and with the count ready to be copied.](/img/yeardata.png)
## Numeric facet
@ -92,73 +96,95 @@ Whereas a text facet groups unique text values into groups, a numeric facet sort
You will be offered the option to include blank, non-numeric, and error values in your numeric visualization; these will appear in the visual range as “0” values.
:::info Numbers as text
You can create a text facet on numeric data, which will treat each entry as a string. This can be useful if you wish, for example, to manually include facets instead of selecting a range, or sort by count, or copy that count.
:::
## Timeline facet
![A screenshot of an example timeline facet.](/img/timelinefacet.png)
Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on dates formatted as “date” data types (e.g. by [using the `toDate()` function](expressions#dates) to transform text into dates, or by manually setting the [data type](#cell-data-types) on individual cells) and in the structure of the ISO-8601-compliant extended format with time in UTC: **YYYY**-**MM**-**DD**T**HH**:**MM**:**SS**Z.
Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on cells formatted as the [“date” data type](exploring#dates).
The facet appears with a count of blank cells and those with errors, which can help you analyze whether your date cells are correctly converted.
## Scatterplot facet
A scatterplot facet can be generated on any number-formatted column. You require two or more number columns to generate scatterplots.
A scatterplot is a visual representation of two related sets of numeric data.
You have the option to generate linear scatterplots (where the X and Y axes show continuous increases) or logarithmic scatterplots (where the X and Y axes show exponential or scaled increases). You can also rotate the plot by 45 degrees in either direction, and you can choose the size of the dot indicating a datapoint. You can make these choices in both the preview and in the facet display.
Selecting “Facet” → “Scatterplot facet” will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further.
A scatterplot facet can be generated on any column. You require two or more number columns to generate scatterplots. Selecting <span class="menuItems">Facet</span><span class="menuItems">Scatterplot facet</span> will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further. You can control which columns are on the X and Y axes by rearranging the columns in your dataset.
When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle. This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
![A simple scatterplot of two numeric values.](/img/scatterplot.png)
When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle (as shown by the rectangle inside the square in the image above). This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
If you have multiple facets applied, plotted points in your scatterplot displays will be greyed out if they are not part of the current matching data subset. If the rectangle you have drawn within a scatterplot display only includes grey dots, you will see no matching rows.
If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG image that you can save.
If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG file that you can save.
## Custom text facet
You may want to explore your textual data in a way that doesnt involve modifying it but does require being more selective about what gets considered. Creating custom text facets will load your column into memory, transform the data, and store those transformations inside the facet.
You may want to explore your textual data with modifications that aren't permanent. Creating custom text facets will load your column into memory, transform the data temporarily, and store those transformations inside the facet.
You can also use text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
You can also use custom text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
If you would like to build your own version of a text facet, you can use the “Custom Text Facet” option. Clicking on “Facets” → “Custom text facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works.
Clicking on <span class="menuItems">Facet</span><span class="menuItems">Custom text facet…</span> will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works.
A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot edit the facets that appear in the sidebar and change the matching cells in your dataset.
A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot click “edit” on the facets that appear in the sidebar and change the matching cells in your dataset - because what they display is modified, not the original entries.
For example, you may wish to analyze only the first word in a text field - perhaps the first name in a column of “[First Name] [Last Name]” entries. In this case, you can tell OpenRefine to facet only on the information that comes before the first space:
```value.split(" ")[0]```
```
value.split(" ")[0]
```
In this case, `split()` is creating an array of text strings based on every space in the cells - in this case, one space, so two values in the array. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. We can do the same splitting and ask for the last name with
In this case, `split()` is creating an array of text strings based on every space in the cells ["Firstname", "Lastname"]. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. (Assuming the first name is one word, not something like “Mary Anne.”) We can do the same splitting and ask for the last name with
```value.split(" ")[1]```
```
value.split(" ")[1]
```
You may want to create a facet that references several columns. For example, lets say you have two columns, "First Name" and "Last Name", and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
You may want to create a facet that references several columns. For example, lets say you have two columns, “First Name” and “Last Name”, and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
```cells["First Name"].value[0] == cells["Last Name"].value[0]```
```
cells["First Name"].value[0] == cells["Last Name"].value[0]
```
That expression will facet your rows into `true` and `false`.
That expression will look for the first letter (the character at index 0) of each entry and compare them. Then it will facet your rows into “true” and “false.”
You can learn more about text-modification functions on the [Expressions page](expressions).
## Custom numeric facet
You may want to explore your numerical data in a way that doesnt involve modifying it but does require being more selective about what gets considered. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
You may want to explore your numerical data with modifications that aren't permanent. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
If you would like to build your own version of a numeric facet, you can use the “Custom Numeric Facet” option. Clicking on “Facets” → “Custom Numeric Facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
If you would like to build your own version of a numeric facet, you can use the <span class="menuItems">Custom Numeric Facet</span> option. Clicking on <span class="menuItems">Facet</span><span class="menuItems">Custom Numeric Facet…</span> will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
For example, you may wish to create a numeric facet that rounds your value to the nearest integer, enter
```round(value)```
```
round(value)
```
If you have two columns of numbers and for each row you wish to create a numeric facet only on the larger of the two, enter
```max(cells["Column1"].value, cells[“Column2”].value)```
```
max(cells["Column1"].value, cells["Column2"].value)
```
If the numeric values in a column are drawn from a power law distribution, then it's better to group them by their logs:
```value.log()```
```
value.log()
```
If the values are periodic you could take the modulus by the period to understand if there's a pattern:
```mod(value, 7)```
```
mod(value, 7)
```
You can learn more about numeric-modification functions on the [Expressions page](expressions).
@ -166,13 +192,15 @@ You can learn more about numeric-modification functions on the [Expressions page
Customized facets have been added to expand the number of default facets users can apply with a single click. They represent some common and useful functions you shouldnt have to work out using an [expression](expressions).
All facets that display in the “Facet/Filter” sidebar can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
All facets that display in the <span class="tabLabels">Facet/Filter</span> tab can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
### Word facet
Word facet is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
A <span class="menuItems">Word facet</span> is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
```value.split(" ")```
```
value.split(" ")
```
This can be useful for exploring the language used in a corpus, looking for common first and last names or titles, or seeing whats in multi-valued cells you dont wish to split up.
@ -180,19 +208,21 @@ Word facet is case-sensitive and only splits by spaces, not by line breaks or ot
### Duplicates facet
A duplicates facet will return only rows that have non-unique values in the column youve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
A <span class="menuItems">Duplicates facet</span> will return only rows that have non-unique values in the column youve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
```facetCount(value, 'value', '[Column]') > 1```
```
facetCount(value, 'value', '[Column]') > 1
```
Duplicates facets are case-sensitive and you may wish to filter out things like leading and trailing whitespace or other hard-to-see issues. You can modify the facet expression, for example, with:
```facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1```
```
facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1
```
### Numeric log facet
Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a numeric log facet can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible.
OpenRefine uses a base-10 log, the "common logarithm."
Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a <span class="menuItems">Numeric log facet</span> can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible. OpenRefine uses a base-10 log, the “common logarithm.”
For example, we can look at [this data about the body weight of various mammals](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight):
@ -220,13 +250,15 @@ A 1-bounded numeric log facet can be used if you'd like to exclude all the value
### Text-length facet
The text-length facet returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
The <span class="menuItems">Text-length facet</span> returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
```value.length()```
```
value.length()
```
This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date, as YYYY/MM/DD, is eight to ten characters).
This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date in YYYY/MM/DD is eight to ten characters).
You can also employ a log of text-length facet that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
You can also employ a <span class="menuItems">Log of text-length facet</span> that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
### Unicode character-code facet
@ -243,19 +275,23 @@ An error is a data type created by OpenRefine in the process of transforming dat
![A view of the expressions window with an error converting a string to a number.](/img/error.png)
To store errors in cells, ensure that you have “store error” selected for the “On error” option in the expressions window.
To store errors in cells, ensure that you have <span class="fieldLabels">store error</span> selected for the “On error” option in the expressions window.
### Facet by null, empty, or blank
Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content. “Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content.
An empty cell is a cell that is set to contain a string, but doesnt have any characters in it (a zero-length string). This can be a leftover from an operation that removed characters, or from manually editing a cell and deleting its contents.
“Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
An empty cell is a cell that is set to contain a string, but doesnt have any characters in it (a zero-length string). This can be left over from an operation that removed characters, or from manually editing a cell and deleting its contents.
### Facet by star or flag
Stars and flags offer you the opportunity to mark specific rows for yourself for later focus. Stars and flags persist through closing and opening your project, and thus can provide a different function than using a permalink to persist your facets. Stars and flags can be used in any way you want, although they are designed to help you flag errors and star rows of particular importance.
You can manually star or flag rows simply by clicking on the icons to the left of each row. You can also apply stars or flags to all matching rows by using the “All” dropdown menu and selecting “Edit rows” → “Star rows” or “Flag rows.” These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
You can manually star or flag rows simply by clicking on the icons to the left of each row.
You can also apply stars or flags to all matching rows by using the <span class="menuItems">All</span> dropdown menu (on the first column) and selecting <span class="menuItems">Edit rows</span><span class="menuItems">Star rows</span> or <span class="menuItems">Flag rows</span>. This will create “true” and “false” facets in the <span class="tabLabels">Facet/Filter</span>. These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
You may wish to create a custom subset of your data through a series of separate faceting activities (rather than successively narrowing down with multiple facets applied). For example, you may wish to:
* apply a facet
@ -266,21 +302,21 @@ You may wish to create a custom subset of your data through a series of separate
* remove that facet
* and then work with all of the cumulative starred rows.
You can use the dropdown menu on the “All” column and selecting “Facet by star” or “Facet by flag.” This will create “true” and “false” facets in the facet sidebar.
You can also create a text facet on any column with the expression `row.starred` or `row.flagged`.
## Text filter
Filters allow you to narrow down your data based on whether a given column includes a text string.
When you choose “Text filter” a box appears in the “Facet/Filter” sidebar that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
When you choose <span class="menuItems">Text filter</span> a box appears in the <span class="tabLabels">Facet/Filter</span> tab that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
For example, you can enter in "side" as a text filter, and it will return all cells in that column containing "side," "sideways," "offside," etc.
For example, you can enter in “side” as a text filter, and it will return all cells in that column containing “side,” “sideways,” “offside,” etc.
The text filter field supports [Java's regular expression language](http://download.oracle.com/javase/tutorial/essential/regex/). For example, you can employ a regular expression to view all properly-formatted emails:
The text filter field supports [regular expressions](expressions#regular-expressions). For example, you can employ a regular expression to view all properly-formatted emails:
```([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})```
```
([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})
```
You can press “invert” on this facet to then see blank cells or invalid email addresses.

View File

@ -1,5 +0,0 @@
---
id: glossary
title: OpenRefine Glossary
sidebar_label: Glossary
---

View File

@ -6,32 +6,32 @@ sidebar_label: GREL functions
## Reading this reference
For the reference below, the function is given in full-length notation and the in-text examples are written in dot notation. Shorthands are used to indicate the kind of [data type](exploring#data-types) used in each function: s for string, b for boolean, n for number, d for date, a for array, p for a regex pattern, as well as with “null” and “error.
For the reference below, the function is given in full-length notation and the in-text examples are written in dot notation. Shorthands are used to indicate the kind of [data type](exploring#data-types) used in each function: s for string, b for boolean, n for number, d for date, a for array, p for a regex pattern, and o for any data type, as well as “null” and “error” data types.
If a function can take more than one kind of data as input or can output more than one kind of data, that is indicated with more than one letter (as with “s or a”) or with o for object.
We also use shorthands for substring (“sub”) and separator string (“sep”).
Optional arguments will say “(optional)”.
In places where OpenRefine will accept a string (s) or a regex pattern (p), you can supply a string by putting it in quotes. If you wish to use any regex notation, wrap the pattern in forward slashes.
In places where OpenRefine will accept a string (s) or a regex pattern (p), you can supply a string by putting it in quotes. If you wish to use any [regex](expressions#regular-expressions) notation, wrap the pattern in forward slashes.
## Boolean functions
###### and(b1, b2, ...)
Uses the logical operator AND on two or more booleans to yield a boolean. Evaluates multiple statements into booleans, then returns true if all of the statements are true. For example, `and(1 < 3, 1 < 0)` returns false because one condition is true and one is false.
Uses the logical operator AND on two or more booleans to output a boolean. Evaluates multiple statements into booleans, then returns true if all of the statements are true. For example, `(1 < 3).and(1 < 0)` returns false because one condition is true and one is false.
###### or(b1, b2, ...)
Uses the logical operator OR on two or more booleans to yield a boolean. For example, `or(1 < 3, 1 > 7)` returns true because at least one of the conditions (the first one) is true.
Uses the logical operator OR on two or more booleans to output a boolean. For example, `(1 < 3).or(1 > 7)` returns true because at least one of the conditions (the first one) is true.
###### not(b)
Uses the logical operator NOT on a boolean to yield a boolean. For example, `not(1 > 7)` returns true because 1 > 7 itself is false.
Uses the logical operator NOT on a boolean to output a boolean. For example, `not(1 > 7)` returns true because 1 > 7 itself is false.
###### xor(b1, b2, ...)
Uses the logical operator XOR (exclusive-or) on two or more booleans to yield a boolean. Evaluates multiple statements, then returns true if only one of them is true. For example, `xor(1 < 3, 1 < 7)` returns false because more than one of the conditions is true.
Uses the logical operator XOR (exclusive-or) on two or more booleans to output a boolean. Evaluates multiple statements, then returns true if only one of them is true. For example, `(1 < 3).xor(1 < 7)` returns false because more than one of the conditions is true.
## String functions
@ -41,9 +41,9 @@ Returns the length of string s as a number.
###### toString(o, string format (optional))
Takes any value type (string, number, date, boolean, error, null) and gives a string version of that value. You can convert between types, within limits (for example, you can't turn the string “asdfsd” into a date or a number, but you can convert the number “123” into a string).
Takes any value type (string, number, date, boolean, error, null) and gives a string version of that value.
You can also use toString() to convert numbers to strings with rounding, using an [optional string format](https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html). For example, if you applied the expression `value.toString("%.0f")` to a column:
You can use toString() to convert numbers to strings with rounding, using an [optional string format](https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html). For example, if you applied the expression `value.toString("%.0f")` to a column:
|Input|Output|
|-|-|
@ -100,7 +100,7 @@ Returns a copy of the string s with leading and trailing whitespace removed. For
###### chomp(s, sep)
Returns a copy of string s with the string sep removed from the end if s ends with sep; otherwise, just returns s. For example, `"hardly".chomp("ly")` and `"hard".chomp("ly")` both return the string “hard”.
Returns a copy of string s with the string sep removed from the end if s ends with sep; otherwise, just returns s. For example, `"barely".chomp("ly")` and `"bare".chomp("ly")` both return the string “bare”.
#### Substring
@ -108,7 +108,7 @@ Returns a copy of string s with the string sep removed from the end if s ends wi
Returns the substring of s starting from character index from, and up to (excluding) character index to. If the to argument is omitted, substring will output to the end of s. For example, `"profound".substring(3)` returns the string “found”, and `"profound".substring(2, 4)` returns the string “of”.
Character indexes start from zero. Negative character indexes count from the end of the string. For example, `"profound".substring(0, -1)` returns the string “profoun”.
Remember that character indices start from zero. A negative character index counts from the end of the string. For example, `"profound".substring(0, -1)` returns the string “profoun”.
###### slice(s, n from, n to (optional))
@ -130,12 +130,12 @@ Returns the first character index of sub as it last occurs in s; or, returns -1
###### replace(s, s or p find, s replace)
Returns the string obtained by replacing the find string with the replace string in the inputted string. For example, `"The cow jumps over the moon and moos".replace("oo", "ee")` returns the string “The cow jumps over the meen and mees”. Find can be a regex pattern; if so, replace can also contain capture groups declared in find.
Returns the string obtained by replacing the find string with the replace string in the inputted string. For example, `"The cow jumps over the moon and moos".replace("oo", "ee")` returns the string “The cow jumps over the meen and mees”. Find can be a regex pattern. For example, `"The cow jumps over the moon and moos".replace(/\s+/, "_")` will return “The_cow_jumps_over_the_moon_and_moos”.
You cannot find or replace nulls with this, as null is not a string. You can instead:
1. Facet by null and then bulk-edit them to a string, or
2. Transform the column with an expression such as `if(value==null,'new',value)`
2. Transform the column with an expression such as `if(value==null,"new",value)`.
###### replaceChars(s, s find, s replace)
@ -145,7 +145,7 @@ Returns the string obtained by replacing a character in s, identified by find, w
Outputs an array of all consecutive substrings inside string s that match the substring or [regex](expressions#grel-supported-regex) pattern p. For example, `"abeadsabmoloei".find(/[aeio]+/)` would result in the array [ "a", "ea", "a", "o", "oei" ].
You can supply a sub instead of p, by putting it in quotes, and OpenRefine will compile it into a regex pattern. Anytime you supply quotes, OpenRefine interprets the contents as a string, not regex. If you wish to use any regex notation, wrap the pattern in forward slashes, for example: `"OpenRefine is Awesome".find(/fine\sis/)` would return [ "fine is" ].
You can supply a substring instead of p, by putting it in quotes, and OpenRefine will compile it into a regex pattern. Anytime you supply quotes, OpenRefine interprets the contents as a string, not regex. If you wish to use any regex notation, wrap the pattern in forward slashes.
###### match(s, p)
@ -153,9 +153,9 @@ Attempts to match the string s in its entirety against the [regex](expressions#g
You will need to convert the array to a string to store it in a cell, with a function such as toString(). An empty array [] is returned when there is no match to the desired substrings. A null is output when the entire regex does not match.
Remember to enclose your regex in forward slashes, and to escape characters and use parentheses as needed. Parentheses are required to denote a desired substring (capturing group); for example, “.&#42;(\d\d\d\d)” would return an array containing a single value, while “(.&#42;)(\d\d\d\d)” would return two. So, if you are looking for a desired substring anywhere within a string, use the syntax `value.match(/.*(desired-substring-regex).*/)`.
Remember to enclose your regex in forward slashes, and to escape characters and use parentheses as needed. Parentheses denote a desired substring (capturing group); for example, “.&#42;(\d\d\d\d)” would return an array containing a single value, while “(.&#42;)(\d\d\d\d)” would return two. So, if you are looking for a desired substring anywhere within a string, use the syntax `value.match(/.*(desired-substring-regex).*/)`.
For example, if the value is “hello 123456 goodbye”:
For example, if `value` is “hello 123456 goodbye”, the following would occur:
|Expression|Result|
|-|-|
@ -182,13 +182,11 @@ Returns the array of strings obtained by splitting s into substrings with the gi
Returns the array of strings obtained by splitting s by sep, or by guessing either tab or comma separation if there is no sep given. Handles quotes properly and understands cancelled characters. The separator can be either a string or a regex pattern. For example, `value.smartSplit("\n")` will split at a carriage return or a new-line character.
Note: `value.[escape](#escapes-s-mode)('javascript')` is useful for previewing unprintable characters prior to using smartSplit().
Note: [`value.escape('javascript')`](#escapes-s-mode) is useful for previewing unprintable characters prior to using smartSplit().
###### splitByCharType(s)
Returns an array of strings obtained by splitting s into groups of consecutive characters each time the characters change unicode types. For example, `"HenryCTaylor".splitByCharType()` will result in an array of [ "H", "enry", "CT", "aylor" ].
It is useful for separating letters and numbers: `"BE1A3E".splitByCharType()` will result in [ "BE", "1", "A", "3", "E" ].
Returns an array of strings obtained by splitting s into groups of consecutive characters each time the characters change [Unicode categories](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category). For example, `"HenryCTaylor".splitByCharType()` will result in an array of [ "H", "enry", "CT", "aylor" ]. It is useful for separating letters and numbers: `"BE1A3E".splitByCharType()` will result in [ "BE", "1", "A", "3", "E" ].
###### partition(s, s or p fragment, b omitFragment (optional))
@ -200,15 +198,13 @@ You can use regex for your fragment. The expresion `"abcdefgh".partition(/c.e/)`
###### rpartition(s, s or p fragment, b omitFragment (optional))
Returns an array of strings [ a, fragment, z ] where a is the substring within s before the last occurrence of fragment, and z is the substring after the last instance of fragment. (Rpartition means “reverse partition.”) For example, `"parallel".rpartition("a")` returns 3 strings: [ "par", "a", "llel" ].
Otherwise works identically to partition() above.
Returns an array of strings [ a, fragment, z ] where a is the substring within s before the last occurrence of fragment, and z is the substring after the last instance of fragment. (Rpartition means “reverse partition.”) For example, `"parallel".rpartition("a")` returns 3 strings: [ "par", "a", "llel" ]. Otherwise works identically to partition() above.
### Encoding and hashing
###### diff(s1, s2, s timeUnit (optional))
Takes two strings and compares them, returning a string. Returns the remainder of s2 starting with the first character where they differ. For example, `diff("cacti", "cactus")` returns "us". Also works with dates; see [Date functions](#diffd1-d2-s-timeunit).
Takes two strings and compares them, returning a string. Returns the remainder of s2 starting with the first character where they differ. For example, `"cacti".diff("cactus")` returns "us". Also works with dates; see [Date functions](#diffd1-d2-s-timeunit).
###### escape(s, s mode)
@ -266,7 +262,7 @@ Quotes a value as a JSON literal value.
Parses a string as JSON. get() can then be used with parseJson(): for example, `parseJson(" { 'a' : 1 } ").get("a")` returns 1.
For example from the following JSON array, let's get all instances called “keywords” having the same object string name of “text”, and combine it with the forEach() function to iterate over the array.
For example, from the following JSON array in `value`, we want to get all instances of “keywords” having the same object string name of “text”, and combine them, using the forEach() function to iterate over the array.
{
"status":"OK",
@ -293,14 +289,16 @@ The GREL expression `forEach(value.parseJson().keywords,v,v.text).join(":::")` w
### Jsoup XML and HTML parsing
###### parseHtml(s)
Given a cell full of HTML-formatted text, simplifies HTML tags (such as by removing “ /” at the end of self-closing tags), closes any unclosed tags, and inserts linebreaks and indents for cleaner code. You cannot pass parseHtml() a URL, but you can pre-fetch HTML with the <span class="menuItems">Add column by fetching URLs</span> menu option. A cell cannot store the output of parseHtml() unless you convert it with toString().
Given a cell full of HTML-formatted text, parseHtml() simplifies HTML tags (such as by removing “ /” at the end of self-closing tags), closes any unclosed tags, and inserts linebreaks and indents for cleaner code. You cannot pass parseHtml() a URL, but you can pre-fetch HTML with the <span class="menuItems">[Add column by fetching URLs](columnediting#add-column-by-fetching-urls)</span> menu option.
A cell cannot store the output of parseHtml() unless you convert it with toString(): for example, `value.parseHtml().toString()`.
When parseHtml() simplifies HTML, it can sometimes introduce errors. When closing tags, it makes its best guesses based on line breaks, indentation, and the presence of other tags. You may need to manually check the results.
You can then extract or select() which portions of the HTML document you need for further splitting, partitioning, etc. An example of extracting all table rows from a div using parseHtml().select() together is described more in depth at [StrippingHTML](https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML).
You can then extract or [select()](#selects-element) which portions of the HTML document you need for further splitting, partitioning, etc. An example of extracting all table rows from a div using parseHtml().select() together is described more in depth at [StrippingHTML](https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML).
###### parseXml(s)
Given a cell full of XML-formatted text, returns a full XML document and adds any missing closing tags. You can then extract or select() which portions of the XML document you need for further splitting, partitioning, etc. Functions the same way as parseHtml() is described above.
Given a cell full of XML-formatted text, parseXml() returns a full XML document and adds any missing closing tags. You can then extract or [select()](#selects-element) which portions of the XML document you need for further splitting, partitioning, etc. Functions the same way as parseHtml() is described above.
###### select(s, element)
Returns an array of all the desired elements from an HTML or XML document, if the element exists. Elements are identified using the [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html). For example, `value.parseHtml().select("img.portrait")[0]` would return the entirety of the first “img” tag with the “portrait” class found in the parsed HTML inside `value`. Returns an empty array if no matching element is found. Use with toString() to capture the results in a cell. A tutorial of select() is shown in [StrippingHTML](https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML).
@ -312,34 +310,34 @@ value.parseHtml().select("div#content")[0].select("tr").toString()
```
###### htmlAttr(s, element)
Returns a string from an attribute on an HTML element. Use it in conjunction with parseHtml() as in the following example: `value.parseHtml().select("a.email")[0].htmlAttr("href")`.
Returns a string from an attribute on an HTML element. Use it in conjunction with parseHtml() as in the following example: `value.parseHtml().select("a.email")[0].htmlAttr("href")` would retrieve the email address attached to a link with the “email” class.
###### xmlAttr(s, element)
Returns a string from an attribute on an XML element. Function the same way htmlAttr() is described above. Use it in conjunction with parseXml().
Returns a string from an attribute on an XML element. Functions the same way htmlAttr() is described above. Use it in conjunction with parseXml().
###### htmlText(element)
Returns a string of the text from within an HTML element (including all child elements), removing HTML tags and line breaks inside the string. Use it in conjunction with parseHtml() and select() to provide an element, as in the following example: `value.parseHtml().select("div.footer")[0].htmlText()`.
###### xmlText(element)
Returns a string of the text from within an XML element (including all child elements). Functions the same way as htmlText() is described above. Use it in conjunction with parseXml() and select() to provide an element.
Returns a string of the text from within an XML element (including all child elements). Functions the same way htmlText() is described above. Use it in conjunction with parseXml() and select() to provide an element.
###### wholeText(element)
_Works from OpenRefine 3.4.1 beta 644 onwards only_
Selects the (unencoded) text of an element and its children, including any newlines and spaces, and returns a string of unencoded, un-normalized text. Use it in conjunction with parseHtml() and select() to provide an element as in the following example: `value.parseHtml().select("div.footer")[0].wholeText()`.
Selects the (unencoded) text of an element and its children, including any new lines and spaces, and returns a string of unencoded, un-normalized text. Use it in conjunction with parseHtml() and select() to provide an element as in the following example: `value.parseHtml().select("div.footer")[0].wholeText()`.
###### innerHtml(element)
Returns the [inner HTML](https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML) of an HTML element. This will include text and children elements within the element selected. Use it in conjunction with parseHtml() and select() to provide an element.
###### innerXml(element)
Returns all the inner XML elements inside your chosen XML element. Does not return the text directly inside your chosen XML element - only the contents of its children. To select the direct text, use ownText(). To select both, use xmlText(). Use it in conjunction with parseXml() and select() to provide an element.
Returns the inner XML elements of an XML element. Does not return the text directly inside your chosen XML element - only the contents of its children. To select the direct text, use ownText(). To select both, use xmlText(). Use it in conjunction with parseXml() and select() to provide an element.
###### outerHtml(element)
Returns the [outer HTML](https://developer.mozilla.org/en-US/docs/Web/API/Element/outerHTML) of an HTML element. outerHtml includes the start and end tags of the current element. Use it in conjunction with parseHtml() and select() to provide an element.
###### ownText(element)
Returns the text directly inside the selected XML or HTML element only, ignoring text inside children elements. Use it in conjunction with a parser and select() to provide an element.
Returns the text directly inside the selected XML or HTML element only, ignoring text inside children elements (for this, use innerXml()). Use it in conjunction with a parser and select() to provide an element.
## Array functions
@ -347,17 +345,17 @@ Returns the text directly inside the selected XML or HTML element only, ignoring
Returns the size of an array, meaning the number of objects inside it. Arrays can be empty, in which case length() will return 0.
###### slice(a, n from, n to (optional))
Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0. If to is omitted, it is understood to be the end of the array. For example, `[0, 1, 2, 3, 4].slice(1, 3)` returns [ 1, 2 ], and `[ 0, 1, 2, 3, 4].slice(1)` returns [ 1, 2, 3, 4 ]. Also works with strings; see [String functions](#slices-n-from-n-to-optional).
Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0. If the to value is omitted, it is understood to be the end of the array. For example, `[0, 1, 2, 3, 4].slice(1, 3)` returns [ 1, 2 ], and `[ 0, 1, 2, 3, 4].slice(2)` returns [ 2, 3, 4 ]. Also works with strings; see [String functions](#slices-n-from-n-to-optional).
###### get(a, n from, n to (optional))
Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0.
If to is omitted, only one array item is returned, as a string, instead of a sub-array. To return a sub-array from one index to the end, you can set the to argument to a very high number such as `value.get(2,999)` or you can use something like `with(value,a,a.get(1,a.length()))` to count the length of each array.
If the to value is omitted, only one array item is returned, as a string, instead of a sub-array. To return a sub-array from one index to the end, you can set the to argument to a very high number such as `value.get(2,999)` or you can use something like `with(value,a,a.get(1,a.length()))` to count the length of each array.
Also works with strings; see [get() in String functions](#gets-n-from-n-to-optional).
Also works with strings; see [String functions](#gets-n-from-n-to-optional).
###### inArray(a, s)
Returns true if the array contains the desired string, and false otherwise.
Returns true if the array contains the desired string, and false otherwise. Will not convert data types; for example, `[ 1, 2, 3, 4 ].inArray("3")` will return false.
###### reverse(a)
Reverses the array. For example, `[ 0, 1, 2, 3].reverse()` returns the array [ 3, 2, 1, 0 ].
@ -366,7 +364,7 @@ Reverses the array. For example, `[ 0, 1, 2, 3].reverse()` returns the array [ 3
Sorts the array in ascending order. Sorting is case-sensitive, uppercase first and lowercase second. For example, `[ "al", "Joe", "Bob", "jim" ].sort()` returns the array [ "Bob", "Joe", "al", "jim" ].
###### sum(a)
Return the sum of the numbers in the array. For example, `[ 2, 1, 0, 3].sum()` returns 6.
Return the sum of the numbers in the array. For example, `[ 2, 1, 0, 3 ].sum()` returns 6.
###### join(a, sep)
Joins the items in the array with sep, and returns it all as a string. For example, `[ "and", "or", "not" ].join("/")` returns the string “and/or/not”.
@ -424,40 +422,40 @@ Also works with strings; see [diff() in string functions](#diffsd1-sd2-s-timeuni
###### inc(d, n, s timeUnit)
Returns a date changed by the given amount in the given unit of time (see the table below). The default unit is “hour”. For example, if you want to move a date backwards by two months, use `value.inc(-2,'month')`.
Returns a date changed by the given amount in the given unit of time (see the table below). The default unit is “hour”. A positive value increases the date, and a negative value moves it back in time. For example, if you want to move a date backwards by two months, use `value.inc(-2,"month")`.
###### datePart(d, s timeUnit)
Returns part of a date. Data type returned depends on the unit (see the table below).
Returns part of a date. The data type returned depends on the unit (see the table below).
OpenRefine supports the following values for timeUnit:
| Unit | Date part returned | Returned data type | Example using [date 2014-03-14T05:30:04.000789000Z] as value |
|-|-|-|-|
| years | Year | Number | value.datePart("years") -> 2014 |
| year | Year | Number | value.datePart("year") -> 2014 |
| months | Month | Number | value.datePart("months") -> 2 |
| month | Month | Number | value.datePart("month") -> 2 |
| weeks | Week of the month | Number | value.datePart("weeks") -> 3 |
| week | Week of the month | Number | value.datePart("week") -> 3 |
| w | Week of the month | Number | value.datePart("w") -> 3 |
| weekday | Day of the week | String | value.datePart("weekday") -> Friday |
| hours | Hour | Number | value.datePart("hours") -> 5 |
| hour | Hour | Number | value.datePart("hour") -> 5 |
| h | Hour | Number | value.datePart("h") -> 5 |
| minutes | Minute | Number | value.datePart("minutes") -> 30 |
| minute | Minute | Number | value.datePart("minute") -> 30 |
| min | Minute | Number | value.datePart("min") -> 30 |
| seconds | Seconds | Number | value.datePart("seconds") -> 04 |
| sec | Seconds | Number | value.datePart("sec") -> 04 |
| s | Seconds | Number | value.datePart("s") -> 04 |
| milliseconds | Millseconds | Number | value.datePart("milliseconds") -> 789 |
| ms | Millseconds | Number | value.datePart("ms") -> 789 |
| S | Millseconds | Number | value.datePart("S") -> 789 |
| n | Nanoseconds | Number | value.datePart("n") -> 789000 |
| nano | Nanoseconds | Number | value.datePart("n") -> 789000 |
| nanos | Nanoseconds | Number | value.datePart("n") -> 789000 |
| time | Date expressed as milliseconds since the Unix Epoch | Number | value.datePart("time") -> 1394775004000 |
| years | Year | Number | value.datePart("years") 2014 |
| year | Year | Number | value.datePart("year") 2014 |
| months | Month | Number | value.datePart("months") 2 |
| month | Month | Number | value.datePart("month") 2 |
| weeks | Week of the month | Number | value.datePart("weeks") 3 |
| week | Week of the month | Number | value.datePart("week") 3 |
| w | Week of the month | Number | value.datePart("w") 3 |
| weekday | Day of the week | String | value.datePart("weekday") Friday |
| hours | Hour | Number | value.datePart("hours") 5 |
| hour | Hour | Number | value.datePart("hour") 5 |
| h | Hour | Number | value.datePart("h") 5 |
| minutes | Minute | Number | value.datePart("minutes") 30 |
| minute | Minute | Number | value.datePart("minute") 30 |
| min | Minute | Number | value.datePart("min") 30 |
| seconds | Seconds | Number | value.datePart("seconds") 04 |
| sec | Seconds | Number | value.datePart("sec") 04 |
| s | Seconds | Number | value.datePart("s") 04 |
| milliseconds | Millseconds | Number | value.datePart("milliseconds") 789 |
| ms | Millseconds | Number | value.datePart("ms") 789 |
| S | Millseconds | Number | value.datePart("S") 789 |
| n | Nanoseconds | Number | value.datePart("n") 789000 |
| nano | Nanoseconds | Number | value.datePart("n") 789000 |
| nanos | Nanoseconds | Number | value.datePart("n") 789000 |
| time | Milliseconds between input and the [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) | Number | value.datePart("time") → 1394775004000 |
## Math functions
@ -507,7 +505,7 @@ Some of these math functions don't recognize integers when supplied as the first
## Other functions
###### type(o)
Returns a string with the data type of o, such as undefined, string, number, boolean, etc. For example, a Transform operation using `value.type()` will convert all cells in a column to strings of their data types.
Returns a string with the data type of o, such as undefined, string, number, boolean, etc. For example, a [Transform](cellediting#transform) operation using `value.type()` will convert all cells in a column to strings of their data types.
###### facetCount(choiceValue, s facetExpression, s columnName)
Returns the facet count corresponding to the given choice value, by looking for the facetExpression in the choiceValue in columnName. For example, to create facet counts for the following table, we could generate a new column based on “Gift” and enter in `value.facetCount("value", "Gift")`. This would add the column we've named “Count”:
@ -522,7 +520,7 @@ Returns the facet count corresponding to the given choice value, by looking for
The facet expression, wrapped in quotes, can be useful to manipulate the inputted values before counting. For example, you could do a textual cleanup using fingerprint(): `(value.fingerprint()).facetCount(value.fingerprint(),"Gift")`.
###### hasField(o, s name)
Returns a boolean indicating whether o has a member field called [name](expressions#variables). For example, `cell.recon.hasField("match")` will return false if a reconciliation match hasnt been selected yet, or true if it has. You cannot chain your desired fields: for example, `cell.hasField(“recon.match”)` will return false even if the above expression returns true).
Returns a boolean indicating whether o has a member field called [name](expressions#variables). For example, `cell.recon.hasField("match")` will return false if a reconciliation match hasnt been selected yet, or true if it has. You cannot chain your desired fields: for example, `cell.hasField("recon.match")` will return false even if the above expression returns true).
###### coalesce(o1, o2, o3, ...)
Returns the first non-null from a series of objects. For example, `coalesce(value, "")` would return an empty string “” if `value` was null, but otherwise return `value`.

View File

@ -31,28 +31,26 @@ We are aware of some minor rendering and performance issues on other browsers su
### Release versions
OpenRefine always has a latest stable release as well as some more recent work available in beta, release candidate, or nightly release versions.
If you are installing for the first time, we recommend [the latest stable release](https://github.com/OpenRefine/OpenRefine/releases/latest).
OpenRefine always has a [latest stable release](https://github.com/OpenRefine/OpenRefine/releases/latest), as well as some more recent developments available in beta, release candidate, or [snapshot releases](https://github.com/OpenRefine/OpenRefine-snapshot-releases/releases). If you are installing for the first time, we recommend [the latest stable release](https://github.com/OpenRefine/OpenRefine/releases/latest).
If you wish to use an extension that is only compatible with an earlier version of OpenRefine, and do not require the latest features, you may find that [an older stable version is best for you](https://github.com/OpenRefine/OpenRefine/releases) in our list of releases. Look at later releases to see which security vulnerabilities are being fixed, in order to assess your own risk tolerance for using earlier versions. Look for “final release” versions instead of “beta” or “release candidate” versions.
#### Unstable versions
If you need a recently developed function, and are willing to risk some untested code, you can look at [the most recent items in the reverse-chronological list](https://github.com/OpenRefine/OpenRefine/releases) and see what changes appeal to you.
If you need a recently developed function, and are willing to risk some untested code, you can look at [the most recent items in the list](https://github.com/OpenRefine/OpenRefine/releases) and see what changes appeal to you.
“Beta” and “release candidate” versions may both have unreported bugs and are most suitable for people who are wiling to help us troubleshoot these versions by [creating bug reports](https://github.com/OpenRefine/OpenRefine/issues).
For the absolute latest development updates, see the [snapshot releases](https://github.com/OpenRefine/OpenRefine-nightly-releases/releases). These are created with every commit.
For the absolute latest development updates, see the [snapshot releases](https://github.com/OpenRefine/OpenRefine-snapshot-releases/releases). These are created with every commit.
#### Whats changed
Our [latest release is at the time of writing is OpenRefine 3.4](**link goes here!**), released **XXXX XX 2020**. The major changes in this version are listed on the [3.4 final release page](**link goes here!**) with the downloadable packages.
Our [latest version is OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1), released September 24th 2020. The major changes in this version are listed on the [3.4.1 release page](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) with the downloadable packages.
You can find information about all of our releases on the [Releases page on Github](https://github.com/OpenRefine/OpenRefine/releases).
You can find information about all OpenRefine versions on the [Releases page on Github](https://github.com/OpenRefine/OpenRefine/releases).
:::info Other distributions
OpenRefine may also work in other environments, such as [Chromebooks](https://gist.github.com/organisciak/3e12e5138e44a2fed75240f4a4985b4f) where Linux terminals are available. Look at our list of [Other Distributions](https://openrefine.org/download.html) on the Downloads page for other ways of running OpenRefine, and refer to our contributor community to see new environments in development.
OpenRefine may also work in other environments, such as [Chromebooks](https://gist.github.com/organisciak/3e12e5138e44a2fed75240f4a4985b4f) where Linux terminals are available. Look at our list of [Other Distributions on the Downloads page](https://openrefine.org/download.html) for other ways of running OpenRefine, and refer to our contributor community to see new environments in development.
:::
## Installing or upgrading
@ -60,9 +58,9 @@ OpenRefine may also work in other environments, such as [Chromebooks](https://gi
If you are upgrading from an older version of OpenRefine and have projects already on your computer, you should create backups of those projects before you install a new version.
First, [locate your workspace directory](installing.md#where-is-data-stored). Then copy everything you find there and paste it into a folder elsewhere on your computer.
First, [locate your workspace directory](#where-is-data-stored). Then copy everything you find there and paste it into a folder elsewhere on your computer.
For extra security you can [export your existing OpenRefine projects](exporting.md#export-a-project).
For extra security you can [export your existing OpenRefine projects](exporting#export-a-project).
:::caution
Take note of the [extensions](#installing-extensions) you have currently installed. They may not be compatible with the upgraded version of OpenRefine. Installations can be installed in two places, so be sure to check both your workspace directory and the existing installation directory.
@ -72,7 +70,7 @@ Take note of the [extensions](#installing-extensions) you have currently install
[Java Development Kit (JDK)](https://jdk.java.net/) is required to run OpenRefine and should be installed first. [OpenRefine installation packages for Mac and Windows come bundled with JDK](https://openrefine.org/download.html), so you do not need to install it separately if you use those bundles.
There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version: at the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version. At the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
Download the archive (either a `.tar.gz` or a `.zip`) to your computer and then extract its contents to a location of your choice. There is no installation process, so you may wish to extract this folder directly into a place where you put program files, or another stable folder.
@ -93,16 +91,16 @@ import TabItem from '@theme/TabItem';
<TabItem value="win">
1. On Windows 10, click the Windows start menu button, type `env`, and look at the search results. **Edit the system environment** variables. (If you are using an earlier version of Windows, use the **Search** or **Search programs and files** box in the start menu.)
1. On Windows 10, click the Start Menu button, type `env`, and look at the search results. Click <span class="buttonLabels">Edit the system environment variables</span>. (If you are using an earlier version of Windows, use the “Search” or “Search programs and files” box in the Start Menu.)
![A screenshot of the search results for 'env'.](/img/env.png "A screenshot of the search results for 'env'.")
2. Click **Environment Variables…** at the bottom of the **Advanced** window that appears.
3. In the **Environment Variables** dialog that appears, click **New…** and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
2. Click <span class="buttonLabels">Environment Variables…</span> at the bottom of the <span class="tabLabels">Advanced</span> window.
3. In the <span class="tabLabels">Environment Variables</span> window that appears, click <span class="buttonLabels">New…</span> and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
![A screenshot of 'Environment Variables'.](/img/javahome.png "A screenshot of 'Environment Variables'.")
4. Set the **Value** to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the **Browse directory...** button.
4. Set the `Value` to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the <span class="buttonLabels">Browse directory...</span> button.
</TabItem>
@ -110,19 +108,27 @@ import TabItem from '@theme/TabItem';
First, find where Java is on your computer with this command:
```which java```
```
which java
```
Check the environment variable `JAVA_HOME` with:
```$JAVA_HOME/bin/java --version```
```
$JAVA_HOME/bin/java --version
```
To set the environment variable for the current Java version of your MacOS:
```export JAVA_HOME="$(/usr/libexec/java_home)"```
```
export JAVA_HOME="$(/usr/libexec/java_home)"
```
Or, for Java 13.x:
```export JAVA_HOME="$(/usr/libexec/java_home -v 13)"```
```
export JAVA_HOME="$(/usr/libexec/java_home -v 13)"
```
</TabItem>
@ -132,20 +138,27 @@ Or, for Java 13.x:
Enter the following:
```sudo apt install default-jre```
```
sudo apt install default-jre
```
This probably wont install the latest JDK package available on the Java website, but it is faster and more straightforward. (At the time of writing, it installs OpenJDK 11.0.7.)
##### Manually
First, [extract the JDK package](https://openjdk.java.net/install/) to the new directory `usr/lib/jvm`:
```
sudo mkdir -p /usr/lib/jvm
sudo tar -x -C /usr/lib/jvm -f /tmp/openjdk-14.0.1_linux-x64_bin.tar.gz
```
Then, navigate to this folder and confirm the final path (in this case, `usr/lib/jvm/jdk-14.0.1`.
Open a terminal and type
```sudo gedit /etc/profile```
Then, navigate to this folder and confirm the final path (in this case, `usr/lib/jvm/jdk-14.0.1`. Open a terminal and type
```
sudo gedit /etc/profile
```
In the text window that opens, insert the following lines at the end of the `profile` file, using the path above:
```
@ -154,10 +167,18 @@ PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
```
Save and close the file. When you are back in the terminal, type
```source /etc/environment```
Exit the terminal and restart your system. You can then check that JAVA_HOME is set properly by opening another terminal and typing
```echo $JAVA_HOME```
```
source /etc/environment
```
Exit the terminal and restart your system. You can then check that `JAVA_HOME` is set properly by opening another terminal and typing
```
echo $JAVA_HOME
```
It should show the path you set above.
</TabItem>
@ -168,7 +189,7 @@ It should show the path you set above.
### Install or upgrade OpenRefine
If you are upgrading an existing OpenRefine installation, you can delete the old program files and install the new files into the same space. Do not overwrite the files as some obsolete files may be left over unnecessarily.
If you are upgrading an existing OpenRefine installation, you can delete the old program files and install the new files into the same space. Do not overwrite the files as some obsolete files may be left over unnecessarily.
:::caution
If you have extensions installed, do not delete the `webapp\extensions` folder where you installed them. You may wish to install extensions into the workspace directory instead of the program directory. There is no guarantee that extensions will be forward-compatible with new versions of OpenRefine, and we do not maintain extensions.
@ -187,7 +208,9 @@ If you have extensions installed, do not delete the `webapp\extensions` folder w
<TabItem value="win">
Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`). You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`).
You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
</TabItem>
@ -201,19 +224,21 @@ Once you have downloaded the `.dmg` file, open it and drag the OpenRefine icon o
The quick version:
1. Install[ Homebrew from here](http://brew.sh)
1. Install [Homebrew](http://brew.sh)
2. In Terminal enter ` brew cask install openrefine`
1. Then find OpenRefine in your Applications folder.
The long version:
[Homebrew](http://brew.sh) is a popular command-line package manager for Mac. Installing Homebrew is accomplished by pasting the installation command on the Homebrew website into a Terminal window. Once Homebrew is installed, applications like OpenRefine can be installed via a simple command. You can [install Homebrew from their website]([http://brew.sh](http://brew.sh)).
[Homebrew](http://brew.sh) is a popular command-line package manager for Mac. Installing Homebrew is accomplished by pasting the installation command on the Homebrew website into a Terminal window. Once Homebrew is installed, applications like OpenRefine can be installed via a simple command. You can [install Homebrew from their website](http://brew.sh).
###### Install
Install OpenRefine with this command:
``` brew cask install openrefine```
```
brew cask install openrefine
```
You should see output like this:
@ -228,23 +253,29 @@ You should see output like this:
Behind the scenes, this command causes Homebrew to download the OpenRefine installer, verify the files authenticity (using a SHA-256 checksum), mount the disk image, copy the `OpenRefine.app` application bundle into the Applications folder, unmount the disk image, and save a copy of the installer and metadata about the installation for future use.
_If an existing `OpenRefine.app` is found in the Applications folder, Homebrew will not overwrite it, so installing via Homebrew requires either deleting or renaming previously installed copies._
If an existing `OpenRefine.app` is found in the Applications folder, Homebrew will not overwrite it, so installing via Homebrew requires either deleting or renaming previously installed copies.
###### Uninstall
To uninstall OpenRefine, paste this command into the Terminal:
``` brew cask uninstall openrefine```
```
brew cask uninstall openrefine
```
You should see output like this:
``` ==> Removing App '/Applications/OpenRefine.app'.```
```
==> Removing App '/Applications/OpenRefine.app'.
```
###### Update
To update to the latest version of OpenRefine, paste this command into the Terminal:
``` brew cask reinstall openrefine```
```
brew cask reinstall openrefine
```
You should see output like this:
@ -269,7 +300,9 @@ If you had previously installed the `openrefine-dev` cask (containing a release
Once you have downloaded the `.tar.gz` file, open a shell, navigate to the folder containing the download, and type:
```tar xzf openrefine-linux-3.4.tar.gz```
```
tar xzf openrefine-linux-3.4.tar.gz
```
</TabItem>
@ -280,7 +313,7 @@ Once you have downloaded the `.tar.gz` file, open a shell, navigate to the folde
### Set where data is stored
OpenRefine stores data in two places: program files in the program directory, wherever it is youve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running.md#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking "Browse workspace directory."
OpenRefine stores data in two places: program files in the program directory, wherever it is youve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking <span class="buttonLabels">Browse workspace directory</span>.
By default this is:
@ -308,11 +341,15 @@ For older Google Refine releases, replace `OpenRefine` with `Google\Refine`.
You can change this by adding this line to the file `openrefine.l4j.ini` and specifying your desired drive and folder path:
```-Drefine.data_dir=D:\MyDesiredFolder```
```
-Drefine.data_dir=D:\MyDesiredFolder
```
If your folder path has spaces, use neutral quotation marks around it:
```-Drefine.data_dir="D:\My Desired Folder"```
```
-Drefine.data_dir="D:\My Desired Folder"
```
If the folder does not exist, OpenRefine will create it.
@ -320,11 +357,15 @@ If the folder does not exist, OpenRefine will create it.
<TabItem value="mac">
```~/Library/Application Support/OpenRefine/```
```
~/Library/Application Support/OpenRefine/
```
For older versions as Google Refine:
For older versions, as Google Refine:
```~/Library/Application Support/Google/Refine/ ```
```
~/Library/Application Support/Google/Refine/
```
Logging is to `/var/log/daemon.log` - grep for `com.google.refine.Refine`.
@ -332,11 +373,15 @@ Logging is to `/var/log/daemon.log` - grep for `com.google.refine.Refine`.
<TabItem value="linux">
```~/.local/share/openrefine/```
```
~/.local/share/openrefine/
```
You can change this when you run OpenRefine from the terminal, by pointing to the workspace directory through the `-d` parameter:
``` ./refine -p 3333 -i 0.0.0.0 -m 6000M -d /My/Desired/Folder```
```
./refine -p 3333 -i 0.0.0.0 -m 6000M -d /My/Desired/Folder
```
</TabItem>
@ -347,51 +392,27 @@ You can change this when you run OpenRefine from the terminal, by pointing to th
### Logs
OpenRefine does not currently output an error log, but because the OpenRefine console window is always open while OpenRefine runs in your browser, you can copy information from the console if an error occurs.
OpenRefine does not currently output an error log, but because the OpenRefine console window is always open (on Linux and Windows) while OpenRefine runs in your browser, you can copy information from the console if an error occurs.
<Tabs
groupId="operating-systems"
defaultValue="win"
values={[
{label: 'Mac', value: 'mac'}
]
}>
<TabItem value="mac">
You can access OpenRefine server logs from the terminal on Mac:
* Find the OpenRefine app/icon in Finder
* Ctrl+Click on the icon and select **Show Package Contents** from the context menu that displays
* This should open a new Finder menu showing a folder called **Contents** - navigate into this folder then into the **MacOS** folder
* Ctrl+Click on **JavaAppLauncher**
* Choose **Open With** from menu, and select **Terminal**
Using a Mac, you can [run OpenRefine using the terminal](running#starting-and-exiting) in order to capture errors.
---
</TabItem>
</Tabs>
## Increasing memory allocation
OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large data sets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
* more than **one million** rows
* more than **one million **total cells
OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large datasets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
* more than one million total cells
* an input file size of more than 50 megabytes (MB)
* more than **50** [rows per record in records mode](**running.md#records-mode**)
* more than 50 [rows per record in records mode](running#records-mode)
By default OpenRefine is set to operate with 1 gigabyte (GB) of memory (1024MB). If OpenRefine is running slowly, or you are getting "out of memory" errors (for example, `java.lang.OutOfMemoryError`), or generally feel that OpenRefine is slow, you can try allocating more memory.
By default OpenRefine is set to operate with 1 gigabyte (GB) of memory (1024MB). If you feel that OpenRefine is running slowly, or you are getting “out of memory” errors (for example, `java.lang.OutOfMemoryError`), you can try allocating more memory.
A good practice is to start with no more than 50% of whatever memory is left over after the estimated usage of your operating system, to leave memory for your browser to run.
All of the settings below use a four-digit number to specify the megabytes (MB) used. The default is usually 1024MB, but the new value doesn't need to be a multiple of 1024.
All of the settings below use a four-digit number to specify the megabytes (MB) used (actually [mebibytes](https://en.wikipedia.org/wiki/Mebibyte)). The default is usually 1024MB, but the new value doesn't need to be a multiple of 1024.
:::info Dealing with large datasets
If your project is big enough to need more than the default amount of memory, consider turning off "Parse cell text into numbers, dates, ..." on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default "string" type.
If your project is big enough to need more than the default amount of memory, consider turning off <span class="fieldLabels">Parse cell text into numbers, dates, ...</span> on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default “string” type.
:::
<Tabs
@ -415,7 +436,7 @@ If you run `openrefine.exe`, you will need to edit the `openrefine.l4j.ini` file
-Xmx1024M
```
The line `-Xmx1024M` defines the amount of memory available in megabytes (actually [mebibytes](https://en.wikipedia.org/wiki/Mebibyte)). Change the number “1024” - for example, edit the line to `-Xmx2048M` to make 2048MB [2GB] of memory available.
The line “-Xmx1024M” defines the amount of memory available in megabytes. Change the number “1024” - for example, edit the line to “-Xmx2048M” to make 2048MB [2GB] of memory available.
:::caution openrefine.exe not running?
Once you increase the memory allocation, you may find that you cannot run `openrefine.exe`. In this case, your computer needs a 64-bit version of [Java](https://www.java.com/en/download/help/index_installing.xml) (this is different from [Java JDK](#install-or-upgrade-java). Look for the “Windows Offline (64-bit)” download on the Downloads page and install that. Your system must also be set to use the 64-bit version of Java by [changing the Java configuration](https://www.java.com/en/download/help/update_runtime_settings.xml).
@ -425,46 +446,47 @@ Once you increase the memory allocation, you may find that you cannot run `openr
On Windows, OpenRefine can also be run by using the file `refine.bat` in the program directory. If you start OpenRefine using `refine.bat`, the memory available to OpenRefine can be specified either through command line options, or through the `refine.ini` file.
To set the maximum amount of memory on the command line when using `refine.bat`, 'cd' to the program directory, then type
To set the maximum amount of memory on the command line when using `refine.bat`, `cd` to the program directory, then type
```refine.bat /m 2048m```
where "2048" is the maximum amount of MB that you want OpenRefine to use.
where “2048” is the maximum amount of MB that you want OpenRefine to use.
To change the default that `refine.bat` uses, edit the `refine.ini` line that reads
```REFINE_MEMORY=1024M```
Note that this file is only read if you use `refine.bat`, not **openrefine.exe**.
Note that this file is only read if you use `refine.bat`, not `openrefine.exe`.
</TabItem>
<TabItem value="mac">
If you have downloaded the **.dmg** package and you start OpenRefine by double-clicking on it:
If you have downloaded the `.dmg` package and you start OpenRefine by double-clicking on it:
* close OpenRefine
* **control-click** on the OpenRefine icon (opens the contextual menu)
* click on **show package content** (a finder window opens)
* open the **Contents** folder
* open the **Info.plist** file with any text editor (like Mac's default TextEdit)
* Change `-Xmx1024M` into, for example, `-Xmx2048M` or `-Xmx8G`
* control-click on the OpenRefine icon (opens the contextual menu)
* click on "show package content” (a finder window opens)
* open the “Contents” folder
* open the `Info.plist` file with any text editor (like Mac's default TextEdit)
* Change “-Xmx1024M” into, for example, “-Xmx2048M” or “-Xmx8G”
* save the file
* restart OpenRefine.
</TabItem>
<TabItem value="linux">
If you have downloaded the `.tar.gz` package and you start OpenRefine from the command line, add the `-m xxxxM` parameter like this:
If you have downloaded the `.tar.gz` package and you start OpenRefine from the command line, add the “-m xxxxM” parameter like this:
`./refine -m 2048m`
#### Setting a default
If you don't want to set this option on the command line each time, you can also set it in the `refine.ini` file. Edit the line
```REFINE_MEMORY=1024M```
```
REFINE_MEMORY=1024M
```
Make sure it is not commented out (that is, that the line doesn't start with a '#' character), and change `1024` to a higher value. Save the file, and when you next start OpenRefine it will use this value.
Make sure it is not commented out (that is, that the line doesn't start with a “#” character), and change “1024” to a higher value. Save the file, and when you next start OpenRefine it will use this value.
</TabItem>
@ -477,13 +499,13 @@ Make sure it is not commented out (that is, that the line doesn't start with a '
Extensions have been created by our contributor community to add functionality or provide convenient shortcuts for common uses of OpenRefine. [We list extensions we know about on our downloads page](https://openrefine.org/download.html).
:::info
:::info Contributing extensions
If youd like to create or modify an extension, [see our developer documentation here](https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers). If youre having a problem, [use our downloads page](https://openrefine.org/download.html) to go to the extensions page and report the issue there.
:::
### Two ways to install extensions
You can [install extensions in one of two places](installing.md#set-where-data-is-stored):
You can [install extensions in one of two places](#set-where-data-is-stored):
* Into your OpenRefine program folder, so they will only be available to that version/installation of OpenRefine (meaning the extension will not run if you upgrade OpenRefine), or
* Into your workspace, where your projects are stored, so they will be available no matter which version of OpenRefine youre using.
@ -492,15 +514,15 @@ We provide these options because you may wish to reinstall a given extension man
### Find the right place to install
If you want to install the extension into the program folder, go to your program directory and then go to `/webapp/extensions` (or create it if not does not exist).
If you want to install the extension into the program folder, go to your program directory and then go to `webapp\extensions` (or create it if not does not exist).
If you want to install the extension into your workspace, you can:
* launch OpenRefine and click <span class="menuItems">Open Project</span> in the sidebar
* At the bottom of the screen, click <span class="menuItems"> Browse workspace directory </span>
* At the bottom of the screen, click <span class="menuItems">Browse workspace directory</span>
* A file-explorer or finder window will open in your workspace
* Create a new folder called `extensions` inside the workspace if it does not exist.
* Create a new folder called “extensions” inside the workspace if it does not exist.
You can also [find your workspace on each operating system using these instructions](installing.md#set-where-data-is-stored).
You can also [find your workspace on each operating system using these instructions](#set-where-data-is-stored).
### Install the extension
@ -511,15 +533,7 @@ Some extensions may have multiple versions, to match OpenRefine versions, so be
Generally, the installation process will be:
* Download the extension (usually as a zip file from GitHub)
* Extract the zip contents into the `extensions` directory, making sure all the contents go into one folder with the name of the extension
* Extract the zip contents into the “extensions” directory, making sure all the contents go into one folder with the name of the extension
* Start (or restart) OpenRefine.
To confirm that installation was a success, follow the instructions provided by the extension. Each extension will appear in its own way inside the OpenRefine interface: make sure you read the documentation to know where the functionality will appear, such as under specific dropdown menus.
## Advanced OpenRefine uses
### Running as a server
### Automating OpenRefine
To confirm that installation was a success, follow the instructions provided by the extension. Each extension will appear in its own way inside the OpenRefine interface. Make sure you read its documentation to know where the functionality will appear, such as under specific dropdown menus.

View File

@ -1,152 +0,0 @@
---
id: key_value_columnize
title: Columnize by key/value columns
sidebar_label: Columnize by key/value
---
This operation can be used to reshape a table which contains *key* and *value* columns, such that the repeating contents in the key column become new column names, and the contents of the value column are spread in the new columns. This operation can be invoked from
any column menu, via **Transpose****Columnize by key/value columns**.
Overview
--------
Consider the following table:
| Field | Data |
|---------|-----------------------|
| Name | Galanthus nivalis |
| Color | White |
| IUCN ID | 162168 |
| Name | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
In this format, each flower species is described by multiple attributes, which are spread on consecutive rows.
In this example, the "Field" column contains the keys and the "Data" column contains the values. With
this configuration, the *Columnize by key/value columns* operations transforms this table as follows:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| Narcissus cyclamineus | Yellow | 161899 |
Entries with multiple values in the same column
-----------------------------------------------
If an entry has multiple values for a given key, then these values will be grouped on consecutive rows,
to form a [record structure](exploring#rows-vs-records).
For instance, flower species can have multiple colors:
| Field | Data |
|-------------|-----------------------|
| Name | Galanthus nivalis |
| **Color** | **White** |
| **Color** | **Green** |
| IUCN ID | 162168 |
| Name | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
This table is transformed by the operation as follows:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| | Green | |
| Narcissus cyclamineus | Yellow | 161899 |
The first key encountered by the operation serves as the record key.
The "Green" value is attached to the "Galanthus nivalis" name because it is the latest record key encountered by the operation as it scans the table. See the [Row order](#row-order) section for more details about the influence of row order on
the results of the operation.
Notes column
------------
In addition to the key and value columns, a *notes* column can be used optionally. This can be used
to store extra metadata associated to a key/value pair.
Consider the following example:
| Field | Data | Source |
|---------|-----------------------|-----------------------|
| Name | Galanthus nivalis | IUCN |
| Color | White | Contributed by Martha |
| IUCN ID | 162168 | |
| Name | Narcissus cyclamineus | Legacy |
| Color | Yellow | 2009 survey |
| IUCN ID | 161899 | |
If the "Source" column is selected as notes column, this table is transformed to:
| Name | Color | IUCN ID | Source : Name | Source : Color |
|-----------------------|----------|---------|---------------|-----------------------|
| Galanthus nivalis | White | 162168 | IUCN | Contributed by Martha |
| Narcissus cyclamineus | Yellow | 161899 | Legacy | 2009 survey |
Notes columns can therefore be used to preserve provenance or other context about a particular key/value pair.
Extra columns
-------------
If the table contains extra columns, which are not used as key, value or notes columns, they can be preserved
by the operation. For this to work, they must have the same value in all old rows corresponding to a new row.
Consider for instance the following table, where the "Field" and "Data" columns are used as key and value columns
respectively, and the "Wikidata ID" column is not selected:
| Field | Data | Wikidata ID |
|---------|-----------------------|-------------|
| Name | Galanthus nivalis | Q109995 |
| Color | White | Q109995 |
| IUCN ID | 162168 | Q109995 |
| Name | Narcissus cyclamineus | Q1727024 |
| Color | Yellow | Q1727024 |
| IUCN ID | 161899 | Q1727024 |
This will be transformed to
| Wikidata ID | Name | Color | IUCN ID |
|-------------|-----------------------|----------|---------|
| Q109995 | Galanthus nivalis | White | 162168 |
| Q1727024 | Narcissus cyclamineus | Yellow | 161899 |
If extra columns do not contain identical values for all old rows spanning an entry, this can
be fixed beforehand by using the [fill down operation](cellediting#fill-down).
Row order
---------
In the absence of extra columns, it is important to note that the order in which
the key/value pairs appear matters. Specifically, the operation will use the first key it encounters as the delimiter for entries:
every time it encounters this key again, it will produce a new row and add the following other key/value pairs to that row.
Consider for instance the following table:
| Field | Data |
|----------|-----------------------|
| **Name** | Galanthus nivalis |
| Color | White |
| IUCN ID | 162168 |
| **Name** | Crinum variabile |
| **Name** | Narcissus cyclamineus |
| Color | Yellow |
| IUCN ID | 161899 |
The occurrences of the "Name" value in the "Field" column define the boundaries of the entries. Because there is
no other row between the "Crinum variabile" and the "Narcissus cyclamineus" rows, the "Color" and "IUCN ID" columns
for the "Crinum variabile" entry will be empty:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
| Galanthus nivalis | White | 162168 |
| Crinum variabile | | |
| Narcissus cyclamineus | Yellow | 161899 |
This sensitivity to order is removed if there are extra columns: in that case, the first extra column will serve as root identifier
for the entries.
Behaviour in records mode
-------------------------
In records mode, this operation behaves just like in rows mode, except that any facets applied to it will be interpreted in records mode.

View File

@ -6,53 +6,58 @@ sidebar_label: Reconciling
## Overview
Reconciliation is the process of matching your dataset with that of an external source. Datasets are produced by libraries, archives, museums, academic organizations, scientific institutions, non-profits, and interest groups. You can also reconcile against user-edited data on [Wikidata](wikidata), or reconcile against [a local dataset that you yourself supply](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources#local-services).
Reconciliation is the process of matching your dataset with that of an external source. Datasets for comparison might be produced by libraries, archives, museums, academic organizations, scientific institutions, non-profits, or interest groups. You can also reconcile against user-edited data on [Wikidata](wikidata), or reconcile against [a local dataset that you yourself supply](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources#local-services).
To reconcile your OpenRefine project against an external dataset, that dataset must offer a web service that conforms to the [Reconciliation Service API standards](https://reconciliation-api.github.io/specs/0.1/).
You may wish to reconcile in order to fix spelling or variations in proper names, to clean up manually-entered subject headings against authorities such as the [Library of Congress Subject Headings](https://id.loc.gov/authorities/subjects.html) (LCSH), to link your data to an existing set, to add it to an open and editable system such as [Wikidata](https://www.wikidata.org), or to see whether entities in your project appear in some specific list or not, such as the [Panama Papers](https://aleph.occrp.org/datasets/734).
You may wish to reconcile in order to:
* fix spelling or variations in proper names
* clean up manually-entered subject headings against authorities such as the [Library of Congress Subject Headings](https://id.loc.gov/authorities/subjects.html) (LCSH)
* link your data to an existing dataset
* add to an editable platform such as [Wikidata](https://www.wikidata.org)
* or see whether entities in your project appear in some specific list, such as the [Panama Papers](https://aleph.occrp.org/datasets/734).
Reconciliation is semi-automated: OpenRefine matches your cell values to the reconciliation information as best it can, but human judgment is required to ensure the process is successful. Reconciling happens by default through string searching, so typos, whitespace, and extraneous characters will have an effect on the results. You may wish to clean and cluster your data before reconciliaton.
Reconciliation is semi-automated: OpenRefine matches your cell values to the reconciliation information as best it can, but human judgment is required to review and approve the results. Reconciling happens by default through string searching, so typos, whitespace, and extraneous characters will have an effect on the results. You may wish to [clean and cluster](cellediting) your data before reconciliaton.
:::info Working iteratively
We recommend planning your reconciliation operations as iterative: reconcile multiple times with different settings, and with different subgroups of your data.
:::
## Sources
There is a [current list of reconcilable authorities](https://reconciliation-api.github.io/testbench/) that includes instructions for adding new services via Wikidata editing. OpenRefine maintains a [further list of sources on the wiki](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources), which can be edited by anyone. This list includes ways that you can reconcile against a [local dataset](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources#local-services).
Start with [this current list of reconcilable authorities](https://reconciliation-api.github.io/testbench/), which includes instructions for adding new services via Wikidata editing if you have one to add.
Other services may exist that are not yet listed in these two places: for example, the [310 datasets hosted by the Organized Crime and Corruption Reporting Project (OCCRP)](https://aleph.occrp.org/datasets/) each have their own reconciliation URL, or you can reconcile against their entire database with the URL listed [here](https://reconciliation-api.github.io/testbench/). For another example, you can reconcile against the entire Virtual International Authority File (VIAF) dataset, or [only the contributions from certain institutions](http://refine.codefork.com/). Search online to see if the authority you wish to reconcile against has an available service, or whether you can download a copy to reconcile against locally.
OpenRefine maintains a [further list of sources on the wiki](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources), which can be edited by anyone. This list includes ways that you can reconcile against a [local dataset](https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources#local-services).
OpenRefine offers Wikidata reconciliation by default - see the [Wikidata](wikidata) page for more information particular to that service.
Other services may exist that are not yet listed in these two places: for example, the [310 datasets hosted by the Organized Crime and Corruption Reporting Project (OCCRP)](https://aleph.occrp.org/datasets/) each have their own reconciliation URL, or you can reconcile against their entire database with the URL [shared on the reconciliation API list](https://reconciliation-api.github.io/testbench/). For another example, you can reconcile against the entire Virtual International Authority File (VIAF) dataset, or [only the contributions from certain institutions](http://refine.codefork.com/). Search online to see if the authority you wish to reconcile against has an available service, or whether you can download a copy to reconcile against locally.
:::info
OpenRefine extensions can add reconciliation services, and can also add enhanced reconciliation capacities. Check the list of extensions on the [Downloads page](https://openrefine.org/download.html) for more information.
:::
OpenRefine includes Wikidata reconciliation in the installation package - see the [Wikidata](wikidata) page for more information particular to that service. Extensions can add reconciliation services, and can also add enhanced reconciliation capacities. Check the list of extensions on the [Downloads page](https://openrefine.org/download.html) for more information.
Each source will have its own documentation on how it provides reconciliation. Refer to the service itself if you have questions about its behaviors and which OpenRefine features it supports.
Each source will have its own documentation on how it provides reconciliation. The table on [the reconciliation API list](https://reconciliation-api.github.io/testbench/) indicates whether your chosen service supports the features described below. Refer to the service's documentation if you have questions about its behaviors and which OpenRefine features it supports.
## Getting started
Select “Reconcile” → “Start reconciling” on a column. If you want to reconcile only some cells in that column, first use filters and facets to isolate them.
Choose a column to reconcile and use its dropdown menu to select <span class="menuItems">Reconcile</span><span class="menuItems">Start reconciling</span>. If you want to reconcile only some cells in that column, first use filters and facets to isolate them.
In the reconciliation window, you will see Wikidata offered as a default service. To add another service, click “Add Standard Service…” and paste in the URL of a [service](#sources). You should see the name of the service appear in the list of Services if the URL is correct.
In the reconciliation window, you will see Wikidata offered as a default service. To add another service, click <span class="buttonLabels">Add Standard Service...</span> and paste in the URL of a [service](#sources). You should see the name of the service appear in the list of <span class="buttonLabels">Services</span> if the URL is correct.
![The reconciliation window.](/img/reconcilewindow.png)
Once you select a service, the service may sample your selected column and identify some [suggested categories (“types”)](#reconciling-by-type) to reconcile against. Other services will suggest their available types without sampling, and some services have no types.
Once you select a service, your selected column may be sampled in order to suggest [“types” (categories)](#reconciling-by-type) to reconcile against. Other services will suggest their available types without sampling, and some services have no types.
For example, if you had a list of artists represented in a gallery collection, you could reconcile their names against the Getty Research Institutes [Union List of Artist Names (ULAN)](https://www.getty.edu/research/tools/vocabularies/ulan/). The same Getty reconciliation URL will offer you ULAN, AAT (Art and Architecture Thesaurus), and TGN (Thesaurus of Geographic Names).
For example, if you had a list of artists represented in a gallery collection, you could reconcile their names against the Getty Research Institutes [Union List of Artist Names (ULAN)](https://www.getty.edu/research/tools/vocabularies/ulan/). The same [Getty reconciliation URL](https://services.getty.edu/vocab/reconcile/) will offer you ULAN, AAT (Art and Architecture Thesaurus), and TGN (Thesaurus of Geographic Names).
![The reconciliation window with types.](/img/reconcilewindow2.png)
Refer to the documentation specific to the reconciliation service (from the testbench, for example) to learn whether types are offered, which types are offered, and which one is most appropriate for your column. You may wish to facet your data and reconcile batches against different types if available.
Refer to the [documentation specific to the reconciliation service](https://reconciliation-api.github.io/testbench/) to learn whether types are offered, which types are offered, and which one is most appropriate for your column. You may wish to facet your data and reconcile batches against different types if available.
Reconciliation can be a time-consuming process, especially with large datasets. We suggest starting with a small test batch. There is no throttle (delay between requests) to set for the reconciliation process. The amount of time will vary for each service, and based on the options you select during the process.
Reconciliation can be a time-consuming process, especially with large datasets. We suggest starting with a small test batch. There is no throttle (delay between requests) to set for the reconciliation process. The amount of time will vary for each service, and vary based on the options you select during the process.
When the process is done, you will see the reconciliation data in the cells.
If the cell was successfully matched, it displays a single dark blue link. In this case, the reconciliation is confident that the match is correct, and you should not have to check it manually.
If there is no clear match, a few candidates are displayed, together with their reconciliation score, with light blue links. You will need to select the correct one.
If the cell was successfully matched, it displays text as a single dark blue link. In this case, the reconciliation is confident that the match is correct, and you should not have to check it manually.
If there is no clear match, one or more candidates are displayed, together with their reconciliation score, with the text in light blue links. You will need to select the correct one.
For each matching decision you make, you have two options: match this cell only ![button to perform a single match](https://openrefine-wikidata.toolforge.org/static/screenshot_single_match.png), or also use the same identifier for all other cells containing the same original string ![button to perform a multiple match](https://openrefine-wikidata.toolforge.org/static/screenshot_bulk_match.png).
For each matching decision you make, you have two options: match this cell only (one checkmark), or also use the same identifier for all other cells containing the same original string (two checkmarks).
For services that offer the [“preview entities” feature](https://reconciliation-api.github.io/testbench/), you can hover your mouse over the suggestions to see more information about the candidates or matches. Each participating service (and each type) will deliver different structured data that may help you compare the candidates.
@ -62,28 +67,30 @@ For example, the Getty ULAN shows an artists discipline, nationality, and bir
Hovering over the suggestion will also offer the two matching options as buttons.
For matched values (those appearing as dark blue links), the underlying cell value has not been altered - the cell is storing both the original string and the matched entity link at the same time. If you were to copy your column to a new column at this point, for example, the reconcilation data would not transfer - only the original strings.
For matched values (those appearing as dark blue links), the underlying cell value has not been altered - the cell is storing both the original string and the matched entity link at the same time. If you were to copy your column to a new column at this point using `value`, for example, the reconcilation data would not transfer - only the original strings. You can learn more about how OpenRefine stores different pieces of information in each cell in [the Variables section specific to reconciliation data](expressions#reconciliation).
For each cell, you can manually “Create new item,” which will take the cells current value and apply it as though it is a match. This will not become a dark blue link, because at this time there is nothing to link to: it is like a draft entity stored only in your project. You can use this feature to prepare these entries for eventual upload to an editable service such as [Wikidata](wikidata), but most services do not yet support this feature.
For each cell, you can manually “Create new item,” which will take the cells original value and apply it, as though it is a match. This will not become a dark blue link, because at this time there is nothing to link to: it is a draft entity stored only in your project. You can use this feature to prepare these entries for eventual upload to an editable service such as [Wikidata](wikidata), but most services do not yet support this feature.
### Reconciliation facets
Under “Reconcile” → “Facets” you can see a number of reconciliation-specific faceting options. OpenRefine automatically creates two facets for you when you reconcile a column.
Under <span class="menuItems">Reconcile</span><span class="menuItems">Facets</span> there are a number of reconciliation-specific faceting options. OpenRefine automatically creates two facets when you reconcile some cells.
One is a numeric facet for “best candidate's score,” the range of reconciliation scores of only the best candidate of each cell. Each service calculates scores differently and has a different range, but higher scores always mean better matches. You can facet for higher scores in the numeric facet, and then approve them all in bulk, by using “Reconcile” → “Actions” → “Match each cell to its best candidate.”
One is a numeric facet for “best candidate's score,” the range of reconciliation scores of only the best candidate of each cell. Higher scores mean better matches, although each service calculates scores differently and has a different range. You can facet for higher scores using the numeric facet, and then approve them all in bulk, by using <span class="menuItems">Reconcile</span><span class="menuItems">[Actions](#reconciliation-actions)</span><span class="menuItems">Match each cell to its best candidate</span>.
There is also a “judgment” facet created, which lets you filter for the cells that haven't been matched (pick “none” in the facet). As you process each cell, its judgment changes from “none” to “matched” and it disappears from the view.
You can add other reconciliation facets by selecting “Reconcile” → “Facets” on your column. You can facet by:
You can add other facets by selecting <span class="menuItems">Reconcile</span><span class="menuItems">Facets</span> on your reconciled column. You can facet by:
* your judgments (“matched,” or “none” for unreconciled cells, or “new” for entities you've created)
* the action youve performed on that cell (chosen a “single” match, or set a "mass" match, or no action, as “unknown”)
* the action youve performed on that cell (chosen a “single” match, or set a “mass” match, or no action, which appears as “unknown”)
* the timestamps on the edits youve made so far (these appear as millisecond counts since an arbitrary point: they can be sorted alphabetically to move forward and back in time).
You can facet only the best candidates for each cell, based on:
* the score (calculated based on each service's own methods)
* the edit distance (using the [Levenshtein distance](https://www.wikiwand.com/en/Levenshtein_distance), a number based on how many single-character edits would be required to get your original value to the candidate value, with a larger value being a greater difference)
* the word similarity (not including [stop words](https://en.wikipedia.org/wiki/Stop_word), a percentage based on how many words in the original value match words in the candidate. For example, the value "Maria Luisa Zuloaga de Tovar" matched to the candidate "Palacios, Luisa Zuloaga de" results in a word similarity value of 0.6, or 60%, or 3 out of 5 words. Cells that are not yet matched to one candidate will show as 0.0.)
* the edit distance (using the [Levenshtein distance](cellediting#nearest-neighbor), a number based on how many single-character edits would be required to get your original value to the candidate value, with a larger value being a greater difference)
* the word similarity.
Word similarity is calculated as a percentage based on how many words (excluding [stop words](https://en.wikipedia.org/wiki/Stop_word)) in the original value match words in the candidate. For example, the value “Maria Luisa Zuloaga de Tovar” matched to the candidate “Palacios, Luisa Zuloaga de” results in a word similarity value of 0.6, or 60%, or 3 out of 5 words. Cells that are not yet matched to one candidate will show as 0.0).
You can also look at each best candidates:
* type (the ones you have selected in successive reconciliation attempts, or other types returned by the service based on the cell values)
@ -94,24 +101,24 @@ These facets are useful for doing successive reconciliation attempts, against di
### Reconciliation actions
You can use the “Reconcile” → “Actions” menu options to perform bulk changes, which will apply only to your current set of rows or records:
* Match each cell to its best candidate (by highest score)
* Create a new item for each cell (discard any suggested matches)
* Create one new item for similar cells (a new entity will be created for each unique string)
* Match all filtered cells to... (a specific item from the chosen service, via a search box. For [services with the “suggest entities” property](https://reconciliation-api.github.io/testbench/).)
* Discard all reconciliation judgments (reverts back to multiple candidates per cell, including cells that may have been auto-matched in the original reconciliation process)
* Clear reconciliation data, reverting all cells back to their original values.
You can use the <span class="menuItems">Reconcile</span><span class="menuItems">Actions</span> menu options to perform bulk changes (which will apply only to your currently viewed set of rows or records):
* <span class="menuItems">Match each cell to its best candidate</span> (by highest score)
* <span class="menuItems">Create a new item for each cell</span> (discard any suggested matches)
* <span class="menuItems">Create one new item for similar cells</span> (a new entity will be created for each unique string)
* <span class="menuItems">Match all filtered cells to...</span> (a specific item from the chosen service, via a search box; only works with services that support the “suggest entities” property)
* <span class="menuItems">Discard all reconciliation judgments</span> (reverts back to multiple candidates per cell, including cells that may have been auto-matched in the original reconciliation process)
* <span class="menuItems">Clear reconciliation data</span>, reverting all cells back to their original values.
The other options available under “Reconcile” are:
* Copy reconciliation data... (to an existing column: if the original values in your reconciliation column are identical to those in your chosen column, the matched and/or new cells will copy over. Unmatched values will not change.)
* [Use values as identifiers](#reconciling-with-unique-identifiers) (if you are reconciling with unique identifiers instead of by doing string searches).
* [Add entity identifiers column](#add-entity-identifiers-column).
The other options available under <span class="menuItems">Reconcile</span> are:
* <span class="menuItems">Copy reconciliation data...</span> (to an existing column: if the original values in your reconciliation column are identical to those in your chosen column, the matched and new cells will copy over; unmatched values will not change)
* [<span class="menuItems">Use values as identifiers</span>](#reconciling-with-unique-identifiers) (if you are reconciling with unique identifiers instead of by doing string searches)
* [<span class="menuItems">Add entity identifiers column</span>](#add-entity-identifiers-column).
## Reconciling with unique identifiers
Reconciliation services use unique identifiers for their entities. For example, the 14th Dalai Lama has the VIAF ID [38242123](https://viaf.org/viaf/38242123/) and the Wikidata ID [Q17293](https://www.wikidata.org/wiki/Q37349). You can supply these identifiers directly to your chosen reconciliation service in order to pull more data, but these strings will not be “reconciled” against the external dataset.
Select the column with unique identifiers and apply the operation “Reconcile” → “Use values as identifiers.” This will bring up the list of reconciliation services you have already added (to add a new service, open the “Start reconciling…” window first). If you use this operation on a column of IDs, you will not have access to the usual reconciliation settings.
Select the column with unique identifiers and apply the operation <span class="menuItems">Reconcile</span><span class="menuItems">Use values as identifiers</span>. This will bring up the list of reconciliation services you have already added (to add a new service, open the <span class="menuItems">Start reconciling...</span> window first). If you use this operation on a column of IDs, you will not have access to the usual reconciliation settings.
Matching identifiers does not validate them. All cells will appear as dark blue “confirmed” matches. You should check before this operation that the identifiers in the column exist on the target service.
@ -121,21 +128,21 @@ You may get false positives, which you will need to hover over or click on to id
## Reconciling by type
Reconciliation services, once added to OpenRefine, may suggest types from their databases. These types will usually be whatever the service specializes in: people, events, places, buildings, tools, plants, animals, organizations, etc.
Reconciliation services, once added to OpenRefine, may suggest types from their databases. These types will usually be whatever the service specializes in: people, events, places, buildings, tools, plants, animals, organizations, etc.
Reconciling against a type may be faster and more accurate, but may result in fewer matches. Some services have hierarchical types (such as “mammal” as a subtype of “animal”). When you reconcile against a more specific type, unmatched values may fall back to more broad types. Other services will not do this, so you may need to perform successive reconciliation attempts against different types. Refer to the documentation specific to the reconciliation service to learn more.
Reconciling against a type may be faster and more accurate, but may result in fewer matches. Some services have hierarchical types (such as “mammal” as a subtype of “animal”). When you reconcile against a more specific type, unmatched values may fall back to the broader type; other services will not do this, so you may need to perform successive reconciliation attempts against different types. Refer to the documentation specific to the reconciliation service to learn more.
When you select a service from the list, OpenRefine will load some or all available types. Some services will sample the first ten rows of your column to suggest types (check the [“Suggest types” column on this table of services](https://reconciliation-api.github.io/testbench/)). You will see a services types in the reconciliation window:
When you select a service from the list, OpenRefine will load some or all available types. Some services will sample the first ten rows of your column to suggest types (check the [“Suggest types” column](https://reconciliation-api.github.io/testbench/)). You will see a services types in the reconciliation window:
![Reconciling using a type.](/img/reconcile-by-type.png)
In this example, “Person” and “Corporate Name” are potential types offered by VIAF. You can also use the “Reconcile against type:” field to enter in another type that the service offers. When you start typing, this field may search and suggest existing types. For VIAF, you could enter “/book/book” if your column contained publications.
In this example, “Person” and “Corporate Name” are potential types offered by the reconciliation API for VIAF. You can also use the <span class="fieldLabels">Reconcile against type:</span> field to enter in another type that the service offers. When you start typing, this field may search and suggest existing types. For VIAF, you could enter “/book/book” if your column contained publications. You may need to enter the service's own strings precisely instead of attempting to search for a match.
Types are structured to fit their content: the Wikidata “human” type, for example, can include fields for birth and death dates, nationality, etc. The VIAF “person” type can include nationality and gender. You can use this to [include more properties](#reconciling-with-additional-columns) and find better matches.
If your column doesnt fit one specific type offered, you can “Reconcile against no particular type.” This may take longer for some services.
If your column doesnt fit one specific type offered, you can <span class="fieldLabels">Reconcile against no particular type</span>. This may take longer.
We recommend working in batches and reconciling against different types, moving from specific to broad. You can create a “best candidates type” facet to see which types are being represented. Some candidates may return more than one type, depending on the service.
We recommend working in batches and reconciling against different types, moving from specific to broad. You can create a facet for <span class="menuItems">Best candidates types</span> facet to see which types are being represented. Some candidates may return more than one type, depending on the service. Types may appear in facets by their unique IDs, rather than by their semantic labels (for example, Q5 for “human” in Wikidata).
## Reconciling with additional columns
@ -143,13 +150,13 @@ Some of your cells may be ambiguous, in the sense that a string can point to mor
![Reconciling sometimes turns up ambiguous matches.](/img/reconcileParis.gif)
Including supplementary information can be useful, depending on the service (such as including birthdate information about each person you are trying to reconcile). The other columns in your project will appear in the reconciliation window, with an “Include?” checkbox available on each.
Including supplementary information can be useful, depending on the service (such as including birthdate information about each person you are trying to reconcile). You can re-reconcile unmatched cells with additional properties, in the right side of the <span class="menuItems">Start reconciling</span> window, under “Also use relevant details from other columns.” The column names in your project will appear in the reconciliation window, with an <span class="fieldLabels">Include?</span> checkbox next to each one.
You can fill in the “As Property” field with the type of information you are including. When you start typing, potential fields may pop up (depending on the [“suggest properties” feature](https://reconciliation-api.github.io/testbench/)), such as “birthDate” in the case of ULAN or “Geburtsdatum” in the case of Integrated Authority File (GND). Use the documentation for your chosen service to identify the fields in their terms.
Fill in the <span class="fieldLabels">As Property</span> field with the type of information you are including. When you start typing, potential fields may pop up (depending on the [“suggest properties” feature](https://reconciliation-api.github.io/testbench/)), such as “birthDate” in the case of ULAN or “Geburtsdatum” in the case of Integrated Authority File (GND). Use the documentation for your chosen service to identify the fields in their terms.
Some services will not be able to search for the exact name of your desired “As Property” entry, but you can still manually supply the field name. Refer to the service to make sure you enter it correctly.
Some services will not be able to search for the exact name of your desired <span class="fieldLabels">As Property</span> entry, but you can still manually supply the field name. Refer to the service to choose the most appropriate field, and make sure you enter it correctly.
![Including a birth-date type.](/img/reconcile-by-type.png)
![Including a birth-date type.](/img/reconcile-with-property.png)
## Fetching more data
@ -157,54 +164,61 @@ One reason to reconcile to some external service is that it allows you to pull d
* Add identifiers for your values
* Add columns from reconciled values
* Add column by fetching URLs
* Add column by fetching URLs.
### Add entity identifiers column
Once you have selected matches for your cells, you can retrieve the unique identifiers for those cells and create a new column for these, with “Reconcile” → “Add entity identifiers column.” You will be asked to supply a column name. New items and other unmatched cells will generate null values in this column.
Once you have selected matches for your cells, you can retrieve the unique identifiers for those cells and create a new column for these, with <span class="menuItems">Reconcile</span><span class="menuItems">Add entity identifiers column</span>. You will be asked to supply a column name. New items and other unmatched cells will generate null values in this column.
### Add columns from reconciled values
If the reconciliation service supports [data extension](https://reconciliation-api.github.io/testbench/), then you can augment your reconciled data with new columns using “Edit column” → “Add columns from reconciled values....”
If the reconciliation service supports [data extension](https://reconciliation-api.github.io/testbench/), then you can augment your reconciled data with new columns using <span class="menuItems">Edit column</span><span class="menuItems">Add columns from reconciled values...</span>.
For example, if you have a column of chemical elements identified by name, you can fetch categorical information about them such as their atomic number and their element symbol, as the animation shows below:
For example, if you have a column of chemical elements identified by name, you can fetch categorical information about them such as their atomic number and their element symbol:
![A screenshare of elements fetching related information.](/img/reconcileelements.gif)
Once you have pulled reconciliation values and selected one for each cell, selecting “Add column from reconciled values...” will bring up a window to choose which information youd like to import into a new column. The quality of the suggested properties will depend on how you have reconciled your data beforehand: reconciling against a specific type will provide you with suggested properties of that type. For example, GND suggests elements about the “people” type after you've reconciled with it, such as their parents, native languages, children, etc.
Once you have chosen reconciliation matches for your cells, selecting <span class="menuItems">Add column from reconciled values...</span> will bring up a window to choose which related information youd like to import into new columns. You can manually enter desired properties, or select from a list of suggestions.
The quality of the suggested properties will depend on how you have reconciled your data beforehand: reconciling against a specific type will provide you with the associated properties of that type. For example, GND suggests elements about the “people” type after you've reconciled with it, such as their parents, native languages, children, etc.
![A screenshot of available properties from GND.](/img/reconcileGND.png)
If you have left any values unreconciled in your column, you will see “&lt;not reconciled>” in the preview. These will generate blank cells if you continue with the column addition process. This process may pull more than one property per row in your data, so you may need to switch into records mode after you've added columns.
If you have left any values unreconciled in your column, you will see “&lt;not reconciled>” in the preview. These will generate blank cells if you continue with the column addition process.
This process may pull more than one property per row in your data (such as multiple occupations), so you may need to switch into records mode after you've added columns.
### Add columns by fetching URLs
If the reconciliation service cannot extend data, look for a generic web API for that data source, or a structured URL that points to their dataset entities via unique IDs (such as https://viaf.org/viaf/000000). You can use the “Edit column” → “[Add column by fetching URLs](columnediting#add-column-by-fetching-urls)” operation to call this API or URL with the IDs obtained from the reconciliation process. This will require using [GREL expressions](expressions#GREL).
If the reconciliation service cannot extend data, look for a generic web API for that data source, or a structured URL that points to their dataset entities via unique IDs (such as “https&#58;//viaf.org/viaf/000000”). You can use the <span class="menuItems">Edit column</span><span class="menuItems">[Add column by fetching URLs](columnediting#add-column-by-fetching-urls)</span> operation to call this API or URL with the IDs obtained from the reconciliation process. This will require using [expressions](expressions).
You will likely not want to pull the entire HTML content of the pages at the ends of these URLs, so look to see whether the service offers a metadata endpoint, such as JSON-formatted data. You can either use a column of IDs, or you can pull the ID from each matched cell during the fetching process.
You may not want to pull the entire HTML content of the pages at the ends of these URLs, so look to see whether the service offers a metadata endpoint, such as JSON-formatted data. You can either use a column of IDs, or you can pull the ID from each matched cell during the fetching process.
For example, if you have reconciled artists to the Getty's ULAN, and [have their unique ULAN IDs as a column](#add-entity-identifiers-column), you can generate a new column of JSON-formatted data by using “Add column by fetching URLs” and entering the GREL expression `“http://vocab.getty.edu/” + value + “.json”` in the window. For this service, the unique IDs are formatted “ulan/000000” and so the generated URLs look like “http://vocab.getty.edu/ulan/000000.json”.
For example, if you have reconciled artists to the Getty's ULAN, and [have their unique ULAN IDs as a column](#add-entity-identifiers-column), you can generate a new column of JSON-formatted data by using <span class="menuItems">Add column by fetching URLs</span> and entering the GREL expression `"http://vocab.getty.edu/" + value + ".json"`. For this service, the unique IDs are formatted “ulan/000000” and so the generated URLs look like “http://vocab.getty.edu/ulan/000000.json”.
You can alternatively insert the ID directly from the matched column using a GREL expression like `“http://vocab.getty.edu/” + cell.recon.match.id + “.json”` instead.
Alternatively, you can insert the ID directly from the matched column's reconciliation variables, using a GREL expression like `“http://vocab.getty.edu/” + cell.recon.match.id + “.json”` instead.
Remember to set an appropriate throttle and to refer to the service documentation to ensure your compliance with their terms. See [the section about this operation](columnediting#add-column-by-fetching-urls) to learn more about common errors with this process.
Remember to set an appropriate throttle and to refer to the service documentation to ensure your compliance with their terms. See [the section about this operation](columnediting#add-column-by-fetching-urls) to learn more about the fetching process.
## Keep all the suggestions made
If you would like to generate a list of each suggestion made, rather than only the best candidate, you can use a [GREL expression](expressions#GREL). Go to “Edit column” → “Add column based on this column.” To create a list of all the possible matches, use
To generate a list of each suggestion made, rather than only the best candidate, you can use a [GREL expression](expressions#GREL). Go to <span class="menuItems">Edit column</span><span class="menuItems">Add column based on this column</span>. To create a list of all the possible matches, use something like
```forEach(cell.recon.candidates,c,c.name).join(“,”)```
```
forEach(cell.recon.candidates,c,c.name).join(", ")
```
To get the unique identifiers of these matches instead, use
```forEach(cell.recon.candidates,c,c.id).join(“,”)```
```
forEach(cell.recon.candidates,c,c.id).join(", ")
```
This information is stored as a string without any attached reconciliation information.
This information is stored as a string, without any attached reconciliation information.
## Writing reconciliation expressions
OpenRefine's GREL supplies a number of variables related specifically to reconciled values.
For example, some of the reconciliation variables are:
OpenRefine supplies a number of variables related specifically to reconciled values. These can be used in GREL and Jython expressions. For example, some of the reconciliation variables are:
* `cell.recon.match.id` or `cell.recon.match.name` for matched values
* `cell.recon.best.name` or `cell.recon.best.id` for best-candidate values
@ -215,8 +229,8 @@ For example, some of the reconciliation variables are:
You can find out more in the [reconciliaton variables](expressions#reconciliaton-variables) section.
## Exporting your reconciled data
## Exporting reconciled data
Once you have data that is reconciled to existing entities online, you may wish to export that data to a user-editable service such as Wikidata. See the section on [uploading your edits to Wikidata](wikidata#upload-edits-to-wikidata) for more information, or the section on [exporting](exporting) to see other formats OpenRefine can produce.
You can share reconciled data in progress through a [project export or import](exporting#export-a-project), with some preparation. The importing user needs to have the reconciliation services installed on their OpenRefine instance in advance of opening the project in order to use candidate and match links. Otherwise, the links will be broken and the user will need to add the reconciliation service and re-reconcile the columns in question. [Wikidata](wikidata) reconciliation data can be shared more easily as the service comes bundled with OpenRefine.
You can share reconciled data in progress through a [project export or import](exporting#export-a-project), with some preparation. The importing user needs to have the appropriate reconciliation services installed on their OpenRefine instance (by going to <span class="menuItems">Start reconciling</span> and clicking on <span class="buttonLabels">Add Standard Service...</span>) in advance of opening the project, in order to use candidate and match links. Otherwise, the links will be broken and the user will need to add the reconciliation service and re-reconcile the columns in question. [Wikidata](wikidata) reconciliation data can be shared more easily as the service comes bundled with OpenRefine.

View File

@ -1,4 +1,4 @@
---
---
id: running
title: Running OpenRefine
sidebar_label: Running
@ -8,9 +8,9 @@ sidebar_label: Running
OpenRefine does not require internet access to run its basic functions. Once you download and install it, it runs as a small web server on your own computer, and you access that local web server by using your browser.
You will see a command line window open when you run OpenRefine. Leave that window alone while you work on datasets in your browser.
You will see a command line window open when you run OpenRefine. Ignore that window while you work on datasets in your browser.
No matter how you load OpenRefine, it will load in your computers default browser. If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: [http://127.0.0.1:3333/](http://127.0.0.1:3333/).
No matter how you start OpenRefine, it will load its interface in your computers default browser. If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: [http://127.0.0.1:3333/](http://127.0.0.1:3333/).
OpenRefine works best on browsers based on Webkit, such as:
* Google Chrome
@ -20,7 +20,7 @@ OpenRefine works best on browsers based on Webkit, such as:
We are aware of some minor rendering and performance issues on other browsers such as Firefox. We don't support Internet Explorer.
You can launch multiple projects at the same time by simply having multiple tabs or browser windows open. From the <span class="menuItems">Open Project</span> screen, you can right-click on project names and select <span class="menuItems">Open in new tab</span>.
You can view and work on multiple projects at the same time by simply having multiple tabs or browser windows open. From the <span class="menuItems">Open Project</span> screen, you can right-click on project names and open them in new tabs or windows.
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
@ -37,21 +37,35 @@ import TabItem from '@theme/TabItem';
<TabItem value="win">
To exit OpenRefine, close all the browser tabs, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard.
#### With openrefine.exe
You can run OpenRefine by double-clicking `openrefine.exe` or calling it from the command line. If you want to [modify the way `openrefine.exe` opens](#starting-with-modifications), you can edit the `openrefine.l4j.ini` file.
You can run OpenRefine by double-clicking `openrefine.exe` or calling it from the command line.
If you want to [modify the way `openrefine.exe` opens](#starting-with-modifications), you can edit the `openrefine.l4j.ini` file.
#### With refine.bat
On Windows, OpenRefine can also be run by using the file `refine.bat` in the program directory. If you start OpenRefine using `refine.bat`, you can do so by opening the file itself, or by calling it from the command line.
If you call `refine.bat` from the command line, you can [start OpenRefine with modifications](#starting-with-modifications). If you want to modify the way `refine.bat` opens through double-clicking or using a shortcut, you can edit the `refine.ini` file.
If you call `refine.bat` from the command line, you can [start OpenRefine with modifications](#starting-with-modifications).
If you want to modify the way `refine.bat` opens through double-clicking or using a shortcut, you can edit the `refine.ini` file.
#### Exiting
To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard. This will save any last changes to your projects.
</TabItem>
<TabItem value="mac">
You can find OpenRefine in your Applications folder, or you can call it from the command line. To exit, close all your OpenRefine browser tabs, go back to the terminal window and press `Command` and `Q` to close it down.
You can find OpenRefine in your Applications folder, or you can open it using Terminal.
To run OpenRefine using Terminal:
* Find the OpenRefine application / icon in Finder
* Control-click on the icon and select “Show Package Contents” from the context menu
* This should open a new Finder menu: navigate into the “MacOS” folder
* Control-click on “JavaAppLauncher”
* Choose “Open With” from the menu, and select “Terminal.”
To exit, close all your OpenRefine browser tabs, go back to the terminal window and press `Command` and `Q` to close it down.
:::caution Problems starting?
If you are using an older version of OpenRefine or are on an older version of MacOS, [check our Wiki for solutions to problems with MacOS](https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions#macos).
@ -64,7 +78,7 @@ If you are using an older version of OpenRefine or are on an older version of Ma
Use a terminal to launch OpenRefine. First, navigate to the installation folder. Then call the program:
```
cd openrefine-3.4
cd openrefine-3.4.1
./refine
```
@ -106,7 +120,9 @@ When you run OpenRefine from a command line, you can change a number of default
On Windows, use a slash:
```C:>refine /i 127.0.0.2 /p 3334```
```
C:>refine /i 127.0.0.2 /p 3334
```
Get a list of all the commands with `refine /?`.
@ -119,25 +135,11 @@ Get a list of all the commands with `refine /?`.
|/d|Enable debugging (on port 8000)|refine /d|
|/x|Enable JMX monitoring for Jconsole and JvisualVM|refine /x|
</TabItem>
<TabItem value="mac">
To see the full list of command-line options, run `./refine -h`.
|Command|Use|Syntax example|
|---|---|---|
|-w|Path to the webapp|./refine -w /path/to/openrefine|
|-d|Path to the workspace|./refine -d /where/you/want/the/workspace|
|-m|Memory maximum heap|./refine -m 6000M|
|-p|Port|./refine -p 3334|
|-i|Interface (IP address, or IP and port)|./refine -i 127.0.0.2:3334|
|-k|Add a Google API key|_need an example_|
|-v|Verbosity (from low to high)|error,warn,info,debug,trace|
|-x|Additional configuration parameters|_need an example_|
|--debug|Enable debugging (on port 8000)|./refine --debug|
|--jmx|Enable JMX monitoring for Jconsole and JvisualVM|./refine --jmx|
You cannot start the Mac version with modifications using Terminal, but you can modify the way the application starts with [settings within files](#modifications-set-within-files).
</TabItem>
@ -152,9 +154,9 @@ To see the full list of command-line options, run `./refine -h`.
|-m|Memory maximum heap|./refine -m 6000M|
|-p|Port|./refine -p 3334|
|-i|Interface (IP address, or IP and port)|./refine -i 127.0.0.2:3334|
|-k|Add a Google API key|_need an example_|
|-v|Verbosity (from low to high)|error,warn,info,debug,trace|
|-x|Additional configuration parameters|_need an example_|
|-k|Add a Google API key|./refine -k YOUR_API_KEY|
|-v|Verbosity (from low to high: error,warn,info,debug,trace)|./refine -v info|
|-x|Additional Java configuration parameters (see Java documentation)||
|--debug|Enable debugging (on port 8000)|./refine --debug|
|--jmx|Enable JMX monitoring for Jconsole and JvisualVM|./refine --jmx|
@ -166,18 +168,44 @@ To see the full list of command-line options, run `./refine -h`.
#### Modifications set within files
On Windows, you can modify the way `openrefine.exe` runs by editing `openrefine.l4j.ini`; you can modify the way `refine.bat` runs by editing `refine.ini`. You can modify the Mac application by editing `info.plist`. On Linux, you can edit `refine.ini`.
On Windows, you can modify the way `openrefine.exe` runs by editing `openrefine.l4j.ini`; you can modify the way `refine.bat` runs by editing `refine.ini`.
These JVM preferences are different options and have different syntax than the key/value descriptions above. Some of the most common keys (with their defaults) are:
* -Drefine.autosave (5 [minutes])
* -Drefine.data_dir (/)
* -Drefine.development (false)
* -Drefine.headless (false)
* -Drefine.host (127.0.0.1)
* -Drefine.port (3333)
* -Drefine.webapp (main/webapp)
You can modify the Mac application by editing `info.plist`.
The syntax within the `.ini` files is as follows:
On Linux, you can edit `refine.ini`.
Some settings, such as changing memory allocations, are already set inside these files, and all you have to do is change the values. Some lines need to be un-commented to work.
For example, inside `refine.ini`, you should see:
```
no_proxy="localhost,127.0.0.1"
#REFINE_PORT=3334
#REFINE_HOST=127.0.0.1
#REFINE_WEBAPP=main\webapp
# Memory and max form size allocations
#REFINE_MAX_FORM_CONTENT_SIZE=1048576
REFINE_MEMORY=1400M
# Set initial java heap space (default: 256M) for better performance with large datasets
REFINE_MIN_MEMORY=1400M
...
```
##### JVM preferences
Further modifications can be performed by using JVM preferences. These JVM preferences are different options and have different syntax than the key/value descriptions used on the command line.
Some of the most common keys (with their defaults) are:
* The project [autosave](starting#autosaving) frequency: `-Drefine.autosave` (5 [minutes])
* The workspace director: `-Drefine.data_dir` (/)
* Development mode: `-Drefine.development` (false)
* Headless mode: `-Drefine.headless` (false)
* IP: `-Drefine.host` (127.0.0.1)
* Port: `-Drefine.port` (3333)
* The application folder: `-Drefine.webapp` (main/webapp)
The syntax is as follows:
<Tabs
groupId="operating-systems"
@ -191,21 +219,29 @@ The syntax within the `.ini` files is as follows:
<TabItem value="win">
Inside either of the `.ini` files, insert lines in this way:
Locate the `refine.l4j.ini` file, and insert lines in this way:
```
-Drefine.port=3333
-Drefine.host=127.0.0.1
-Drefine.port=3334
-Drefine.host=127.0.0.2
-Drefine.webapp=broker/core
```
In `refine.ini`, use a similar syntax, but set multiple parameters within a single line starting with `JAVA_OPTIONS=`:
```
JAVA_OPTIONS=-Drefine.data_dir=C:\Users\user\Documents\OpenRefine\ -Drefine.port=3334
```
</TabItem>
<TabItem value="mac">
Find the 'array' element that follows the line:
Locate the `info.plist`, and find the `array` element that follows the line
`<key>JVMOptions</key>`
```
<key>JVMOptions</key>
```
Typically this looks something like:
@ -219,7 +255,7 @@ Typically this looks something like:
</array>
```
Add in values like:
Add in values such as:
```
<key>JVMOptions</key>
@ -238,7 +274,7 @@ Add in values like:
<TabItem value="linux">
In `refine.ini`, add `JAVA_OPTIONS=` before the `-Drefine.preference` declaration. You can un-comment and edit the existing suggested lines, or add lines:
Locate the `refine.ini` file, and add `JAVA_OPTIONS=` before the `-Drefine.preference` declaration. You can un-comment and edit the existing suggested lines, or add lines:
```
JAVA_OPTIONS=-Drefine.autosave=2
@ -257,11 +293,13 @@ Refer to the [official Java documentation](https://docs.oracle.com/javase/8/docs
## The home screen
When you first launch OpenRefine, you will see a screen with a menu on the left hand side that includes <span class="menuItems">Create Project</span>, <span class="menuItems">Open Project</span>, <span class="menuItems">Import Project</span>, and <span class="menuItems">Language Settings</span>. This is called the "home screen", where you can manage your projects and general settings.
When you first launch OpenRefine, you will see a screen with a menu on the left hand side that includes <span class="menuItems">Create Project</span>, <span class="menuItems">Open Project</span>, <span class="menuItems">Import Project</span>, and <span class="menuItems">Language Settings</span>. This is called the “home screen,” where you can manage your projects and general settings.
In the lower left-hand corner of the screen, you'll see <span class="menuItems">Preferences</span>, <span class="menuItems">Help</span>, and <span class="menuItems">About</span>.
### Language settings
You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface:
From the home screen, look in the options to the left for <span class="menuItems">Language Settings</span>. You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface:
* Cebuano
* German
@ -278,13 +316,15 @@ You can set your preferred interface language here. This language setting will p
* Tagalog
* Chinese (简体中文)
:::info
To leave the Language Settings screen, click on the diamond “OpenRefine” logo.
:::info Help us Translate OpenRefine
We use Weblate to provide translations for the interface. You can check [our profile on Weblate](https://hosted.weblate.org/projects/openrefine/translations/) to see which languages are in the process of being supported. See [our technical reference if you are interested in contributing translation work](https://docs.openrefine.org/technical-reference/translating) to make OpenRefine accessible to people in other languages.
:::
### Preferences
At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it.
In the bottom left corner of the screen, look for <span class="menuItems">Preferences</span>. At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it.
|Setting|Key|Value syntax|Default|Example|
|---|---|---|---|---|
@ -294,7 +334,7 @@ At this time you can set preferences using a key/value pair: that is, selecting
|Timeout for Google Drive authorization|googleConnectTimeOut|Number (microseconds)|180000|500000|
|Maximum lag for Wikidata edit retries|wikibase.upload.maxLag|Number (seconds)|5|10|
To leave the Preferences screen, click on the “OpenRefine” logo.
To leave the Preferences screen, click on the diamond “OpenRefine” logo.
If the preference youre looking for isnt here, look at the options you can set from the [command line or in an `.ini` file](#starting-with-modifications).
@ -308,7 +348,7 @@ The project screen (or work screen) is where you will spend most of your time on
The project bar runs across the very top of the project screen. It contains the the OpenRefine logo, the project title, and the project control buttons on the right side.
At any time you can close your current project and go back to the home screen by clicking on the OpenRefine logo. If youd like to open another project in a new browser tab or window, you can right-click on the logo and use “Open in a new tab.” You will lose your current facets and view settings if you close your project (but data transformations will be saved in the [History](#history-undoredo) of the project).
At any time you can close your current project and go back to the home screen by clicking on the OpenRefine logo. If youd like to open another project in a new browser tab or window, you can right-click on the logo and use “Open in a new tab.” You will lose [your current facets and view settings](#facetfilter) if you close your project (but data transformations will be saved in the [History](#history-undoredo) of the project).
:::caution
Dont click the “back” button on your browser - it will likely close your current project and you will lose your facets and view settings.
@ -316,21 +356,21 @@ Dont click the “back” button on your browser - it will likely close your
You can rename a project at any time by clicking inside the project title, which will turn into a text field. Project names dont have to be unique, as OpenRefine organizes them based on a unique identifier behind the scenes.
<span class="menuItems">Permalink</span> allows you to return to a project at a specific view state - that is, with facets and filters applied. The permalink can help you pick up where you left off if you have to close your project while working with facets and filters. It puts view-specific information directly into the URL: clicking on it will load this current-view URL in the existing tab. You can right-click and copy the Permalink URL to copy the current view state to your clipboard, without refreshing the tab youre using.
<br/>
<span class="menuItems">Open…</span> will open up a new browser tab showing the “Create Project” screen. From here you can change settings, start a new project, or open an existing project.
<br/>
<span class="menuItems">Export</span> is a dropdown menu that allows you to pick a format for exporting your current dataset. It will only export rows and records that are currently visible - the currently selected facets and filters, not the total data in the project.
<br/>
The <span class="menuItems">Permalink</span> allows you to return to a project at a specific view state - that is, with [facets and filters](facets) applied. The <span class="menuItems">Permalink</span> can help you pick up where you left off if you have to close your project while working with facets and filters. It puts view-specific information directly into the URL: clicking on it will load this current-view URL in the existing tab. You can right-click and copy the <span class="menuItems">Permalink</span> URL to copy the current view state to your clipboard, without refreshing the tab youre using.
The <span class="menuItems">Open…</span> button will open up a new browser tab showing the <span class="menuItems">Create Project</span> screen. From here you can change settings, start a new project, or open an existing project.
<span class="menuItems">Export</span> is a dropdown menu that allows you to pick a format for exporting a dataset. Many of the export options will only export rows and records that are currently visible - the currently selected facets and filters, not the total data in the project.
<span class="menuItems">Help</span> will open up a new browser tab and bring you to this user manual on the web.
### The grid header
The grid header sits below the project bar and above the project grid (the data of your project). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in rows or records mode.
The grid header sits below the project bar and above the project grid (where the data of your project is displayed). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in [rows or records mode](exploring#rows-vs-records).
It will also tell you if youre currently looking at a select number of rows via facets or filtering, rather than the entire dataset, by displaying either, for example, <span class="menuItems">180 rows</span> or <span class="menuItems">67 matching rows (180 total)</span>.
It will also tell you if youre currently looking at a select number of rows via facets or filtering, rather than the entire dataset, by displaying either, for example, “180 rows” or “67 matching rows (180 total).”
Directly below the row number, you have the ability to switch between row mode and records mode. OpenRefine stores which projects are in records mode, and displays your data as records by default if you are.
Directly below the row number, you have the ability to switch between [row mode and records mode](exploring#rows-vs-records). OpenRefine stores projects persistently in one of the two modes, and displays your data as records by default if you are.
To the right of the rows/records selection is the array of options for how many rows/records to view on screen at one time. At the far right of the screen you can navigate through your entire dataset one page at a time.
@ -340,43 +380,43 @@ The <span class="menuItems">Extensions</span> dropdown offers you options for ex
### The grid
The area of the project screen that displays your dataset is called the "project grid" (or the "data grid", or simply the "grid"). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you.
The area of the project screen that displays your dataset is called the “grid” (or the “data grid,” or the “project grid”). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you.
Columns widths are automatically set based on their contents; some column headers may be cut off, but can be viewed by mousing over the headers.
In each column header you will see a small arrow. Clicking on this arrow brings up a dropdown menu containing column-specific data exploration and transformation options. You will learn about each of these options in the [Exploring data](exploring) and [Transforming data](transforming) sections.
The first column in every project will always be <span class="menuItems">All</span>, which contains options to flag, star, and do non-column-specific operations. The <span class="menuItems">All</span> column is also where rows/records are numbered.
The first column in every project will always be <span class="menuItems">All</span>, which contains options to flag, star, and do non-column-specific operations. The <span class="menuItems">All</span> column is also where rows/records are numbered. Numbering shows the permanent order of rows and records; a temporary sorting or facet may reorder the rows or show a limited set, but numbering will show you the original identifiers unless you make a permanent change.
The project grid may display with both vertical and horizontal scrolling, depending on the number and width of columns, and the number of rows/records displayed. You can control the display of the project grid by using [Sort and View options](exploring#sort-and-view).
Mousing over individual cells will allow you to [edit cells individually](transforming).
Mousing over individual cells will allow you to [edit cells individually](cellediting#edit-one-cell-at-a-time).
### The sidebar
### Facet/Filter
#### Facet/Filter
The Facet/Filter tab is one of the main ways of exploring your data: displaying the patterns and trends in your data, and helping you narrow your focus and modify that data. [Facets](exploring#facets) and [filters](exploring#filters) are explained more in [Exploring data](exploring).
The <span class="tabLabels">Facet/Filter</span> tab is one of the main ways of exploring your data: displaying the patterns and trends in your data, and helping you narrow your focus and modify that data. [Facets](facets) and [filters](facets#text-filter) are explained more in [Exploring data](exploring).
![A screenshot of facets and filters in action.](/img/facetfilter.png)
In the interface, you will see three buttons: <span class="menuItems">Refresh</span>, <span class="menuItems">Reset all</span>, and <span class="menuItems">Remove all</span>. Refreshing your facets will ensure you are looking at the latest information about each facet, if you have changed the counts or eliminated some options, for example.
In the tab, you will see three buttons: <span class="menuItems">Refresh</span>, <span class="menuItems">Reset all</span>, and <span class="menuItems">Remove all</span>.
Resetting your facets will remove any inclusion or exclusion you may have set - the facet options will stay in the sidebar, but your view settings will be reset.
Refreshing your facets will ensure you are looking at the latest information about each facet, for example if you have changed the counts or eliminated some options.
Removing your facets will clear out the sidebar entirely. If you have written custom facets using expressions, these will be lost.
Resetting your facets will remove any inclusion or exclusion you may have set - the facet options will stay in the sidebar, but your view settings will be undone.
You can preserve your facets and filters for future use by copying a [Permalink](#the-project-bar).
Removing your facets will clear out the sidebar entirely. If you have written custom facets using [expressions](expressions), these will be lost.
#### History (Undo/Redo)
You can preserve your facets and filters for future use by copying a <span class="menuItems">[Permalink](#the-project-bar)</span>.
In OpenRefine, any activity that changes the data can be undone. Changes are tracked from the very beginning, when a project is first created. The change history of each project is saved with the project's data, so quitting OpenRefine does not erase the steps you've taken. When you restart OpenRefine, you can view and undo changes that you made before you quit OpenRefine.
### History (Undo/Redo)
In OpenRefine, any activity that changes the data can be undone. Changes are tracked from the very beginning, when a project is first created. The change history of each project is saved with the project's data, so quitting OpenRefine does not erase the steps you've taken. When you restart OpenRefine, you can view and undo changes that you made before you quit OpenRefine. OpenRefine [autosaves](starting#autosaving) your actions every five minutes by default, and when you close OpenRefine properly (using Ctrl + C). You can [change this interval](running#jvm-preferences).
Project history gets saved when you export a project archive, and restored when you import that archive to a new installation of OpenRefine.
![A screenshot of the History (Undo/Redo) tab with 13 steps.](/img/history.png "A screenshot of the History (Undo/Redo) tab with 13 steps.")
When you click on <span class="menuItems">Undo / Redo</span> in the sidebar of any project, that projects history is shown as a list of changes in order, with the first change being the action of creating the project itself. (That first change, indexed as step zero, cannot be undone.) Here is a sample history with 3 changes:
When you click on the <span class="tabLabels">Undo / Redo</span> tab in the sidebar of any project, that projects history is shown as a list of changes in order, with the first change being the action of creating the project itself. (That first change, indexed as step zero, cannot be undone.) Here is a sample history with 3 changes:
```
0. Create project
@ -387,22 +427,78 @@ When you click on <span class="menuItems">Undo / Redo</span> in the sidebar of a
The current state of the project is highlighted with a dark blue background. If you move back and forth on the timeline you will see the current state become highlighted, while the actions that came after that state will be grayed out.
To revert your data back to an earlier state, simply click on the last action in the timeline you want to keep. In the example above, if we keep the removal of 7 rows but revert everything we did after that, then click on <span class="menuItems">Remove 7 rows</span>. The last 2 changes will be undone, in order to bring the project back to state #1.
To revert your data back to an earlier state, simply click on the last action in the timeline you want to keep. In the example above, if we keep the removal of 7 rows but revert everything we did after that, then click on “Remove 7 rows.” The last 2 changes will be undone, in order to bring the project back to state #1.
In this example, changes #2 and #3 will now be grayed out. You can redo a change by clicking on it in the history - everything up to and including it will be redone.
If you have moved back one or more states, and then you perform a new operation on your data, the later actions (everything thats greyed out) will be erased and cannot be re-applied.
The <span class="menuItems">Undo/Redo</span> tab will show you which step youre on, and if youre about to risk erasing work - by saying something like "4/5" or "1/7" at the end.
The Undo/Redo tab will indicate which step youre on, and if youre about to risk erasing work - by saying something like “4/5" or “1/7” at the end.
##### Reusing operations
#### Reusing operations
Operations that you perform in OpenRefine can be reused. For example, a formula you wrote inside one project can be copied and applied to another project later.
To reuse one or more operations, you first extract it from the project where it was first applied. Click to the <span class="menuItems">Undo/Redo</span> tab and click <span class="menuItems">Extract…</span>. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON off to the clipboard.
To reuse one or more operations, first extract it from the project where it was first applied. Click to the <span class="tabLabels">Undo/Redo</span> tab and click <span class="menuItems">Extract…</span>. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON to the clipboard.
Move to the second project, go to the <span class="menuItems">Undo/Redo</span> tab, click <span class="menuItems">Apply…</span> and paste in that JSON.
Move to the second project, go to the <span class="tabLabels">Undo/Redo</span> tab, click <span class="menuItems">Apply…</span> and paste in that JSON.
Not all operations can be extracted. Edits to a single cell, for example, cant be replicated.
### Common extension buttons
## Advanced OpenRefine uses
### Running OpenRefine's Linux version on a Mac
You can run OpenRefine from the command line in Mac by using the Linux installation package. We do not promise support for this method. Follow the instructions in the Linux section.
### Running as a server
:::caution
Please note that if your machine has an external IP (is exposed to the Internet), you should not do this, or should protect it behind a proxy or firewall, such as nginx. Proceed at your own risk.
:::
By default (and for security reasons), OpenRefine only listens to TCP requests coming from localhost (127.0.0.1) on port 3333. If you want to share your OpenRefine instance with colleagues and respond to TCP requests to any IP address of the machine, start it from the command line like this:
```
./refine -i 0.0.0.0
```
or set this option in `refine.ini`:
```
REFINE_HOST=0.0.0.0
```
or set this JVM option:
```
-Drefine.host=0.0.0.0
```
On Mac, you can add a specific entry to the `Info.plist` file located within the app bundle (`/Applications/OpenRefine.app/Contents/Info.plist`):
```
<key>JVMOptions</key>
<array>
<string>-Drefine.host=0.0.0.0</string>
</array>
```
:::caution
OpenRefine has no built-in security or version control for multi-user scenarios. OpenRefine has a single data model that is not shared, so there is a risk of data operations being overwritten by other users. Care must be taken by users.
:::
### Automating OpenRefine
Some users may wish to employ OpenRefine for batch processing as part of a larger automated pipeline. Not all OpenRefine features can work without human supervision and advancement (such as clustering), but many data transformation tasks can be automated.
:::caution
The following are all third-party extensions and code; the OpenRefine team does not maintain them and cannot guarantee that any of them work.
:::
Some examples:
* This project allows OpenRefine to be run from the command line using [operations saved in a JSON file](running#reusing-operations): [OpenRefine batch processing](https://github.com/opencultureconsulting/openrefine-batch)
* A Python project for applying a JSON file of operations to a data file, outputting the new file, and deleting the temporary project, written by David Huynh and Max Ogden: [Python client library for Google Refine](https://github.com/maxogden/refine-python)
* And the same in Ruby: [Refine-Ruby](https://github.com/maxogden/refine-ruby)
* Another Python client library, by Paul Makepeace: [OpenRefine Python Client Library](https://github.com/PaulMakepeace/refine-client-py)
To look for other instances, search our Google Groups [for users](https://groups.google.com/g/openrefine) and [for developers](https://groups.google.com/g/openrefine-dev), where [these projects were originally posted](https://groups.google.com/g/openrefine/c/GfS1bfCBJow/m/qWYOZo3PKe4J).

View File

@ -6,30 +6,30 @@ sidebar_label: Sort and view
## Sort
You can temporarily sort your rows by one column. You can sort:
You can temporarily sort your rows by one column. You can sort based on [data type](exploring#data-types):
* text alphabetically or reverse
* numbers by largest or smallest
* dates by earliest or latest
* boolean values by false first or true first
* boolean values by false first or true first.
You can also choose where to place errors and blank cells in the sorting. Text can be case-sensitive or not: cells that start with lowercase characters will appear ahead of uppercase.
You can also choose where to place errors and blank cells in the sorting. Text can be case-sensitive or not: if so, cells that start with lowercase characters will appear ahead of those that start with uppercase characters.
![A screenshot of the Sort window.](/img/sort.png)
After you apply a sorting method, you can make it permanent, remove it, reverse it, or apply a subsequent sorting. Youll find “Sort” in the project grid header to the right of the rows-display setting, which will show all current sorting settings.
After you apply a sorting method, you can make it permanent, remove it, reverse it, or apply a subsequent sorting. When it is applied, youll find <span class="menuItems">Sort</span> in the project grid header to the right of the rows-display setting, which will show all current sorting settings.
If you have multiple sorting methods applied, they will work in the order you applied them (represented in order in the "Sort" menu). For example, you can sort an "authors" column alphabetically, and then sort books by publication date, for those authors that have more than one book. If you apply those in a different order - sort all the publication dates in the dataset first, and then alphabetically by author - your dataset will look different.
If you have multiple sorting methods applied, they will work in the order you applied them (represented in order in the <span class="menuItems">Sort</span> menu). For example, you can sort an “authors” column alphabetically, and then sort their books by publication date, for those authors that have more than one book. If you apply those in a different order - sort all the publication dates in the dataset first, and then alphabetically by author - your dataset will look different.
![Temporarily sorted rows.](/img/sort2.png)
When the sorting method you've applied is temporary, you will see that the rows retain their original numbering. When you make that sorting method permanent, by selecting "Reorder rows permanently," the row numbers will change and the "Sort" menu in the project grid header will disappear. This will apply all current sorting methods.
When the sorting method you've applied is temporary, you will see that the rows retain their original numbering. When you make that sorting method permanent, by selecting <span class="menuItems">Reorder rows permanently</span>, the row numbers will change and the <span class="menuItems">Sort</span> menu in the project grid header will disappear. This will apply all current sorting methods.
## View
You can control what data you view in the grid. On each column, you can “collapse” that specific column, all other columns, all columns to the left, and all columns to the right. Using the “All” columns dropdown menu, you can collapse all columns, and expand all the columns that you previously collapsed.
You can control what data you view in the grid. On each column, you will see a <span class="menuItems">View</span> menu option. From there, you can “collapse” (hide) that specific column, all other columns, all columns to the left, and all columns to the right. Using the <span class="menuItems">View</span> option that appears in the <span class="menuItems">All</span> columns dropdown menu, you can collapse all columns, and expand all the columns that you previously collapsed.
### Show/hide “null”
You can also use the “All” dropdown to show and hide [“null” values](#cell-data-types). A small grey “null” will appear in each applicable cell.
You can find, under <span class="menuItems">All</span><span class="menuItems">View</span>, the option to show and hide [“null” values](exploring#data-types). A small grey “null” will appear in each applicable cell. Remember that a null cell is not the same thing as an empty cell.
![A screenshot of what a null value looks like.](/img/null.png)

View File

@ -1,4 +1,4 @@
---
---
id: starting
title: Starting a project
sidebar_label: Starting a project
@ -8,21 +8,21 @@ sidebar_label: Starting a project
An OpenRefine project is started by importing in some existing data - OpenRefine doesnt allow you to create a dataset from nothing.
No matter where your data comes from, OpenRefine doesnt modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored).
No matter where your data comes from, OpenRefine wont modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored).
The data and all of your edits are automatically saved inside the project file. When youre finished modifying the data, you can export it back out into the file format of your choice.
The data and all of your edits are [automatically saved](#autosaving) inside the project file. When youre finished modifying the data, you can [export it back out](exporting) into the file format of your choice.
You can also receive and open other peoples projects, or send them yours, by exporting a project archive and importing it.
You can also receive and open other peoples projects, or send them yours, by [exporting a project archive](exporting#export-a-project) and [importing it](#import-a-project).
## Create project by importing data
## Create a project by importing data
When you start OpenRefine, youll be taken to the “Create Project” screen. Youll see on the left side of the screen that your options are to:
When you start OpenRefine, youll be taken to the <span class="menuItems">Create Project</span> screen. Youll see on the left side of the screen that your options are to:
* import data from a file on your computer
* import data from a link to the web
* import data from one or more files on your computer
* import data from one or more links on the web
* import data by pasting in text from your clipboard
* import data from a database (using SQL), and
* import Sheets from Google Drive.
* import one or more Sheets from Google Drive.
From these sources, you can load any of the following file formats:
@ -40,7 +40,7 @@ From these sources, you can load any of the following file formats:
More formats can be imported by [adding extensions to provide that functionality](https://openrefine.org/download.html).
If you supply two or more files for one project, the files rows will be loaded in the order that you specify, and OpenRefine will create a column at the beginning of the dataset with the source URL or file name in it to help you identify where each row came from. If the files have matching columns, the data will load in each column; if not, the successive files will append all of their new columns to the end of the dataset:
If you supply two or more files for one project, the files rows will be loaded in the order that you specify, and OpenRefine will create a column at the beginning of the dataset with the source URL or file name in it to help you identify where each row came from. If the files have columns with identical names, the data will load in those columns; if not, the successive files will append all of their new columns to the end of the dataset:
|File|Fruit|Quantity|Berry|Berry source|
|---|---|---|---|---|
@ -49,18 +49,19 @@ If you supply two or more files for one project, the files rows will be loade
|berries.csv||9|Mulberry|Greece|
|berries.csv||2|Blueberry|Canada|
You cannot combine two datasets into one project by appending data within rows. You can, however, combine two projects later using functions such as [cross()](grelfunctions/#crosscell-s-projectname-s-columnname), or [fetch further data](columnediting) using other methods.
For whichever method you choose, when you click <span class="menuItems">Next >></span> you will be given a preview and a chance to configure the way OpenRefine interprets the file.
For whichever method you choose to start your project, when you click <span class="menuItems">Next >></span> you will be given a preview and a chance to configure the way OpenRefine interprets the data you input.
### Get data from this computer
Click on <span class="menuItems">Browse…</span> and select a file on your hard drive. All files will be shown, not just compatible ones.
Click on <span class="menuItems">Browse…</span> and select a file (or several) on your hard drive. All files will be shown, not just compatible ones.
If you import an archive file (something with the extension `.zip`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.gz`, or `.bz2`), OpenRefine detects the files inside it, shows you a preview screen, and allows you to select which ones to load. This does not work with `.rar` files.
### Web Addresses (URLs)
### Web addresses (URLs)
Type or paste the URL to the data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview it for you.
Type or paste the URL to a data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview the project for you.
If you supply two or more file URLs, OpenRefine will identify each one and ask you to choose which (or all) to load.
@ -68,52 +69,55 @@ Do not use this form to load a Google Sheet by its link; use [the Google Data fo
### Clipboard
You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into multi-column rows. OpenRefine recognizes each new text line as a row.
You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into rows. OpenRefine recognizes each new text line as a row.
This can be useful if you want to pre-select a specific number of rows from your source data, or paste together rows from different places, rather than delete unwanted rows later in the project interace.
This can also be useful if you would like to paste in a list of URLs, which you can use later to fetch the data online and build columns with.
This can also be useful if you would like to paste in a list of URLs, which you can use later to [fetch more data](columnediting).
### Database (SQL)
If you are an administrator or have SQL access to a database of information, you may want to pull the latest dataset directly from there. This could include an online catalogue, a content management system, or a digital repository or collection management system. You can also load a database (`.db`) file saved locally. You will need to use an [SQL query](https://www.w3schools.com/sql/) to import your intended data.
There are some publicly-accessible databases you can query, such as [one provided by Rfam](https://docs.rfam.org/en/latest/database.html). The instructions provided by Rfam can help you understand how to connect to and query from any database.
There are some publicly-accessible databases you can query, such as [one provided by Rfam](https://docs.rfam.org/en/latest/database.html). The instructions provided by Rfam can help you understand how to connect to and query from other databases.
OpenRefine can connect to PostgreSQL, MySQL, MariaDB, and SQLite database systems. It will automatically populate the <span class="menuItems">Port</span> field based on which of these you choose, but you can manually edit this if needed.
OpenRefine can connect to PostgreSQL, MySQL, MariaDB, and SQLite database systems. It will automatically populate the <span class="fieldLabels">Port</span> field based on which of these you choose, but you can manually edit this if needed.
If you have a `.db` file, you can supply the path to the file on your computer directly in the <span class="menuItems">Database</span> field at the bottom of the form. You can leave the rest of the fields blank.
If you have a `.db` file, you can supply the path to the file on your computer in the <span class="fieldLabels">Database</span> field at the bottom of the form. You can leave the rest of the fields blank.
To import data directly from a database, you will need the database type (such as MySQL), database name, the hostname (either an IP address or the domain that hosts the database), and the port on the host. You will need an account authorized for access, and you may need to add OpenRefine's IP address or host to the <span class="menuItems">allowable hosts</span> for that account. You can find that information by pressing <span class="menuItems">Test</span> and getting the IP from the error message that results.
To import data directly from a database, you will need the database type (such as MySQL), database name, the hostname (either an IP address or the domain that hosts the database), and the port on the host. You will need an account authorized for access, and you may need to add OpenRefine's IP address or host to the "allowable hosts" for that account. You can find that information by pressing <span class="buttonLabels">Test</span> and getting the IP address from the error message that results.
You can either connect just once to gather data, or save the connection to use it again later. If you press <span class="menuItems">Connect</span> without saving, OpenRefine will forget all the information you just entered. If youd like to save the connection, name your connection in a way you will recognize later. Click <span class="menuItems">Save</span> and it will appear in the <span class="menuItems">Saved Connections</span> list on the left. From now on, you can click on the <span class="menuItems">...</span> ellipsis to the right of the connection youve saved, and click <span class="menuItems">Connect</span>.
You can either connect just once to gather data, or save the connection to use it again later. If you press <span class="buttonLabels">Connect</span> without saving, OpenRefine will forget all the information you just entered. If youd like to save the connection, name your connection in a way you will recognize later. Click <span class="buttonLabels">Save</span> and it will appear in the <span class="menuItems">Saved Connections</span> list on the left. From now on, you can click on the <span class="buttonLabels">...</span> ellipsis to the right of the connection youve saved, and click <span class="buttonLabels">Connect</span>.
If your connection is successful, you will see a Query Editor where you can run your SQL query. OpenRefine will give you an error if you write a statement that tries to modify the source database in any way.
### Google Data
### Google data
You have two ways to load in data from Google Sheets:
* A link to an accessible Google Sheet (that is, one with link-sharing turned on)
* Selecting a Google Sheet in your Google Drive
* providing a link to an accessible Google Sheet (that is, one with link-sharing turned on), and
* selecting a Google Sheet in your Google Drive.
#### Google Sheet by URL
You can import data from any Google Sheet that has link-sharing turned on. Paste in a URL that looks something like
```https://docs.google.com/spreadsheets/………/edit?usp=sharing```
```
https://docs.google.com/spreadsheets/………/edit?usp=sharing
```
This will only work with Sheets, not with any other Google Drive file that might have an available link, including `.xls` and other valid files that are hosted in Google Drive. These links will also not work [by URL](#web-addresses-urls), so you need to download the files to your computer.
This will only work with Sheets, not with any other Google Drive file that might have an available link, including `.xls` and other valid files that are hosted in Google Drive. These links will not work when attempting to start a project [by URL](#web-addresses-urls) either, so you need to download those files to your computer.
#### Google Sheet from Drive
You can authorize OpenRefine to access your Google Drive data and import data from any Google Sheet it finds there. This will include Sheets that belong to you and Sheets that are shared with you, as well as Sheets that are in your trash.
When you select a Google option (either here, or [when exporting project data to Google Drive or Google Sheets](exporting), you will see a pop-up window that asks you to select a Google account to authorize with. You may see an error message when you authorize: if so, try your import or export operation again and it should succeed.
OpenRefine will not show spreadsheets that are in your email inbox or stored in any other Google property - only in Drive. It also wont show all compatible file formats, only Sheets files.
OpenRefine will generate a list of all Sheets it finds, with the most recently modified Sheets at the top. If a file youve just added isnt showing in this list, you can close and restart OpenRefine, or simply navigate to an existing project, open it, then head back to the <span class="menuItems">Create Project</span> window and check again.
When you click <span class="menuItems">Preview</span> the Sheet will open in a new browser tab. When you click the Sheet title, OpenRefine will begin to process the data.
When you click <span class="buttonLabels">Preview</span> the Sheet will open in a new browser tab. When you click the Sheet title, OpenRefine will begin to process the data.
## Project preview
@ -122,34 +126,39 @@ Once OpenRefine is ready to import the data, you will see a screen with <span cl
At the bottom of the screen you will find options for telling OpenRefine how to process what it has found. You can tell it which row(s) to parse as column headers, as well as to ignore any number of rows at the top. You can also select a specific range of rows to work with, by discarding some rows at the top (excluding the header) and limiting the total number of rows it loads.
OpenRefine tries to guess how to parse your data based on the file extension. For example, `.xml` files are going to be parsed as though they are formatted in XML. An unknown file extension (or your clipboard copy-paste) is assumed to be either tab-separated or comma-separated. OpenRefine looks for a tab character; if one is found, it assumes you have imported tab-separated data.
OpenRefine tries to guess how to parse your data based on the file extension. For example, `.xml` files are going to be parsed as though they are formatted in XML. An unknown file extension (or your clipboard copy-paste) is assumed to be either tab-separated or comma-separated. OpenRefine looks for a tab character, and if one is found, it assumes you have imported tab-separated data.
If OpenRefine isnt certain what format you imported, it will provide a list of possibilities under <span class="menuItems">Parse data as</span> and some settings. You can specify a custom separator now, or split columns later on in the project interface.
If OpenRefine isnt certain what format you imported, it will provide a list of possibilities under <span class="menuItems">Parse data as</span> and some settings. You can specify a custom separator now, or split columns later while [transforming your data](transforming).
If you imported a spreadsheet with multiple worksheets, they will be listed along with the number of rows they contain. You can only select data from one worksheet.
Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file.
Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file. Hyperlinked text will be input as plain text, but OpenRefine will recognize links and make them clickable inside the project interface.
You should create a project name at this stage. You can also supply tags to keep your projects organized. When youre happy with the preview, click <span class="menuItems">Create Project</span>.
:::info Encoding issues?
Look for character encoding issues at this stage. You may want to manually select an encoding, such as UTF-8, UTF-16, or ASCII, if OpenRefine does not display some characters correctly in the preview. Once your project is created, you can specify another encoding for specific columns using the [reinterpret() function](grelfunctions#reinterprets-s-encoder).
:::
You should create a project name at this stage. You can also supply tags to keep your projects organized. When youre happy with the preview, click <span class="buttonLabels">Create Project</span>.
## Import a project
Because OpenRefine only runs locally on your computer, you cant have a project accessible to more than one person at the same time.
The best way to collaborate with another person is to export and import projects that save all your changes, so that you can pick up where someone else left off. You can also [export projects](exporting#export-a-project) and import them to new computers of your own, such as for working on the same project from the office and from home.
The best way to collaborate with another person is to export and import projects that save all your changes, so that you can pick up where someone else left off. You can also [export projects](exporting#export-a-project) and import them to other computers, such as for working on the same project from the office and from home.
An exported project will include all of the [history](running#history-undoredo), so you can see (and undo) all the changes from the previous user. It is essentially a point-in-time snapshot of their work. OpenRefine only exports projects as `.tar.gz` files at this time.
:::caution
If you wish to hide the original state of your data and your history of edits (for example, if you are using OpenRefine to anonymize information), export your cleaned dataset only and do not share your project archive.
:::
### Instructions
Once someone has sent you a project archive file from their computer, you can save it anywhere. OpenRefine will import it like a new project and save its information to your workspace directory.
Once someone has sent you a project archive file from their computer, you can save it anywhere, including your Downloads folder.
In the left-hand menu of the home screen, click <span class="menuItems">Import Project</span>. Click <span class="menuItems">Browse…</span> and navigate to wherever you saved the file you were sent (for example, your Downloads folder).
In the left-hand menu of the home screen, click <span class="buttonLabels">Import Project</span>. Click <span class="buttonLabels">Browse…</span> and navigate to wherever you saved the file you were sent (for example, your Downloads folder).
You can rename the project if youd like - we recommend adding your name, a date, or a version number, if youre planning to continue collaborating with another person (or working from multiple computers).
Then, click <span class="menuItems">Import Project</span>. Your project should appear with a step count beside <span class="menuItems">Undo/Redo</span> if steps were saved by the exporter.
Then, click <span class="buttonLabels">Import Project</span>. Your project should appear with a step count beside <span class="tabLabels">Undo/Redo</span> if steps were saved by the exporter.
OpenRefine will store the project in its own workspace directory, so you can now delete the original file that was sent to you.
@ -158,27 +167,28 @@ OpenRefine will store the project in its own workspace directory, so you can now
You can access all of your created projects by clicking on <span class="menuItems">Open Project</span>. Your project list can be organized by modification date, title, row count, and other metadata you can supply (such as subject, descripton, tags, or creator). To edit the fields you see here, click <span class="menuItems">About</span> to the left of each project. There you can edit a number of available fields. You can also see the project ID that corresponds to the name of the folder in your work directory.
### Naming projects
You may have multiple projects from the same dataset, or multiple versions from sharing a project with another person. OpenRefine automatically generates a project name from the imported file, or <span class="menuItems">clipboard</span> when you use Clipboard importing. Project names dont have to be unique, so OpenRefine will create many projects with the same name unless you intervene.
You may have multiple projects from the same dataset, or multiple versions from sharing a project with another person. OpenRefine automatically generates a project name from the imported file, or “clipboard” when you use <span class="menuItems">Clipboard</span> importing. Project names dont have to be unique, and OpenRefine will create many projects with the same name unless you intervene.
You can name a project when you create it or import it, and you can rename a project by opening it and clicking on the project name at the top of the screen.
You can edit a project's name when you create it or import it, and you can rename a project later by opening it and clicking on the project name at the top of the screen.
### Autosaving
OpenRefine saves all of your actions (everything you can see in the <span class="menuItems">Undo/Redo</span> panel). That includes flagging and starring rows.
OpenRefine [saves all of your actions](running#history-undoredo) (everything you can see in the <span class="tabLabels">Undo/Redo</span> panel). That includes flagging and starring rows.
It doesnt, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, whether you are showing your data as rows or records, and any sorting or column collapsing you may have done. A good rule of thumb is: if its not showing in <span class="menuItems">Undo/Redo</span>, you will lose it when you leave the project workspace.
It doesnt, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, and any sorting or column collapsing you may have done. A good rule of thumb is: if its not showing in <span class="tabLabels">Undo/Redo</span>, you will lose it when you leave the project workspace.
Autosaving happens by default every five minutes. You can [change this preference by following these directions](running#jvm-preferences).
You can only save and share facets and filters, not any other type of view. To save current facets and filters, click <span class="menuItems">Permalink</span>. The project will reload with a different URL, which you can then copy and save elsewhere. This permalink will save both the facets and filters youve set, and the settings for each one (such as sorting by count rather than by name).
### Deleting projects
You can delete projects, which will erase the project files from the work directory on your computer. This is immediate and cannot be undone.
You can delete projects, which will erase the project files from the workspace directory on your computer. This is immediate and cannot be undone.
Go to <span class="menuItems">Open Project</span> and find the project you want to delete. Click on the <span class="menuItems">X</span> to the left of the project name. There will be a confirmation dialog.
### Project files
You can find all of your raw project files in your work directory. They will be named according to the unique Project ID that OpenRefine has assigned them, which you can find on the <span class="menuItems">Open Project</span> screen, under the “About” link for each project.
You can find all of your raw project files in your work directory. They will be named according to the unique Project ID that OpenRefine has assigned them, which you can find on the <span class="menuItems">Open Project</span> screen, under the “About” link for each project.

View File

@ -8,15 +8,14 @@ sidebar_label: Overview
OpenRefine gives you powerful ways to clean, correct, codify, and extend your data. Without ever needing to type inside a single cell, you can automatically fix typos, convert things to the right format, and add structured categories from trusted sources.
The following ways to improve data are organized by their appearance in the menu options in OpenRefine. You can:
This section of ways to improve data are organized by their appearance in the menu options in OpenRefine. You can:
* change the order of rows or columns
* edit cell contents within a particular column
* edit cell contents across all rows and columns
* transform rows into columns, and columns into rows
* split or join columns
* add new columns based on existing data or through reconciliation
* convert your rows of data into multi-row records
* change the order of [rows](#edit-rows) or [columns](columnediting#rename-remove-and-move)
* edit [cell contents](cellediting) within a particular column
* [transform](transposing) rows into columns, and columns into rows
* [split or join columns](columnediting#split-or-join)
* [add new columns](columnediting) based on existing data, with fetching new information, or through [reconciliation](reconciling)
* convert your rows of data into [multi-row records](exploring#rows-vs-records).
## Edit rows
@ -26,8 +25,10 @@ You can [sort your data](sortview#sort) based on the values in one column, but t
![A screenshot of where to find the Sort menu with a sorting applied.](/img/sortPermanent.png)
In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select “Reorder rows permanently.” You will see the numbering of the rows change under the “All” column.
In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select <span class="menuItems">Reorder rows permanently</span>. You will see the numbering of the rows change under the <span class="menuItems">All</span> column.
Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through facets and filters.
:::info Reordering all rows
Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through [facets and filters](facets).
:::
You can undo this action using the [“History” sidebar](running#history-undoredo).
You can undo this action using the [<span class="fieldLabels">History</span> tab](running#history-undoredo).

View File

@ -17,13 +17,13 @@ Imagine personal data with addresses in this format:
|Jacques Cousteau|23, quai de Conti|Paris||France|75270|
|Emmy Noether|010 N Merion Avenue|Bryn Mawr|Pennsylvania|USA|19010|
You can transpose the address information from this format into multiple rows. Go to the “Street” column and select “Transpose” → “Transpose cells across columns into rows.” From there you can select all of the five columns, starting with “Street” and ending with “Postal code,” that correspond to address information. Once you begin, you should put your project into [records mode](exploring#rows-vs-records) to associate the subsequent rows with “Name” as the key column.
You can transpose the address information from this format into multiple rows. Go to the “Street” column and select <span class="menuItems">Transpose</span><span class="menuItems">Transpose cells across columns into rows</span>. From there you can select all of the five columns, starting with “Street” and ending with “Postal code,” that correspond to address information. Once you begin, you should put your project into [records mode](exploring#rows-vs-records) to associate the subsequent rows with “Name” as the key column.
![A screenshot of the transpose across columns window.](/img/transpose1.png)
### One column
You can transpose the multiple address columns into a series of rows instead:
You can transpose the multiple address columns into a series of rows:
|Name|Address|
|---|---|
@ -37,7 +37,7 @@ You can transpose the multiple address columns into a series of rows instead:
||USA|
||19010|
You can include the column-name information in each cell by prepending it to the value, with or without a separator:
You can choose one column and include the column-name information in each cell by prepending it to the value, with or without a separator:
|Name|Address|
|---|---|
@ -53,7 +53,7 @@ You can include the column-name information in each cell by prepending it to the
### Two columns
You can retain the column names as separate cell values, by selecting “Two new columns” and naming the key and value columns.
You can retain the column names as separate cell values, by selecting <span class="fieldLabels">Two new columns</span> and naming the key and value columns.
|Name|Address part|Address|
|---|---|---|
@ -91,7 +91,7 @@ The goal is to sort out all of the information contained in one column into sepa
|Joe Khoury |Junior analyst |Beirut|
|Samantha Martinez |CTO |Tokyo|
By selecting “Transpose” → “Transpose cells in rows into columns...” a window will appear that simply asks how many rows to transpose. In this case, each employee record has three rows, so input “3” (do not subtract one for the original column). The original column will disappear and be replaced with three columns, with the name of the original column plus a number appended.
By selecting <span class="menuItems">Transpose</span><span class="menuItems">Transpose cells in rows into columns...</span> a window will appear that simply asks how many rows to transpose. In this case, each employee record has three rows, so input “3” (do not subtract one for the original column). The original column will disappear and be replaced with three columns, with the name of the original column plus a number appended.
|Column 1 |Column 2 |Column 3|
|---|---|---|
@ -99,13 +99,17 @@ By selecting “Transpose” → “Transpose cells in rows into columns...” a
|Employee: Joe Khoury |Job title: Junior analyst |Office: Beirut|
|Employee: Samantha Martinez |Job title: CTO |Office: Tokyo|
From here you can use “Cell editing” → “Replace” to remove “Employee: ”, “Job title: ”, and “Office: ”, or use [GREL functions](expressions#grel) with “Edit cells” → “Transform...” to clean out the extraneous characters: `value.replace('Employee: ', '')`, etc.
From here you can use <span class="menuItems">Cell editing</span><span class="menuItems">Replace</span> to remove “Employee: ”, “Job title: ”, and “Office: ” if you wish, or use [expressions](expressions) with <span class="menuItems">Edit cells</span><span class="menuItems">Transform...</span> to clean out the extraneous characters:
If your dataset doesn't have a predictable number of cells per intended row, such that you cannot specify easily how many columns to create, try “Columnize by key/value columns.“
```
value.replace("Employee: ", "")
```
If your dataset doesn't have a predictable number of cells per intended row, such that you cannot specify easily how many columns to create, try <span class="menuItems">Columnize by key/value columns</span>.
## Columnize by key/value columns
This operation can be used to reshape a dataset that contains key and value columns: the repeating strings in the key column become new column names, and the contents of the value column are moved to new columns. This operation can be found at “Transpose” → “Columnize by key/value columns.”
This operation can be used to reshape a dataset that contains key and value columns: the repeating strings in the key column become new column names, and the contents of the value column are moved to new columns. This operation can be found at <span class="menuItems">Transpose</span><span class="menuItems">Columnize by key/value columns</span>.
![A screenshot of the Columnize window.](/img/transpose2.png)
@ -120,7 +124,7 @@ Consider the following example, with flowers, their colours, and their Internati
|Color |Yellow |
|IUCN ID |161899 |
In this format, each flower species is described by multiple attributes on consecutive rows. The “Field” column contains the keys and the “Data” column contains the values. In the “Columnize by key/value columns” window you can select each of these from the available columns. It transforms the table as follows:
In this format, each flower species is described by multiple attributes on consecutive rows. The “Field” column contains the keys and the “Data” column contains the values. In the <span class="menuItems">Columnize by key/value columns</span> window you can select each of these from the available columns. It transforms the table as follows:
| Name | Color | IUCN ID |
|-----------------------|----------|---------|
@ -227,4 +231,4 @@ This will be transformed to:
This actually changes the operation: OpenRefine no longer looks for the first key (“Name”) but simply pivots all information based on the first extra column's values. Every old row with the same value gets transposed into one new row. If you have more than one extra column, they are pivoted as well but not used as the new key.
You can use [“Fill down”](cellediting#fill-down) to put identical values in the extra columns if you need to.
You can use <span class="menuItems">[Fill down](cellediting#fill-down-and-blank-down)</span> to put identical values in the extra columns if you need to.

View File

@ -4,32 +4,26 @@ title: Troubleshooting
sidebar_label: Troubleshooting
---
## Frequently Asked Questions
## Frequently asked questions
We collect and share FAQs and responses on Github at [https://github.com/OpenRefine/OpenRefine/wiki/FAQ](https://github.com/OpenRefine/OpenRefine/wiki/FAQ). If you dont find your problem and solution there, continue on to the resources in the Community section to see more conversations and look for solutions.
## Community
If youre having a problem:
* Search the [User forum](https://groups.google.com/g/openrefine) to see if the problem is already reported
* Read [Github issues](https://github.com/OpenRefine/OpenRefine/issues) to see if the problem is already reported
* Read [Stack Overflow](https://stackoverflow.com/questions/tagged/openrefine) to see if the problem is already reported
* Check [Twitter](https://twitter.com/search?f=tweets&vertical=default&q=OpenRefine%20OR%20%22Open%20Refine%22%20OR%20%23OpenRefine&src=typd) to see if others are discussing the problem
* Report an issue:
* First as a new thread in the User forum
* Then, if you wish, you can create a Github issue
* Then, if you wish, you can create a Github issue.
If you want to contribute:
* [We have a guide to contributing here.](https://github.com/OpenRefine/OpenRefine/blob/master/CONTRIBUTING.md)
* Contribute your feature requests in the User forum or as Github issues
* Share with us your successes and use cases in the User forum
* Add your blog posts, guides, tips, tricks, tutorials to our list
* Respond to our biennial user survey
* Join the User Forum and/or [Developer Forum](https://groups.google.com/g/openrefine-dev)
* [Help us translate the tool into more languages](https://docs.openrefine.org/technical-reference/translating), using Weblate
* [We have a guide to contributing](https://docs.openrefine.org/technical-reference/contributing) in the Technical Reference section
* Contribute your feature requests in the [User forum](https://groups.google.com/g/openrefine) or as [Github issues](https://github.com/OpenRefine/OpenRefine/issues)
* Join the User Forum and/or the [Developer Forum](https://groups.google.com/g/openrefine-dev)
* Share your successes and use cases with us, in the User forum
* Add your [blog posts, guides, tips, tricks, tutorials to our list](https://github.com/OpenRefine/OpenRefine/wiki/External-Resources)
* Keep an eye out for and respond to our biennial user survey.

View File

@ -8,11 +8,13 @@ sidebar_label: Wikidata
OpenRefine provides powerful ways to both pull data from Wikidata and add data to it.
OpenRefines connections to Wikidata is supplied by an extension that is available by default in OpenRefine. The Wikidata extension can be removed manually by navigating to your OpenRefine installation folder, and then looking inside `webapp/extensions/` and deleting the `wikidata` folder inside.
You do not need a Wikidata account to reconcile your local OpenRefine project to Wikidata. If you wish to [upload your cleaned dataset to Wikidata](#editing-wikidata-with-openrefine), you will need an [autoconfirmed](https://www.wikidata.org/wiki/Wikidata:Autoconfirmed_users) account, and you must [authorize OpenRefine with that account](#manage-wikidata-account).
The best source for information about how OpenRefine works with Wikidata is [on Wikidata itself, under Tools](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine). This section has tutorials, guidelines on editing, and spaces for discussion and help. The following text reviews the basics and can help you get set up, but the Wikidata help page is more regularly updated when technology or policies change.
:::info A better resource
The best source for information about how OpenRefine works with Wikidata is [on Wikidata itself, under Tools](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine). That page has tutorials, guidelines on editing, and spaces for discussion and help. The following text on this page reviews the basics and can help you get set up, but the Wikidata help page is more regularly updated when technology or policies change. Links to the Wikidata help page are included throughout this page.
:::
OpenRefines connections to Wikidata were formerly an optional extension, but are now included automatically with installation. The Wikidata extension can be removed manually by navigating to your OpenRefine installation folder, and then looking inside `webapp/extensions/` and deleting the `wikidata` folder found there.
## Reconciling with Wikidata
@ -24,23 +26,23 @@ The Wikidata [reconciliation service](reconciling) for OpenRefine [supports](htt
You can find documentation and further resources on the reconciliation API [here](https://wikidata.reconci.link/).
For the most part, Wikidata reconciliation behaves the same way other reconciliation processes do, but there are a few processes and features specific to Wikidata.
For the most part, Wikidata reconciliation behaves the same way other reconciliation services do, but there are a few processes and features specific to Wikidata.
### Language settings
You can install a version of the Wikidata reconciliation service that uses your language. First, you need the language code: this is the [two-letter code found on this list](https://en.wikipedia.org/wiki/List_of_Wikipedias), or in the domain name of the desired Wikipedia (for instance, “fr” if your Wikipedia is [https://fr.wikipedia.org/wiki/](https://fr.wikipedia.org/wiki/)).
You can install a version of the Wikidata reconciliation service that uses your language. First, you need the language code: this is the [two-letter code found on this list](https://en.wikipedia.org/wiki/List_of_Wikipedias), or in the domain name of the desired Wikipedia/Wikidata (for instance, “fr” if your Wikipedia is https://fr.wikipedia.org/wiki/).
Then, open the reconciliation window (under <span class="menuItems">Reconcile</span><span class="menuItems">Start reconciling...</span>) and click <span class="menuItems">Add Standard Service</span>. The URL is `https://openrefine-wikidata.toolforge.org/fr/api` where “fr” is your desired language code.
Then, open the reconciliation window (under <span class="menuItems">Reconcile</span><span class="menuItems">Start reconciling...</span>) and click <span class="menuItems">Add Standard Service</span>. The URL to enter is `https://openrefine-wikidata.toolforge.org/fr/api`, where “fr” is your desired language code.
When reconciling using this interface, items and properties will be displayed in your language if a translation is available. The matching score of the reconciliation is not influenced by your choice of language: items are matched by considering all labels and keeping the best possible match. So the language of your dataset is irrelevant to the choice of the language for the reconciliation interface.
When reconciling using this interface, items and properties will be displayed in your chosen language if the label is available. The matching score of the reconciliation is not influenced by your choice of language for the service: items are matched by considering all labels and returning the best possible match. The language of your dataset is also irrelevant to your choice of language for the reconciliation service; it simply determines which language labels to return based on the entity chosen.
### Restricting matches by type
In Wikidata, types are items themselves. For instance, the [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) has the type [public university (Q875538)](https://www.wikidata.org/wiki/Q875538), using the [instance of (P31)](https://www.wikidata.org/wiki/Property:P31) property. Types can be subclasses of other types, using the [subclass of (P279)](https://www.wikidata.org/wiki/Property:P279) property. For instance, [public university (Q875538)](https://www.wikidata.org/wiki/Q875538) is a subclass of [university (Q3918)](https://www.wikidata.org/wiki/Q3918). You can visualize these structures with the [Wikidata Graph Builder](https://angryloki.github.io/wikidata-graph-builder/).
When you select or enter a type for reconciliation, OpenRefine will include that type and all of its subtypes. For instance, if you select [university (Q3918)](https://www.wikidata.org/wiki/Q3918), then [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) will be a possible match, though that item isn't directly linked to Q3918.
When you select or enter a type for reconciliation, OpenRefine will include that type and all of its subtypes. For instance, if you select [university (Q3918)](https://www.wikidata.org/wiki/Q3918), then [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) will be a possible match, though that item isn't directly linked to Q3918 - because it is directly linked to Q875538, the subclass of Q3918.
Some items may not yet be set as an instance of anything, because Wikidata is crowdsourced. If you restrict reconciliation to a type, these items will not appear in the results, except as a fallback, and will have a lower score.
Some items and types may not yet be set as an instance or subclass of anything (because Wikidata is crowdsourced). If you restrict reconciliation to a type, items without the chosen type will not appear in the results, except as a fallback, and will have a lower score.
### Reconciling via unique identifiers
@ -52,7 +54,7 @@ If the identifier you submit is assigned to multiple Wikidata items (because Wik
Wikidata's hierarchical property structure can be called by using property paths (using |, /, and . symbols). Labels, aliases, descriptions, and sitelinks can also be accessed. You can also match values against subfields, such as latitude and longitude subfields of a geographical coordinate.
For information on how to do this, read the documentation and further resources [here](https://wikidata.reconci.link/#documentation).
For information on how to do this, read the [documentation and further resources here](https://wikidata.reconci.link/#documentation).
## Editing Wikidata with OpenRefine
@ -62,9 +64,11 @@ As a user-maintained data source, Wikidata can be edited by anyone. OpenRefine m
Wikidata is built by creating entities (such as people, organizations, or places, identified with unique numbers starting with Q), defining properties (unique numbers starting with P), and using properties to define relationships between entities (a Q has a property P, with a value of another Q).
For example, you may wish to create entities for local authors and the books they've set in your community. Each writer will be an entity with the occupation [author (Q482980)](https://www.wikidata.org/wiki/Q482980), each book will be an entity with [literary work (Q7725634)](https://www.wikidata.org/wiki/Q7725634), and books will be related to authors through a property [author (P50)](https://www.wikidata.org/wiki/Property:P50). Books can have places where they are set, with [setting (Q617332)](https://www.wikidata.org/wiki/Q617332). In OpenRefine, you'll need a column of publication titles that you have reconciled (and create new items where needed); each publication will have one or more locations in a “setting” column, which is also reconciled to municipalities or regions where they exist (and create new items where needed). Then you can add those new relationships to each book, and create new entities for both books and places.
For example, you may wish to create entities for local authors and the books they've set in your community. Each writer will be an entity with the occupation [author (Q482980)](https://www.wikidata.org/wiki/Q482980), each book will be an entity with the property “instance of” ([P31](https://www.wikidata.org/wiki/Property:P31)) linking it to a class such as [literary work (Q7725634)](https://www.wikidata.org/wiki/Q7725634), and books will be related to authors through a property [author (P50)](https://www.wikidata.org/wiki/Property:P50). Books can have places where they are set, with the property [narrative location (P840)](https://www.wikidata.org/wiki/Property:P840).
There is a list of [tutorials and walkthroughs on Wikidata](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing) that will allow you to see the full process. You can save your schemas and drafts in OpenRefine, and your progress stays in draft until you are sure youre ready to upload it to Wikidata. You can also find information on [how to design a schema](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Schema_alignment) and [how OpenRefine evaluates your proposed edits for issues](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Quality_assurance).
To do this with OpenRefine, you'll need a column of publication titles that you have reconciled (and create new items where needed); each publication will have one or more locations in a “setting” column, which is also reconciled to municipalities or regions where they exist (and create new items where needed). Then you can add those new relationships, and create new entities for authors, books, and places where needed. You do not need columns for properties; those are defined later, in the creation of your [schema](#edit-wikidata-schema).
There is a list of [tutorials and walkthroughs on Wikidata](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing) that will allow you to see the full process. You can save your schemas and drafts in OpenRefine, and your progress stays in draft until you are ready to upload it to Wikidata. You can also find information on [how to design a schema](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Schema_alignment) and [how OpenRefine evaluates your proposed edits for issues](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Quality_assurance).
Batches of edits to Wikidata that are created with OpenRefine can be undone. You can test out the uploading process by reconciling to several “sandbox” entities created specifically for drafting edits and learning about Wikidata:
* https://www.wikidata.org/wiki/Q4115189
@ -80,15 +84,15 @@ You can use OpenRefine's reconciliation preview to look at the target Wikidata e
The best resource is the [Schema alignment page](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Schema_alignment) on Wikidata.
A [schema](https://en.wikipedia.org/wiki/Database_schema) is the plan for how to structure information in a database. In OpenRefine, the schema operates as a template for how Wikidata edits should be applied: how to translate your tabular data into statements. With a schema, you can:
A [schema](https://en.wikipedia.org/wiki/Database_schema) is a plan for how to structure information in a database. In OpenRefine, the schema operates as a template for how Wikidata edits should be applied: how to translate your tabular data into statements. With a schema, you can:
* preview the Wikidata edits and inspect them manually;
* analyze and fix any issues raised automatically by the tool;
* analyze and fix any issues highlighted by OpenRefine;
* upload your changes to Wikidata by logging in with your own account;
* export the changes to the QuickStatements v1 format.
For example, if your dataset has columns for authors, publication titles, and publication years, your schema can be conceptualized as: [publication title] has the author [author], and was published in [publication year]. To establish these facts, you need to establish one or more columns as “items,” for which you will make “statements” that relate them to other columns.
You can export any schema you create, and import an existing schema for use with a new dataset. This can help you work in batches on a large amount of data with a minimum of redundant labor.
You can export any schema you create, and import an existing schema for use with a new dataset. This can help you work in batches on a large amount of data while minimizing redundant labor.
Once you select <span class="menuItems">Edit Wikidata schema</span> under the <span class="menuItems">Extensions</span> dropdown menu, your project interface will change. Youll see new tabs added to the right of “X rows/records" in the grid header: “Schema,” “Issues,” and “Preview.” You can now switch between the tabular grid format of your dataset and the screens that allow you to prepare data for uploading.
@ -96,15 +100,19 @@ OpenRefine presents you with an easy visual way to map out the relationships in
![A screenshot of the schema construction window in OpenRefine.](/img/wikidata-schema.png)
There is [a Wikidata tutorial on how OpenRefine handles Wikidata schema](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Tutorials/Basic_editing).
You may wish to refer to [this Wikidata tutorial on how OpenRefine handles Wikidata schema](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Tutorials/Basic_editing).
#### Editing terms with your schema
You may wish to include edits to terms (labels, aliases, descriptions, or sitelinks) as well as establishing relationships between entities.
With OpenRefine, you can edit the terms (labels, aliases, descriptions, or sitelinks) of Wikidata entities as well as establish relationships between entities. For example, you may wish to upload pseudonyms, pen names, maiden names, or married names for authors.
For example, you may wish to upload pseudonyms, pen names, maiden or married names for historical authors. You can do so by putting the preferred names in one column of your dataset and alternative names in another column. In the schema interface, add an item for the preferred values, then click “Add term” on the right-hand side of the screen. Select “Alias” from the dropdown, enter in “English” in the language field, and drop your alternative names column into the space.
![An author with a number of aliases indicating pseudonyms.](/img/wikidata-terms.png)
Terms must always have a language selected. You cannot edit multiple languages at once, unless you drop a suitable column into the “language” field. For example, if you had translated publication titles, with data in the format
You can do so by putting the preferred names in one column of your dataset and alternative names in another column. In the schema interface, add an item for the preferred values, then click “Add term” on the right-hand side of the screen. Select “Alias” from the dropdown, enter in “English” in the language field, and drop your alternative names column into the space. For this example, you should also consider adding those alternative names to the authors' entries using the property [pseudonym (P742)](https://www.wikidata.org/wiki/Property:P742). The "description" and "label" terms can only contain one value, so there is an option to override existing values if needed. Aliases can be potentially infinite.
![The schema window showing a term being edited.](/img/wikidata-terms2.png)
Terms must always have an associated language. You can select the term's language by typing in the “lang” field, which will auto-complete for you. You cannot edit multiple languages at once, unless you supply a suitable column instead. For example, suppose you had translated publication titles, with data in the following format:
|English title|Translated title|Translation language|
|---|---|---|
@ -115,19 +123,19 @@ Terms must always have a language selected. You cannot edit multiple languages a
|Wolf Hall|En la corte del lobo|Spanish|
||ウルフ・ホール|Japanese|
You could upload translated titles to “Label” with the language from “Translation language.” You may wish to fetch the two-letter language code and use that instead for better language matches.
You could upload the “Translated titles” to “Label” with the language specified by “Translation language.” You may wish to fetch the two-letter language code and use that instead for better language matches.
![Constructing a schema with aliases and languages.](/img/wikidata-translated.png)
### Manage Wikidata account
To edit Wikidata directly from OpenRefine, you must have a Wikidata account and log into it in OpenRefine. OpenRefine can only upload edits with Wikidata user accounts that are “[autoconfirmed](https://www.wikidata.org/wiki/Wikidata:Autoconfirmed_users)” - at this time, that means accounts that have more than 50 edits and have existed for longer than four days.
To edit Wikidata directly from OpenRefine, you must log in with a Wikidata account. OpenRefine can only upload edits with Wikidata user accounts that are “[autoconfirmed](https://www.wikidata.org/wiki/Wikidata:Autoconfirmed_users)” - at this time, that means accounts that have more than 50 edits and have existed for longer than four days.
Use the Extensions menu to select <span class="menuItems">Manage Wikidata account</span> and you will be presented with the following window:
Use the <span class="menuItems">Extensions</span> menu to select <span class="menuItems">Manage Wikidata account</span> and you will be presented with the following window:
![The Wikidata authorization window in OpenRefine.](/img/wikidata-login.png)
For security reasons, it is suggested that you not use your main account authorization with OpenRefine. Wikidata allows you to set special passwords to access your account through software. You can find this setting for your account at [https://www.wikidata.org/wiki/Special:BotPasswords](https://www.wikidata.org/wiki/Special:BotPasswords) once logged in. Creating bot access will prompt you for a unique name, and allow you to enable the following required settings:
For security reasons, you should not use your main account authorization with OpenRefine. Wikidata allows you to set special passwords to access your account through software. You can find [this setting for your account here](https://www.wikidata.org/wiki/Special:BotPasswords) once logged in. Creating bot access will prompt you for a unique name. You should then enable the following required settings:
* High-volume editing
* Edit existing pages
* Create, edit, and move pages
@ -136,13 +144,13 @@ It will then generate a username (in the form of “yourwikidatausername@yourbot
If your account or your bot is not properly authorized, OpenRefine will not display a warning or error when you try to upload your edits.
You may also wish to store your unencrypted username and password in OpenRefine, saved locally to your computer. For security reasons, you may wish to leave this box unchecked. You can save your OpenRefine-specific bot password in your browser or with a password management tool.
You can store your unencrypted username and password in OpenRefine, saved locally to your computer and available for future use. For security reasons, you may wish to leave this box unchecked. You can also save your OpenRefine-specific bot password in your browser or with a password management tool.
### Import and export schema
You can save time on repetitive processes by defining a schema on one project, then exporting it and importing for use on new datasets in the future. Or you and your colleagues can share a schema with each other to coordinate your work.
You can export a schema from a project using <span class="menuItems">Export</span><span class="menuItems">Wikidata schema</span>, or by using <span class="menuItems">Extensions</span><span class="menuItems">Export schema</span>. OpenRefine will generate a JSON file for you to save and share. You may experience issues with pop-up windows in your browser: consider allowing pop-ups for the OpenRefine URL (`127.0.0.1`) from now on.
You can export a schema from a project using <span class="menuItems">Export</span><span class="menuItems">Wikidata schema</span>, or by using <span class="menuItems">Extensions</span><span class="menuItems">Export schema</span>. OpenRefine will generate a JSON file for you to save and share. You may experience issues with pop-up windows in your browser: consider allowing pop-ups from the OpenRefine URL (`127.0.0.1`) from now on.
You can import a schema using <span class="menuItems">Extensions</span><span class="menuItems">Import schema</span>. You can upload a JSON file, or paste JSON statements directly into a field in the window. An imported schema will look for columns with the same names, and you will see an error message if your project doesn't contain matching columns.
@ -150,26 +158,25 @@ You can import a schema using <span class="menuItems">Extensions</span> → <spa
The best resource is the [Uploading page](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Uploading) on Wikidata.
There are two menu options in OpenRefine for applying your edits to Wikidata. Under <span class="menuItems">Export</span> you will see <span class="menuItems">Wikidata edits...</span> and under <span class="menuItems">Extensions</span> you will see <span class="menuItems">Upload edits to Wikidata</span>. Both will bring up the same window for you to [log in with your Wikidata account](#manage-wikidata-account).
There are two menu options in OpenRefine for applying your edits to Wikidata. Under <span class="menuItems">Export</span> you will see <span class="menuItems">Wikidata edits...</span> and under <span class="menuItems">Extensions</span> you will see <span class="menuItems">Upload edits to Wikidata</span>. Both will bring up the same window for you to [log in with a Wikidata account](#manage-wikidata-account).
Once you are authorized, you will see a window with any outstanding issues. You can ignore these issues, but we recommend you resolve them.
If you are ready to upload your edits, you can provide an “Edit summary” - a short message describing the batch of edits you are making. It can be helpful to leave notes for yourself, such as “batch 1: authors A-G” or other indicators of your workflow progress. OpenRefine will show the progress of the upload as it is happening, but does not show a confirmaton window.
If you have made edits successfully, you will see them on [your Wikidata user contributions page](https://www.wikidata.org/wiki/Special:Contributions/), and on the [Edit groups page](https://editgroups.toolforge.org/).
All edits can be undone from this interface.
If your edits have been successful, you will see them listed on [your Wikidata user contributions page](https://www.wikidata.org/wiki/Special:Contributions/), and on the [Edit groups page](https://editgroups.toolforge.org/). All edits can be undone from this second interface.
### QuickStatements export
Your OpenRefine data can be exported in a format recognized by [QuickStatements](https://www.wikidata.org/wiki/Help:QuickStatements), a tool that creates Wikidata edits using text commands. OpenRefine generates “version 1” QuickStatements commands. In order to use QuickStatements, you must authorize it with a Wikidata account that is [autoconfirmed](https://www.wikidata.org/wiki/Wikidata:Autoconfirmed_users) (it may appear as “MediaWiki” when you authorize).
Any dataset can be converted into QuickStatements text commands. You can follow the steps listed on [this page](https://www.wikidata.org/wiki/Help:QuickStatements#Running_QuickStatements).
Under the <span class="menuItems">Export</span> menu, look for <span class="menuItems">QuickStatements file</span>; under <span class="menuItems">Extensions</span> look for <span class="menuItems">Export to QuickStatements</span>. Exporting your schema from OpenRefine will generated a text file called `statements.txt` by default. Paste the contents of the text file into a new QuickStatements batch using version 1. You can find version 1 of the tool (no longer maintained) [here](https://wikidata-todo.toolforge.org/quick_statements.php). The text commands will be processed into Wikidata edits and previewed for you to review before submitting.
Your OpenRefine data can be exported in a format recognized by [QuickStatements](https://www.wikidata.org/wiki/Help:QuickStatements), a tool that creates Wikidata edits using text commands. OpenRefine generates “version 1” QuickStatements commands.
There are advantages to using QuickStatements rather than uploading your edits directly to Wikidata, including the way QuickStatements resolves duplicates and redundancies. You can learn more on QuickStatements' [Help page](https://www.wikidata.org/wiki/Help:QuickStatements), and on OpenRefine's [Uploading page](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Uploading).
In order to use QuickStatements, you must authorize it with a Wikidata account that is [autoconfirmed](https://www.wikidata.org/wiki/Wikidata:Autoconfirmed_users) (it may appear as “MediaWiki” when you authorize).
Follow the [steps listed on this page](https://www.wikidata.org/wiki/Help:QuickStatements#Running_QuickStatements).
To prepare your OpenRefine data into QuickStatements, select <span class="menuItems">Export</span><span class="menuItems">QuickStatements file</span>, or <span class="menuItems">Extensions</span><span class="menuItems">Export to QuickStatements</span>. Exporting your schema from OpenRefine will generate a text file called `statements.txt` by default. Paste the contents of the text file into a new QuickStatements batch using version 1. You can find [version 1 of the tool (no longer maintained) here](https://wikidata-todo.toolforge.org/quick_statements.php). The text commands will be processed into Wikidata edits and previewed for you to review before submitting.
### Schema alignment
The best resource is the [Schema alignment page](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Schema_alignment) on Wikidata.
@ -180,10 +187,4 @@ The best resource is the [Quality assurance page](https://www.wikidata.org/wiki/
OpenRefine will analyze your schema and make suggestions. It does not check for conflicts in your proposed edits, or tell you about redundancies.
One of the most common suggestions is to attach [a reference to your edits](https://www.wikidata.org/wiki/Help:Sources) - a citation for where the information can be found. This can be a book or newspaper citation, a URL to an online page, a reference to a physical source in an archival or special collection, or another source. If the source is itself an item on Wikidata, use the relationship [stated in (P248)](https://www.wikidata.org/wiki/Property:P248); otherwise, use [reference URL (P854)](https://www.wikidata.org/wiki/Property:P854) to identify an external source.
## Wikibases
Much of the above is also true of other Wikibase instances. You can reconcile your dataset against an available Wikibase reconciliation API.
Wikibase administrators can configure a reconciliation API using the [instructions here](https://openrefine-wikibase.readthedocs.io/en/latest/index.html).
One of the most common suggestions is to attach [a reference to your edits](https://www.wikidata.org/wiki/Help:Sources) - a citation for where the information can be found. This can be a book or newspaper citation, a URL to an online page, a reference to a physical source in an archival or special collection, or another source. If the source is itself an item on Wikidata, use the relationship [stated in (P248)](https://www.wikidata.org/wiki/Property:P248); otherwise, use [reference URL (P854)](https://www.wikidata.org/wiki/Property:P854) to identify an external source.

View File

@ -23,7 +23,6 @@ module.exports = {
items: ['manual/expressions', 'manual/grelfunctions'],
},
'manual/exporting',
'manual/glossary',
'manual/troubleshooting'
],
'Technical Reference': [

View File

@ -67,7 +67,7 @@
font-size: 35px;
}
.markdown .menuItems {
.markdown .menuItems, .markdown .fieldLabels, .markdown .tabLabels, .markdown .buttonLabels {
font-size: var(--ifm-code-font-size);
display: inline-block;
border: 2px solid #4dc6e1;

BIN
docs/static/img/export-menu.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

BIN
docs/static/img/scatterplot.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.8 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 23 KiB

BIN
docs/static/img/wikidata-terms.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

BIN
docs/static/img/wikidata-terms2.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.2 KiB

BIN
docs/static/img/yeardata.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB