diff --git a/docs/docs/manual/cellediting.md b/docs/docs/manual/cellediting.md
index f2aed7f19..fc5b5540d 100644
--- a/docs/docs/manual/cellediting.md
+++ b/docs/docs/manual/cellediting.md
@@ -9,15 +9,17 @@ OpenRefine offers a number of features to edit and improve the contents of cells
One way of doing this is editing through a [text facet](facets#text-facet). Once you have created a facet on a column, hover over the displayed results in the sidebar. Click on the small “edit” button that appears to the right of the facet, and type in a new value. This will apply to all the cells in the facet.
+You can apply a text facet on numbers, boolean values, and dates, but if you edit a value it will be converted into the text [data type](exploring#data-types) (regardless of whether you edit a date into another correctly-formatted date, or a “true” value into “false”, etc.).
+
## Transform
-Select “Edit cells” → “Transforms” to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as `toUppercase` or `toLowercase`, used in expressions as `toUppercase(value)` or `toLowercase(value)`. In all of these cases, `value` is the value in each cell in the selected column.
+Select → to open up an expressions window. From here, you can apply [expressions](expressions) to your data. The simplest examples are GREL functions such as [`toUppercase()`](grelfunctions#touppercases or [`toLowercase()`](grelfunctions#tolowercases), used in expressions as `toUppercase(value)` or `toLowercase(value)`. In these cases, `value` is the value in each cell in the selected column.
Use the preview to ensure your data is being transformed correctly.
-You can also switch to the “History” tab inside the expressions window to reuse expressions you’ve already attempted in this project, whether they have been undone or not.
+You can also switch to the Undo / Redo tab inside the expressions window to reuse expressions you’ve already attempted in this project, whether they have been undone or not.
-OpenRefine offers you some frequently-used transformations in the next menu option, “Common transforms.” For more custom transforms, read up on [expressions](expressions).
+OpenRefine offers you some frequently-used transformations in the next menu option, . For more custom transforms, read up on [expressions](expressions).
## Common transforms
@@ -27,11 +29,11 @@ Often cell contents that should be identical, and look identical, are different
### Collapse consecutive whitespace
-You may also find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
+You may find that some text cells contain what look like spaces but are actually tabs, or contain multiple spaces in a row. This function will remove all space characters that sit in sequence and replace them with a single space.
### Unescape HTML
-Your data may come from an HTML-formatted source that expresses some characters through references (such as “ ” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent.
+Your data may come from an HTML-formatted source that expresses some characters through references (such as “ ” for a space, or “%u0107” for a ć) instead of the actual Unicode characters. You can use the “unescape HTML entities” transform to look for these codes and replace them with the characters they represent. For other formatting that needs to be escaped, try a custom transformation with [`escape()`](grelfunctions#escapes-s-mode).
### Replace smart quotes with ASCII
@@ -39,28 +41,17 @@ Smart quotes (or curly quotes) recognize whether they come at the beginning or e
### Case transforms
-You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which many functions are) causing problems in your analysis.
+You can transform an entire column of text into UPPERCASE, lowercase, or Title Case using these three options. This can be useful if you are planning to do textual analysis and wish to avoid case-sensitivity (which some functions are) causing problems in your analysis. Consider also using a [custom facet](facets#custom-text-facet) to temporarily modify cases instead of this permanent operation if appropriate.
### Data-type transforms
As detailed in [Data types](exploring#data-types), OpenRefine recognizes different data types: string, number, boolean, and date. When you use these transforms, OpenRefine will check to see if the given values can be converted, then both transform the data in the cells (such as “3” as a text string to “3” as a number) and convert the data type on each successfully transformed cell. Cells that cannot be transformed will output the original value and maintain their original data type.
-For example, the following column of strings on the left will transform into the values on the right:
+:::caution
+Be aware that dates may require manual intervention to transform successfully: see the section on [Dates](exploring#dates) for more information.
+:::
-|Input|>|Output|
-|---|---|---|
-|23/12/2019|>|2019-12-23T00:00:00Z|
-|14-10-2015|>|2015-10-14T00:00:00Z|
-|2012 02 16|>|2012-02-16T00:00:00Z|
-|August 2nd 1964|>|1964-08-02T00:00:00Z|
-|today|>|today|
-|never|>|never|
-
-This is based on OpenRefine’s ability to recognize dates with the [`toDate()` function](expressions#date-functions).
-
-Clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
-
-Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells.
+Because these common transforms do not offer the ability to output an error instead of the original cell contents, be careful to look for unconverted and untransformed values. You will see a yellow alert at the top of screen that will tell you how many cells were converted - if this number does not match your current row set, you will need to look for and manually correct the remaining cells. Also consider faceting by data type, with the GREL function [`type()`](grelfunctions#typeo).
You can also convert cells into null values or empty strings. This can be useful if you wish to, for example, erase duplicates that you have identified and are analyzing as a subset.
@@ -68,41 +59,45 @@ You can also convert cells into null values or empty strings. This can be useful
Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity.
-If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will convert the data based on the remaining values in the first column. Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the value to come first, so that you can clean up empty rows later, once you blank down.
+If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column.
+
+Be careful that your data is sorted properly before you begin blanking down - not just the first column but other columns you may want to have in a certain order. For example, you may have multiple identical entries in the first column, one with a value in the second column and one with an empty cell in the second column. In this case you want the row with the second-column value to come first, so that you can clean up empty rows later, once you blank down.
If, conversely, you’ve received data with empty cells because it was already in something akin to records mode, you can fill down information to the rest of the rows. This will duplicate whatever value exists in the topmost cell with a value: if the first row in the record is blank, it will take information from the next cell, or the cell after that, until it finds a value. The blank cells above this will remain blank.
## Split multi-valued cells
-Splitting cells with more than one value in them is a common way to get your data from single rows into multi-row records. Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
+Splitting cells with more than one value in them is a common way to get your data from single rows into [multi-row records](exploring#rows-vs-records). Survey data, for example, frequently allows respondents to “Select all that apply,” or an inventory list might have items filed under more than one category.
You can split a column based on any character or series of characters you input, such as a semi-colon (;) or a slash (/). The default is a comma. Splitting based on a separator will remove the separator characters, so you may wish to include a space with your separator (; ) if it exists in your data.
-You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “regular expression” checkbox.
+You can use [expressions](expressions) to design the point at which a cell should split itself into two or more rows. This can be used to identify special characters or create more advanced evaluations. You can split on a line-break by entering `\n` and checking the “[regular expression](expressions#regular-expressions)” checkbox.
-This can be useful if the split is not straightforward: say, if a capital letter indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
+Regular expressions can be useful if the split is not straightforward: say, if a capital letter (`[A-Z]`) indicates the beginning of a new string, or if you need to _not_ always split on a character that appears in both the strings and as a separator. Remember that this will remove all the matching characters.
You can also split based on the lengths of the strings you expect to find. This can be useful if you have predictable data in the cells: for example, a 10-digit phone number, followed by a space, followed by another 10-digit phone number. Any characters past the explicit length you’ve specified will be discarded: if you split by “11, 10” any characters that may come after the 21st character will disappear. If some cells only have one phone number, you will end up with blank rows.
-If you have data that should be split into multiple columns instead of multiple rows, see [split into several columns(columnediting#split-into-several-columns).
+If you have data that should be split into multiple columns instead of multiple rows, see [Split into several columns](columnediting#split-into-several-columns).
## Join multi-valued cells
-Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional.
+Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional. We suggest the separator | as a sufficiently rare character.
## Cluster and edit
Creating a facet on a column is a great way to look for inconsistencies in your data; clustering is a great way to fix those inconsistencies. Clustering uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match. Where editing a single cell or text facet at a time can be time-consuming and difficult, clustering is quick and streamlined.
-Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all those matching cells (or type in your own version). OpenRefine will do a number of cleanup operations behind the scenes, in memory, in order to do its analysis, but only the merges you approve will modify your data.
+Clustering always requires the user to approve each suggested edit - it will display values it thinks are variations on the same thing, and you can select which version to keep and apply across all the matching cells (or type in your own version).
-You can start the process in two ways: using the dropdown menu on your column, select “Edit cells” → “Cluster and edit…” or create a text facet and then press the “Cluster” button that appears in the facet box.
+OpenRefine will do a number of cleanup operations behind the scenes in order to do its analysis, but only the merges you approve will modify your data. Understanding those different behind-the-scenes cleanups can help you choose which clustering method will be more accurate and effective.
+
+You can start the process in two ways: using the dropdown menu on your column, select → ; or create a text facet and then press the “Cluster” button that appears in the facet box.
![A screenshot of the Clustering window.](/img/cluster.png)
The clustering pop-up window will take a small amount of time to analyze your column, and then make some suggestions based on the clustering method currently active.
-For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with “Merge selected & re-cluster,” and you can work through all the methods this way.
+For each cluster identified, you can pick one of the existing values to apply to all cells, or manually type in a new value in the text box. And, of course, you can choose not to cluster them at all. OpenRefine will keep analyzing every time you make a change, with Merge selected & re-cluster, and you can work through all the methods this way.
You can also export the currently identified clusters as a JSON file, or close the window with or without applying your changes. You can also use the histograms on the right to narrow down to, for example, clusters with lots of matching rows, or clusters of long or short values.
@@ -127,9 +122,9 @@ The clustering pop-up window offers you a variety of clustering methods:
**Key collisions** are very fast and can process millions of cells in seconds:
-**Fingerprinting** is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “Wang Zhenyi”). This makes comparing those types of name values very easy.
+**Fingerprinting** is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split all strings (words) and sort them alphabetically (so “Zhenyi, Wang” becomes “wang zhenyi”).
-**N-gram fingerprinting** allows you to set the _n_ value to whatever number you’d like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a _fingerprint_. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
+**N-gram fingerprinting** allows you to set the _n_ value to whatever number you’d like, and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a _fingerprint_. For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify because it keeps words separated). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).
##### Phonetic clustering
diff --git a/docs/docs/manual/exploring.md b/docs/docs/manual/exploring.md
index 9b0f63b36..b91d91217 100644
--- a/docs/docs/manual/exploring.md
+++ b/docs/docs/manual/exploring.md
@@ -1,4 +1,4 @@
----
+---
id: exploring
title: Exploring data
sidebar_label: Overview
@@ -6,7 +6,7 @@ sidebar_label: Overview
## Overview
-OpenRefine is a powerful tool for learning about your dataset, even if you don’t change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
+OpenRefine offers lots of features to help you learn about your dataset, even if you don’t change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output of those calculations; it only shows the value inside each cell. It doesn’t support cell colors or text formatting.
@@ -14,21 +14,19 @@ Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output
Each piece of information (each cell) in OpenRefine is assigned a data type. Some file formats, when imported, can set data types that are recognized by OpenRefine. Cells without an associated data type on import will be considered a “string” at first, but you can have OpenRefine convert cell contents into other data types later. This is set at the cell level, not at the column level.
-You can see data types in action when you preview a new project: check the box that says “Attempt to parse cell text into numbers” and cells will be converted to the “number” data type based on their contents. You’ll see numbers change from black text to green if they are recognized.
+You can see data types in action when you preview a new project: check the box next to Attempt to parse cell text into numbers, and cells will be converted to the “number” data type based on their contents. You’ll see numbers change from black text to green if they are recognized.
The data type will determine what you can do with the value. For example, if you want to add two values together, they must both be recognized as the number type.
You can check data types at any time by:
* clicking “edit” on a single cell (where you can also edit the type)
-* creating a Custom Text Facet on a column, and inserting `type(value)` into the “Expression” field. This will generate the data type in the preview, and you can facet by data type if you press “OK.”
+* creating a on a column, and inserting `type(value)` into the Expression field. This will generate the data type in the preview, and you can facet by data type if you press OK.
The data types supported are:
* string (one or more text characters)
* number (one or more characters of numbers only)
* boolean (values of “true” or “false”)
-* date (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
-
-A “date” type is created when a text column is [transformed into dates](transforming#to-date), or when individual cells are set to have the data type “date.”
+* [date](#dates) (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
OpenRefine recognizes two further data types as a result of its own processes:
* error
@@ -36,33 +34,59 @@ OpenRefine recognizes two further data types as a result of its own processes:
An “error” data type is created when the cell is storing an error generated during a transformation in OpenRefine.
-A “null” data type is a special value which basically means “this cell has no value.” It’s used to differentiate between cells that have values such as “0” or “false” - or a cell that looks empty but has, for example, spaces in it. When you use `type(value)`, it will show you that the cell’s value is “null” and its type is “undefined.” You can opt to [show “null” values](#view) to differentiate them from empty strings, by going to “All” → “View” → “Show/Hide ‘null’ values in cells.”
+A “null” data type is a special type that means “this cell has no value.” It’s distinct from cells that have values such as “0” or “false”, or cells that look empty but have whitespace in them, or cells that contain empty strings. When you use `type(value)`, it will show you that the cell’s value is “null” and its type is “undefined.” You can opt to [show “null” values](sortview#showhide-null), by going to → → .
-Converting a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as “Transform” → “Common transforms …” → “to date” may not convert all values successfully, but going to an individual cell, clicking “edit” and changing the data type can successfully convert text to a date. These operations use different underlying code.
+Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as → → may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section.
+
+To transform data from one type to another, see [Transforming data](cellediting#data-type-transforms) for information on using common tranforms, and see [Expressions](expressions) for information on using [toString()](grelfunctions#tostringo-string-format-optional), [toDate()](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-), and other functions.
-To transform data from one type to another, see [Transforming data](transforming#transform) for information on using common tranforms, and see [Expressions](expressions) for information on using `toString()`, `toDate()`, and other functions.
### Dates
-Date-formatted data in OpenRefine relies on a number of conversion tools and standards. When you convert a cell into a "date" data type, what you'll be doing is trying to transform the original contents in an ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
+A “date” type is created when a column is [transformed into dates](transforming#to-date), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date.”
-You can convert dates when you [export your data using the custom tabular exporter](exporting#custom-tabular-exporter). You are given the option to keep your dates in ISO 8601 format, or to output short, medium, long, or full locale formats. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after manipulating your data formatted into ISO 8601.
+Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
+
+When you run → → , the following column of strings on the left will transform into the values on the right:
+
+|Input|→|Output|
+|---|---|---|
+|23/12/2019|→|2019-12-23T00:00:00Z|
+|14-10-2015|→|2015-10-14T00:00:00Z|
+|2012 02 16|→|2012-02-16T00:00:00Z|
+|August 2nd 1964|→|1964-08-02T00:00:00Z|
+|today|→|today|
+|never|→|never|
+
+OpenRefine uses a variety of tools to recognize, convert, and format [dates](exploring#dates) and so some of the values above can be reformatted using other methods. In this case, clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
+
+You can do more precise conversion and formatting using expressions and arguments based on the state of your data: see the GREL functions reference section on [Date functions](grelfunctions#date-functions) for more help.
+
+You can convert dates into a more human-readable format when you [export your data using the custom tabular exporter](exporting#custom-tabular-exporter). You are given the option to keep your dates in the ISO 8601 format, to output short, medium, long, or full locale formats, or to specify a custom format. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after working with ISO-8601-formatted dates in your project.
+
+The following table shows some example [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html):
-The following table shows the [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html):
|Style |U.S. Locale |French Locale|
-|DEFAULT |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
-|SHORT |6/30/09 7:03 AM |30/06/09 07:03|
-|MEDIUM |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
-|LONG |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT|
-|FULL |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT|
+|---|---|---|
+|Default |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
+|Short |6/30/09 7:03 AM |30/06/09 07:03|
+|Medium |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47|
+|Long |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT|
+|Full |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT|
## Rows vs. records
-A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response. In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record (a survey response, for example) as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with.
+A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response.
-Generally, when you import some data, OpenRefine reads that data in row mode. From there you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
+In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with.
-OpenRefine understands records based on the content of the first column, what we call the "key column." Splitting a row into a multi-row record will base all association on the first column in your dataset. If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record: you can imagine this structure as a tree with many branches, all leading back to the same trunk.
+Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
+
+OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset.
+
+If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group.
+
+You can imagine the structure as a tree with many branches, all leading back to the same trunk.
For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.
@@ -83,9 +107,9 @@ For example, your key column may be a film or television show, with multiple cas
||Margaret Hamilton|Miss Almira Gulch|
|||The Wicked Witch of the West|
-Once you are in records mode, you can still move columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
+Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
-OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column (but you will likely have one, to ensure data stays properly sorted). You can keep track of which rows are assigned to which record by the record number that appears under the “All” column.
+OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the column.
To [split multi-valued cells](transforming#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming).
diff --git a/docs/docs/manual/facets.md b/docs/docs/manual/facets.md
index 1eb37ab31..20e99e95a 100644
--- a/docs/docs/manual/facets.md
+++ b/docs/docs/manual/facets.md
@@ -1,4 +1,4 @@
----
+---
id: facets
title: Exploring facets
sidebar_label: Facets
@@ -6,16 +6,18 @@ sidebar_label: Facets
## Overview
-Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from! Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
+Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from!
+
+Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option.
Faceted browsing gives you a big-picture look at your data (do they agree or disagree?) and also allows you to filter down to a specific subset to explore it more (what do people who disagree say in other responses?).
-Typically, you create a facet on a particular column. That facet selection appears on the left, in the Facet/Filter tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
+Typically, you create a facet on a particular column. That facet selection appears on the left, in the Facet/Filter tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.”
### An example
-You can learn about facets and filtering with the following example.
+You can learn about facets and filtering with the following example. You can copy the following table and paste it using the method of starting a project if you would like to try it yourself.
We collected a list of the [10 most populous cities from Wikidata](https://w.wiki/3Em), using an example query of theirs. We removed the GPS coordinates and added the country.
@@ -32,9 +34,9 @@ We collected a list of the [10 most populous cities from Wikidata](https://w.wik
| Guangzhou | 13080500 | People's Republic of China |
| São Paulo | 12106920 | Brazil |
-If we want to see which countries have the most populous cities, we can create a “text facet” on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
+If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells.
-We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count, you’ll learn which countries hold the most populous cities.
+We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count at the top of the facet window, you’ll learn which countries hold the most populous cities.
|Facet|Count|
|---|---|
@@ -48,25 +50,25 @@ We will see in the sidebar that the countries identified are displayed, along wi
If we want to learn more about a particular country, we can click on its appearance in the facet sidebar. This narrows our dataset down temporarily to only rows matching that facet.
-You’ll see the “10 rows” notification change to “4 matching rows (10 total)” if you click on “People’s Republic of China”. In the data grid, you’ll see the same number of rows, but only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
+You’ll see the “10 rows” indicator change to “4 matching rows (10 total)” if you click on “People’s Republic of China”. In the data grid, you’ll see fewer rows: only the ones matching your current filter. Each row will maintain its original numbering, though - in this case, rows #1, 2, and 8.
-If you want to go back to the original dataset, click “reset” or “exclude.” If you want to view the most populous cities in both China and India, click “include” next to each facet. Now you’ll see 5 rows - #1, 2, 5, 8, 9.
+If you want to go back to the original dataset, click Reset All or the small “exclude” text next to the facet. If you want to view the most populous cities in both China and India, click “include” next to each facet. Now you’ll see 5 rows - #1, 2, 5, 8, 9.
We can also explore our data using the population information. In this case, because population is a number, we can create a numeric facet. This will give us the ability to explore by range rather than by exact matching values.
With the numeric facet, we are given a scale from the smallest to the largest value in the column. We can drag the range minimum and maximum to narrow the results. In this case, if we narrow down to only cities with more than 20 million in population, we get 3 matching rows out of the original 10.
-When you look at the facet display of countries, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
+When you look back at the text facet display of country names, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows.
We can combine these facets - say, by narrowing to only the Chinese cities with populations greater than 20 million - simply by clicking in both. You should see 2 matching rows for both these criteria.
### Things to know about facets
-When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you press “Export” and copy your data out of OpenRefine while facets are active, you will only export the matching rows, not all the rows in your project.
+When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you click and copy your data out of OpenRefine while facets are active, many of the exporting options will only export the matching rows, not all the rows in your project.
OpenRefine has several default facets, which you’ll learn about below. The most powerful facets are the ones designed by you - custom facets, written using [expressions](expressions) to transform the data behind the scenes and help you narrow down to precisely what you’re looking for.
-Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the "permalink" next to the project’s name.
+Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the next to the project’s name.
You can modify any facet expression by clicking the “change” button to the right of the column name in the facet sidebar.
@@ -74,15 +76,17 @@ Facet boxes that appear in the sidebar can be resized and rearranged. You can dr
## Text facet
-A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to “Facet” → “Text facet”. The created facet will be sorted alphabetically, and can be sorted by count.
+A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to → . The created facet will be sorted alphabetically, and can be sorted by count.
A text facet is very simple: it takes the total contents of the cells of the column in question and matches them up. It does no guessing about typos or near-matches.
-You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in your data. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit), with the “Cluster” button displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
+You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in the column. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit): a “Cluster” button is displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window.
-Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which will increase the processing work required by your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will edit that preference for you.
+Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which may slow down your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will permanently edit that preference for you.
-The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet.
+The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet. This can be useful to generate small summary tables of your data.
+
+![A column of years faceted as text and numbers, and with the count ready to be copied.](/img/yeardata.png)
## Numeric facet
@@ -92,73 +96,95 @@ Whereas a text facet groups unique text values into groups, a numeric facet sort
You will be offered the option to include blank, non-numeric, and error values in your numeric visualization; these will appear in the visual range as “0” values.
+:::info
+You can create a text facet on numeric data, which will treat each entry as a string. This can be useful if you wish, for example, to manually include facets instead of selecting a range, or sort by count, or copy that count.
+:::
+
## Timeline facet
![A screenshot of an example timeline facet.](/img/timelinefacet.png)
-Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on dates formatted as “date” data types (e.g. by [using the `toDate()` function](expressions#dates) to transform text into dates, or by manually setting the [data type](#cell-data-types) on individual cells) and in the structure of the ISO-8601-compliant extended format with time in UTC: **YYYY**-**MM**-**DD**T**HH**:**MM**:**SS**Z.
+Much like a numeric facet, a timeline facet will display as a small histogram with the values sorted: in this case, chronologically. A timeline facet only works on cells formatted as the [“date” data type](exploring#dates).
+
+The facet appears with a count of blank cells and those with errors, which can help you analyze whether your date cells are correctly converted.
## Scatterplot facet
-A scatterplot facet can be generated on any number-formatted column. You require two or more number columns to generate scatterplots.
+A scatterplot is a visual representation of two related sets of numeric data.
You have the option to generate linear scatterplots (where the X and Y axes show continuous increases) or logarithmic scatterplots (where the X and Y axes show exponential or scaled increases). You can also rotate the plot by 45 degrees in either direction, and you can choose the size of the dot indicating a datapoint. You can make these choices in both the preview and in the facet display.
-Selecting “Facet” → “Scatterplot facet” will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further.
+A scatterplot facet can be generated on any column. You require two or more number columns to generate scatterplots. Selecting → will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further. You can control which columns are on the X and Y axes by rearranging the columns in your dataset.
-When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle. This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
+![A simple scatterplot of two numeric values.](/img/scatterplot.png)
+
+When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle (as shown by the rectangle inside the square in the image above). This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square.
If you have multiple facets applied, plotted points in your scatterplot displays will be greyed out if they are not part of the current matching data subset. If the rectangle you have drawn within a scatterplot display only includes grey dots, you will see no matching rows.
-If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG image that you can save.
+If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG file that you can save.
## Custom text facet
-You may want to explore your textual data in a way that doesn’t involve modifying it but does require being more selective about what gets considered. Creating custom text facets will load your column into memory, transform the data, and store those transformations inside the facet.
+You may want to explore your textual data with modifications that aren't permanent. Creating custom text facets will load your column into memory, transform the data temporarily, and store those transformations inside the facet.
-You can also use text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
+You can also use custom text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and false” as values.
-If you would like to build your own version of a text facet, you can use the “Custom Text Facet” option. Clicking on “Facets” → “Custom text facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works.
+Clicking on → will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works.
-A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot edit the facets that appear in the sidebar and change the matching cells in your dataset.
+A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot click “edit” on the facets that appear in the sidebar and change the matching cells in your dataset - because what they display is modified, not the original entries.
For example, you may wish to analyze only the first word in a text field - perhaps the first name in a column of “[First Name] [Last Name]” entries. In this case, you can tell OpenRefine to facet only on the information that comes before the first space:
-```value.split(" ")[0]```
+```
+value.split(" ")[0]
+```
-In this case, `split()` is creating an array of text strings based on every space in the cells - in this case, one space, so two values in the array. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. We can do the same splitting and ask for the last name with
+In this case, `split()` is creating an array of text strings based on every space in the cells ["Firstname", "Lastname"]. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. (Assuming the first name is one word, not something like “Mary Anne”.) We can do the same splitting and ask for the last name with
-```value.split(" ")[1]```
+```
+value.split(" ")[1]
+```
-You may want to create a facet that references several columns. For example, let’s say you have two columns, "First Name" and "Last Name", and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
+You may want to create a facet that references several columns. For example, let’s say you have two columns, “First Name” and “Last Name”, and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression
-```cells["First Name"].value[0] == cells["Last Name"].value[0]```
+```
+cells["First Name"].value[0] == cells["Last Name"].value[0]
+```
-That expression will facet your rows into `true` and `false`.
+That expression will look for the first letter (the character at index 0) of each entry and compare them. Then it will facet your rows into `true` and `false`.
You can learn more about text-modification functions on the [Expressions page](expressions).
## Custom numeric facet
-You may want to explore your numerical data in a way that doesn’t involve modifying it but does require being more selective about what gets considered. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
+You may want to explore your numerical data with modifications that aren't permanent. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`).
-If you would like to build your own version of a numeric facet, you can use the “Custom Numeric Facet” option. Clicking on “Facets” → “Custom Numeric Facet…” will bring up an [expressions](expressions) window where you can enter in a GREL, Python or Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
+If you would like to build your own version of a numeric facet, you can use the option. Clicking on → will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default.
For example, you may wish to create a numeric facet that rounds your value to the nearest integer, enter
-```round(value)```
+```
+round(value)
+```
If you have two columns of numbers and for each row you wish to create a numeric facet only on the larger of the two, enter
-```max(cells["Column1"].value, cells[“Column2”].value)```
+```
+max(cells["Column1"].value, cells["Column2"].value)
+```
If the numeric values in a column are drawn from a power law distribution, then it's better to group them by their logs:
-```value.log()```
+```
+value.log()
+```
If the values are periodic you could take the modulus by the period to understand if there's a pattern:
-```mod(value, 7)```
+```
+mod(value, 7)
+```
You can learn more about numeric-modification functions on the [Expressions page](expressions).
@@ -166,13 +192,15 @@ You can learn more about numeric-modification functions on the [Expressions page
Customized facets have been added to expand the number of default facets users can apply with a single click. They represent some common and useful functions you shouldn’t have to work out using an [expression](expressions).
-All facets that display in the “Facet/Filter” sidebar can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
+All facets that display in the Facet/Filter tab can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used.
### Word facet
-Word facet is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
+A is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet:
-```value.split(" ")```
+```
+value.split(" ")
+```
This can be useful for exploring the language used in a corpus, looking for common first and last names or titles, or seeing what’s in multi-valued cells you don’t wish to split up.
@@ -180,19 +208,21 @@ Word facet is case-sensitive and only splits by spaces, not by line breaks or ot
### Duplicates facet
-A duplicates facet will return only rows that have non-unique values in the column you’ve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
+A will return only rows that have non-unique values in the column you’ve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is
-```facetCount(value, 'value', '[Column]') > 1```
+```
+facetCount(value, 'value', '[Column]') > 1
+```
Duplicates facets are case-sensitive and you may wish to filter out things like leading and trailing whitespace or other hard-to-see issues. You can modify the facet expression, for example, with:
-```facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1```
+```
+facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1
+```
### Numeric log facet
-Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a numeric log facet can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible.
-
-OpenRefine uses a base-10 log, the "common logarithm."
+Logarithmic scales reduce wide-ranging quantities to more compact and manageable ranges. A log transformation can be used to make highly skewed distributions less skewed. If your numerical data is unevenly distributed (say, lots of values in one range, and then a long tail extending off into different magnitudes), a can represent that range better than a simple numeric facet. It will break these values down into more navigable segments than the buckets of a numeric facet. This facet can make patterns in your data more visible. OpenRefine uses a base-10 log, the “common logarithm.”
For example, we can look at [this data about the body weight of various mammals](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight):
@@ -220,13 +250,15 @@ A 1-bounded numeric log facet can be used if you'd like to exclude all the value
### Text-length facet
-The text-length facet returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
+The returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is
-```value.length()```
+```
+value.length()
+```
-This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date, as YYYY/MM/DD, is eight to ten characters).
+This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date in YYYY/MM/DD is eight to ten characters).
-You can also employ a log of text-length facet that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
+You can also employ a that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out.
### Unicode character-code facet
@@ -243,19 +275,23 @@ An error is a data type created by OpenRefine in the process of transforming dat
![A view of the expressions window with an error converting a string to a number.](/img/error.png)
-To store errors in cells, ensure that you have “store error” selected for the “On error” option in the expressions window.
+To store errors in cells, ensure that you have store error selected for the “On error” option in the expressions window.
### Facet by null, empty, or blank
-Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content. “Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
+Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content.
-An empty cell is a cell that is set to contain a string, but doesn’t have any characters in it (a zero-length string). This can be a leftover from an operation that removed characters, or from manually editing a cell and deleting its contents.
+“Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank.
+
+An empty cell is a cell that is set to contain a string, but doesn’t have any characters in it (a zero-length string). This can be left over from an operation that removed characters, or from manually editing a cell and deleting its contents.
### Facet by star or flag
Stars and flags offer you the opportunity to mark specific rows for yourself for later focus. Stars and flags persist through closing and opening your project, and thus can provide a different function than using a permalink to persist your facets. Stars and flags can be used in any way you want, although they are designed to help you flag errors and star rows of particular importance.
-You can manually star or flag rows simply by clicking on the icons to the left of each row. You can also apply stars or flags to all matching rows by using the “All” dropdown menu and selecting “Edit rows” → “Star rows” or “Flag rows.” These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
+You can manually star or flag rows simply by clicking on the icons to the left of each row.
+
+You can also apply stars or flags to all matching rows by using the dropdown menu (on the first column) and selecting → or . This will create “true” and “false” facets in the Facet/Filter. These operations will modify all matching rows in your current subset. You can unstar or unflag them as well.
You may wish to create a custom subset of your data through a series of separate faceting activities (rather than successively narrowing down with multiple facets applied). For example, you may wish to:
* apply a facet
@@ -266,21 +302,21 @@ You may wish to create a custom subset of your data through a series of separate
* remove that facet
* and then work with all of the cumulative starred rows.
-You can use the dropdown menu on the “All” column and selecting “Facet by star” or “Facet by flag.” This will create “true” and “false” facets in the facet sidebar.
-
You can also create a text facet on any column with the expression `row.starred` or `row.flagged`.
## Text filter
Filters allow you to narrow down your data based on whether a given column includes a text string.
-When you choose “Text filter” a box appears in the “Facet/Filter” sidebar that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
+When you choose a box appears in the Facet/Filter tab that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression.
-For example, you can enter in "side" as a text filter, and it will return all cells in that column containing "side," "sideways," "offside," etc.
+For example, you can enter in “side” as a text filter, and it will return all cells in that column containing “side,” “sideways,” “offside,” etc.
-The text filter field supports [Java's regular expression language](http://download.oracle.com/javase/tutorial/essential/regex/). For example, you can employ a regular expression to view all properly-formatted emails:
+The text filter field supports [regular expressions](expressions#regular-expressions). For example, you can employ a regular expression to view all properly-formatted emails:
-```([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})```
+```
+([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15})
+```
You can press “invert” on this facet to then see blank cells or invalid email addresses.
diff --git a/docs/docs/manual/grelfunctions.md b/docs/docs/manual/grelfunctions.md
index 4de5425ce..22393b45d 100644
--- a/docs/docs/manual/grelfunctions.md
+++ b/docs/docs/manual/grelfunctions.md
@@ -182,7 +182,7 @@ Returns the array of strings obtained by splitting s into substrings with the gi
Returns the array of strings obtained by splitting s by sep, or by guessing either tab or comma separation if there is no sep given. Handles quotes properly and understands cancelled characters. The separator can be either a string or a regex pattern. For example, `value.smartSplit("\n")` will split at a carriage return or a new-line character.
-Note: `value.[escape](#escapes-s-mode)('javascript')` is useful for previewing unprintable characters prior to using smartSplit().
+Note: [`value.escape('javascript')`](#escapes-s-mode) is useful for previewing unprintable characters prior to using smartSplit().
###### splitByCharType(s)
diff --git a/docs/docs/manual/installing.md b/docs/docs/manual/installing.md
index 88f766eb8..adda2313c 100644
--- a/docs/docs/manual/installing.md
+++ b/docs/docs/manual/installing.md
@@ -45,7 +45,7 @@ For the absolute latest development updates, see the [snapshot releases](https:/
#### What’s changed
-Our [latest version is OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1), released September 24th 2020. The major changes in this version are listed on the [3.4 release page](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) with the downloadable packages.
+Our [latest version is OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1), released September 24th 2020. The major changes in this version are listed on the [3.4.1 release page](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) with the downloadable packages.
You can find information about all OpenRefine versions on the [Releases page on Github](https://github.com/OpenRefine/OpenRefine/releases).
@@ -70,7 +70,7 @@ Take note of the [extensions](#installing-extensions) you have currently install
[Java Development Kit (JDK)](https://jdk.java.net/) is required to run OpenRefine and should be installed first. [OpenRefine installation packages for Mac and Windows come bundled with JDK](https://openrefine.org/download.html), so you do not need to install it separately if you use those bundles.
-There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version: at the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
+There are JDK packages for Mac, Windows, and Linux. We recommend you install the latest “Ready for use” version. At the time of writing, this is [JDK 14.0.1](https://jdk.java.net/14/).
Download the archive (either a `.tar.gz` or a `.zip`) to your computer and then extract its contents to a location of your choice. There is no installation process, so you may wish to extract this folder directly into a place where you put program files, or another stable folder.
@@ -91,16 +91,16 @@ import TabItem from '@theme/TabItem';
-1. On Windows 10, click the Windows start menu button, type “env,” and look at the search results. Click “Edit the system environment variables.” (If you are using an earlier version of Windows, use the “Search” or “Search programs and files” box in the start menu.)
+1. On Windows 10, click the Start Menu button, type `env`, and look at the search results. Click Edit the system environment variables. (If you are using an earlier version of Windows, use the “Search” or “Search programs and files” box in the Start Menu.)
![A screenshot of the search results for 'env'.](/img/env.png "A screenshot of the search results for 'env'.")
-2. Click “Environment Variables…” at the bottom of the “Advanced” window that appears.
-3. In the “Environment Variables” dialog that appears, click “New…” and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
+2. Click Environment Variables… at the bottom of the Advanced window.
+3. In the Environment Variables window that appears, click New… and create a variable with the key `JAVA_HOME`. You can set the variable for only your user account, as in the screenshot below, or set it as a system variable - it will work either way.
![A screenshot of 'Environment Variables'.](/img/javahome.png "A screenshot of 'Environment Variables'.")
-4. Set the `Value` to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the “Browse directory...” button.
+4. Set the `Value` to the folder where you installed JDK, in the format `D:\Programs\OpenJDK`. You can locate this folder with the Browse directory... button.
@@ -174,7 +174,7 @@ Save and close the file. When you are back in the terminal, type
source /etc/environment
```
-Exit the terminal and restart your system. You can then check that JAVA_HOME is set properly by opening another terminal and typing
+Exit the terminal and restart your system. You can then check that `JAVA_HOME` is set properly by opening another terminal and typing
```
echo $JAVA_HOME
```
@@ -208,7 +208,9 @@ If you have extensions installed, do not delete the `webapp\extensions` folder w
-Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`). You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
+Once you have downloaded the `.zip` file, extract it into a folder where you wish to store program files (such as `D:\Program Files\OpenRefine`).
+
+You can right-click on `openrefine.exe` or `refine.bat` and pin one of those programs to your Start Menu or create shortcuts for easier access.
@@ -311,7 +313,7 @@ tar xzf openrefine-linux-3.4.tar.gz
### Set where data is stored
-OpenRefine stores data in two places: program files in the program directory, wherever it is you’ve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking “Browse workspace directory.”
+OpenRefine stores data in two places: program files in the program directory, wherever it is you’ve installed it; and project files in what we call the “workspace directory.” You can access this folder easily from OpenRefine by going to the [home screen](running#the-home-screen) (at [http://127.0.0.1:3333/](http://127.0.0.1:3333/)) and clicking Browse workspace directory.
By default this is:
@@ -359,7 +361,7 @@ If the folder does not exist, OpenRefine will create it.
~/Library/Application Support/OpenRefine/
```
-For older versions as Google Refine:
+For older versions, as Google Refine:
```
~/Library/Application Support/Google/Refine/
@@ -418,7 +420,7 @@ You can access OpenRefine server logs from the terminal on Mac:
## Increasing memory allocation
-OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large data sets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
+OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large datasets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators:
* more than one million total cells
* an input file size of more than 50 megabytes (MB)
* more than 50 [rows per record in records mode](running#records-mode)
@@ -430,7 +432,7 @@ A good practice is to start with no more than 50% of whatever memory is left ove
All of the settings below use a four-digit number to specify the megabytes (MB) used (actually [mebibytes](https://en.wikipedia.org/wiki/Mebibyte)). The default is usually 1024MB, but the new value doesn't need to be a multiple of 1024.
:::info Dealing with large datasets
-If your project is big enough to need more than the default amount of memory, consider turning off “Parse cell text into numbers, dates, ...” on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default “string” type.
+If your project is big enough to need more than the default amount of memory, consider turning off Parse cell text into numbers, dates, ... on import. It's convenient, but less efficient than explicitly converting any columns that you need as a data type other than the default “string” type.
:::