Introduction to Transformations
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How do I use transformations to programmatically edit my data?
What are the kind of transformations Open Refine supports?
What is GREL?
Objectives
Describe common transformations
Explain GREL, the General Refine Expression Language
Introducing Transformations
Through facets, filters and clusters OpenRefine offers relatively straightforward ways of getting an overview of your data, and making changes where you want to standardise terms used to a common set of values.
However, sometimes there will be changes you want to make to the data that cannot be achieved in this way. Such types of changes include:
- Splitting data that is in a single column into multiple columns (e.g., splitting an address into multiple parts)
- Standardising the format of data in a column without changing the values (e.g., removing punctuation or standardising a date format)
- Extracting a particular type of data from a longer text string (e.g., finding ISBNs in a bibliographic citation)
OpenRefine supports these through Transformations
which are ways of manipulating data in columns. Transformations are normally written in a special language called GREL (General Refine Expression Language)
. To some extent GREL expressions are similar to Excel Formula, although they tend to focus on text manipulations rather than numeric functions.
For more information about GREL commands, check this manual.
Common transformations
OpenRefine features several functions for performing commonly needed transformations to your data. Many of these transformations are intended for data cleaning purposes and can also be done by using regular expressions. Some transformations are used regularly and are accessible directly through menu options, without having to type them directly. To locate these transformations:
- Go to the column you would like to make edits to and click the arrow button on the column header
- Select the “Edit cells” and then “Common transforms” options
Examples of some of these common transformations are given in the table below:
Transformation | Function |
---|---|
Trim leading and trailing whitespace | Removes extra spaces before and after the value in each cell in the selected column. |
Collapse consecutive whitespace | Removes extra spaces between individual strings within a cell value in the selected column. |
Unescape HTML entities | Removes HTML strings indicating particular characters (e.g., “&” vs “&”). |
Replace Smart quotes with ascii | Replaces Smart quotes to their ASCII (e.g. ‘ and “) |
To titlecase | Changes the first letter of all words within a cell value to an uppercase letter. |
To uppercase | Changes all letters of all words within a cell value to uppercase letters. |
To lowercase | Changes all letters of all words within a cell value to lowercase letters. |
To number | Changes the cell format to number. |
To date | Changes the cell format to date. |
To text | Changes the cell format to text. |
To null | Changes the cell format to null. |
To empty string | Changes the cell format to empty string. |
Removing Unnecessary Spaces Without Inspecting All Facet Choices (DOAJ_big)
- Create a text facet on the “Society or Institution” Column
- Note that you have over 300 choices listed and would take too much time to inspect each of them
- On the column use the dropdown menu to select
Edit cells->Common transforms->Trim leading and trailing whitespace
- Note the message that appears on the top and the expression
- Then,
Edit cells>Common transforms>Collapse consecutive whitespace
. Pay attention to the message again.- Look at the facet now - has it changed at all or the number remains the same? Do you know why?
Transforming Text Facet into Dates (DOAJ_big)
By default, in
OpenRefine
all columns are identified as text. So if you want to create a date facet you will have to first transform those values into ‘dates’.
- Click on the column
Added on Date
dropdown menu, thenEdit Cells>Common Transforms>To Date
(Note that values will be green, demonstrating the sucessfull transformation to the desired data format)- Now, perform a
Timeline Facet
- Can you spot a time period when there was a large number of journals added to
DOAJ_big
?
Challenge
Now, try the same process with the
Most Recent Article
column. What is the peak timeline for this variable? How many cases fall under this peak?Solution
The peak time was in between October 28, 2019 and October 27, 2020
and291 cases according to row numbers
Key Points
Common transformations are available through the Menu option