Introduction to OpenRefine
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What is OpenRefine? What can it do?
Objectives
Explain what the OpenRefine software does
Explain how the OpenRefine software can help work with data files
What is OpenRefine?
OpenRefine is a desktop application that uses your web browser as a graphical interface. It is described as “a power tool for working with messy data” (David Huynh) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you or your team solve.
OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:
- Get an overview of a data set
- Resolve inconsistencies in a data set, for example standardizing date formatting
- Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
- Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
- Enhance a data set with data from other sources
Some common scenarios might be:
- Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
- Where you want to know how values are distributed across your whole data set
- Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example:
Data you have | Desired data |
---|---|
1st January 2014 | 2014-01-01 |
01/01/2014 | 2014-01-01 |
Jan 1 2014 | 2014-01-01 |
2014-01-01 | 2014-01-01 |
- Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example:
Data you have | Desired data |
---|---|
London | London |
London] | London |
London,] | London |
london | London |
- Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field:
Address in single field | Institution | Library name | Address 1 | Address 2 | Town/City | Region | Country | Postcode |
---|---|---|---|---|---|---|---|---|
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom | University of Wales | Llyfrgell Thomas Parry Library | Llanbadarn Fawr | Aberystwyth | Ceredigion | United Kingdom | SY23 3AS | |
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom | University of Abderdeen | Queen Mother Library | Meston Walk | Aberdeen | United Kingdom | AB24 3UE | ||
University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom | University of Birmingham | Barnes Library | Medical School | Edgbaston | Birmingham | West Midlands | United Kingdom | B15 2TT |
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom | University of Warwick | Library | Gibbett Hill Road | Coventry | United Kingdom | CV4 7AL |
- Where you want to add to your data from an external data source:
Data you have | Date of Birth from VIAF (Virtual International Authority File) | Date of Death from VIAF (Virtual International Authority File) |
---|---|---|
Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 |
Rossetti, William Michael | 1829 | 1919 |
Prest, Thomas Peckett | 1810 | 1879 |
What Should I Know When Working With OpenRefine?
- No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
- You are NOT modifying original/raw data.
- Saving will happen automatically, but it’s important to properly exit OpenRefine to ensure this.
- Files are saved locally such that if you are working on two computers you will have to export/import files/projects.
About the Data
The datasets we will be working with in this workshop is a collection of journals, containing the attributes Title, Authors, DOI, URL, Date, Language, Subjects, ISSNs, Publisher, Citation, and Licence.
These datasets were obtained from the Directory of Open Access Journals (DOAJ), an independent, non-profit organisation managed by Infrastructure Services for Open Access C.I.C. (IS4OA). You may learn more about DOAJ on their page. The datasets we are using here are subsets from DOAJ’s independent index of peer-reviewed, open access journals covering all areas of science, technology, medicine, social sciences, arts and humanities.
Key Points
OpenRefine is ‘a tool for working with messy data’
OpenRefine works best with data in a simple tabular format
OpenRefine can help you split data up into more granular parts
OpenRefine can help you match local data up to other data sets
OpenRefine can help you enhance a data set with data from other sources