Introduction to OpenRefine

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is OpenRefine? What can it do?

Objectives
  • Explain what the OpenRefine software does

  • Explain how the OpenRefine software can help work with data files

What is OpenRefine?

OpenRefine is a desktop application that uses your web browser as a graphical interface. It is described as “a power tool for working with messy data” (David Huynh) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you or your team solve.

OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:

Some common scenarios might be:

Data you have Desired data
1st January 2014 2014-01-01
01/01/2014 2014-01-01
Jan 1 2014 2014-01-01
2014-01-01 2014-01-01
Data you have Desired data
London London
London] London
London,] London
london London
Address in single field Institution Library name Address 1 Address 2 Town/City Region Country Postcode
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom University of Wales Llyfrgell Thomas Parry Library Llanbadarn Fawr   Aberystwyth Ceredigion United Kingdom SY23 3AS
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom University of Abderdeen Queen Mother Library Meston Walk   Aberdeen   United Kingdom AB24 3UE
University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom University of Birmingham Barnes Library Medical School Edgbaston Birmingham West Midlands United Kingdom B15 2TT
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom University of Warwick Library Gibbett Hill Road   Coventry   United Kingdom CV4 7AL
Data you have Date of Birth from VIAF (Virtual International Authority File) Date of Death from VIAF (Virtual International Authority File)
Braddon, M. E. (Mary Elizabeth) 1835 1915
Rossetti, William Michael 1829 1919
Prest, Thomas Peckett 1810 1879

What Should I Know When Working With OpenRefine?

About the Data

The datasets we will be working with in this workshop is a collection of journals, containing the attributes Title, Authors, DOI, URL, Date, Language, Subjects, ISSNs, Publisher, Citation, and Licence.

These datasets were obtained from the Directory of Open Access Journals (DOAJ), an independent, non-profit organisation managed by Infrastructure Services for Open Access C.I.C. (IS4OA). You may learn more about DOAJ on their page. The datasets we are using here are subsets from DOAJ’s independent index of peer-reviewed, open access journals covering all areas of science, technology, medicine, social sciences, arts and humanities.

Key Points

  • OpenRefine is ‘a tool for working with messy data’

  • OpenRefine works best with data in a simple tabular format

  • OpenRefine can help you split data up into more granular parts

  • OpenRefine can help you match local data up to other data sets

  • OpenRefine can help you enhance a data set with data from other sources