Introduction

Overview

Teaching: 15 min
Exercises: 3 min
Questions
  • What are basic principles for using spreadsheets for good data organization?

Objectives
  • Understand how to organize data so computers can make the best use of the data

Things You’ll Need To Complete This Tutorial

Spreadsheet Software

To work through this tutorial you will need access to a spreadsheet program. We will be using Google Sheets available through Google Apps. Many computers come with a pre-installed spreadsheet program like Excel. If you are willing to explore the same content using Excel or Libre Office you may refer to the Carpentry official version of this workhop https://datacarpentry.org/spreadsheets-socialsci. Follow these steps to access Google Sheets.

Data

All data and examples we will be covering are available at the Data-Org-Spreadsheets folder on Google Drive. To follow along, make a copy on your own drive.

Good data organization is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling very basic data quality control functions.

Spreadsheet outline

In this lesson, we’re going to talk about:

Spreadsheet programs

Spreadsheet applications cab be handy for data entry and organization. Spreadsheets are frequently the go to tool for collecting and organizing data, which is among the simplest of its uses. Information can easily be placed in neat columns and rows and then sorted by information type. Generally speaking, they require minimal training and allow you to produce datasets that can be easily integrated with other statistical and analytical tools. Also, most databases and survey tools let you export data as spreadsheets.

Many spreadsheet programs are available and the functionality of these may differ. We chose to work with Google Sheets because it:

We won’t be covering all functionalities of Google Sheets or the differences between this application and other spreadsheet programs. For an overview on the differences between Google Sheets and Excel visit the Google Space Learning Center

Problems with Spreadsheets

Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them to create data tables for publications, to generate summary statistics, and make figures. Laying out spreadsheets in this way often adds some difficulty when we want to take our data from the spreadsheet and use it in another program. Additional white space, merged cells, colour and grids may aid readability but are not easily handled by other programs that take our spreadsheet as an input to further analysis.

Generating statistics and figures in spreadsheets should be done with caution. The graphical, drag and drop nature of spreadsheet programs means that it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else’s). This is particularly true if your stats or figures require complex calculations. Furthermore, when performing calculations in a spreadsheet, it’s easy to accidentally apply a slightly different formula to multiple adjacent cells. This often makes it difficult to demonstrate data quality and consistency in our analysis.

Even when we are aware of some of the limitations that data in spreadsheets presents, often we have inherited spreadsheets from another colleague or data provider. In these situations we cannot exercise any control in its construction or entry of the data within it. Nevertheless it is important to be aware of the limitations these data may present, and know how to assess if any problems are present and how to overcome them.

What this lesson will NOT teach you

  • How to do statistics in a spreadsheet.
  • How to do plotting in a spreadsheet.
  • How to write code in spreadsheet programs.

If you’re looking to do this, a good reference is Head First Excel, published by O’Reilly.

Discussion:

  • How often do you use spreadsheets?
  • For what purposes?
  • Where there times you have found using spreadsheets frustrating? Why?

Using Spreadsheets for Data Entry and Cleaning

We do not advise you use spreadsheet programs to perform data analysis since they are not as robust for that matter. Excel and Google Sheets are mostly two-dimensional, and it limits your ability to incorporate multiple variables at the same time. They are a poor choice for statistical analysis beyond textbook examples, the simplest descriptive statistics, or for more than a very few columns. Also, they do not preserve the data manipulations analysis workflows which are important for reproducibility.

Still, there are circumstances where you might want to use a spreadsheet program to produce “quick and dirty” calculations or figures, and some of these features can be used in data cleaning, prior to importation into a statistical analysis program. We will show you how to use some features of spreadsheet programs to check your data quality along the way and produce preliminary summary statistics.

In this lesson we’re going to talk about:

  1. Formatting data tables in spreadsheets
  2. Formatting problems
  3. Dates as data
  4. Quality control
  5. Exporting data

Key Points

  • Organizing your data tables according to tidy data principles will make them easier for you and others to use for analysis.


Formatting Data Tables in Spreadsheets

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What are some common challenges with formatting data in spreadsheets and how can we avoid them?

Objectives
  • Recognise and resolve common spreadsheet formatting problems.

  • Describe the importance of metadata.

  • Identify metadata that should be included with a dataset.

Data formatting problems

The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers don’t view information the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fit together.

Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).

This is why it’s extremely important to set up well-formatted tables from the outset - before you even start entering data from your very first preliminary experiment. Data organization is the foundation of your research project. It can make it easier or harder to work with your data throughout your analysis, so it’s worth thinking about when you’re doing your data entry or setting up your experiment. You can set things up in different ways in spreadsheets, but some of these choices can limit your ability to work with the data in other programs or have the you-of-6-months-from-now or your collaborator work with the data.

Tip

The best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.

Keeping track of your analyses

When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, you should:

Put these principles into practice not only during this workshop, but as part of your daily work.

Structuring data in spreadsheets

The cardinal rules of using spreadsheet programs for data:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’, ‘citation counts’, or ‘temperature’.
  2. Put each observation in its own row.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Again, leave the raw data raw - don’t edit it! Securely store a copy of it as read-only to avoid any undesired modifications to the original data.
  5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This format is accepted by many software applications without required dependencies.

For this workshop, we’re going to be working with data from a study of agricultural practices among farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). Researchers conducted interviews with farmers in these countries to collect data on household statistics (e.g. number of household members, number of meals eaten per day, availability of water), farming practices (e.g. water usage), and assets (e.g. number of farm plots, number of livestock). They also recorded the dates and locations of each interview.

If they were to keep track of the data like this:

multiple-info example

What challenges they would experience? The problem is that number of livestock and type of livestock are in the same field. So, if they wanted to look at the average number of livestock owned, or the average number of each type of livestock, it would be hard to do this using this data setup. If instead we put the count of each type of livestock in its own column, this would make analysis much easier. The rule of thumb, when setting up a datasheet, is that each variable (in this case, each type of livestock) should have its own column, each observation should have its own row, and each cell should contain only a single value. Thus, the example above should look like this:

single-info example

Notice that this now allows us to make statements about the number of each type of animal that a farmer owns, while still allowing us to say things about the total number of livestock. All we need to do is sum the values in each row to find a total. We’ll be learning how to do this computationally and reproducibly later in this workshop.

Workshop Data

The data used in these lessons are taken from interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g., construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g., number and types of livestock).

This is a real dataset, however, it has been simplified for this workshop. If you’re interested in exploring the full dataset further, you can download it from Figshare and work with it using exactly the same tools we’ll learn about today.

For more information about the dataset and to download it from Figshare, check out the Social Sciences workshop data page. This is also a good reminder that if you are reusing a dataset or a combination of datasets to generate your own spreadsheet, you should always include citations to correctly attribute data creators.

Exercise

We’re going to take a messy version of the SAFI data and describe how we would clean it up.

  1. Download the SAFI messy data.
  2. Open up the data in Google Sheets and make a copy, renaming it with your last name (e.g., SAFI-messy-Curty), to ensure we all working on our individual copies and that we are not modifying the original (raw) data.
  3. Notice that there are two tabs. Two researchers conducted the interviews, one in Mozambique and the other in Tanzania. They both structured their data tables in a different way. Now, you’re the person in charge of this project and you want to be able to start analyzing the data.
  4. With the person next to you, identify what is wrong with this spreadsheet. Discuss the steps you would need to take to clean up the two tabs, and to put them all together in one spreadsheet.

After you follow these four steps, in pairs, take a look at the main problems you identify with this dataset and discuss how you could fix the issues you have identified.

Solution

  • Take about 10 minutes to work on this exercise.
  • All the mistakes listed in the next episode are present in the messy dataset. If this exercise is done during a workshop, ask people what they saw as wrong with the data. As they bring up different points, you can refer to the next episode or expand a bit on the point they brought up.

Handy References

Two excellent references on spreadsheet organization are:

  • Karl W. Broman & Kara H. Woo, Data Organization in Spreadsheets, Vol. 72, Issue 1, 2018, The American Statistician.

https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Metadata

Recording data about your data (“metadata”) is essential. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember the exact wording of the question you asked about your informants’ water use (the data recorded in the column water use), for example, are slim.

As well, there are many reasons other people may want to examine or use your data - to understand your findings, to verify your findings, to review your submitted publication, to replicate your results, to design a similar study, or even to archive your data for access and re-use by others. While digital data by definition are machine-readable, understanding their meaning is a job for human beings. The importance of documenting your data during the collection and analysis phase of your research cannot be overestimated, especially if your research is going to be part of the scholarly record.

However, metadata should not be contained in the data file itself. Unlike a table in a paper or a supplemental file, metadata (in the form of legends) should not be included in a data file since this information is not data, and including it can disrupt how computer programs interpret your data file. Rather, metadata should be stored as a separate file in the same directory as your data file, preferably in plain text format with a name that clearly associates it with your data file. Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file.

Some of this information may be familiar to learners who conduct analyses on survey data or other data sets that come with codebooks. Codebooks will often describe the way a variable has been constructed, what prompt was associated with it in an survey or interview, and what the meaning of various values are. For example, the General Social Survey maintains their entire codebook online. Looking at an entry for a particular variable, such as the variable SEX, provides valuable information about what survey waves the variable covers, and the meaning of particular values.

Additionally, file or database level metadata describes how files that make up the dataset relate to each other; what format are they are in; and whether they supersede or are superseded by previous files. A folder-level README.txt file is the classic way of accounting for all the files and folders in a project.

Metadata are most useful when they follow a standard. For example, the Data Documentation Initiative (DDI) provides a standardized way to document metadata at various points in the research cycle. Research librarians may have specific expertise in this area, and can be helpful resources for thinking about ways to purposefully document metatdata as part of your research.

(Text on metadata adapted from the online course Research Data MANTRA by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a Creative Commons Attribution 4.0 International License.)

Exercise

Download a clean version of this dataset SAFI_clean.csv, open the file in Google Sheets and save a copy in your own Google Drive. This data has many more variables that were not included in the messy spreadsheet and is formatted according to tidy data principles.

Discuss this data with a partner and make a list of some of the types of metadata that should be recorded about this dataset. It may be helpful to start by asking yourself: 1) “What is not immediately obvious to me about this data? and, 2) What questions would I need to know the answers to in order to analyze and interpret this data?

Solution

Some types of metadata that should be recorded and made available with the data are:

  • the exact wording of questions used in the interviews (if interviews were structured) or general prompts used (if interviews were semi-structured)
  • a description of the type of data allowed in each column (e.g. the allowed range for numerical data with a restricted range, a list of allowed options for categorical variables, whether data in a numerical column should be continuous or discrete)
  • definitions of any categorical variables (e.g. definitions of “burntbricks” and “sunbricks”)
  • definitions of what was counted as a “room”, a “plot”, etc. (e.g. was there a minimum size)
  • learners may come up with additional questions to add to this list

Key Points

  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data.

  • Organize your data according to tidy data principles.

  • Record metadata in a separate plain text file.


Formatting Problems

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What are some common challenges with formatting data in spreadsheets and how can we avoid them?

Objectives
  • Recognize and resolve common spreadsheet formatting problems.

Common Spreadsheet Errors

This lesson is meant to be used as a reference for discussion as learners identify issues with the messy dataset discussed in the previous lesson. Instructors: don’t go through this lesson except to refer to responses to the exercise in the previous lesson.

There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets, can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.

Using multiple tables

A common strategy is creating multiple data tables within one spreadsheet. This confuses the computer, so don’t do this! When you create multiple tables within one spreadsheet, you’re drawing false associations between things for the computer, which sees each row as an observation. You’re also potentially using the same field name in multiple places, which will make it harder to clean your data up into a usable form. The example below depicts the problem:

multiple tables

In the example above, the computer will see (for example) row 24 and assume that all columns A-J refer to the same sample. This row actually represents two distinct samples (information about livestock for informant 1 and information about plots for informant 2). Other rows are similarly problematic.

Using multiple tabs

But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.

This isn’t good practice for two reasons: 1) you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and 2) even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine tabs - and if the tabs are inconsistently formatted, you might even have to do it manually.

For these and other reasons, it is good practice to avoid creating new tabs to organize your spreadsheet data. The next time you’re entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. You may, however, use a new tab to store notes about your data, such as steps you’ve taken to clean or manipulate your data.

Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can’t see your headers at the top of the spreadsheet. But don’t repeat your header row. These can easily get mixed into the data, leading to problems down the road.

Instead you can freeze the column headers so that they remain visible even when you have a spreadsheet with many rows.

Documentation on how to freeze column headers

Not filling in zeros

It might be that when you’re measuring something, it’s usually a zero, say the number of cows that an informant has, in a region where most farmers have goats and no cows. Why bother writing in the number zero in that column, when it’s mostly zeros?

filling in zeros

However, there’s a difference between a zero and a blank cell in a spreadsheet. To the computer, a zero is actually data. You measured or counted it. A blank cell means that it wasn’t measured and the computer will interpret it as an unknown value (otherwise known as a null value).

The spreadsheets or statistical programs will likely mis-interpret blank cells that you intend to be zeros. By not entering the value of your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with subsequent calculations or analyses. For example, the average of a set of numbers which includes a single null value is always null (because the computer can’t guess the value of the missing observations). Because of this, it’s very important to record zeros as zeros and truly missing data as nulls.

Using problematic null values

Example: using -999 or other numerical values (or zero) to represent missing data.

Solution: One common practice is to record unknown or missing data as -999, 999, or 0. Many statistical programs will not recognize that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to analyze your data. It is essential to use a clearly defined and consistent null indicator. Blanks (most applications) and NA (for R) are good choices. White et al. (2013), explain good choices for indicating null values for different software applications in their article: Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution.

White et al.

Using formatting to convey information

As a rule of thumb, formatting should never replace semantics. Create codes and dictionaries to convey meaning and allow interpretation of your dataset. Some may think colors can make data processing and cleaning decisions more visual, but they can compromise the data. Why?

Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.

formatting

Solution: create a new field to encode which data should be excluded.

good formatting

Using formatting to make the data sheet look pretty

Example: merging cells.

Solution: If you’re not careful, formatting a worksheet to be more aesthetically pleasing can compromise your computer’s ability to see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in such a way that you will not need to merge cells to organize your data. Also, if you save a workbook in another file format, such as .csv, the styling will be lost anyways.

Placing comments or units in cells

Example: Some of your informants only irrigate their plots at certain times of the year. You’ve added this information as notes directly into the cell with the data.

Solution: Most analysis software can’t see/read comments, and would be confused by comments placed within your data cells. As described above for formatting, create another field if you need to add notes to cells. Similarly, don’t include units in cells: ideally, all the measurements you place in one column should be in the same unit, but if for some reason they aren’t, create another field and specify the units the cell is in.

comments in cells

Entering more than one piece of information in a cell

Example: Your informant has multiple livestock of different types. You record this information as “3, (oxen , cows)” to indicate that there are three total livestock, which is a mixture of oxen and cows.

Solution: Don’t include more than one piece of information in a cell. This will limit the ways in which you can analyze your data. If you need both these types of information (the total number of animals and the types), design your data sheet to include this information. For example, include a separate column for each type of livestock.

Using problematic field names

Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be misinterpreted by parsers that use whitespace as delimiters and some programs don’t like field names that are text strings that start with numbers.

Underscores (_) are a good alternative to spaces. Consider writing names in camel case (like this: ExampleFileName) to improve readability. Remember that abbreviations that make sense at the moment may not be so obvious in 6 months, but don’t overdo it with names that are excessively long. Including the units in the field names avoids confusion and enables others to readily interpret your variable names. Avoid starting variable names with numbers, as this may cause problems with some analysis software.

Examples

problematic field names

Using special characters in data

Example: You treat your spreadsheet program as a word processor when writing notes, for example copying data directly from Word or other applications.

Solution: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes, etc in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such as left- and right-aligned quotation marks) are included. When exporting this data into a coding/statistical environment or into a relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown.

General best practice is to avoid adding characters such as newlines, tabs, and vertical tabs. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces.

Key Points

  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Don’t use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.


Dates as Data

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • What are good approaches for handling dates in spreadsheets?

Objectives
  • Recognise problematic or suspicious date formats.

  • Use formulas to separate dates into their component values (e.g. Month, Day, Year).

Date formats in spreadsheets

Dates in spreadsheets are often stored in a single column. While this seems like a logical way to record dates when you are entering them, or visually reviewing data, it’s not actually the best practice for preparing data for analysis.

When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations and when you or your team might be working with different versions or suites of software products (e.g., Google Sheets, LibreOffice, Microsoft Excel, Numbers, Gnumeric).

To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - DAY (DD), MONTH (MM), and YEAR (YYYY).

When working with dates it’s also important to remember that functions are guaranteed to be compatible only within the same family of software products. If you need to export your data and conserve the timestamps, you are better off handling dates using the ISO format we will cover in a bit.

One of the other reasons dates can be tricky is that most spreadsheet programs have “useful features” which can change the way dates are displayed - but not stored. The image below demonstrates some of the many date formatting options.

Many formats, many ambiguities

Download and make a copy of the file if you want to follow this portion along.

You may explore more options available on Google Sheets (Format> Number> Custom date and time...):

Many formats, many ambiguities-2

Dates stored as integers

The first thing you need to know is that as most spreadsheets programs, Google Sheets stores dates as numbers. Because DATE values are stored as integers, you can use them in arithmetic expressions. For example, you can subtract a DATE value from another DATE value. This serial number represents the number of days from December 31, 1899. But how can you obtain that number for the 10/10/2022 date? We may use the DATEVALUE function to obtain that number by typing in and use the autocomplete feature: =DATEVALUE(F2), and we will get 44844. As for any other function in spreadsheets, we can reuse that and apply to the cells bellow.

More on Google Sheets Dates

  • Inputs to DATE must always be numbers - if a string or a reference to a cell containing a string is provided, an error will be returned.
  • DATE will silently recalculate numeric dates which fall outside of valid month or day ranges. For example, DATE(1969,13,1), which specifies the illegal month 13, will create a date of 1/1/1970. Similarly, DATE(1969,1,32), which specifies the non-existent 32nd day of January, will create a date of 2/1/1969.

Using functions we can add days, months or years to a given date. Say you had a research plan where you needed to conduct interviews with a set of informants every ninety days for a year.

In our example above, in a new cell you could type:

=B2+90

And it would return

8-Jan

The format will follow the cell of reference in the formula. Test it if you would like to see the results for other cells. 

Regional date formatting

When you enter a date into a spreadsheet it looks like a date although the spreadsheet program may display different text from what you input. It does this to be ‘helpful’ but it often is not.

For example if you enter ‘7/12/88’ into a Excel spreadsheet it may display as ‘07/12/1988’ (depending on your version of Excel). These are different ways of formatting the same date.

Different countries also write dates differently. If you are in the UK or Brazil, for example, you will interpret the date above as the 7th day of December, however a researcher from the US will interpret the same entry as the 12th day of July. For Google Sheets, you may adjust the time and locale settings accordingly. But imagine you are working on a project with multiple collaborators from different countries. You can always define rules for data entry, but an easier way to prevent any future issues with dates is to treat them not as a single data point, but as three distinct pieces of data. Separating dates into their component parts will avoid this confusion, while also giving the added benefit of allowing you to compare, for example data collected in January of multiple years with data collected in February of multiple years.

If you choose to store dates as a single string, use ISO 8601

If you choose keeping all date elements in one cell, to avoid regional conflicts, we recommend you use the International Standard Organization (ISO) 8601 format for dates which represents date and time by starting with the year, followed by the month, the day, the hour, the minutes, seconds and milliseconds. For example, 2020-07-10 15:00:00.000, represents the 10th of July 2020 at 3 p.m. (in local time as there is no time zone offset specified. Such strings will be correctly sorted in ascending or descending order, and following this standard will ensure your dates will be correctly processed by the receiving software. Remember to note the use of the ISO standard in your data documentation.

Separating dates into components

Download and open the SAFI-dates.xlsx file. This file contains a subset of the data from the SAFI interviews, including the dates on which the interviews were conducted. Make a copy and rename it with your last name (e.g., SAFI-dates-Curty), to ensure we will all be working on our individual copies of the file.

Extract the components of the date to new columns. For this we can use the built in functions:

=MONTH()
=DAY()
=YEAR()

Apply each of these formulas to its entire column. Make sure the new column is formatted as a number and not as a date.

We now have each component of our date isolated in its own column. This will allow us to group our data with respect to month, year, or day of month for our analyses and will also prevent problems when passing data between different versions of spreadsheet software (as for example when sharing data with collaborators in different countries).

Solution

dates exercise 1

Note that this solution shows the dates in MM_DD_YEAR format.

Default year

What happens if you type in another cell 11/17? How should Day, Month, and Year columns should populate for this new data point. What year is shown in the Year column?

Solution

If no year is specified, the spreadsheet program will assume you mean the current year and will insert that value. This may be incorrect if you are working with historical data so be very cautious when working with data that does not have a year specified within its date variable.

Proceed with Extra Caution when Dealing with Historical Dates

In short, spreadsheet programs were not designed to operate historical dates well. Google Sheets can’t handle well dates prior to 1900-01-01. And Excel can’t handle well anything prior to 1900-03-01 because they have considered 1900 a leap year, when in reality it is not!

If you’re mixing historic data from before and after this date, it will translate only the post-1900 dates into its internal format, thus resulting in mixed data. If you’re working with historic data, be extremely careful with your dates!

For historical dates, it is best practice to follow the advice for separating dates into components above and use a numerical value that is positive or negative depending on the era. For dates that are before the Common Era (often abbreviated as B.C.E/BCE, or older nomenclature as B.C./BC), the years should be represented as negative values, while dates after 0 (abbreviated CE/C.E. or AD/A.D.) should be represented as positive values. For example, the date of March 15, 44 BCE in Day, Month, and Year columns could look like: 15,3,-44.

Key Points

  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.


Quality Assurance

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How can we carry out basic quality assurance in spreadsheets?

  • How can we track changes on Google Sheets and prevent changes if needed?

Objectives
  • Apply quality assurance techniques to limit incorrect data entry.

  • Keep track of changes and protect sheets from changes if needed.

When you have a well-structured data table, you can use several simple techniques within your spreadsheet to ensure the data you enter is free of errors.

Spreadsheet Encoding

One issue you may encounter when dealing with Spreadsheets is files with different encoding. As with other text files, a CSV file has a character encoding that determines how characters are stored. The de facto standard today is to use Unicode (UTF-8), which supports all the alphabets of the world and lots of other characters like technical symbols and emojis.

Google sheets uses UTF-8 by deafult. Unfortunately, Excel still exports files using legacy character encodings like ASCII. If you select the wrong type of character encoding (often by double clicking on a .csv file and having Excel or other spreadsheet application automatically open it), you can have unreadable data, and even possible data corruption. To avoid this issue, make sure to save your files as CSV UTF-8.

Validating data on input

When we input data into a cell of a spreadsheet we are typically not constrained in the type of data we enter. But the wrong data type will result in future errors.

In any one column, the spreadsheets software will not warn us if we start to enter a mix of text, numbers or dates in different rows. Even if we are not facing constraints from the software, as a researcher we often anticipate that all data in one column will be of a certain type. It is also possible that the nature of the data contained in the table allows us to place additional restrictions on the acceptable values for cells in a column. For example a column recording age in years should be numeric, greater than 0 and is unlikely to be greater than 120.

As Excel, Google Sheets allows us to specify data validations to be applied to cell contents. If the validation fails, an error is raised and the data we entered does not go into the particular cell. In addition to providing validation when we enter data, Google Sheets allows us to add validations to data that has already been entered. The validation is not applied retrospectively, so that data are removed. Instead, if a particular cell would fail the validation check a triangle is placed in the top left corner of the cell as a warning, as shown below:

Image of data validation failure

We will be working with a couple of examples of data validation rules but many others exist. For an overview of data validation options and rules available in Google Sheets, check out the support page on data validation.

We will look at three examples:

  1. Restricting data to a numeric range
  2. Restricting data to integers and integers
  3. Restricting data to entries from a list

Restricting data to a numeric range

Looking again at the clean version of the SAFI dataset, we see that there are several columns with numeric data. One example of this is the column no_membrs representing the number of people in the household. We would assume that there are no families with greater than 30 members. Can can assure there won’t be issues with data entry in this column?

1. Select the no_membrs column.

2. On the Data tab select and then Data Validation. The following popout will appear:

Image of data validation tab in Google Sheets

3. Select the Number for criteria and set the range. Here we will choose a minimum of 1 and a maximum of 30.

Your data table will now not allow you to enter a value that violates the data validation rule you have created. To test this out, try to enter a new value above 30 in one of the cells for this collumn and see what happens.

Tip: Customizing warning messages

While setting validation rules you can customize the resulting message to be more informative by entering your own message by selecting to show validation help text in Appearance. You can also choose to let invalid data result in a warning rather than reject input, by selecting that option. Rejecting an input will result on an empty cell after a “there is a problem” warning.

Restricting data to integers

We have covered data validation for simple number ranges, but what if we expect that a cell will only take integers? Let’s say we want to determine that we will only accept integers for the column rooms. How can we accomplish that? In Excel, we could choose from data validation options whole number, but this option is not available on Google Sheets. Still, we can apply this restriction using a conditional function:

  1. Select the rooms column and then, Data, followed by Data validation;
  2. Choose the custom formula option and type =IF(G2=INT(G2),True,False);
  3. Make sure you select the correct data range (starting from the first cell specificed in the formula) and save it.

This function is instructing Google Sheets to give you a test result if the G2 column has an integer number or not. If any decimal numbers are entered you will get a warning.

Exercise

Apply a new data validation rule of your choice to one of the other numeric columns in this data table. Discuss with the person sitting next to you what a reasonable rule would be for the column you’ve selected. Be sure to create an informative input message.

Restricting data to entries from a list

Quality assurance can make data entry easier as well as more robust. For example, if you use a list of options to restrict data entry, the spreadsheet will provide you with a drop-downlist of the available items. So, instead of trying to remember how to spell “mabatisloping”, or whether or not you capitalized “cement” you can select the right option from the list.

1. Select the respondent_wall_type column.

2. On the Data tab select Data Validation and then choose “List of items`

3. Type a list of all the values that you want to be accepted in this column, separated by a comma (with no spaces). For us this will be “grass, muddaub, burntbricks, sunbricks, cement”.

4. Create a meaningful input message, then save it.

We have now provided a restriction that will be validated each time we try and enter data into the selected cells. When a cell in this column is selected, a drop-down arrow will appear. When you click the arrow, you will be able to select a value from your list. If you type a value which is not on the list, you will get an error message. This not only prevents data input errors, but also makes it easier and faster to enter data.

Exercise

Apply a new data validation rule to one of the other categorical columns in this data table. Discuss with the person sitting next to you what a reasonable rule would be for the column you’ve selected. Be sure to create an informative input message.

Tip

Typing a list of values where only a few possible values exist (like “grass, muddaub, burntbricks, sunbricks, cement”) might be convenient, but if the list is longer it makes sense to create it as a small table (in a separate tab of the workbook). We can give the table a name and then reference the table name as the source of acceptable inputs when the source box appears in the Data Validation pop-out.

Using a table in this way makes the data entry process more flexible. If you add or remove contents from the table, then these are immediately reflected in any new cell entries based on this source. You can also have different cells refer to the same table of acceptable inputs.

Tracking Changes and Protecting Sheets and Cells from Changing

Google Sheets track changes don’t work like Google Docs which have a feature that displays all changes automatically. Instead, you have to track changes using a few other features. Google Sheets allow you to:

Most of these features however, will be only enabled if the spreadsheet is native, or if the file is saved as Google Sheets. This means that some of the functions won’t work if you open a .xls or .csv file you have uploaded to your drive. So, make sure, to check whether under File, you have the option Save as Google Sheets (.Gsheet). If so (SAFI_dates is one example) make sure to save it. Do not worry about that extension for now. In the next episode we will learn how to transform the file to other formats.

Think about .Gsheet as a hyperlink, but instead of taking you to a specific website, it drops you directly into the Google Sheets document within your Drive. When you double-click on a Google Sheet in your Google Driver folder, it automatically opens a new browser tab or window with the document displayed in Drive

Seeing the history of changes for a cell

If you wish to inspect updates within a cell you may select the cell, then right-click show edit history. The information regarding the last editing action will be displayed and you can use the arrows to navigate the all history of edits for that specific cell.

Seeing the complete version history of the document

You may also check versions for the entire document by either clicking on the Last edit info displayed right next to Help on the top menu, or File then Version History and See version history. You can name versions to keep better track of them. Note that this won’t change the file name!

Setting up different permissions

As in other programs from the Google office suite, when collaborating in a spreadsheet you may grant different access permissions (viewer, commenter and or editor) to each of the group members listed. If you click on the top gear icon you will be able to change some of the access levels within these categories. For example, you can disable the ability of editors to change permissions and share, limiting them to only contribute to the spreadsheet.

Protecting Sheets and Ranges

If you’d like to protect sheets or specific cells to be modified, even by people you have previously granted editor permission, you can accomplish that by selecting Data and then, Protect sheets and range.

Setting up to get notifications about changes

You may also activate notifications to receive alerts for every comment or edit in the spreadsheet, by choosing Notification settings under Tools.

Key Points

  • Always copy your original spreadsheet file and work with a copy so you don’t affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.


Exporting Data

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How can we export data from spreadsheets in a way that is useful for downstream applications?

Objectives
  • Store spreadsheet data in universal file formats.

  • Export data from a spreadsheet to a CSV file.

A Note on Data Formats

Storing the data you’re going to work with for your analyses file formats such as *.xls or *.xlsx - isn’t a good idea. Why?

Even though *.xls or *.xlsx extensions are considered open standards (ECMA-376, ISO/IEC 29500), they are closely tied to the MS Office Excel application, and these files can contain lots of features that only interpreted by Excel.

As an example of inconsistencies in data storage, do you remember how we talked about how spreasheet programs stores dates earlier? It turns out that there are multiple defaults for different versions of the software, and you can switch between them all. So, say you’re compiling Excel-stored data from multiple sources. There’s dates in each file- Excel interprets them as their own internally consistent serial numbers. When you combine the data, the software take the serial number from the place you’re importing it from, and interpret it using the rule set for the software you’re using to work on the spreadsheet. Essentially, you could be adding errors to your data, and it wouldn’t necessarily be flagged by any data cleaning methods if your ranges overlap.

Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values or TSV) or comma-delimited (comma separated values or CSV). CSV files are plain text files where the columns are separated by commas, hence ‘comma separated values’ or CSV. The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or NotePad.

Data in a CSV file can also be easily imported into other formats and environments, such as SQLite and R. We’re not tied to a certain version of a certain expensive program when we work with CSV files, so it’s a good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export.

To export files as CSV in Google Sheets, choose file, then download and select the format.

Working Off-line

There might be cases researchers will be on the field, working from remote locations, and won’t have stable or any connectivity to work with Google Sheets. The good news is that you can turn-on the offline access option to create, view, and edit files.

Before you turn on offline access:

How to turn on offline access:

Open Google Chrome or Microsoft Edge. If on Chrome, make sure you are signed into your account. Go to drive.google.com/drive/settings or click on the gear icon on Google Drive, then Settings. Check the box next to “Create, open, and edit your recent Google Docs, Sheets, and Slides files on this device while offline.”

What to do when your data contain commas?

In some datasets, the data values themselves may include commas (,). In that case, you need to make sure that the commas are properly escaped when saving the file. Otherwise, the software which you use (including Excel) will most likely incorrectly display the data in columns. This is because the commas which are a part of the data values will be interpreted as delimiters.

If you are working with data that contains commas, the fields should be enclosed with double quotes. The spreadsheet software should do the right thing (LibreOffice provides comprehensive options to import and export CSV files). However, it is always a good idea to double check that the file you are exporting can be read in correctly. For more of a discussion on data formats and potential issues with commas within datasets see the Ecology Spreadsheets lesson discussion page.

Key Points

  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.