An Introduction to Cleaning Data in R
Course Excerpt: An Introduction to Cleaning Data in R
The data cleaning process
Hi, I'm Nick. I'm a data scientist at DataCamp and I'll be your instructor for this course on Cleaning Data in R. Let's kick things off by looking at an example of dirty data.
You're looking at the top and bottom, or `head` and `tail`, of a dataset containing various weather metrics recorded in the city of Boston over a 12 month period of time. At first glance these data may not appear very dirty. The information is already organized into rows and columns, which is not always the case. The rows are numbered and the columns have names. In other words, it's already in table format, similar to what you might find in a spreadsheet document. We wouldn't be this lucky if, for example, we were scraping a webpage, but we have to start somewhere.
Despite the dataset's deceivingly neat appearance, a closer look reveals many issues that should be dealt with prior to, say, attempting to build a statistical model to predict weather patterns in the future. For starters, the first column `
X` (all the way on the left) appears be meaningless; it's not clear what the columns `
X2`, and so forth represent (and if they represent days of the month, then we have time represented in both rows and columns); the different types of measurements contained in the
measure column should probably each have their own column; there are a bunch of `NA`s at the bottom of the data; and the list goes on. Don't worry if these things are not immediately obvious to you -- they will be by the end of the course. In fact, in the last chapter of this course, you will clean this exact same dataset from start to finish using all of the amazing new things you've learned.
Dirty data are everywhere. In fact, most real-world datasets start off dirty in one way or another, but by the time they make their way into textbooks and courses, most have already been cleaned and prepared for analysis. This is convenient when all you want to talk about is how to analyze or model the data, but it can leave you at a loss when you're faced with cleaning your own data.
With the rise of so-called "big data", data cleaning is more important than ever before. Every industry - finance, healthcare, retail, hospitality, and even education - is now doggy-paddling in a large sea of data. And as the data get bigger, the number of things that can go wrong do too. Each imperfection becomes harder to find when you can't simply look at the entire dataset in a spreadsheet on your computer.
In fact, data cleaning is an essential part of the data science process. In simple terms, you might break this process down into four steps: collecting or acquiring your data, cleaning your data, analyzing or modeling your data, and reporting your results to the appropriate audience. If you try to skip the second step, you'll often run into problems getting the raw data to work with traditional tools for analysis in, say, R or Python. This could be true for a variety of reasons. For example, many common algorithms require variables to be arranged into columns and for missing values to be either removed or replaced with non-missing values, neither of which was the case with the weather data you just saw.
Not only is data cleaning an essential part of the data science process - it's also often the most time-consuming part. As the New York Times reported in a 2014 article called "For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights", "Data scientists ... spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets." Unfortunately, data cleaning is not as sexy as training a neural network to identify images of cats on the internet, so it's generally not talked about in the media nor is it taught in most intro data science and statistics courses. No worries, we're here to help.
In this course, we'll break data cleaning down into a three step process: exploring your raw data, tidying your data, and preparing your data for analysis. Each of the first three chapters of this course will cover one of these steps in depth, then the fourth chapter will require you to use everything you've learned to take the weather data from raw to ready for analysis.
Let's jump right in!
Exploring raw data
The first step in the data cleaning process is exploring your raw data. We can think of data exploration itself as a three-step process consisting of understanding the structure of your data, looking at your data, and visualizing your data.
Understanding the structure of your data
To understand the structure of your data, you have several tools at your disposal in R. Here, we read in a simple dataset called
lunch, which contains information on the number of free, reduced price, and full price school lunches served in the US from 1969 through 2014. First, we check the class of the
lunch object to verify that it's a data frame, or a two-dimensional table consisting of rows and columns, of which each column is a single data type such as numeric, character, etc.
We then view the dimensions of the dataset with the
dim() function. This particular dataset has 46 rows and 7 columns. `dim()` always displays the number of rows first, followed by the number of columns.
Next, we take a look at the column names of
lunch with the
names() function. Each of the 7 columns has a name:
avg_reduced, and so on.
Okay, so we're starting to get a feel for things, but let's dig deeper. The
str() (for "structure") function is one of the most versatile and useful functions in the R language because it can be called on any object and will normally provide a useful and compact summary of its internal structure. When passed a data frame, as in this case,
str() tells us how many rows and columns we have. Actually, the function refers to rows as observations and columns as variables, which, strictly speaking, is true in a tidy dataset, but not always the case as you'll see in the next chapter. In addition, you see the name of each column, followed by its data type and a preview of the data contained in it. The
lunch dataset happens to be entirely integers and numerics. We'll have a closer look at these datatypes in chapter 3.
The dplyr package offers a slightly different flavor of
glimpse(), which offers the same information, but attempts to preview as much of each column as will fit neatly on your screen. So here, we first load dplyr with the
library() command, then call
glimpse() with a single argument, lunch.
Another extremely helpful function is summary(), which, when applied to a data frame, provides a useful summary of each column. Since the
lunch data are entirely integers and numerics, we see a summary of the distribution of each column including the minimum and maximum, the mean, and the 25th, 50th, and 75th percent quartiles (also referred to as the first quartile, median, and third quartile, respectively.) As you'll soon see, when faced with character or factor variables,
summary() will produce different summaries.
To review, you've seen how we can use the class() function to see the class of a data set, the
dim() function to view its dimensions,
names() to see the column names,
str() to view its structure,
glimpse() to do the same in a slightly enhanced format, and summary() to see a helpful summary of each column.
Looking at and visualizing your data.
Okay, so we've seen some useful summaries of our data, but there's no substitute for just looking at it. The
head() function shows us the first 6 rows by default. If you add one additional argument, n, you can control how many rows to display. For example,
head(lunch, n = 15) will display the first 15 rows of the data.
We can also view the bottom of lunch with the
tail() function, which displays the last 6 rows by default, but that behavior can be altered in the same way with the
Viewing the top and bottom of your data only gets you so far. Sometimes the easiest way to identify issues with the data is to plot them. Here, we use
hist() to plot a histogram of the percent free and reduced lunch column, which quickly gives us a sense of the distribution of this variable. It looks like the value of this variable falls between 50 and 60 for 20 out of the 46 years contained in the lunch dataset.
Finally, we can produce a scatter plot with the
plot() function to look at the relationship between two variables. In this case, we clearly see that the percent of lunches that are either free or reduced price has been steadily rising over the years, going from roughly 15 to 70 percent between 1969 and 2014.
tail() can be used to view the top and bottom of your data, respectively. Of course, you can also just
print() your data to the console, which may be okay when working with small datasets like
lunch, but is definitely not recommended when working with larger datasets.
hist() will show you a histogram of a single variable and
plot() can be used to produce a scatter plot showing the relationship between two variables.