Merging Datasets in R
In the applied setting, data are hosted on different servers and exist in many different files. When the data you need come from multiple sources, it's essential to know how to aggregate them so that you lose as little information as possible and make pairings that actually make sense given the structure of your data.
This tutorial will walk you through:
- Merging datasets horizontally and vertically
- What primary keys are and how they add structure to your data
- Different types of joins (e.g., left-join, inner-join, full-join) and how to choose among them
- A common problem to watch out for and how to resolve it
At the high level, there are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively. Let's learn by seeing some examples.
Adding datasets vertically
When you have multiple datasets that have the same set of columns, you can concatenate one dataset to another, vertically. That is, keeping the columns of your dataset, you can add more rows to it. Having such information in one file will make it easier for you to aggregate and see the bigger picture without the hassle of switching back and forth between multiple files and losing track of them.
It's important to note that if you have the same observation across multiple datasets and you concatenate them vertically using
rbind(), you'll end up with duplicate observations in your table. And though the two datasets must have the same set of variables (i.e., columns), they don't have to be in the same order. See for yourself in the console below!
In your workspace, there are two datasets called
dataset2 you saw above. Try reordering the columns of
dataset2 as well as
dataset1 <- data.frame(make = c("Honda", "BMW"), num_models = c(63, 10)) dataset2 <- data.frame(make = c("Ford", "Tesla"), num_models = c(26, 4))
# Add datasets vertically rbind(dataset1, dataset2) # Reorder columns reordered_dataset1 <- dataset1[, c(2, 1)] # See that rbind() is robust to column ordering rbind(reordered_dataset1, dataset2)
rbind(), your results should have information on all four car makes in one table like this:
Vertically concatenated dataset
Adding datasets horizontally
When you have datasets representing the same set of observations, you can concatenate such datasets horizontally. This time, keeping the rows of your dataset, you can add more columns to it. In such cases, you should check that the order of the observations are the same. If your datasets have a different amount of rows, or they have the same number of rows, but the rows are ordered inconsistently, you can pair one set of columns with the other set in a way that doesn't make sense.
Let's extend the example above for an example. Suppose you have two data files, one containing the car make and number of unique models offered, and another containing the car make and total sales:
Number of unique models offered
It's important to note that if you have the same observation across multiple datasets and you concatenate them horizontally using
cbind() , you'll end up with redundant columns in your table. And though the two datasets contain related information, ordering of rows matter!
In the console below, call
sales and print out the result:
models <- data.frame(make = c("Honda", "BMW", "Ford", "Tesla"), num_models = c(63, 10, 26, 4)) sales <- data.frame(make = c("Ford", "BMW", "Honda", "Tesla"), sales = c(119157, 25908, 188328, 29975))
# Add datasets horizontally cbind(models, sales)
You should have gotten something like this:
Models & Sales
Do you see the problem here? This data is not tidy!
According to the principles of tidy data taught in this foundational course, each observation of a dataset should be represented in a unique row. And what if you had information only on some of the rows in one dataset and wanted to add information only for those you have more information on? Put another way, what if you wanted to add more columns from one dataset to another, but these datasets don't have the same number of observations?
Primary key and foreign keys
The first step when looking to combine datasets is to look for the primary key of your dataset. The primary key is the column or set of columns that uniquely identifies each observation in your dataset. In the example with car makes, the number of unique models offered, and total sales, the primary key of your datasets is the
Now, we can perform joins, the standard way to merge datasets into a single table.
Types of joins
There are many types of joins. You can learn how to augment columns from one dataset with columns from another with mutating joins, how to filter one dataset against another with filtering joins, and how to sift through datasets with set operations in the Joining Data in R with dplyr course. Below are some of the most common.
left_join(x, y): returns all rows from
x, and all columns from
y. Rows in
x with no match in
y will have
NA values in the new columns. If there are multiple matches between
y, all combinations of the matches are returned.
inner_join(x, y): returns all rows from
x where there are matching values in
y, and all columns from
y. If there are multiple matches between
y, all combinations of the matches are returned.
full_join(x, y): returns all rows and all columns from both
y. Where there are not matching values, the function returns
NA for the one missing.
The joins mentioned above are examples of mutating joins since they combine variables from two datasets.
Suppose you have two datasets. The first dataset is called
size and contains the names of people and their shirt size:
> size name size 1 Tom M 2 Dan XL 3 Keil S
The second dataset is called
color and contains the people's surnames, shirt color preferences, and stores some information in the
> color surname color Tom Jeon <NA> Dan Smith Dark Bob McLadden Light
Notice what could go wrong here? Two-table joins can get complicated when there are missing keys or duplicate keys. In this example, R's data frames store important information in the
row.names attribute. When this is the case, you won't be able to access the key with a join function, as join functions can only access columns of the data frame.
The trick to easily fix this problem is to use the
rownames_to_column() function from the
tibble package. It returns a copy of your dataset with the row names added to the data as a column. The first argument to
rownames_to_column() is your data frame object and the second argument is a string specifying the name of the column you want to add.
Try exploring in the console below. Datasets
color are pre-loaded on your workspace.
color <- data.frame(surname = c("Jeon", "Smith", "McLadden"), color = c(NA, "Dark", "Light")) row.names(color) <- c("Tom", "Dan", "Bob") size <- data.frame(name = c("Tom", "Dan", "Keil"), size = c("M", "XL", "S"))
# Explore here!
In the real world, data may come split across many datasets, across many different formats. Because R is designed to work with single tables of data, manipulating and combining datasets into a single table is an essential skill. Complete the Importing & Cleaning Data with R skill track and learn to parse and combine data in any format. Happy learning!