Skip to main content
HomeTutorialsR Programming

Merging Datasets in R

In this tutorial, you'll learn to join multiple datasets in R.
Oct 29, 2018  · 8 min read

In the applied setting, data are hosted on different servers and exist in many different files. When the data you need come from multiple sources, it's essential to know how to aggregate them so that you lose as little information as possible and make pairings that actually make sense given the structure of your data.

This tutorial will walk you through:

  • Merging datasets horizontally and vertically
  • What primary keys are and how they add structure to your data
  • Different types of joins (e.g., left-join, inner-join, full-join) and how to choose among them
  • A common problem to watch out for and how to resolve it

Concatenating Datasets

At the high level, there are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively. Let's learn by seeing some examples.

Adding datasets vertically

When you have multiple datasets that have the same set of columns, you can concatenate one dataset to another, vertically. That is, keeping the columns of your dataset, you can add more rows to it. Having such information in one file will make it easier for you to aggregate and see the bigger picture without the hassle of switching back and forth between multiple files and losing track of them.

Dataset 1

Make Num models
Honda 63
BMW 10

Dataset 2

Make Num models
Ford 26
Tesla 4

It's important to note that if you have the same observation across multiple datasets and you concatenate them vertically using rbind(), you'll end up with duplicate observations in your table. And though the two datasets must have the same set of variables (i.e., columns), they don't have to be in the same order. See for yourself in the console below!

In your workspace, there are two datasets called dataset1 and dataset2 you saw above. Try reordering the columns of dataset1. Call rbind() on dataset1 and dataset2 as well as reordered_dataset1 and dataset2.

dataset1 <- data.frame(make = c("Honda", "BMW"), num_models = c(63, 10)) dataset2 <- data.frame(make = c("Ford", "Tesla"), num_models = c(26, 4)) # Add datasets vertically rbind(dataset1, dataset2) # Reorder columns reordered_dataset1 <- dataset1[, c(2, 1)] # See that rbind() is robust to column ordering rbind(reordered_dataset1, dataset2)

After rbind(), your results should have information on all four car makes in one table like this:

Vertically concatenated dataset

Make Num models
Honda 63
BMW 10
Ford 26
Tesla 4

Start Learning R For Free

Introduction to R

BeginnerSkill Level
4 hr
2.7M learners
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See DetailsRight Arrow

Intermediate R

BeginnerSkill Level
6 hr
598.6K learners
Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

Adding datasets horizontally

When you have datasets representing the same set of observations, you can concatenate such datasets horizontally. This time, keeping the rows of your dataset, you can add more columns to it. In such cases, you should check that the order of the observations are the same. If your datasets have a different amount of rows, or they have the same number of rows, but the rows are ordered inconsistently, you can pair one set of columns with the other set in a way that doesn't make sense.

Let's extend the example above for an example. Suppose you have two data files, one containing the car make and number of unique models offered, and another containing the car make and total sales:

Number of unique models offered

Make Num models
Honda 63
BMW 10
Ford 26
Tesla 4

Total sales

Make Sales
Ford 119157
BMW 25908
Honda 188328
Tesla 29975

It's important to note that if you have the same observation across multiple datasets and you concatenate them horizontally using cbind() , you'll end up with redundant columns in your table. And though the two datasets contain related information, ordering of rows matter!

In the console below, call cbind() on models and sales and print out the result:

models <- data.frame(make = c("Honda", "BMW", "Ford", "Tesla"), num_models = c(63, 10, 26, 4)) sales <- data.frame(make = c("Ford", "BMW", "Honda", "Tesla"), sales = c(119157, 25908, 188328, 29975)) # Add datasets horizontally cbind(models, sales)

You should have gotten something like this:

Models & Sales

Make Num models Make Sales
Honda 63 Ford 119157
BMW 10 BMW 25908
Ford 26 Honda 188328
Tesla 4 Tesla 29975

Do you see the problem here? This data is not tidy!

According to the principles of tidy data taught in this foundational course, each observation of a dataset should be represented in a unique row. And what if you had information only on some of the rows in one dataset and wanted to add information only for those you have more information on? Put another way, what if you wanted to add more columns from one dataset to another, but these datasets don't have the same number of observations?

Primary Key and Foreign Keys

The first step when looking to combine datasets is to look for the primary key of your dataset. The primary key is the column or set of columns that uniquely identifies each observation in your dataset. In the example with car makes, the number of unique models offered, and total sales, the primary key of your datasets is the make column.

Now, we can perform joins, the standard way to merge datasets into a single table.

Types of Joins

There are many types of joins. You can learn how to augment columns from one dataset with columns from another with mutating joins, how to filter one dataset against another with filtering joins, and how to sift through datasets with set operations in the Joining Data in R with dplyr course. Below are some of the most common.

left_join(x, y): returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

inner_join(x, y): returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join(x, y): returns all rows and all columns from both x and y. Where there are not matching values, the function returns NA for the one missing.

The joins mentioned above are examples of mutating joins since they combine variables from two datasets.

Missing Keys

Suppose you have two datasets. The first dataset is called size and contains the names of people and their shirt size:

 > size
name size
1  Tom    M
2  Dan   XL
3 Keil    S

The second dataset is called color and contains the people's surnames, shirt color preferences, and stores some information in the row.names attribute:

 > color
     surname color
Tom     Jeon  <NA>
Dan    Smith  Dark
Bob McLadden Light

Notice what could go wrong here? Two-table joins can get complicated when there are missing keys or duplicate keys. In this example, R's data frames store important information in the row.names attribute. When this is the case, you won't be able to access the key with a join function, as join functions can only access columns of the data frame.

The trick to easily fix this problem is to use the rownames_to_column() function from the tibble package. It returns a copy of your dataset with the row names added to the data as a column. The first argument to rownames_to_column() is your data frame object and the second argument is a string specifying the name of the column you want to add.

Try exploring in the console below. Datasets size and color are pre-loaded on your workspace.

color <- data.frame(surname = c("Jeon", "Smith", "McLadden"), color = c(NA, "Dark", "Light")) row.names(color) <- c("Tom", "Dan", "Bob") size <- data.frame(name = c("Tom", "Dan", "Keil"), size = c("M", "XL", "S")) # Explore here!

Final Thoughts

In the real world, data may come split across many datasets, across many different formats. Because R is designed to work with single tables of data, manipulating and combining datasets into a single table is an essential skill. Complete the Importing & Cleaning Data with R skill track and learn to parse and combine data in any format. Check out DataCamp's R Data Import tutorial. Happy learning!

Topics

Learn more about R

Course

Introduction to R

4 hr
2.7M
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Merging Data in R

Merging data is a common task in data analysis, especially when working with large datasets. The merge function in R is a powerful tool that allows you to combine two or more datasets based on shared variables.
DataCamp Team's photo

DataCamp Team

4 min

tutorial

Subsetting Datasets in R

Subsetting datasets is a crucial skill for any data professional. Learn and practice subsetting data in this quick interactive tutorial!
Tom Jeon's photo

Tom Jeon

16 min

tutorial

Joining DataFrames in pandas Tutorial

In this tutorial, you’ll learn various ways in which multiple DataFrames could be merged in python using Pandas library.
DataCamp Team's photo

DataCamp Team

19 min

tutorial

Introduction to Data frames in R

This tutorial takes course material from DataCamp's Introduction to R course and allows you to practice data frames.
Ryan Sheehy's photo

Ryan Sheehy

5 min

tutorial

Sorting Data in R

How to sort a data frame in R.
DataCamp Team's photo

DataCamp Team

2 min

tutorial

Importing Data Into R - Part Two

A tutorial on importing data into R. The focus is on reading data from sources like statistical software, databases, webscraping, and more.
Karlijn Willems's photo

Karlijn Willems

34 min

See MoreSee More