Blog

Merging Datasets in R

In this tutorial, you'll learn to join multiple datasets in R.

Oct 2018 · 8 min read

In the applied setting, data are hosted on different servers and exist in many different files. When the data you need come from multiple sources, it's essential to know how to aggregate them so that you lose as little information as possible and make pairings that actually make sense given the structure of your data.

This tutorial will walk you through:

Merging datasets horizontally and vertically
What primary keys are and how they add structure to your data
Different types of joins (e.g., left-join, inner-join, full-join) and how to choose among them
A common problem to watch out for and how to resolve it

Concatenating Datasets

At the high level, there are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively. Let's learn by seeing some examples.

Adding datasets vertically

When you have multiple datasets that have the same set of columns, you can concatenate one dataset to another, vertically. That is, keeping the columns of your dataset, you can add more rows to it. Having such information in one file will make it easier for you to aggregate and see the bigger picture without the hassle of switching back and forth between multiple files and losing track of them.

Dataset 1

Make	Num models
Honda	63
BMW	10

Dataset 2

Make	Num models
Ford	26
Tesla	4

It's important to note that if you have the same observation across multiple datasets and you concatenate them vertically using rbind(), you'll end up with duplicate observations in your table. And though the two datasets must have the same set of variables (i.e., columns), they don't have to be in the same order. See for yourself in the console below!

In your workspace, there are two datasets called dataset1 and dataset2 you saw above. Try reordering the columns of dataset1. Call rbind() on dataset1 and dataset2 as well as reordered_dataset1 and dataset2.


            dataset1 <- data.frame(make = c("Honda", "BMW"),
                           num_models = c(63, 10))
            dataset2 <- data.frame(make = c("Ford", "Tesla"),
                          num_models = c(26, 4))


            # Add datasets vertically
            rbind(dataset1, dataset2)

            # Reorder columns
            reordered_dataset1 <- dataset1[, c(2, 1)]

            # See that rbind() is robust to column ordering
            rbind(reordered_dataset1, dataset2)

After rbind(), your results should have information on all four car makes in one table like this:

Vertically concatenated dataset

Make	Num models
Honda	63
BMW	10
Ford	26
Tesla	4

Start Learning R For Free

Introduction to R

BeginnerSkill Level

4 hr

2.7M learners

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Intermediate R

BeginnerSkill Level

6 hr

588.2K learners

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Adding datasets horizontally

When you have datasets representing the same set of observations, you can concatenate such datasets horizontally. This time, keeping the rows of your dataset, you can add more columns to it. In such cases, you should check that the order of the observations are the same. If your datasets have a different amount of rows, or they have the same number of rows, but the rows are ordered inconsistently, you can pair one set of columns with the other set in a way that doesn't make sense.

Let's extend the example above for an example. Suppose you have two data files, one containing the car make and number of unique models offered, and another containing the car make and total sales:

Number of unique models offered

Make	Num models
Honda	63
BMW	10
Ford	26
Tesla	4

Total sales

Make	Sales
Ford	119157
BMW	25908
Honda	188328
Tesla	29975

It's important to note that if you have the same observation across multiple datasets and you concatenate them horizontally using cbind() , you'll end up with redundant columns in your table. And though the two datasets contain related information, ordering of rows matter!

In the console below, call cbind() on models and sales and print out the result:


            models <- data.frame(make = c("Honda", "BMW", "Ford", "Tesla"),
                     num_models = c(63, 10, 26, 4))
            sales <- data.frame(make = c("Ford", "BMW", "Honda", "Tesla"),
                    sales = c(119157, 25908, 188328, 29975))


            # Add datasets horizontally
            cbind(models, sales)

You should have gotten something like this:

Models & Sales

Make	Num models	Make	Sales
Honda	63	Ford	119157
BMW	10	BMW	25908
Ford	26	Honda	188328
Tesla	4	Tesla	29975

Do you see the problem here? This data is not tidy!

According to the principles of tidy data taught in this foundational course, each observation of a dataset should be represented in a unique row. And what if you had information only on some of the rows in one dataset and wanted to add information only for those you have more information on? Put another way, what if you wanted to add more columns from one dataset to another, but these datasets don't have the same number of observations?

Primary Key and Foreign Keys

The first step when looking to combine datasets is to look for the primary key of your dataset. The primary key is the column or set of columns that uniquely identifies each observation in your dataset. In the example with car makes, the number of unique models offered, and total sales, the primary key of your datasets is the make column.

Now, we can perform joins, the standard way to merge datasets into a single table.

Types of Joins

There are many types of joins. You can learn how to augment columns from one dataset with columns from another with mutating joins, how to filter one dataset against another with filtering joins, and how to sift through datasets with set operations in the Joining Data in R with dplyr course. Below are some of the most common.

left_join(x, y): returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

inner_join(x, y): returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join(x, y): returns all rows and all columns from both x and y. Where there are not matching values, the function returns NA for the one missing.

The joins mentioned above are examples of mutating joins since they combine variables from two datasets.

Missing Keys

Suppose you have two datasets. The first dataset is called size and contains the names of people and their shirt size:

 > size
name size
1  Tom    M
2  Dan   XL
3 Keil    S

The second dataset is called color and contains the people's surnames, shirt color preferences, and stores some information in the row.names attribute:

 > color
     surname color
Tom     Jeon  <NA>
Dan    Smith  Dark
Bob McLadden Light

Notice what could go wrong here? Two-table joins can get complicated when there are missing keys or duplicate keys. In this example, R's data frames store important information in the row.names attribute. When this is the case, you won't be able to access the key with a join function, as join functions can only access columns of the data frame.

The trick to easily fix this problem is to use the rownames_to_column() function from the tibble package. It returns a copy of your dataset with the row names added to the data as a column. The first argument to rownames_to_column() is your data frame object and the second argument is a string specifying the name of the column you want to add.

Try exploring in the console below. Datasets size and color are pre-loaded on your workspace.


            color <- data.frame(surname = c("Jeon", "Smith", "McLadden"), color = c(NA, "Dark", "Light"))
            row.names(color) <- c("Tom", "Dan", "Bob")
            size <- data.frame(name = c("Tom", "Dan", "Keil"), size = c("M", "XL", "S"))


            # Explore here!

Final Thoughts

In the real world, data may come split across many datasets, across many different formats. Because R is designed to work with single tables of data, manipulating and combining datasets into a single table is an essential skill. Complete the Importing & Cleaning Data with R skill track and learn to parse and combine data in any format. Check out DataCamp's R Data Import tutorial. Happy learning!

Topics

R Programming

Data Science

Data Analysis

Learn more about R

Certification available

Course

Introduction to R

4 hr

2.7M

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Start Course

Certification available

Course

Intermediate R

6 hr

590K

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Start Course

Certification available

Course

Introduction to Data Visualization with ggplot2

4 hr

131.3K

Learn to produce meaningful and beautiful data visualizations with ggplot2 by understanding the grammar of graphics.

See Details

Start Course

blog

Navigating R Certifications in 2024: A Comprehensive Guide

Explore DataCamp's R programming certifications with our guide. Learn about Data Scientist and Data Analyst paths, preparation tips, and career advancement.

Matt Crabtree

8 min

blog

Data Sets and Where to Find Them: Navigating the Landscape of Information

Are you struggling to find interesting data sets to analyze? Do you have a plan for what to do with a sample data set once you’ve found it? If you have data set questions, this tutorial is for you! We’ll go over the basics of what a data set is, where to find one, how to clean and explore it, and where to showcase your data story.

Amberle McKee

11 min

blog

You’re invited! Join us for Radar: The Analytics Edition

Join us for a full day of events sharing best practices from thought leaders in the analytics space

DataCamp Team

4 min

blog

10 Top Data Analytics Conferences for 2024

Discover the most popular analytics conferences and events scheduled for 2024.

Javier Canales Luna

7 min

blog

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.

Mark Graus

10 min

tutorial

Mastering Bayesian Optimization in Data Science

Unlock the power of Bayesian Optimization for hyperparameter tuning in Machine Learning. Master theoretical foundations and practical applications with Python to enhance model accuracy.

Zoumana Keita

11 min

See More See More

Concatenating Datasets

Adding datasets vertically

Dataset 1

Dataset 2

Vertically concatenated dataset

Start Learning R For Free

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

Intermediate R

Adding datasets horizontally

Number of unique models offered

Total sales

Models & Sales

Primary Key and Foreign Keys

Types of Joins

Missing Keys

Final Thoughts

Navigating R Certifications in 2024: A Comprehensive Guide

Data Sets and Where to Find Them: Navigating the Landscape of Information

You’re invited! Join us for Radar: The Analytics Edition

10 Top Data Analytics Conferences for 2024

A Data Science Roadmap for 2024

Mastering Bayesian Optimization in Data Science

Introduction to R

Intermediate R

Introduction to Data Visualization with ggplot2

Navigating R Certifications in 2024: A Comprehensive Guide

Data Sets and Where to Find Them: Navigating the Landscape of Information

You’re invited! Join us for Radar: The Analytics Edition

10 Top Data Analytics Conferences for 2024

A Data Science Roadmap for 2024

Mastering Bayesian Optimization in Data Science

Introduction to R