Interactive Course

Data Cleaning in R

Develop the skills you need to go from raw data to awesome insights as quickly and accurately as possible.

  • 4 hours
  • 13 Videos
  • 44 Exercises
  • 2,473 Participants
  • 3,700 XP

Loved by learners at thousands of top companies:

mercedes-grey.svg
ea-grey.svg
axa-grey.svg
intel-grey.svg
forrester-grey.svg
roche-grey.svg

Course Description

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.

In this course, you'll learn how to clean dirty data. Using R, you'll learn how to identify values that don't look right and fix dirty data by converting data types, filling in missing values, and using fuzzy string matching. As you learn, you’ll brush up on your skills by working with real-world datasets, including bike-share trips, customer asset portfolios, and restaurant reviews—developing the skills you need to go from raw data to awesome insights as quickly and accurately as possible!

  1. 1

    Common Data Problems

    Free

    In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

  2. Advanced Data Problems

    In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

  3. Categorical and Text Data

    Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

  4. Record Linkage

    Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

  1. 1

    Common Data Problems

    Free

    In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

  2. Categorical and Text Data

    Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

  3. Advanced Data Problems

    In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

  4. Record Linkage

    Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

What do other learners have to say?

Devon

“I've used other sites, but DataCamp's been the one that I've stuck with.”

Devon Edwards Joseph

Lloyd's Banking Group

Louis

“DataCamp is the top resource I recommend for learning data science.”

Louis Maiden

Harvard Business School

Ronbowers

“DataCamp is by far my favorite website to learn from.”

Ronald Bowers

Decision Science Analytics @ USAA

Maggie Matsui
Maggie Matsui

Content Developer at DataCamp

Maggie is a Content Developer at DataCamp. She holds a Bachelor's degree in Statistics and Computer Science from Brown University, where she spent lots of time teaching math, programming, and statistics as a tutor and teaching assistant. She's passionate about teaching all things data-related and making programming accessible to everyone.

See More
Icon Icon Icon professional info