Cleaning Data in R Course

Name: Cleaning Data in R
Rating: 4.775840597758406 (803 reviews)

Cleaning Data in R

IntermediateSkill Level

4.7+

803 reviews

Updated 08/2024

Learn to clean data as quickly and accurately as possible to help you move from raw data to awesome insights.

Course Description

Overcome Common Data Problems Like Removing Duplicates in R

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.

In this course, you’ll learn a variety of techniques to help you clean dirty data using R. You’ll start by converting data types, applying range constraints, and dealing with full and partial duplicates to avoid double-counting.

Delve into Advanced Data Challenges

Once you’ve practiced working on common data issues, you’ll move on to more advanced challenges such as ensuring consistency in measurements and dealing with missing data. After every new concept, you’ll have the chance to complete a hands-on exercise to cement your knowledge and build your experience.

Learn to Use Record Linkage During Data Cleaning

Record Linkage is used to merge datasets together when the values have issues such as typos or different spellings. You’ll explore this useful technique in the final chapter and practice the application by using it to join two restaurant review datasets together into a single dataset.

Prerequisites

Joining Data with dplyr

Common Data Problems

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Data type constraints

50 XP

Common data types

100 XP

Converting data types

100 XP

Trimming strings

100 XP

Range constraints

50 XP

Ride duration constraints

100 XP

Back to the future

100 XP

Uniqueness constraints

50 XP

Full duplicates

100 XP

Removing partial duplicates

100 XP

Aggregating partial duplicates

100 XP

Start Chapter

Categorical and Text Data

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Checking membership

50 XP

Members only

100 XP

Not a member

100 XP

Categorical data problems

50 XP

Identifying inconsistency

100 XP

Correcting inconsistency

100 XP

Collapsing categories

100 XP

Cleaning text data

50 XP

Detecting inconsistent text data

100 XP

Replacing and removing

100 XP

Invalid phone numbers

100 XP

Start Chapter

Advanced Data Problems

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Uniformity

50 XP

Date uniformity

100 XP

Currency uniformity

100 XP

Cross field validation

50 XP

Validating totals

100 XP

Validating age

100 XP

Completeness

50 XP

Types of missingness

100 XP

Visualizing missing data

100 XP

Treating missing data

100 XP

Start Chapter

Record Linkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Comparing strings

50 XP

Calculating distance

50 XP

Small distance, small difference

100 XP

Fixing typos with string distance

100 XP

Generating and comparing pairs

50 XP

Link or join?

100 XP

Pair blocking

100 XP

Comparing pairs

100 XP

Scoring and linking

50 XP

Score then select or select then score?

100 XP

Putting it together

100 XP

Congratulations!

50 XP

Start Chapter

Cleaning Data in R

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.7

from 803 reviews

80%

19%

Sort by

SANTIAGO

2 days ago

Es un curso que ayuda a esclarecer dificultades que nos agobian cuando queremos limpiar datos, la verdad con este curso da mucha claridad y unas grandes herramientas

Zach

5 days ago

Matt

5 days ago

WELELAW

last week

Jhonny

2 weeks ago

Adam

2 weeks ago

"Es un curso que ayuda a esclarecer dificultades que nos agobian cuando queremos limpiar datos, la verdad con este curso da mucha claridad y unas grandes herramientas"

SANTIAGO

WELELAW

Jhonny

FAQs

Why is data cleaning important?

Cleaning data is an essential part of the data management process. It ensures that you are working with relevant data, in a standardized format, and will derive accurate insights instead of harming your analysis by including duplicates, errors, or synonyms within the dataset. Working with clean data can help improve efficiency, reduce overall costs, and increase ROI for decisions made based on your data.

What is record linkage?

Record linkage is sometimes called entry resolution or data matching. It’s a useful technique for finding records within a dataset that refer to the same subject but use different terms that do not have a common identifier. For example, your dataset might include people who live in NY and New York - you would want to combine these datasets together as it is the same place, rather than counting these two names as two separate places.

Who needs to learn how to clean data?

Data cleaning is usually carried out by data engineers, data managers, or data quality analysts. However, it’s a useful skill set for anybody who uses data for analysis and decision making on a regular basis, such as managers, marketers, finance professionals, and HR professionals. Learning data cleaning approaches and techniques will also help you spot poor data and prepare your data properly for analysis.

Is this course suitable for beginners?

This course is not suitable for complete beginners. You will need introductory R knowledge and we recommend that you take the Joining Data with dplyr course in order to fully benefit from this course.

Cleaning Data in R

Training a Team?

Course Description

Overcome Common Data Problems Like Removing Duplicates in R

Delve into Advanced Data Challenges

Learn to Use Record Linkage During Data Cleaning

Prerequisites

Common Data Problems

Categorical and Text Data

Advanced Data Problems

Record Linkage

Earn Statement of Accomplishment

Don’t just take our word for it

FAQs

Why is data cleaning important?

What is record linkage?

Who needs to learn how to clean data?

Is this course suitable for beginners?

Join over 19 million learners and start Cleaning Data in R today!

Grow your data skills with DataCamp for Mobile

Course Description

Overcome Common Data Problems Like Removing Duplicates in R

Delve into Advanced Data Challenges

Learn to Use Record Linkage During Data Cleaning

Earn Statement of Accomplishment

Don’t just take our word for it

FAQs

What is record linkage?

Who needs to learn how to clean data?

Is this course suitable for beginners?

Join over .css-nklxlk{color:var(--wf-brand--main, #03EF62);}19 million learners and start Cleaning Data in R today!

Create Your Free Account

Grow your data skills with DataCamp for Mobile

Join over 19 million learners and start Cleaning Data in R today!