Ben Bolstad has completed

Cleaning Data in Python

4 hr

4,800 XP

Loved by learners at thousands of companies

Course Description

A vital component of data science involves acquiring raw data and getting it into a form ready for analysis. It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers. At the end of the course, you'll apply all of the techniques you've learned to a case study to clean a real-world Gapminder dataset.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Exploring your data
Free
Say you've just gotten your hands on a brand new dataset and are itching to start exploring it. But where do you begin, and how can you be sure your dataset is clean? This chapter will introduce you to data cleaning in Python. You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.
Play Chapter Now
Diagnose data for cleaning
50 xp
Loading and viewing your data
100 xp
Further diagnosis
100 xp
Exploratory data analysis
50 xp
Calculating summary statistics
50 xp
Frequency counts for categorical data
100 xp
Visual exploratory data analysis
50 xp
Visualizing single variables with histograms
100 xp
Visualizing multiple variables with boxplots
100 xp
Visualizing multiple variables with scatter plots
100 xp
2
Tidying data for analysis
Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient. You'll gain first-hand experience with reshaping and tidying data using techniques such as pivoting and melting.
Play Chapter Now
Tidy data
50 xp
Recognizing tidy data
50 xp
Reshaping your data using melt
100 xp
Customizing melted data
100 xp
Pivoting data
50 xp
Pivot data
100 xp
Resetting the index of a DataFrame
100 xp
Pivoting duplicate values
100 xp
Beyond melt() and pivot()
50 xp
Splitting a column with .str
100 xp
Splitting a column with .split() and .get()
100 xp
3
Combining data for analysis
The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. You'll need to learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.
Play Chapter Now
Concatenating data
50 xp
Combining rows of data
100 xp
Combining columns of data
100 xp
Finding and concatenating data
50 xp
Finding files that match a pattern
100 xp
Iterating and concatenating all matches
100 xp
Merge data
50 xp
1-to-1 data merge
100 xp
Many-to-1 data merge
100 xp
Many-to-many data merge
100 xp
4
Cleaning data for analysis
Dive into some of the grittier aspects of data cleaning. Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.
Play Chapter Now
Data types
50 xp
Converting data types
100 xp
Working with numeric data
100 xp
Using regular expressions to clean strings
50 xp
String parsing with regular expressions
100 xp
Extracting numerical values from strings
100 xp
Pattern matching
100 xp
Using functions to clean data
50 xp
Custom functions to clean data
100 xp
Lambda functions
100 xp
Duplicate and missing data
50 xp
Dropping duplicate data
100 xp
Filling missing data
100 xp
Testing with asserts
50 xp
Testing your data with asserts
100 xp
5
Case study
In this final chapter, you'll apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using Python.
Play Chapter Now
Putting it all together
50 xp
Exploratory analysis
50 xp
Visualizing your data
100 xp
Thinking about the question at hand
100 xp
Assembling your data
100 xp
Initial impressions of the data
50 xp
Reshaping your data
100 xp
Checking the data types
100 xp
Looking at country spellings
100 xp
More data cleaning and processing
100 xp
Wrapping up
100 xp
Final thoughts
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

datasets

Air quality DOB job application filings Ebola Gapminder Tuberculosis Tips NYC Uber data

collaborators

Hugo Bowne-Anderson

Yashas Roy

prerequisites

Intermediate Python

Daniel Chen

Data Science Consultant at Lander Analytics

Daniel is a Software Carpentry instructor and a doctoral student in Genetics, Bioinformatics, and Computational Biology at Virginia Tech, where he works in the Social and Decision Analytics Laboratory under the Biocomplexity Institute. He received his MPH at the Mailman School of Public Health in Epidemiology and is interested in integrating hospital data in order to perform predictive health analytics and build clinical support tools for clinicians. An advocate of open science, he aspires to bridge data science with epidemiology and health care.

Join over 18 million learners and start Cleaning Data in Python today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Cleaning Data in Python

Loved by learners at thousands of companies

Course Description

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Exploring your data

Tidying data for analysis

Combining data for analysis

Cleaning data for analysis

Case study

Training 2 or more people?

Join over .css-ou6dz6{color:#03ef62;}18 million learners and start Cleaning Data in Python today!

Create Your Free Account

Training 2 or more people?

Join over 18 million learners and start Cleaning Data in Python today!