Loved by learners at thousands of companies
Discover How to Clean Data in PythonIt's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. Data cleaning is an essential step for every data scientist, as analyzing dirty data can lead to inaccurate conclusions.
In this course, you will learn how to identify, diagnose, and treat various data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!
Learn How to Clean Different Data TypesThe first chapter of the course explores common data problems and how you can fix them. You will first understand basic data types and how to deal with them individually. After, you'll apply range constraints and remove duplicated data points.
The last chapter explores record linkage, a powerful tool to merge multiple datasets. You'll learn how to link records by calculating the similarity between strings. Finally, you'll use your new skills to join two restaurant review datasets into one clean master dataset.
Gain Confidence in Cleaning DataBy the end of the course, you will gain the confidence to clean data from various types and use record linkage to merge multiple datasets. Cleaning data is an essential skill for data scientists. If you want to learn more about cleaning data in Python and its applications, check out the following tracks: Data Scientist with Python and Importing & Cleaning Data with Python.
Common data problemsFree
In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.Data type constraints50 xpCommon data types100 xpNumeric data or ... ?100 xpSumming strings and concatenating numbers100 xpData range constraints50 xpTire size constraints100 xpBack to the future100 xpUniqueness constraints50 xpHow big is your subset?50 xpFinding duplicates100 xpTreating duplicates100 xp
Text and categorical data problems
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
Advanced data problems
In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.
Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.Comparing strings50 xpMinimum edit distance50 xpThe cutoff point100 xpRemapping categories II100 xpGenerating pairs50 xpTo link or not to link?100 xpPairs of restaurants100 xpSimilar restaurants100 xpLinking DataFrames50 xpGetting the right index50 xpLinking them together!100 xpCongratulations!50 xp
DatasetsRide sharing datasetAirlines datasetBanking datasetRestaurants datasetRestaurants dataset II
Content Developer @ DataCamp
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
Don’t just take our word for it
*4.3from 19 reviews
- David R.1 day
- Hakan S.14 days
- Lucas G.22 days
- Hans H.2 months
Very thorough. Excellent course!
- Bijan S.2 months
In the data mining course, I learned that “Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.” (Data Mining Concepts and Techniques, 3rd Ed., by Jiawei Han et. al, 2012). However, now, based on “Cleaning Data in Python” I practically learned how to perform some of the above processes such as dealing with missing values and removing duplicates in Python. So now, not only do I know the theoretical issues about data cleaning but also know how to perform them in practice using Python. Therefore, this course was very helpful to me. I thank datacamp very much.