Loved by learners at thousands of companies
Course Description
A vital component of data science involves acquiring raw data and getting it into a form ready for analysis. It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers. At the end of the course, you'll apply all of the techniques you've learned to a case study to clean a real-world Gapminder dataset.
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1
Exploring your data
FreeSay you've just gotten your hands on a brand new dataset and are itching to start exploring it. But where do you begin, and how can you be sure your dataset is clean? This chapter will introduce you to data cleaning in Python. You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.
Diagnose data for cleaning50 xpLoading and viewing your data100 xpFurther diagnosis100 xpExploratory data analysis50 xpCalculating summary statistics50 xpFrequency counts for categorical data100 xpVisual exploratory data analysis50 xpVisualizing single variables with histograms100 xpVisualizing multiple variables with boxplots100 xpVisualizing multiple variables with scatter plots100 xp - 2
Tidying data for analysis
Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient. You'll gain first-hand experience with reshaping and tidying data using techniques such as pivoting and melting.
Tidy data50 xpRecognizing tidy data50 xpReshaping your data using melt100 xpCustomizing melted data100 xpPivoting data50 xpPivot data100 xpResetting the index of a DataFrame100 xpPivoting duplicate values100 xpBeyond melt() and pivot()50 xpSplitting a column with .str100 xpSplitting a column with .split() and .get()100 xp - 3
Combining data for analysis
The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. You'll need to learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.
- 4
Cleaning data for analysis
Dive into some of the grittier aspects of data cleaning. Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.
Data types50 xpConverting data types100 xpWorking with numeric data100 xpUsing regular expressions to clean strings50 xpString parsing with regular expressions100 xpExtracting numerical values from strings100 xpPattern matching100 xpUsing functions to clean data50 xpCustom functions to clean data100 xpLambda functions100 xpDuplicate and missing data50 xpDropping duplicate data100 xpFilling missing data100 xpTesting with asserts50 xpTesting your data with asserts100 xp - 5
Case study
In this final chapter, you'll apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using Python.
Putting it all together50 xpExploratory analysis50 xpVisualizing your data100 xpThinking about the question at hand100 xpAssembling your data100 xpInitial impressions of the data50 xpReshaping your data100 xpChecking the data types100 xpLooking at country spellings100 xpMore data cleaning and processing100 xpWrapping up100 xpFinal thoughts50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.collaborators
prerequisites
Intermediate PythonDaniel Chen
See MoreData Science Consultant at Lander Analytics
Daniel is a Software Carpentry instructor and a doctoral student in Genetics, Bioinformatics, and Computational Biology at Virginia Tech, where he works in the Social and Decision Analytics Laboratory under the Biocomplexity Institute. He received his MPH at the Mailman School of Public Health in Epidemiology and is interested in integrating hospital data in order to perform predictive health analytics and build clinical support tools for clinicians. An advocate of open science, he aspires to bridge data science with epidemiology and health care.
Join over 15 million learners and start Cleaning Data in Python today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.