A vital component of data science involves acquiring raw data and getting it into a form ready for analysis. It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers. At the end of the course, you'll apply all of the techniques you've learned to a case study to clean a real-world Gapminder dataset.
Say you've just gotten your hands on a brand new dataset and are itching to start exploring it. But where do you begin, and how can you be sure your dataset is clean? This chapter will introduce you to data cleaning in Python. You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.
Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient. You'll gain first-hand experience with reshaping and tidying data using techniques such as pivoting and melting.
The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. You'll need to learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.
Dive into some of the grittier aspects of data cleaning. Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.
In this final chapter, you'll apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using Python.
Data Science Consultant at Lander Analytics
“I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.”
Devon Edwards Joseph
Lloyds Banking Group
“DataCamp is the top resource I recommend for learning data science.”
Harvard Business School
“DataCamp is by far my favorite website to learn from.”
Decision Science Analytics, USAA