
Sourav Saha has completed
Preprocessing for Machine Learning in Python
Start course For Free4 hr
4,700 XP

Loved by learners at thousands of companies
Course Description
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1Introduction to Data PreprocessingFreeIn this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data. 
- 2Standardizing DataThis chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. Standardization50 xpWhen to standardize50 xpModeling without normalizing100 xpLog normalization50 xpChecking the variance50 xpLog normalization in Python100 xpScaling data for feature comparison50 xpScaling data - investigating columns50 xpScaling data - standardizing columns100 xpStandardized data and modeling50 xpKNN on non-scaled data100 xpKNN on scaled data100 xp
- 3Feature EngineeringIn this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features. Feature engineering50 xpFeature engineering knowledge test50 xpIdentifying areas for feature engineering50 xpEncoding categorical variables50 xpEncoding categorical variables - binary100 xpEncoding categorical variables - one-hot100 xpEngineering numerical features50 xpAggregating numerical features100 xpExtracting datetime components100 xpEngineering text features50 xpExtracting string patterns100 xpVectorizing text100 xpText classification using tf/idf vectors100 xp
- 4Selecting Features for ModelingThis chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA). Feature selection50 xpWhen to use feature selection50 xpIdentifying areas for feature selection50 xpRemoving redundant features50 xpSelecting relevant features100 xpChecking for correlated features100 xpSelecting features using text vectors50 xpExploring text vectors, part 1100 xpExploring text vectors, part 2100 xpTraining Naive Bayes with feature selection100 xpDimensionality reduction50 xpUsing PCA100 xpTraining a model with PCA100 xp
- 5Putting It All TogetherNow that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings. UFOs and preprocessing50 xpChecking column types100 xpDropping missing data100 xpCategorical variables and standardization50 xpExtracting numbers from strings100 xpIdentifying features for standardization100 xpEngineering new features50 xpEncoding categorical variables100 xpFeatures from dates100 xpText vectorization100 xpFeature selection and modeling50 xpSelecting the ideal dataset100 xpModeling the UFO dataset, part 1100 xpModeling the UFO dataset, part 2100 xpCongratulations!50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.collaborators


 James Chapman
James ChapmanAI Curriculum Manager, DataCamp
James is a Curriculum Manager at DataCamp, where he collaborates with experts from industry and academia to create courses on AI, data science, and analytics. He has led nine DataCamp courses on diverse topics in Python, R, AI developer tooling, and Google Sheets. He has a Master's degree in Physics and Astronomy from Durham University, where he specialized in high-redshift quasar detection. In his spare time, he enjoys restoring retro toys and electronics.
 
Follow James on LinkedIn
Follow James on LinkedIn
Join over 18 million learners and start Preprocessing for Machine Learning in Python today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.