Loved by learners at thousands of companies
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.
Introduction to Data PreprocessingFree
In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.
This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.Standardizing Data50 xpWhen to standardize50 xpModeling without normalizing100 xpLog normalization50 xpChecking the variance50 xpLog normalization in Python100 xpScaling data for feature comparison50 xpScaling data - investigating columns50 xpScaling data - standardizing columns100 xpStandardized data and modeling50 xpKNN on non-scaled data100 xpKNN on scaled data100 xp
In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.Feature engineering50 xpFeature engineering knowledge test50 xpIdentifying areas for feature engineering50 xpEncoding categorical variables50 xpEncoding categorical variables - binary100 xpEncoding categorical variables - one-hot100 xpEngineering numerical features50 xpEngineering numerical features - taking an average100 xpEngineering numerical features - datetime100 xpText classification50 xpEngineering features from strings - extraction100 xpEngineering features from strings - tf/idf100 xpText classification using tf/idf vectors100 xp
Selecting features for modeling
This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).Feature selection50 xpWhen to use feature selection50 xpIdentifying areas for feature selection50 xpRemoving redundant features50 xpSelecting relevant features100 xpChecking for correlated features100 xpSelecting features using text vectors50 xpExploring text vectors, part 1100 xpExploring text vectors, part 2100 xpTraining Naive Bayes with feature selection100 xpDimensionality reduction50 xpUsing PCA100 xpTraining a model with PCA100 xp
Putting it all together
Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.UFOs and preprocessing50 xpChecking column types100 xpDropping missing data100 xpCategorical variables and standardization50 xpExtracting numbers from strings100 xpIdentifying features for standardization100 xpEngineering new features50 xpEncoding categorical variables100 xpFeatures from dates100 xpText vectorization100 xpFeature selection and modeling50 xpSelecting the ideal dataset100 xpModeling the UFO dataset, part 1100 xpModeling the UFO dataset, part 2100 xpCongratulations!50 xp
In the following tracksMachine Learning Scientist
DataCamp Content Creator
DataCamp offers interactive R, Python, Spreadsheets, SQL and shell courses. All on topics in data science, statistics, and machine learning. Learn from a team of expert teachers in the comfort of your browser with video lessons and fun coding challenges and projects.
What do other learners have to say?
I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.
Devon Edwards Joseph
Lloyds Banking Group
DataCamp is the top resource I recommend for learning data science.
Harvard Business School
DataCamp is by far my favorite website to learn from.
Decision Science Analytics, USAA