Loved by learners at thousands of companies
Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data: Will a tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam? In this course, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. You'll learn how to build predictive models, tune their parameters, and determine how well they will perform with unseen data—all while using real world datasets. You'll be using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.
In this chapter, you will be introduced to classification problems and learn how to solve them using supervised learning techniques. And you’ll apply what you learn to a political dataset, where you classify the party affiliation of United States congressmen based on their voting records.Supervised learning50 xpWhich of these is a classification problem?50 xpExploratory data analysis50 xpNumerical EDA50 xpVisual EDA50 xpThe classification challenge50 xpk-Nearest Neighbors: Fit100 xpk-Nearest Neighbors: Predict100 xpMeasuring model performance50 xpThe digits recognition dataset100 xpTrain/Test Split + Fit/Predict/Accuracy100 xpOverfitting and underfitting100 xp
In the previous chapter, you used image and political datasets to predict binary and multiclass outcomes. But what if your problem requires a continuous outcome? Regression is best suited to solving such problems. You will learn about fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data.Introduction to regression50 xpWhich of the following is a regression problem?50 xpImporting data for supervised learning100 xpExploring the Gapminder data50 xpThe basics of linear regression50 xpFit & predict for regression100 xpTrain/test split for regression100 xpCross-validation50 xp5-fold cross-validation100 xpK-Fold CV comparison100 xpRegularized regression50 xpRegularization I: Lasso100 xpRegularization II: Ridge100 xp
Fine-tuning your model
Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.How good is your model?50 xpMetrics for classification100 xpLogistic regression and the ROC curve50 xpBuilding a logistic regression model100 xpPlotting an ROC curve100 xpPrecision-recall Curve50 xpArea under the ROC curve50 xpAUC computation100 xpHyperparameter tuning50 xpHyperparameter tuning with GridSearchCV100 xpHyperparameter tuning with RandomizedSearchCV100 xpHold-out set for final evaluation50 xpHold-out set reasoning50 xpHold-out set in practice I: Classification100 xpHold-out set in practice II: Regression100 xp
Preprocessing and pipelines
This chapter introduces pipelines, and how scikit-learn allows for transformers and estimators to be chained together and used as a single unit. Preprocessing techniques will be introduced as a way to enhance model performance, and pipelines will tie together concepts from previous chapters.Preprocessing data50 xpExploring categorical features100 xpCreating dummy variables100 xpRegression with categorical features100 xpHandling missing data50 xpDropping missing data100 xpImputing missing data in a ML Pipeline I100 xpImputing missing data in a ML Pipeline II100 xpCentering and scaling50 xpCentering and scaling your data100 xpCentering and scaling in a pipeline100 xpBringing it all together I: Pipeline for classification100 xpBringing it all together II: Pipeline for regression100 xpFinal thoughts50 xp
DatasetsAutomobile miles per gallonBoston housingDiabetesGapminderUS Congressional Voting Records (1984)White wine qualityRed wine quality
PrerequisitesStatistical Thinking in Python (Part 1)
Data Scientist at DataCamp
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC. If you want to know what he likes to talk about, definitely check out DataFramed, the DataCamp podcast, which he hosts and produces: https://www.datacamp.com/community/podcast
What do other learners have to say?
I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.
Devon Edwards Joseph
Lloyds Banking Group
DataCamp is the top resource I recommend for learning data science.
Harvard Business School
DataCamp is by far my favorite website to learn from.
Decision Science Analytics, USAA