Skip to content

Supervised Learning

Predicting values of a target variable given a set of features

  • For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

Regression

  • Predicting the values of a continuous variable e.g., house price.

Classification

  • Predicting a binary outcome e.g., customer churn.

Data Dictionary

The data has the following fields:

Column nameDescription
loan_idUnique loan id
genderGender - Male / Female
marriedMarital status - Yes / No
dependentsNumber of dependents
educationEducation - Graduate / Not Graduate
self_employedSelf-employment status - Yes / No
applicant_incomeApplicant's income
coapplicant_incomeCoapplicant's income
loan_amountLoan amount (thousands)
loan_amount_termTerm of loan (months)
credit_historyCredit history meets guidelines - 1 / 0
property_areaArea of the property - Urban / Semi Urban / Rural
loan_statusLoan approval status (target) - 1 / 0
# Import required libraries
# Read in the dataset


# Preview the data

Exploratory Data Analysis

We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

Cleanliness

  • Are columns set to the correct data type?
  • Do we have missing data?

Distributions

  • Many machine learning algorithms expect data that is normally distributed.
  • Do we have outliers (extreme values)?

Relationships

  • If data is strongly correlated with the target variable it might be a good feature for predictions!

Feature Engineering

  • Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
# Remove the loan_id to avoid accidentally using it as a feature
# Counts and data types per column
# Distributions and relationships
# Correlation between variables
# Target frequency
# Class frequency by loan_status

Modeling

# First model using loan_amount

# Split into training and test sets

# Previewing the training set
# Instantiate a logistic regression model

# Fit to the training data

# Predict test set values

# Check the model's first five predictions

Classification Metrics

 

Accuracy

 

Confusion Matrix

True Positive (TP) = # Correctly predicted as positive

True Negative (TN) = # Correctly predicted as negative

False Positive (FP) = # Incorrectly predicted as positive (actually negative)

False Negative (FN) = # Incorrectly predicted as negative (actually positive)

 

Predicted: NegativePredicted: Positive
Actual: NegativeTrue NegativeFalse Positive
Actual: PositiveFalse NegativeTrue Positive

 

Confusion Matrix Metrics