Skip to content
Supervised Learning
Predicting values of a target variable given a set of features
- For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).
Regression
- Predicting the values of a continuous variable e.g., house price.
Classification
- Predicting a binary outcome e.g., customer churn.
Data Dictionary
The data has the following fields:
Column name | Description |
---|---|
loan_id | Unique loan id |
gender | Gender - Male / Female |
married | Marital status - Yes / No |
dependents | Number of dependents |
education | Education - Graduate / Not Graduate |
self_employed | Self-employment status - Yes / No |
applicant_income | Applicant's income |
coapplicant_income | Coapplicant's income |
loan_amount | Loan amount (thousands) |
loan_amount_term | Term of loan (months) |
credit_history | Credit history meets guidelines - 1 / 0 |
property_area | Area of the property - Urban / Semi Urban / Rural |
loan_status | Loan approval status (target) - 1 / 0 |
# Import required libraries
# Read in the dataset
# Preview the data
Exploratory Data Analysis
We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?
Cleanliness
- Are columns set to the correct data type?
- Do we have missing data?
Distributions
- Many machine learning algorithms expect data that is normally distributed.
- Do we have outliers (extreme values)?
Relationships
- If data is strongly correlated with the target variable it might be a good feature for predictions!
Feature Engineering
- Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
# Remove the loan_id to avoid accidentally using it as a feature
# Counts and data types per column
# Distributions and relationships
# Correlation between variables
# Target frequency
# Class frequency by loan_status
Modeling
# First model using loan_amount
# Split into training and test sets
# Previewing the training set
# Instantiate a logistic regression model
# Fit to the training data
# Predict test set values
# Check the model's first five predictions
Classification Metrics
Accuracy
Confusion Matrix
True Positive (TP) = # Correctly predicted as positive
True Negative (TN) = # Correctly predicted as negative
False Positive (FP) = # Incorrectly predicted as positive (actually negative)
False Negative (FN) = # Incorrectly predicted as negative (actually positive)
Predicted: Negative | Predicted: Positive | |
---|---|---|
Actual: Negative | True Negative | False Positive |
Actual: Positive | False Negative | True Positive |
Confusion Matrix Metrics