Skip to content
Getting Started with Machine Learning in Python
Supervised Learning
You can consult the solution for this live training in notebook-solution.ipynb.
Predicting values of a target variable given a set of features
- For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).
 
Regression
- Predicting the values of a continuous variable e.g., house price.
 
Classification
- Predicting a binary outcome e.g., customer churn.
 
Data Dictionary
The data has the following fields:
| Column name | Description | 
|---|---|
loan_id | Unique loan id | 
gender | Gender - Male / Female | 
married | Marital status - Yes / No | 
dependents | Number of dependents | 
education | Education - Graduate / Not Graduate | 
self_employed | Self-employment status - Yes / No | 
applicant_income | Applicant's income | 
coapplicant_income | Coapplicant's income | 
loan_amount | Loan amount (thousands) | 
loan_amount_term | Term of loan (months) | 
credit_history | Credit history meets guidelines - 1 / 0 | 
property_area | Area of the property - Urban / Semi Urban / Rural | 
loan_status | Loan approval status (target) - 1 / 0 | 
# Import required libraries
# Read in the dataset
# Preview the data
Exploratory Data Analysis
We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?
Cleanliness
- Are columns set to the correct data type?
 - Do we have missing data?
 
Distributions
- Many machine learning algorithms expect data that is normally distributed.
 - Do we have outliers (extreme values)?
 
Relationships
- If data is strongly correlated with the target variable it might be a good feature for predictions!
 
Feature Engineering
- Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
 
# Remove the loan_id to avoid accidentally using it as a feature
# Counts and data types per column
# Distributions and relationships
# Correlation between variables
# Correlation between variables
cor_matrix = loans.corr()
sns.heatmap(cor_matrix, annot=True)
plt.show()
# Target frequency
# Class frequency by loan_status
Modeling
# First model using loan_amount
# Split into training and test sets
# Previewing the training set# Instantiate a logistic regression model
# Fit to the training data
# Predict test set values
# Check the model's first five predictions
Classification Metrics
Accuracy
Confusion Matrix
True Positive (TP) = # Correctly predicted as positive
True Negative (TN) = # Correctly predicted as negative
False Positive (FP) = # Incorrectly predicted as positive (actually negative)
False Negative (FN) = # Incorrectly predicted as negative (actually positive)
| Predicted: Negative | Predicted: Positive | |
|---|---|---|
| Actual: Negative | True Negative | False Positive | 
| Actual: Positive | False Negative | True Positive | 
Confusion Matrix Metrics