Skip to content
Duplicate of Getting Started with Machine Learning in Python [Solution]
Supervised Learning
Predicting values of a target variable given a set of features
- For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).
Regression
- Predicting the values of a continous variable e.g., house price.
Classification
- Predicting a binary outcome e.g., customer churn.
Data Dictionary
The data has the following fields:
| Column name | Description |
|---|---|
loan_id | Unique loan id |
gender | Gender - Male / Female |
married | Marital status - Yes / No |
dependents | Number of dependents |
education | Education - Graduate / Not Graduate |
self_employed | Self-employment status - Yes / No |
applicant_income | Applicant's income |
coapplicant_income | Coapplicant's income |
loan_amount | Loan amount (thousands) |
loan_amount_term | Term of loan (months) |
credit_history | Credit history meets guidelines - 1 / 0 |
property_area | Area of the property - Urban / Semi Urban / Rural |
loan_status | Loan approval status (target) - 1 / 0 |
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay# Read in the dataset
loans = pd.read_csv("loans.csv")
# Preview the data
loans.head()Exploratory Data Analysis
We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?
Cleanliness
- Are columns set to the correct data type?
- Do we have missing data?
Distributions
- Many machine learning algorithms expect data that is normally distributed.
- Do we have outliers (extreme values)?
Relationships
- If data is strongly correlated with the target variable it might be a good feature for predictions!
Feature Engineering
- Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
# Remove the loan_id to avoid accidentally using it as a feature
loans.drop(columns=["loan_id"], inplace=True)# Counts and data types per column
loans.info()# Distributions and relationships
sns.pairplot(data=loans, diag_kind="kde", hue="loan_status")
plt.show()# Correlation between variables
sns.heatmap(loans.corr(), annot=True)
plt.show()# Target frequency
loans["loan_status"].value_counts(normalize=True)# Class frequency by loan_status
for col in loans.columns[loans.dtypes == "object"]:
sns.countplot(data=loans, x=col, hue="loan_status")
plt.show()Modeling
# First model using loan_amount
X = loans[["loan_amount"]]
y = loans[["loan_status"]]
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.3,
random_state=42,
stratify=y)
# Previewing the training set
print(X_train[:5], "\n", y_train[:5])# Instantiate a logistic regression model
clf = LogisticRegression(random_state=42)
# Fit to the training data
clf.fit(X_train, y_train)
# Predict test set values
y_pred = clf.predict(X_test)
# Check the model's first five predictions
print(y_pred[:5])Classification Metrics
Accuracy
Confusion Matrix
True Positive (TP) = # Correctly predicted as positive
True Negative (TN) = # Correctly predicted as negative
False Positive (FP) = # Incorrectly predicted as positive (actually negative)
False Negative (FP) = # Incorrectly predicted as negative (actually positive)
| Predicted: Negative | Predicted: Positive | |
|---|---|---|
| Actual: Negative | True Negative | False Positive |
| Actual: Positive | False Negative | True Positive |
Confusion Matrix Metrics