Getting Started with Machine Learning in Python

Supervised Learning

You can consult the solution for this live training in notebook-solution.ipynb.

Predicting values of a target variable given a set of features

For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

Regression

Predicting the values of a continuous variable e.g., house price.

Classification

Predicting a binary outcome e.g., customer churn.

Data Dictionary

The data has the following fields:

Column name	Description
`loan_id`	Unique loan id
`gender`	Gender - `Male` / `Female`
`married`	Marital status - `Yes` / `No`
`dependents`	Number of dependents
`education`	Education - `Graduate` / `Not Graduate`
`self_employed`	Self-employment status - `Yes` / `No`
`applicant_income`	Applicant's income
`coapplicant_income`	Coapplicant's income
`loan_amount`	Loan amount (thousands)
`loan_amount_term`	Term of loan (months)
`credit_history`	Credit history meets guidelines - `1` / `0`
`property_area`	Area of the property - `Urban` / `Semi Urban` / `Rural`
`loan_status`	Loan approval status (target) - `1` / `0`

# Import required libraries

# Read in the dataset


# Preview the data

Exploratory Data Analysis

We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

Cleanliness

Are columns set to the correct data type?
Do we have missing data?

Distributions

Many machine learning algorithms expect data that is normally distributed.
Do we have outliers (extreme values)?

Relationships

If data is strongly correlated with the target variable it might be a good feature for predictions!

Feature Engineering

Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?

# Remove the loan_id to avoid accidentally using it as a feature

# Counts and data types per column

# Distributions and relationships

# Correlation between variables
# Correlation between variables
cor_matrix = loans.corr()

sns.heatmap(cor_matrix, annot=True)

plt.show()

# Target frequency

# Class frequency by loan_status

Modeling

# First model using loan_amount

# Split into training and test sets

# Previewing the training set

# Instantiate a logistic regression model

# Fit to the training data

# Predict test set values

# Check the model's first five predictions

Classification Metrics

Accuracy

Confusion Matrix

True Positive (TP) = # Correctly predicted as positive

True Negative (TN) = # Correctly predicted as negative

False Positive (FP) = # Incorrectly predicted as positive (actually negative)

False Negative (FN) = # Incorrectly predicted as negative (actually positive)

	Predicted: Negative	Predicted: Positive
Actual: Negative	True Negative	False Positive
Actual: Positive	False Negative	True Positive

Confusion Matrix Metrics

‌
‌
‌