Skip to content
Getting Started with Machine Learning in Python
  • AI Chat
  • Code
  • Report
  • Supervised Learning

    Predicting values of a target variable given a set of features

    • For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).


    • Predicting the values of a continuous variable e.g., house price.


    • Predicting a binary outcome e.g., customer churn.

    Data Dictionary

    The data has the following fields:

    Column nameDescription
    loan_idUnique loan id
    genderGender - Male / Female
    marriedMarital status - Yes / No
    dependentsNumber of dependents
    educationEducation - Graduate / Not Graduate
    self_employedSelf-employment status - Yes / No
    applicant_incomeApplicant's income
    coapplicant_incomeCoapplicant's income
    loan_amountLoan amount (thousands)
    loan_amount_termTerm of loan (months)
    credit_historyCredit history meets guidelines - 1 / 0
    property_areaArea of the property - Urban / Semi Urban / Rural
    loan_statusLoan approval status (target) - 1 / 0
    # Import required libraries
    # Read in the dataset
    # Preview the data

    Exploratory Data Analysis

    We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?


    • Are columns set to the correct data type?
    • Do we have missing data?


    • Many machine learning algorithms expect data that is normally distributed.
    • Do we have outliers (extreme values)?


    • If data is strongly correlated with the target variable it might be a good feature for predictions!

    Feature Engineering

    • Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
    # Remove the loan_id to avoid accidentally using it as a feature
    # Counts and data types per column
    # Distributions and relationships
    # Correlation between variables
    # Target frequency
    # Class frequency by loan_status


    # First model using loan_amount
    # Split into training and test sets
    # Previewing the training set
    # Instantiate a logistic regression model
    # Fit to the training data
    # Predict test set values
    # Check the model's first five predictions

    Classification Metrics




    Confusion Matrix

    True Positive (TP) = # Correctly predicted as positive

    True Negative (TN) = # Correctly predicted as negative

    False Positive (FP) = # Incorrectly predicted as positive (actually negative)

    False Negative (FN) = # Incorrectly predicted as negative (actually positive)


    Predicted: NegativePredicted: Positive
    Actual: NegativeTrue NegativeFalse Positive
    Actual: PositiveFalse NegativeTrue Positive


    Confusion Matrix Metrics