Skip to content
Getting Started with Machine Learning in Python
  • AI Chat
  • Code
  • Report
  • Supervised Learning

    Predicting values of a target variable given a set of features

    • For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

    Regression

    • Predicting the values of a continuous variable e.g., house price.

    Classification

    • Predicting a binary outcome e.g., customer churn.

    Data Dictionary

    The data has the following fields:

    Column nameDescription
    loan_idUnique loan id
    genderGender - Male / Female
    marriedMarital status - Yes / No
    dependentsNumber of dependents
    educationEducation - Graduate / Not Graduate
    self_employedSelf-employment status - Yes / No
    applicant_incomeApplicant's income
    coapplicant_incomeCoapplicant's income
    loan_amountLoan amount (thousands)
    loan_amount_termTerm of loan (months)
    credit_historyCredit history meets guidelines - 1 / 0
    property_areaArea of the property - Urban / Semi Urban / Rural
    loan_statusLoan approval status (target) - 1 / 0
    # Import required libraries
    
    # Read in the dataset
    
    
    # Preview the data
    

    Exploratory Data Analysis

    We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

    Cleanliness

    • Are columns set to the correct data type?
    • Do we have missing data?

    Distributions

    • Many machine learning algorithms expect data that is normally distributed.
    • Do we have outliers (extreme values)?

    Relationships

    • If data is strongly correlated with the target variable it might be a good feature for predictions!

    Feature Engineering

    • Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
    # Remove the loan_id to avoid accidentally using it as a feature
    
    # Counts and data types per column
    
    # Distributions and relationships
    
    # Correlation between variables
    
    # Target frequency
    
    # Class frequency by loan_status
    

    Modeling

    # First model using loan_amount
    
    # Split into training and test sets
    
    # Previewing the training set
    # Instantiate a logistic regression model
    
    # Fit to the training data
    
    # Predict test set values
    
    # Check the model's first five predictions
    

    Classification Metrics

     

    Accuracy

     

    Confusion Matrix

    True Positive (TP) = # Correctly predicted as positive

    True Negative (TN) = # Correctly predicted as negative

    False Positive (FP) = # Incorrectly predicted as positive (actually negative)

    False Negative (FN) = # Incorrectly predicted as negative (actually positive)

     

    Predicted: NegativePredicted: Positive
    Actual: NegativeTrue NegativeFalse Positive
    Actual: PositiveFalse NegativeTrue Positive

     

    Confusion Matrix Metrics