1. Credit card applications
Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.
We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository. The structure of this notebook is as follows:
- First, we will start off by loading and viewing the dataset.
- We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.
- We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.
- After our data is in good shape, we will do some exploratory data analysis to build our intuitions.
- Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.
First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.
# Import pandas
# ... YOUR CODE FOR TASK 1 ...
import pandas as pd
# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data",header=None)
# Inspect data
# ... YOUR CODE FOR TASK 1 ...
print(cc_apps.head())
2. Inspecting the applications
The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but this blog gives us a pretty good overview of the probable features. The probable features in a typical credit card application are Gender
, Age
, Debt
, Married
, BankCustomer
, EducationLevel
, Ethnicity
, YearsEmployed
, PriorDefault
, Employed
, CreditScore
, DriversLicense
, Citizen
, ZipCode
, Income
and finally the ApprovalStatus
. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.
As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)
print('\n')
# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)
print('\n')
# Inspect missing values in the dataset
# ... YOUR CODE FOR TASK 2 ...
print(cc_apps.tail(n=17))
3. Splitting the dataset into train and test sets
Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then preprocess it.
Also, features like DriversLicense
and ZipCode
are not as important as the other features in the dataset for predicting credit card approvals. To get a better sense, we can measure their statistical correlation to the labels of the dataset. But this is out of scope for this project. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.
# Import train_test_split
from sklearn.model_selection import train_test_split
# Drop the features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)
# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)
4. Handling the missing values (part i)
Now we've split our data, we can handle some of the issues we identified when inspecting the DataFrame, including:
- Our dataset contains both numeric and non-numeric data (specifically data that are of
float64
,int64
andobject
types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values. - The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like
mean
,max
, andmin
) about the features that have numerical values. - Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output of the second task.
Now, let's temporarily replace these missing value question marks with NaN.
# Import numpy
# ... YOUR CODE FOR TASK 4 ...
import numpy as np
# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace('?',np.nan)
cc_apps_test = cc_apps_test.replace('?',np.nan)
5. Handling the missing values (part ii)
We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.
An important question that gets raised here is why are we giving so much importance to missing values? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as Linear Discriminant Analysis (LDA).
So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.
# Impute the missing values with mean imputation
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)
# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())
6. Handling the missing values (part iii)
We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this is why the mean imputation strategy would not work here. This needs a different treatment.
We are going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.
# Iterate over each column of cc_apps_train
for col in cc_apps_train:
# Check if the column is of object type
if cc_apps[col].dtypes == 'object':
# Impute with the most frequent value
cc_apps_train = cc_apps_train.fillna(cc_apps[col].value_counts().index[0])
cc_apps_test = cc_apps_test.fillna(cc_apps[col].value_counts().index[0])
# Count the number of NaNs in the dataset and print the counts to verify
# ... YOUR CODE FOR TASK 6 ...
print(cc_apps.isnull().values.sum())
7. Preprocessing the data (part i)
The missing values are now successfully handled.
There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into two main tasks:
- Convert the non-numeric data into numeric.
- Scale the feature values to a uniform range.
First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using the get_dummies()
method from pandas.
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)
# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)
8. Preprocessing the data (part ii)
Now, we are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.
Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore
as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore
of 1 is the highest since we're rescaling all the values to the range of 0-1.