Skip to content

1. Credit card applications

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

Credit card being held in hand

We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository. The structure of this notebook is as follows:

  • First, we will start off by loading and viewing the dataset.
  • We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.
  • We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.
  • After our data is in good shape, we will do some exploratory data analysis to build our intuitions.
  • Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.

Task 1: Instructions Load and look at the dataset.

Import the pandas library under the alias pd. Load the dataset, "datasets/cc_approvals.data", into a pandas DataFrame called cc_apps. Set the header argument to None. Print the first 5 rows of cc_apps using the head() method. Good to know For this project, it is recommended that you know basic Python programming, the pandas and numpy packages, some data preprocessing, and a little bit of machine learning. Here are some resources that may be helpful throughout the project:

For a quick introduction to Python: DataCamp's Introduction to Python course For learning the basics of the pandas and numpy packages: Data Manipulation with pandas pandas Cheatsheet NumPy Cheat Sheet For data preprocessing: Preprocessing in Data Science (Part 1) Preprocessing in Data Science (Part 2) Preprocessing in Data Science (Part 3) For machine learning: Google's Machine Learning Crash Course Supervised Learning with scikit-learn Apart from the above, we encourage you to use your preferred search engine to find other useful resources.

Show Answer HINT You can read "path_to/my_data.data" into a DataFrame named my_data like so after importing pandas:

import pandas as pd my_data = pd.read_csv("path_to/my_data.data") Pay close attention to the header parameter of the read_csv() method.

# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
cc_apps.head()

2. Inspecting the applications

The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but this blog gives us a pretty good overview of the probable features. The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

Task 2: Instructions Inspect the structure, numerical summary, and specific rows of the dataset.

Extract the summary statistics of the data using the describe() method of cc_apps. Use the info() method of cc_apps to get more information about the DataFrame. Print the last 17 rows of cc_apps using the tail() method to display missing values. Helpful links:

pandas tail() method documentation

Show Answer HINT You can use the describe() method of a DataFrame named my_data like this:

my_data.describe()

# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print('\n')

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print('\n')

# Inspect missing values in the dataset
cc_apps.tail(17)

3. Splitting the dataset into train and test sets

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then preprocess it.

Also, features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. To get a better sense, we can measure their statistical correlation to the labels of the dataset. But this is out of scope for this project. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.

Task 3: Instructions Split cc_apps into train and test sets.

Import train_test_split from the sklearn.model_selection module. Drop features 11 and 13 using the drop() method. Using the train_test_split() method, split the data into train and test sets with a split ratio of 33% (test_size argument) and set the random_state argument to 42. Assign the train and test DataFrames to the following variables respectively: cc_apps_train, cc_apps_test. Keep track of the total number of features before and after dropping the features. This often helps with debugging.

Setting random_state ensures the dataset is split with same sets of instances every time the code is run.

Helpful links:

pandas drop() method documentation sklearn train_test_split() method documentation

# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)
Hidden output

4. Handling the missing values (part i)

Now we've split our data, we can handle some of the issues we identified when inspecting the DataFrame, including:

  • Our dataset contains both numeric and non-numeric data (specifically data that are of float64, int64 and object types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.
  • The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like mean, max, and min) about the features that have numerical values.
  • Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output of the second task.

Now, let's temporarily replace these missing value question marks with NaN.

Task 4: Instructions Replace the question marks with NaN.

Import the numpy library under the alias np. Replace the '?'s with NaNs using the replace() method in both the train and test sets. Helpful links:

pandas replace() method documentation NumPy data types for special values

Show Answer HINT If you import the numpy module as aliased as np then you can use np.NaN for replacing the desired values (which in this case are denoted with question marks) to NaNs.

You can call the replace() method on my_data and then overwrite it like this:

my_data = my_data.replace(replacing_value, np.NaN)

# Import numpy
import numpy as np

# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace('?', np.NaN)
cc_apps_test = cc_apps_test.replace('?', np.NaN)
Hidden output

5. Handling the missing values (part ii)

We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.

An important question that gets raised here is why are we giving so much importance to missing values? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as Linear Discriminant Analysis (LDA).

So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.

Task 5: Instructions Impute the NaN values with the mean imputation approach.

For the numeric columns, impute the missing values (NaNs) with pandas method fillna(). Ensure the test set is imputed with the mean values computed from the training set. Verify if the fillna() method performed as expected by printing the total number of NaNs in each column. Remember that you have already marked all the question marks as NaNs. pandas provides fillna() to help you impute missing values with different strategies, mean imputation being one of them. pandas also has a mean() method to calculate the mean of a DataFrame. As your dataset contains both numeric and non-numeric data, for this task you will only impute the missing values (NaNs) present in the columns having numeric data-types (columns 2, 7, 10 and 14).

Helpful links:

mean imputation tutorial pandas fillna() method documentation pandas mean() method documentation pandas isnull() method documentation

Show Answer HINT You can call the fillna() method on a pandas DataFrame like this:

my_data.fillna(my_data.mean(), inplace=True) Please note that the fillna() method implicitly handles the imputations for the columns containing numeric data-types. The inplace parameter is set to True to allow the DataFrame to mutate. You can count the number of NaNs in a dataset by using the isnull() and sum() methods in conjugation.

# Impute the missing values with mean imputation
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())