Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.
The Data
The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a pandas DataFrame called cc_apps. The last column in the dataset is the target value.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
# Loading the dataset
data = pd.read_csv("cc_approvals.data", header= None)
data.head()# Replace the "?" with Nan in the dataset
data_replaced = data.replace("?", np.NaN)
# Create a copy of the Nan Replacement DataFrame
data_imputed = data_replaced.copy()
# Iterate and impute
for col in data_imputed.columns:
if data_imputed[col].dtypes == "object":
data_imputed[col] = data_imputed[col].fillna(data_imputed[col].value_counts().index[0])
else:
data_imputed[col] = data_imputed[col].fillna(data_imputed[col].mean())
# Dummify Categorical Variables
data_encoded = pd.get_dummies(data_imputed, drop_first=True)# Extract last column as the target variable
X = data_encoded.iloc[:, :-1].values
y = data_encoded.iloc[:, [-1]].values
# Split the data into train and test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)
# Instantiate StandardScaler and use it to rescale X_train and X_test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Instantiate Logistic Regression with default parameter values
logreg = LogisticRegression()
# Fit the model to the training data
logreg.fit(X_train_scaled, y_train.ravel())
# Predict the test labels
y_pred = logreg.predict(X_test_scaled)
# Define grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
# Create a dictionary
param_grid = dict(tol=tol, max_iter = max_iter)
# Instantiate grid search Cv
grid_search = GridSearchCV(estimator=logreg, param_grid = param_grid, cv=5)
# Fit the grid to the data
grid_search_results = grid_search.fit(X_train_scaled, y_train.ravel())# Summarize results
best_train_score, best_train_params = grid_search_results.best_score_, grid_search_results.best_params_
print("Best: %f using %s" % (best_train_score, best_train_params))# Extract the best model and evaluate it on the test set
best_model = grid_search_results.best_estimator_
best_score = best_model.score(X_test_scaled, y_test)print("Accuracy of logistic regression classifier: ", best_score)