Skip to content

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a pandas DataFrame called cc_apps. The last column in the dataset is the target value.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

1 - Preprocess the data

# Replace "?" with np.NaN to handle missing values
cc_apps.replace('?', np.NaN, inplace=True)

# Make a copy of the dataset for imputation
cc_apps_imputed = cc_apps.copy()

# Impute missing values using a for loop 
for col in cc_apps_imputed.columns:
    if cc_apps_imputed[col].dtype == 'object':
        # Impute categorical columns with the most frequent values
        most_frequent = cc_apps_imputed[col].value_counts().idxmax()
        cc_apps_imputed[col].fillna(most_frequent, inplace=True)
    else:
        # Impute numeric columns with the mean value
        mean_value = cc_apps_imputed[col].mean()
        cc_apps_imputed[col].fillna(mean_value, inplace=True)
        
# One-hot encoding
# Apply pd;get_dummies() to categorical columns
cc_apps_encoded = pd.get_dummies(cc_apps_imputed, drop_first=True)

#Display the first few rows of the preprocessed data
print(cc_apps_encoded.head())

In this step, we started by handling missing values in the dataset. We replaced all missing values represented as '?' with np.NaN to make them easier to manage. Then, we imputed missing values based on the data type: for categorical variables, we filled in the missing values with the most frequent value, while for numerical variables, we used the mean. After that, we applied one-hot encoding to categorical variables to prepare the data for machine learning models.

2 - Prepare the data for modeling

# Define the target variable (last column) and feature variables (all other columns)
X = cc_apps_encoded.iloc[:, :-1].values
y = cc_apps_encoded.iloc[:, -1].values

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

# Scale the data using StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Display the shape of the train and test sets to ensure correct splits
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

Once the data was cleaned, we defined the target variable (credit card approval status) as the last column of the dataset and the feature variables as all other columns. We split the dataset into training and testing sets, with 80% of the data used for training and 20% for testing. We then applied StandardScaler to scale the features, which ensures that all variables are on the same scale, improving the performance of machine learning models.

3 - Train the model

# Instantiate the logistic regression model
logreg = LogisticRegression(random_state=9)

# Fit the model on the training data 
logreg.fit(X_train_scaled, y_train)

# Generate predictions on the test set
y_pred = logreg.predict(X_test_scaled)

# Evaluate the predictions using confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the confusion matrix
print("Confusion Matrix: \n", conf_matrix)

With the data prepared, we instantiated a Logistic Regression model and trained it using the scaled training data. We used the model to predict the approval status for the test data and evaluated its performance using a confusion matrix. The confusion matrix gave us insights into the number of correct and incorrect predictions, providing a snapshot of the model’s accuracy.

4 - Finding the best scoring model

# Define the grid search parameters
param_grid = {
    'tol': [1e-4, 1e-3, 1e-2],
    'max_iter': [100, 200, 300],
    'C': [0.1, 1, 10] #Regularization strength
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the grid search model to the training data
grid_search.fit(X_train_scaled, y_train)

# Extract the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
best_score = best_model.score(X_test_scaled, y_test)

# Display the best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Model Test Set Score:", best_score)

To further improve the model’s performance, we used Grid Search Cross Validation to search for the best combination of hyperparameters for the Logistic Regression model. We defined a grid of parameters (tolerance, maximum iterations, and regularization strength) and performed 5-fold cross-validation to evaluate different models. The best model was extracted and evaluated on the test set, resulting in a final test set score which indicated the model's overall performance.

Conclusion

In this project, we successfully built a machine learning model to predict credit card approvals. We started by preprocessing the data, handling missing values and applying one-hot encoding to categorical variables. After preparing the data for modeling by splitting it into training and test sets and scaling the features, we trained an initial Logistic Regression model. The model's performance was evaluated using a confusion matrix.

To further improve the model, we performed Grid Search Cross Validation to find the best combination of hyperparameters. The final model, with C = 0.1, max_iter = 100, and tol = 0.0001, achieved a test set score of 81.16%, indicating strong predictive performance.

This project demonstrated the importance of preprocessing, feature scaling, and hyperparameter tuning in building robust predictive models. The final logistic regression model provides a reliable tool for predicting credit card approvals, helping financial institutions automate and streamline the decision-making process.