Skip to content

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a pandas DataFrame called cc_apps. The last column in the dataset is the target value.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None)

# Handle missing values
cc_apps.replace('?', np.nan, inplace=True)
cc_apps.fillna(cc_apps.mean(numeric_only=True), inplace=True)
cc_apps.fillna(cc_apps.mode().iloc[0], inplace=True)

# Encode categorical features
le = LabelEncoder()
for col in cc_apps.select_dtypes(include=['object']).columns:
    cc_apps[col] = le.fit_transform(cc_apps[col])

# Split the data into features and target
X = cc_apps.iloc[:, :-1]
y = cc_apps.iloc[:, -1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define models and parameters for GridSearchCV
models = {
    'LogisticRegression': LogisticRegression(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier()
}

params = {
    'LogisticRegression': {'C': [0.1, 1, 10]},
    'DecisionTree': {'max_depth': [None, 10, 20, 30]},
    'RandomForest': {'n_estimators': [50, 100, 200]}
}

# Find the best model and parameters
best_score = 0
for model_name in models:
    grid = GridSearchCV(models[model_name], params[model_name], cv=5)
    grid.fit(X_train, y_train)
    score = grid.best_score_
    if score > best_score:
        best_score = score
        best_model = grid.best_estimator_

# Evaluate the best model on the test set
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
best_score = accuracy_score(y_test, y_pred)
print("The best model is:", best_model)
print("The best score is:",best_score)