Skip to content

Setup

1. Importing necessary libraries

import pandas as pd
import numpy as np

2. Loading dataset

cc_apps = pd.read_csv('datasets/cc_approvals.data', header=None)
cc_apps

3. Inspecting the applications

The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but this blog gives us a pretty good overview of the probable features. The probable features in a typical credit card application are:

  • 0   Gender
  • 1   Age
  • 2   Debt
  • 3   Married
  • 4   BankCustomer
  • 5   EducationLevel
  • 6   Ethnicity
  • 7   YearsEmployed
  • 8   PriorDefault
  • 9   Employed
  • 10 CreditScore
  • 11 DriversLicense
  • 12 Citizen
  • 13 ZipCode
  • 14 Income
  • 15 ApprovalStatus This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.
# Add header
cc_apps.columns = ["gender", "age", "debt", "married", "bank_customer", "education_level", "ethnicity", "years_employed", "prior_default", "employed", "credit_score", "driver_license", "citizen", "zip_code", "income", "approval_status"]
# Inpect some rows
print("Head")
print(cc_apps.head())

# Print summary statistics
print("\n")
print("Description")
print(cc_apps.describe())

# Print DataFrame information
print("\n")
print("Info")
print(cc_apps.info())
Hidden output

4. Splitting the dataset into "features" and "target"

target_column = 'approval_status'
X = cc_apps.drop(columns=[target_column])
y = cc_apps[target_column]

Cleaning

1. Drop non essentials columns

non_essentials_columns = ['driver_license', 'zip_code']
X.drop(columns=non_essentials_columns)
Hidden output

2. Replace nan