Kaggle - Give Me Some Credit

Description

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal is to build a model that borrowers can use to help make the best financial decisions.

@misc{GiveMeSomeCredit, author = {Credit Fusion and Will Cukierski}, title = {Give Me Some Credit}, year = {2011}, howpublished = {\url{https://kaggle.com/competitions/GiveMeSomeCredit}}, note = {Kaggle} }

Data Dictionary

import pandas as pd

# Load a specific sheet by name or index, make the first row as the column names, and set the first column as the index
dictionary_df = pd.read_excel('datasets/Data Dictionary.xls', sheet_name='Sheet1', header=1)
dictionary_df

Credit Risk Classification Model

1. Import Necessary Libraries

# import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, Binarizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import (roc_auc_score, confusion_matrix, classification_report,
                             precision_recall_curve, roc_curve, f1_score, auc)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb

# Imbalanced data handling
from imblearn.over_sampling import SMOTE

# Model interpretability
import shap

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

2. Load the Data

import os

# Get a list of all CSV files in the datasets folder
csv_files = [f for f in os.listdir('datasets') if f.endswith('.csv')]

# Load each CSV file into a DataFrame and store it in a dictionary
dataframes = {}
for file in csv_files:
    df_name = os.path.splitext(file)[0].replace('-', '_')  # Get the file name without extension and replace '-' with '_'
    dataframes[df_name] = pd.read_csv(f'datasets/{file}')  # Set the first column as index

# Optionally, you can also create variables for each DataFrame
for df_name, df in dataframes.items():
    globals()[df_name] = df

# Display all available DataFrames
for df_name, df in dataframes.items():
    display(f"DataFrame: {df_name}")
    display(df.head())

# Set the first column as the index and remove the column name
cs_training.set_index(cs_test.columns[0], inplace=True)
cs_training.index.name = None

cs_training.head()

# Check for any leading or trailing whitespaces in column names
whitespace_columns = [col for col in cs_training.columns if col != col.strip()]

# Check for any leading or trailing whitespaces in string data within the DataFrame
whitespace_data = cs_training.applymap(lambda x: isinstance(x, str) and (x != x.strip()))

# Display columns with whitespace in their names
whitespace_columns

# Display rows and columns where data contains leading or trailing whitespaces
whitespace_data.any().any()

cs_training.shape

cs_training.describe()

# Check for any leading or trailing whitespaces in column names
whitespace_columns = [col for col in cs_test.columns if col != col.strip()]

# Check for any leading or trailing whitespaces in string data within the DataFrame
whitespace_data = cs_test.applymap(lambda x: isinstance(x, str) and (x != x.strip()))

# Display columns with whitespace in their names
whitespace_columns

# Display rows and columns where data contains leading or trailing whitespaces
whitespace_data.any().any()

# Set the first column as the index and remove the column name
cs_test.set_index(cs_test.columns[0], inplace=True)
cs_test.index.name = None

cs_test.head()