Skip to content

Binary classification modeling for FinTech

Field descriptions:

  • id – unique identifier of the object
  • gb – target variable for binary classification
  • Fields starting with cat_ – categorical features
  • Fields starting with num_ – numerical features

Objective:

Develop a binary classification model to predict the target variable (gb).

Algorithm requirements:

  • Logistic Regression
  • Decision Tree
  • Any additional algorithm (optionally)

Data loading and initial exploration

Imported the dataset, reviewed its structure. Displayed the first few rows to get an overview of the data, checked the shape to understand its dimensions, and used info() to identify data types. This helped determine the initial data quality and guided the preprocessing approach.

# Import necessary libraries
import pandas as pd

# Load and explore dataset
df = pd.read_csv('train_df.csv', delimiter='\t')

# Display the first few rows of the dataframe
print("First few rows of the dataframe:")
print(df.head())

# Display the shape of the dataframe
print("\nShape of the dataframe:")
print(df.shape)

# Display the information about the dataframe
print("\nInformation about the dataframe:")
df.info()

Data preprocessing

Identified categorical, numerical, and other columns based on naming conventions. Converted categorical columns to the category data type to optimize memory usage and improve model performance.

Addressed missing values by:

  • Identifying that columns with missing values were numeric, which defined the imputation method.
  • Dropping columns with more than 80% missing values, as they lacked sufficient information.
  • Imputing missing values in columns with 50% or less missing data using the median, which is robust to outliers.
# Data preprocessing

# Step 1. Addressing column types

# Identify columns dtypes
cat = [col for col in df.columns if col.startswith('cat_')]
num = [col for col in df.columns if col.startswith('num_')]
other = [col for col in df.columns if col not in cat + num]

# Display initial dtypes with counts
print(f'Categorical columns (dtype: {df[cat].dtypes.unique()[0]}), count: {len(cat)}')
print(f'Numerical columns (dtype: {df[num].dtypes.unique()[0]}), count: {len(num)}')
print(f'Other columns (dtype: {df[other].dtypes.unique()[0]}): {other}')

# Convert categorical columns to category dtype
df[cat] = df[cat].astype('category')

# Step 2. Addressing missing values

# Identify columns with missing values
missing_counts = df.isna().sum()
missing_counts = missing_counts[missing_counts > 0]
initial_missing_values = df.isna().sum().sum()

# Display missing values data
print(f'\nInitial missing values: {initial_missing_values}')
print(f'Total columns with missing values: {len(missing_counts)}')

# Check missing values in categorical columns
missing_in_cat = df[cat].isna().sum()
missing_in_cat = missing_in_cat[missing_in_cat > 0]

# Display missing values in categorical columns
if len(missing_in_cat) == 0:
    print('No missing values in categorical columns.')
else:
    print(f'Categorical columns with missing values: {len(missing_in_cat)}')
    print(missing_in_cat)

# Check missing values in other columns
missing_in_other = df[other].isna().sum()
missing_in_other = missing_in_other[missing_in_other > 0]

# Display missing values in other columns
if len(missing_in_other) == 0:
    print('No missing values in other columns.')
else:
    print(f'Other columns with missing values: {len(missing_in_other)}')
    print(missing_in_other)        
    
# Identify and drop columns with >80% missing values
high_missing = missing_counts[missing_counts > (0.8 * len(df))].index
df = df.drop(columns=high_missing)

# Display the results of dropping
print(f'\nDropped {len(high_missing)} columns with more than 80% missing values.')
print(f'Dataset shape after dropping: {df.shape}')

# Identify and impute columns with ≀50% missing values
low_missing = missing_counts[missing_counts <= (0.5 * len(df))].index
df[low_missing] = df[low_missing].fillna(df[low_missing].median())

# Display the results of imputation
print(f'\nImputed {len(low_missing)} columns with 50% or less missing values.')

# Verify the results of manipulations
remaining_missing_values = df.isna().sum().sum()
remaining_columns_with_missing = (df.isna().sum() > 0).sum()
print(f'\nRemaining missing values: {remaining_missing_values}')
print(f'Remaining columns with missing values: {remaining_columns_with_missing}')
print(f'Dataset shape after missing value handling: {df.shape}')

Identified columns with 50–80% missing values and evaluated their importance using a Random Forest model. Temporarily imputed missing values with the median to allow proper feature importance calculation.
Applied a threshold of 0.01 to determine whether features contributed meaningfully to predictions. All remaining missing-value columns had low importance, so they were dropped.
Final verification confirmed that no missing values remained and the dataset shape was updated accordingly.

# Step 3. Handling columns with 50–80% missing values

# Identify remaining columns with missing values (50–80% missing values)
remaining_missing = df.isna().sum()
remaining_missing = remaining_missing[remaining_missing > 0]

# Display the number of remaining columns with missing values
print(f'Columns with 50–80% missing values: {len(remaining_missing)}')

# Define features and target column
X = df.drop(columns=['gb', 'id', 'Unnamed: 0'])
y = df['gb']

# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Temporary impute missing values before model fitting
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# Fit Random Forest Classifier for feature importance evaluation
rf = RandomForestClassifier(random_state=42)
rf.fit(X_imputed, y)

# Extract feature importance
importances = pd.Series(rf.feature_importances_, index=X.columns)

# Select columns from remaining_missing that exist in X
missing_columns_in_X = [col for col in remaining_missing.index if col in X.columns]

# Extract feature importance for these columns
remaining_missing_importance = importances[missing_columns_in_X].sort_values(ascending=False)

# Define threshold for feature importance (commonly used threshold ~0.01 for RF)
importance_threshold = 0.01

# Identify columns to keep and drop based on importance
cols_to_keep = remaining_missing_importance[remaining_missing_importance >= importance_threshold].index
cols_to_drop = remaining_missing_importance[remaining_missing_importance < importance_threshold].index

# Display the number of columns to keep and drop
print(f'\nColumns to keep (importance β‰₯ {importance_threshold}): {len(cols_to_keep)}')
print(f'Columns to drop (importance < {importance_threshold}): {len(cols_to_drop)}')

# Drop low-importance columns
df = df.drop(columns=cols_to_drop)

# Display the results of manipulation
print(f'\nDropped {len(cols_to_drop)} columns with low feature importance (< {importance_threshold})')

# Final verification
remaining_missing_values = df.isna().sum().sum()
remaining_columns_with_missing = (df.isna().sum() > 0).sum()
print(f'\nRemaining missing values: {remaining_missing_values}')
print(f'Remaining columns with missing values: {remaining_columns_with_missing}')
print(f'Dataset shape after handling missing values: {df.shape}')

Checked for fully duplicated rows and found none. Identified duplicate IDs, which likely represent multiple transactions per user. Verified target consistency within these duplicate IDs and detected 13 IDs with conflicting target values.
To maintain data integrity, removed records with inconsistent target labels. Confirmed that all such inconsistencies were successfully eliminated, resulting in a slightly reduced dataset with consistent target values.

# Step 4. Addressing dublicates

# Check for fully duplicated rows
full_duplicates = df.duplicated()
print(f'Full duplicate rows in the dataset: {full_duplicates.sum()}')

# Check for duplicate IDs
duplicate_ids_count = df['id'].value_counts()
duplicate_ids = duplicate_ids_count[duplicate_ids_count > 1]
print(f'\nTotal unique IDs: {df["id"].nunique()}')
print(f'IDs with more than one occurrence: {len(duplicate_ids)}')

# Check for inconsistent target values within duplicate IDs
inconsistent_gb = df[df['id'].isin(duplicate_ids.index)].groupby('id')['gb'].nunique()
inconsistent_ids = inconsistent_gb[inconsistent_gb > 1].index
print(f'IDs with inconsistent target (gb) values: {len(inconsistent_ids)}')

# Drop inconsistent IDs from the dataset
df = df[~df['id'].isin(inconsistent_ids)]

# Verify the results
print(f'\nDataset shape after dropping inconsistent IDs: {df.shape}')
print(f'Total unique IDs after dropping: {df["id"].nunique()}')

# Verify for remaining inconsistencies
remaining_inconsistent = df[df['id'].isin(inconsistent_ids)]
if len(remaining_inconsistent) == 0:
    print('All inconsistent IDs have been successfully removed.')
else:
    print('Some inconsistent IDs remain. Please check the data.')

Applied the IQR method to identify outliers in all numerical features. This revealed outliers in 229 columns, with a total of 720,675 instances.
Since the dataset likely represents transactional data, these outliers could reflect natural variations, like small everyday purchases or large one-time transactions. To avoid losing meaningful information, decided not to remove or modify them.

# Step 5. Outlier detection

# Import necessary libraries
from scipy.stats import iqr
import numpy as np

# Function to detect outliers using the IQR method
def detect_outliers(df, col):
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = iqr(df[col])
    outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]
    return len(outliers), (len(outliers) / len(df)) * 100

# Apply outlier detection to all numerical columns
outlier_summary = {}
num = [col for col in df.columns if col.startswith('num_')]

for col in num:
    outlier_count, outlier_percent = detect_outliers(df, col)
    if outlier_count > 0:
        outlier_summary[col] = (outlier_count, round(outlier_percent, 2))

# Display summary
print(f'Total columns with outliers: {len(outlier_summary)}')

# Total number of outlier instances detected
total_outliers = sum(count for count, _ in outlier_summary.values())
print(f'Total outlier instances detected across all numerical features: {total_outliers}')

Standardized all numerical features using StandardScaler to bring them to a mean of 0 and a standard deviation of 1. This improved model performance, especially for logistic regression, which is sensitive to feature scales.

# Step 6. Feature scaling

# Import necessary libraries
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Apply scaling to numerical columns
df[num] = scaler.fit_transform(df[num])

# Display summary
print(f'Feature scaling completed for {len(num)} numerical columns.')

Applied label encoding to all categorical features to convert them into numerical format, making them suitable for machine learning models.

# Step 7. Categorical Encoding

# Import necessary libraries
from sklearn.preprocessing import LabelEncoder

# Apply label encoding to all categorical columns
le = LabelEncoder()

for col in cat:
    df[col] = le.fit_transform(df[col])

# Verify results of encoding
print(f'Categorical columns encoded into (dtype: {df[cat].dtypes.unique()[0]}), count: {len(cat)}')
β€Œ
β€Œ
β€Œ