Skip to content

FraudDetectorLogisticRegression

This is part one of a three-project series where I explore the various different approaches on how to classify credit card fraud with artificial intelligence. The other two projects can be found on my profile titled FraudDetectorVotingClassifier and FraudDetectorDeepLearning. Enjoy!

In this project, I used a logistic regression machine learning model to detect credit card fraud.

Some Background

Detecting credit card fraud is a unique type of classification task where one class, non-fraud, is significantly more prevalent than the other class, fraud. Because of this, resampling is necessary to improve the class imbalance during training. Three popular resampling techniques are undersampling, oversampling, and SMOTE. Undersampling attains a subset of the larger class while oversampling creates copies of observations in the smaller class. Both methods result in a balanced dataset, but there are some downsides, namely being we’re getting rid of lots of observations with undersampling and training the model on duplicate data with oversampling. SMOTE, on the other hand, generates synthetic observations based on the characteristics of similar datapoints (hence the name Synthetic Minority Oversampling Technique). I used SMOTE in this project.

Code Breakdown

creditcard_sampledata.csv contains information on credit card purchases. I started off by importing the necessary libraries and loading the dataset.

# Import necessary libraries for data manipulation, resampling, model building, and evaluation
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Load the dataset from the specified CSV file
df = pd.read_csv("creditcard_sampledata.csv")

I defined the function prep_data() that returns the feature and target variables (with the target variable being fraud or non-fraud).

Run cancelled
# Define the function to preprocess the data by separating features and target variable
def prep_data(df):
    # Drop the target column 'Class' to get features (X) and store the target in (y)
    X = df.drop('Class', axis=1)
    y = df['Class']
    return X, y

Next, I defined instances of SMOTE and the logistic regression model and created a pipeline with these instances. Logistic regression is used to model the probability that a given input X belongs to a particular class Y, typically coded as 0 or 1. In this case, the two classes are non-fraud and fraud.

# Specify the resampling technique (SMOTE) and the machine learning model (Logistic Regression) for the pipeline
resampling = SMOTE()
model = LogisticRegression()

# Create a pipeline that first applies SMOTE to balance the classes and then fits a Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

I split the dataset into training and test sets. It is especially important to do this in this project, not only because it’s good to have unseen observations to test the model on, but also because it is only necessary to resample the training set, not the test (in real life there isn’t a balanced amount of fraud and non-fraud). I finished things off by training the pipeline to the training set, obtained predictions on the test set, and visualized the results with a classification report and confusion matrix.

# Split the data into training and test sets with 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit the pipeline to the training data and use it to make predictions on the test data
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)

# Print a detailed classification report and the confusion matrix to evaluate the model's performance
print('Classification report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
Classification report: precision recall f1-score support 0 1.00 0.98 0.99 2390 1 0.12 0.80 0.21 10 accuracy 0.97 2400 macro avg 0.56 0.89 0.60 2400 weighted avg 1.00 0.97 0.98 2400 Confusion matrix: [[2331 59] [ 2 8]]

These results aren’t too pretty. Despite only 2 false negatives (fraud transactions incorrectly identified as non-fraud), there were 59 false positives (non-fraud transactions incorrectly identified as fraud), contributing to a precision score of only 0.12 for the fraud class. This can definitely which can be improved, but how? One could argue deep learning, another could argue to stick to traditional machine learning methods. I break this down in my next project, FraudDetectorVotingClassifier.