Fraud Detection using labeled, unlabeled and text data
This wokspace is based on Fraud Detection Project. The objective is to detect fradulent transactions using a credit card transactions data.
-
Problem Statement
A survey conducted by Certified Fraud Examiners (CFEs) reveals that organizations worldwide suffer a loss of approximately five percent of their annual revenues due to fraudulent activities. This study, conducted from January 2010 to December 2011, estimates that the total global fraud loss could exceed $3.5 trillion when applied to the estimated Gross World Product of 2011 (Source). This project aims to explore the utilization of data to combat fraud. We will employ machine learning algorithms to identify fraudulent behavior that resembles previously observed instances. When dealing with fraud analytics, we often encounter imbalanced datasets, where the fraud class is vastly outnumbered by the non-fraud class. Therefore, we will also examine various techniques to address this challenge.
-
Dataset
The dataset used in this project has been obtained from DataCamp's Fraud Detection in Python course (Invalid URL). It consists of credit card transaction data, wherein instances of fraud are fortunately a rare occurrence. However, Machine Learning algorithms typically exhibit optimal performance when the dataset contains a relatively balanced representation of the different classes. In cases where fraud instances are scarce, there is insufficient data to effectively train the algorithms to identify them. This scenario is commonly referred to as class imbalance, and it represents one of the primary challenges encountered in fraud detection. In the following analysis, we will delve into this dataset and examine the implications of this class imbalance problem.
-
Setup and Imports:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
#Data preprocessing
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path
import seaborn as sns
from itertools import product
import string
#Text
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#Sampling
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.pipeline import Pipeline
#Modelling
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.metrics import homogeneity_score, silhouette_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import MiniBatchKMeans, DBSCAN
#Topic Modelling
import gensim
from gensim import corpora
#Configuration
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 200)
pd.set_option('display.min_rows', 10)
pd.set_option('display.expand_frame_repr', True)
df = pd.read_csv("creditcard_sampledata_3.csv")
df.info()
df.sample(5).transpose()
df.Class.value_counts()
#Mean for each class
df.groupby("Class").mean()
# create input and target variable
X = df.drop(["Unnamed: 0", "Class"], axis=1)
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# fit a logistic regression model to the data
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
# obtain model predictions
predicted = model.predict(X_test)
# predict probabilities
probs = model.predict_proba(X_test)