Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.
Source: Kaggle The data was partially cleaned and adapted by DataCamp.
We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.
Explore this dataset
Here are some ideas to get your started with your analysis...
- ๐บ๏ธ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- ๐ Visualize: Use a geospatial plot to visualize the fraud rates across different states.
- ๐ Analyze: Are older customers significantly more likely to be victims of credit card fraud?
๐ Scenario: Accurately Predict Instances of Credit Card Fraud
This scenario helps you develop an end-to-end project for your portfolio.
Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
You can query the pre-loaded CSV file using SQL directly. Hereโs a sample query, followed by some sample Python code and outputs:
SELECT * FROM 'credit_card_fraud.csv'
LIMIT 5"""External libraries that require installation """
!pip install ISLP
!pip install l0bnb
"""Native and most commonly used packages listed here"""
import os as os
import inspect
import re as p_regex
from functools import partial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
"""Statistical packages"""
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import (cross_val_predict, KFold, ShuffleSplit, GridSearchCV)
from sklearn.base import clone
from sklearn.utils import resample
from sklearn.linear_model import (ElasticNet, ElasticNetCV, Lasso, LinearRegression)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,summarize,poly)
from ISLP.models import sklearn_sm
from ISLP.models import (Stepwise, sklearn_selected, sklearn_selection_path)
from l0bnb import fit_path
Data Dictionary
| transdatetrans_time | Transaction DateTime |
|---|---|
| merchant | Merchant Name |
| category | Category of Merchant |
| amt | Amount of Transaction |
| city | City of Credit Card Holder |
| state | State of Credit Card Holder |
| lat | Latitude Location of Purchase |
| long | Longitude Location of Purchase |
| city_pop | Credit Card Holder's City Population |
| job | Job of Credit Card Holder |
| dob | Date of Birth of Credit Card Holder |
| trans_num | Transaction Number |
| merch_lat | Latitude Location of Merchant |
| merch_long | Longitude Location of Merchant |
| is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Concept for analysing Credit Card Fraud data using the credit_card_fraud data set
- Clean data:
- Check for missing values
- Check for incorrect entries (patterns, correct values, data types)
- Check rows for completeness
- Eventually change data types, create categorial data types
- Try to detect outliers and high leverage points (only write down potential ones here)
- Create visuales for detecting patterns:
- Create scatter plots/scatter matrix to check for patterns
- Create histograms to check distribution for is_fraud given specific combination of values
- Try to pre-determine the features to use in for the linear regression
- Create analytical methods:
- Apply known LR, including 1d linear regression, xd linear regression, linear regression with interaction term, logistic regression,
- Include review of outliers, high-leverage points, colinearity
- Apply KNN as a non-parameterized option
- Usage of geneartive models (LDA, QDA, Naive Bayes)
- Check Poisson Regression as an alternative to predict possibility of fraud occurrence
- Eventually PCA and PLA
- Compare analytical methods:
- Create box plots for the test errors
- Apply validation, LOOCV, k-fold set approach
- Review bootstrapping for enhancing validation set approaches
- Apply selection methods (subselection, stepwise selection[forward, backward, both], ridge regression, lasso, eventually baysiaan form )
- Review and plot Cp, AIC, BIC, and Adjusted R2 for each model(s)
Part 1:
Clean data:
- Check for missing values
- Check for incorrect entries (patterns, correct values, data types)
- Check rows for completeness
- Eventually change data types, create categorial data types
- Try to detect outliers and high leverage points (only write down potential ones here)
def check_csv_file(file):
"""
Takes a file as input and checks, if it exists in the current working directory. Returns a pd.read_csv() object
paramter: file <-- Must be a .csv file
"""
dir_files = os.listdir()
if file in dir_files and p_regex.match(r'.*.csv$', file):
csv_file = pd.read_csv(file)
else:
print(str(file) + " file not ending with .csv or does not exist in " + os.getcwd())
return csv_file
"""Check for missing values or empty values"""
def check_val(df):
"""
Checks if a dataframe(df) has any missing values. Input must be a valid pd.DataFrame object
"""
if type(df) == pd.DataFrame:
collector = []
for i in range(0, len(ccf.columns)):
mask = ccf[ccf.columns[i]].isna() | (ccf[ccf.columns[i]] == "")
if mask.any() == True:
collector.append(index)
print(collector)
return collector
else:
return print("No missing or empty values detected")
else:
print("Please input a valid pd.DataFrame")
ccf = check_csv_file("credit_card_fraud.csv")
check_val(ccf)
#Check for incorrect entries (patterns, correct values, data types)
"""
possible patterrn mismatches:
trans_date_trans_time --> should be of pattern yyyy-mm-dd hh:mm:ss
category --> must be consistent (no typos, correct letters, correct length etc)
amt --> Must be a float of type f,ff
city --> must be consistent (no typos, correct letters, correct length etc.)
state --> must be consistent (no typos, correct letters, correct length etc.)
lat/long --> must be valid floats
city_pop --> must be valid numbers
job --> must be consistent (no typos, correct letters, correct length etc.)
dob --> must be of pattern yyyy-mm-dd
trans_num --> must have common length(?)
merch_lat/merch_long --> must be valid floats
is_fraud --> classifier
"""
#convert date and time column to datetime
#print(ccf['trans_date_trans_time'].dtype)
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'], infer_datetime_format=True, errors='raise')
#print(ccf['trans_date_trans_time'].dtype)
#check category column
ccf['category'].unique() #<-- No duplicate values or typos are returned; we can assume the values are all correct
#check amt
#print(ccf['amt'].dtype == float) #returns true
ccf.loc[ccf['amt'].astype(str).str.contains(r'^[0-9]+(,[0-9]{2})?$') == True, 'amt'] #returns no values, i. e. pattern matchess
#check city
ccf['city'].unique() #longer list, requires external validation;
dup_cities = ccf['city'].duplicated()
ccf.loc[dup_cities == True, 'city'] #returns no values, i. e. no duplicates
#check state
ccf['state'].unique() #returns correct list of states; would require external validation
ccf.loc[ccf['state'].str.len() != 2, 'state'] #returns no values, i. e. all states are of length 2
#check lat/long
ccf['lat'].dtype #returns float
ccf['long'].dtype #returns float
#check city_pop
ccf['city_pop'].dtype #returns int64, would require external validation
#check job
ccf['job'].unique() # would show no title case for jobs
len(ccf['job']) - len(ccf.loc[ccf['job'].str.istitle() == False, 'job']) #Inconsistent for 50458 are not title case
ccf['job'] = ccf['job'].str.title()
len(ccf['job']) - len(ccf.loc[ccf['job'].str.istitle() == True, 'job']) #Consistent for 0 are not title case
#ccf['job'].unique() #shows only title cases
#check dob (date of birth)
ccf['dob'].dtype #returns object, should be date/datetime, eventually change to Period, if possible
ccf['dob'] = pd.to_datetime(ccf['dob'], format='%Y-%m-%d')
ccf['dob'].dtype
#check trans_num
ccf['trans_num'] #is a random string, would require external validation
#check merch_lat/merch_long
ccf['merch_lat'].dtype #returns float
ccf['merch_long'].dtype #returns float
#check is_fraud
ccf['is_fraud'].unique() #returns 0 or 1, so False or True
ccf['is_fraud'].dtype #returns int64 and is interpretable as boolean
#ccf['is_fraud'] = ccf['is_fraud'].astype('category')
#ccf['is_fraud'].dtype #returns category2 hidden cells
"""
Let us display the amount of fraud transactions per year and for both years per month
"""
#ccf['trans_date_trans_time'].dt.year.unique() <-- Outputs 2019, 2020
#ccf['trans_date_trans_time'].dt.month.unique() <-- Outputs 1 to 12
#ccf['trans_date_trans_time'].dt.day.unique() <-- Outputs 1 to 31
#ccf['trans_date_trans_time'].dt.hour.unique() <-- Outputs 0 to 23
#ccf['trans_date_trans_time'].dt.minute.unique() <-- Outputs 0 to 58 (no 59?!??)
"""
Create df for fraud for different intervals
"""
is_fraud_year = ccf.loc[ccf['is_fraud'] == 1, 'trans_date_trans_time'].dt.year
is_fraud_month = ccf.loc[ccf['is_fraud'] == 1, 'trans_date_trans_time'].dt.month
is_fraud_day = ccf.loc[ccf['is_fraud'] == 1, 'trans_date_trans_time'].dt.day
is_fraud_hour = ccf.loc[ccf['is_fraud'] == 1, 'trans_date_trans_time'].dt.hour
is_fraud_minute = ccf.loc[ccf['is_fraud'] == 1, 'trans_date_trans_time'].dt.minute
#plt.plot(ccf['is_fraud'], is_fraud_year, )
"""
Creates counts of fraud transactions for a specific year, month etc.
"""
fraud_per_year = pd.crosstab(is_fraud_year, columns='is_fraud')
fraud_per_month = pd.crosstab(is_fraud_month, columns='is_fraud')
fraud_per_day = pd.crosstab(is_fraud_day, columns='is_fraud')
fraud_per_hour = pd.crosstab(is_fraud_hour, columns='is_fraud')
fraud_per_minute = pd.crosstab(is_fraud_minute, columns='is_fraud')
"""
Plot the year and the month fraud count
"""
fig, axes = plt.subplots(2,1)
#fraud_per_category = pd.crosstab(is_fraud_year, columns='is_fraud')
axes[0].bar(fraud_per_year.index, fraud_per_year['is_fraud'])
axes[1].bar(fraud_per_month.index, fraud_per_month['is_fraud'])
axes[0].set_title('Fraud per Year')
axes[0].legend(loc='upper right', labels=fraud_per_year.columns)
axes[0].set_xticks(fraud_per_year.index, fraud_per_year.index)
axes[1].set_title('Fraud per Month', loc='center')
axes[1].legend(loc='upper right', labels=fraud_per_year.columns)
axes[1].set_xticks(fraud_per_month.index, fraud_per_month.index);
#Add fraud per month in a given year#Here we can see that most of the fraud amounts are in the lower range, below 10000
pd.plotting.scatter_matrix(ccf[['amt', 'city_pop', 'is_fraud']], figsize=(10,10));
#We can confirm this by screening all fraud values and check the maximum amount of amt which is below 5000
ccf.loc[ccf['is_fraud'] == 1, ['amt']].max(axis=0)"""
We create labels for the next plots and we create four quantiles in order to group amt
"""
labels = ccf['is_fraud'].unique()
ccf['amt_q10'] = pd.qcut(ccf['amt'], 4, labels=["small", "middle", "high", "very high"])
segments = ccf['amt_q10'].unique()
titles = ccf['amt_q10'].unique().sort_values()
"""
For each quantile, we create a pd.DataFrame
"""
ccf_small = ccf.loc[ccf['amt_q10'] == 'small', ['amt', 'is_fraud']]
ccf_middle = ccf.loc[ccf['amt_q10'] == 'middle', ['amt', 'is_fraud']]
ccf_high = ccf.loc[ccf['amt_q10'] == 'high', ['amt', 'is_fraud']]
ccf_very_high = ccf.loc[ccf['amt_q10'] == 'very high', ['amt', 'is_fraud']]โ
โ