1 hidden cell
Explore this dataset
Here are some ideas to get your started with your analysis...
- ๐บ๏ธ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- ๐ Visualize: Use a geospatial plot to visualize the fraud rates across different states.
- ๐ Analyze: Are older customers significantly more likely to be victims of credit card fraud?
๐ Scenario: Accurately Predict Instances of Credit Card Fraud
This scenario helps you develop an end-to-end project for your portfolio.
Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
You can query the pre-loaded CSV file using SQL directly. Hereโs a sample query, followed by some sample Python code and outputs:
1 hidden cell
"""External libraries that require installation """
!pip install ISLP
!pip install l0bnb
"""Native and most commonly used packages listed here"""
import os as os
import inspect
import re as p_regex
from functools import partial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
"""Statistical packages"""
import statsmodels.api as sm
from sklearn.model_selection import (train_test_split, cross_val_predict, cross_val_score, KFold, ShuffleSplit, GridSearchCV, StratifiedKFold)
from sklearn.base import clone
from sklearn.utils import resample
from sklearn.linear_model import (ElasticNet, ElasticNetCV, Lasso, LinearRegression)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,summarize,poly)
from ISLP.models import sklearn_sm
from ISLP.models import (Stepwise, sklearn_selected, sklearn_selection_path)
from l0bnb import fit_path
Data Dictionary
| transdatetrans_time | Transaction DateTime |
|---|---|
| merchant | Merchant Name |
| category | Category of Merchant |
| amt | Amount of Transaction |
| city | City of Credit Card Holder |
| state | State of Credit Card Holder |
| lat | Latitude Location of Purchase |
| long | Longitude Location of Purchase |
| city_pop | Credit Card Holder's City Population |
| job | Job of Credit Card Holder |
| dob | Date of Birth of Credit Card Holder |
| trans_num | Transaction Number |
| merch_lat | Latitude Location of Merchant |
| merch_long | Longitude Location of Merchant |
| is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Concept for analysing Credit Card Fraud data using the credit_card_fraud data set
- Clean data:
- Check for missing values
- Check for incorrect entries (patterns, correct values, data types)
- Check rows for completeness
- Eventually change data types, create categorial data types
- Try to detect outliers and high leverage points (only write down potential ones here)
- Create visuales for detecting patterns:
- Create scatter plots/scatter matrix to check for patterns
- Create histograms to check distribution for is_fraud given specific combination of values
- Try to pre-determine the features to use in for the linear regression
- Create analytical methods:
- Apply known LR, including 1d linear regression, xd linear regression, linear regression with interaction term, logistic regression,
- Include review of outliers, high-leverage points, colinearity
- Apply KNN as a non-parameterized option
- Usage of geneartive models (LDA, QDA, Naive Bayes)
- Check Poisson Regression as an alternative to predict possibility of fraud occurrence
- Eventually PCA and PLA
- Compare analytical methods:
- Create box plots for the test errors
- Apply validation, LOOCV, k-fold set approach
- Review bootstrapping for enhancing validation set approaches
- Apply selection methods (subselection, stepwise selection[forward, backward, both], ridge regression, lasso, eventually baysiaan form )
- Review and plot Cp, AIC, BIC, and Adjusted R2 for each model(s)
def check_csv_file(file):
"""
Takes a file as input and checks, if it exists in the current working directory. Returns a pd.read_csv() object
paramter: file <-- Must be a .csv file
"""
dir_files = os.listdir()
if file in dir_files and p_regex.match(r'.*.csv$', file):
csv_file = pd.read_csv(file)
else:
print(str(file) + " file not ending with .csv or does not exist in " + os.getcwd())
return csv_file
"""Check for missing values or empty values"""
def check_val(df):
"""
Checks if a dataframe(df) has any missing values. Input must be a valid pd.DataFrame object
"""
if type(df) == pd.DataFrame:
collector = []
for i in range(0, len(ccf.columns)):
mask = ccf[ccf.columns[i]].isna() | (ccf[ccf.columns[i]] == "")
if mask.any() == True:
collector.append(index)
print(collector)
return collector
else:
return print("No missing or empty values detected")
else:
print("Please input a valid pd.DataFrame")
ccf = check_csv_file("credit_card_fraud.csv")
check_val(ccf)
#Check for incorrect entries (patterns, correct values, data types)
"""
possible patterrn mismatches:
merchant --> Must be consistent
trans_date_trans_time --> should be of pattern yyyy-mm-dd hh:mm:ss
category --> must be consistent (no typos, correct letters, correct length etc)
amt --> Must be a float of type f,ff
city --> must be consistent (no typos, correct letters, correct length etc.)
state --> must be consistent (no typos, correct letters, correct length etc.)
lat/long --> must be valid floats
city_pop --> must be valid numbers
job --> must be consistent (no typos, correct letters, correct length etc.)
dob --> must be of pattern yyyy-mm-dd
trans_num --> must have common length(?)
merch_lat/merch_long --> must be valid floats
is_fraud --> classifier
"""
#check merchant
len(ccf['merchant']) - len(ccf.loc[ccf['merchant'].str.istitle() == False, 'merchant']) #Returns 168039 values
ccf['job'] = ccf['job'].str.title()
len(ccf['job']) - len(ccf.loc[ccf['job'].str.istitle() == True, 'job']) #Consistent for 0 are not title case
#convert date and time column to datetime
#print(ccf['trans_date_trans_time'].dtype)
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'], infer_datetime_format=True, errors='raise')
#print(ccf['trans_date_trans_time'].dtype)
#check category column
ccf['category'].unique() #<-- No duplicate values or typos are returned; we can assume the values are all correct
#check amt
#print(ccf['amt'].dtype == float) #returns true
ccf.loc[ccf['amt'].astype(str).str.contains(r'^[0-9]+(,[0-9]{2})?$') == True, 'amt'] #returns no values, i. e. pattern matchess
#check city
ccf['city'].unique() #longer list, requires external validation;
dup_cities = ccf['city'].duplicated()
ccf.loc[dup_cities == True, 'city'] #returns no values, i. e. no duplicates
#check state
ccf['state'].unique() #returns correct list of states; would require external validation
ccf.loc[ccf['state'].str.len() != 2, 'state'] #returns no values, i. e. all states are of length 2
#check lat/long
ccf['lat'].dtype #returns float
ccf['long'].dtype #returns float
#check city_pop
ccf['city_pop'].dtype #returns int64, would require external validation
#check job
ccf['job'].unique() # would show no title case for jobs
len(ccf['job']) - len(ccf.loc[ccf['job'].str.istitle() == False, 'job']) #Inconsistent for 50458 are not title case
ccf['job'] = ccf['job'].str.title()
len(ccf['job']) - len(ccf.loc[ccf['job'].str.istitle() == True, 'job']) #Consistent for 0 are not title case
#ccf['job'].unique() #shows only title cases
#check dob (date of birth)
ccf['dob'].dtype #returns object, should be date/datetime, eventually change to Period, if possible
ccf['dob'] = pd.to_datetime(ccf['dob'], format='%Y-%m-%d')
ccf['dob'].dtype
#check trans_num
ccf['trans_num'] #is a random string, would require external validation
#check merch_lat/merch_long
ccf['merch_lat'].dtype #returns float
ccf['merch_long'].dtype #returns float
#check is_fraud
ccf['is_fraud'].unique() #returns 0 or 1, so False or True
ccf['is_fraud'].dtype #returns int64 and is interpretable as boolean
#ccf['is_fraud'] = ccf['is_fraud'].astype('category')
#ccf['is_fraud'].dtype #returns category1 hidden cell
Part 3:
Create analytical methods:
- Apply known LR, including 1d linear regression, xd linear regression, linear regression with interaction term, logistic regression,
- Include review of outliers, high-leverage points, colinearity
- Apply KNN as a non-parameterized option
- Usage of geneartive models (LDA, QDA, Naive Bayes)
- Check Poisson Regression as an alternative to predict possibility of fraud occurrence
- Eventually PCA and PLA
Now, following our initial analysis in Basic Analytics - Credit Card Fraud, we will further pursue to determine a simple linear regression model in order to determine the following:
- Can we determine the most relevant factors that help us to determine what transactions should be flagged as fraud
- Can we create a model that helps us to classify fraudalent transactions correctly
For this, we will use the same data set used in Basic Analytics - Credit Card Fraud
6 hidden cells
After some random guessing using different random variables we can conclude none of the results are in any way satisfactory. This is expected and it would have been very lucky to have a model on such a complex data set that already works with a few variables.
Now in the next steps we will do the following exactly the same as before but only using Logistic Regression instead of linear regression.
โ
โ