Credit Card Fraud

Fraud icons created by Freepik - Flaticon

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

import pandas as pd 

df = pd.read_csv('credit_card_fraud.csv')

Data Dictionary

transdatetrans_time	Transaction DateTime
merchant	Merchant Name
category	Category of Merchant
amt	Amount of Transaction
city	City of Credit Card Holder
state	State of Credit Card Holder
lat	Latitude Location of Purchase
long	Longitude Location of Purchase
city_pop	Credit Card Holder's City Population
job	Job of Credit Card Holder
dob	Date of Birth of Credit Card Holder
trans_num	Transaction Number
merch_lat	Latitude Location of Merchant
merch_long	Longitude Location of Merchant
is_fraud	Whether Transaction is Fraud (1) or Not (0)

Source of dataset. The data was partially cleaned and adapted by DataCamp.

Exploratory Analysis

I would like to investigate a few different things to start, such as:

Locations where fraud is most likely to occur
Distance between purchase location and location of merchant
Date and Time
Age of customer

# import necessary packages
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns

# look for missing values
print(df.describe())
print(df.info())

#convert datetime values
pd.to_datetime(df['dob'])
pd.to_datetime(df['trans_date_trans_time'])

# Fraud and number of transactions by state
fraud_by_state = df.groupby('state')['is_fraud'].mean().reset_index()
transactions_by_state =  df.groupby('state').size()

# Fraud and number of transactions by city
fraud_by_city = df.groupby('city')['is_fraud'].mean().reset_index().sort_values(by = 'is_fraud', ascending= False).head(20)
transactions_by_city =  df.groupby('city').size()
 
#set sns style and color palette
sns.set_style('whitegrid')



sns.barplot(data = fraud_by_state, x = 'state',y='is_fraud',palette = 'Blues')
plt.title('Fraud Rate by State')
plt.show()
plt.figure(figsize=(6,12))
sns.barplot(data = fraud_by_city, y = 'city',x='is_fraud',orient='h', palette = 'Blues')
plt.title('Cities with Highest Fraud Rate')
plt.show()

#Isolate cities with 100% fraud rate to investigate

full_fraud_cities = fraud_by_city[fraud_by_city['is_fraud']==1]
full_fraud_cities = full_fraud_cities['city']

print('Cities with 100% Fraud Rate')
display(full_fraud_cities)
display(df[df['city'].isin(full_fraud_cities)])

# Fraud and number of transactions by category
fraud_by_category = df.groupby('category')['is_fraud'].mean().reset_index().sort_values(by='is_fraud',ascending=False,)
transactions_by_category =  df.groupby('category').size()

#plot figure
plt.figure(figsize=(6,12))
sns.barplot(data = fraud_by_category, y = 'category',x='is_fraud',orient='h',palette = 'Blues')
plt.title('Fraud Level by Purchase Category')
plt.show()

# add those features to the dataset
#df = pd.merge(df, pd.concat([fraud_by_city,transactions_by_city], axis=1), on='city', how='left', suffixes=['rate','transaction_count'])

# create a function to calculate distance between two points using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # radius of the earth in km
    dLat = math.radians(lat2 - lat1)
    dLon = math.radians(lon2 - lon1)
    a = math.sin(dLat / 2) * math.sin(dLat / 2) + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.sin(dLon / 2) * math.sin(dLon / 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = R * c
    return d

# create a new feature 'distance' that calculates the distance between the purchase and merchant locations
df['distance'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

display(df.head(25))

# Correlation Matrix
corr = df.corr()
sns.heatmap(corr,cmap='Blues')
plt.show()

Identifying Customers

There is no sspecific customer identification in the dataset, but between the birthdate, city, and job title, it should be possible to parse out individual customers to be able to identify unique behavior to that customer that might make it easier for models to spot outliers.

#Identify unique customers based on a combination of birthdate, city_population, and job

df['customer_id_test'] = df['dob'].astype(str)+df['city'].astype(str)+df['job'].astype(str)
customers = df.groupby('customer_id_test')
# Get the unique customers and create and apply customer_ids
unique_customers = list(customers.groups.keys())
df["customer_id"] = df["customer_id_test"].map(dict(zip(unique_customers, range(len(unique_customers)))))
display(df.head(25))
#test on one customer_id to make sure the id lines up
display(df[df['customer_id']==122])

Unique Customer Behavior

Now that I have each customer uniquely identified, and assigned a customer id, I would like to map out customer behavior specific to each customer such as

'avg_purchase_price'
'avg_distance'

I would also like to include the quantiles for each

‌
‌
‌

Credit Card Fraud

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Credit Card Fraud

Data Dictionary

Exploratory Analysis

Identifying Customers

Unique Customer Behavior

Credit Card Fraud