Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
Here is the description (from Kaggle):
This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.
The goals of the project
The project pursues two goals.
First, I will do some exploratory data analysis (EDA) to find out which variables are associated with fraud. For example, is there any relationship between a transaction's amount and the probability of its being fraudulent? Questions like that will be answered in the course of the EDA.
Second, I will build a classifier predicting whether a transaction is fraudulent. The variables identified in the EDA will be used to train the clasifier.
Data Dictionary
transdatetrans_time | Transaction DateTime |
---|---|
merchant | Merchant Name |
category | Category of Merchant |
amt | Amount of Transaction |
city | City of Credit Card Holder |
state | State of Credit Card Holder |
lat | Latitude Location of Purchase |
long | Longitude Location of Purchase |
city_pop | Credit Card Holder's City Population |
job | Job of Credit Card Holder |
dob | Date of Birth of Credit Card Holder |
trans_num | Transaction Number |
merch_lat | Latitude Location of Merchant |
merch_long | Longitude Location of Merchant |
is_fraud | Whether Transaction is Fraud (1) or Not (0) |
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Import the data and look at the first 20 rows.
ccf = pd.read_csv('credit_card_fraud.csv')
ccf.head(20)
# Examine the data types
ccf = pd.read_csv('credit_card_fraud.csv')
ccf.info() # The trans_date_trans_time col needs to be converted to the datetime format. The same goes for the dob col.
# Examine how dates and time are represented in the dataframe.
print(ccf['trans_date_trans_time'].sample(n=100)) # ISO 8601 format
print(ccf['dob'].sample(n=100)) # ISO 8601 format
# Set the appropriate format for the trans_date_trans_time col.
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])
print(ccf.info()) # Now the format is correct.
# Check missing values
ccf.isnull().sum() # No missing values
# Examine categorical variables
print(ccf.describe(include='object'))
# Examine numeric columns
print(ccf.describe())
# Examine the relationship between the amount of a transaction and the likelihood of its being fraudulent.
# Look at the distribution of the values of the amt column.
ccf['amt'].describe() # The outlier value of 28948.9 may obscure the column's histogram.
# Create a histogram of the amt column without the outliers. I will use the standard definition of an outlier - a value that is farther than 1.5 * IQR from the 75th percentile
iqr = ccf['amt'].quantile(.75) - ccf['amt'].quantile(.25)
threshold = 1.5 * iqr + ccf['amt'].quantile(.75)
# Histogram without outliers
ccf[ccf['amt'] < threshold]['amt'].hist()
plt.show()
# Histogram of the outlier values
ccf[ccf['amt'] >= threshold]['amt'].hist()
plt.show()
# Cut the amt column into bins.
bins = [0,50,100,200,500,1000,5000,np.inf]
labels = ['0 to 50','50 to 100','100 to 200','200 to 500','500 to 1000','1000 to 5000','over 5000']
ccf['amt_category'] = pd.cut(ccf['amt'],bins=bins,labels=labels)
print(ccf[['amt','amt_category']].sample(100)) # Cutting worked as expected.
# Examine the relationship between the amt_category and the is_fraud columns
print(ccf.groupby('amt_category').agg(mean=('is_fraud','mean'),n_transactions=('is_fraud','count')))
# The percentage of fraudulent transcations varies a lot between the amount categories: from less than 0.001% for transactions between 50 and 100 to more than 20% for transactions between 500 and 1000, and 1000 and 5000.
# Visualize this pattern
sns.barplot(data=ccf,x='amt_category',y='is_fraud')
plt.xticks(rotation=90)
plt.title('Proportion of fraudulent transactions by amount of transaction')
plt.show() # The plot confirms that there is strong association between the is_fraud column and the amount of a transaction.
# Create new columns to be used in the EDA
# Cut the city_pop column into bins
bins = [0,50,100,200,500,1000,np.inf]
labels = ['0 to 50K','50K to 100K','100K to 200K','200K to 500K','500K to 1000K','over 1000K']
ccf['city_pop_category'] = pd.cut(ccf['city_pop'],bins=bins,labels=labels)
# Create the month column - it stores the month when a transaction was made.
ccf['month'] = ccf['trans_date_trans_time'].dt.month
# Create the day of the week column - the day of the week when a transaction was made.
ccf['dow'] = ccf['trans_date_trans_time'].dt.dayofweek # 0 stands for Monday.
# Create the hour column - the hour when a transaction was made.
ccf['hour'] = ccf['trans_date_trans_time'].dt.hour
# Extract the year from the dob column
ccf['dob_year'] = ccf['dob'].dt.year
#
print(ccf['dob_year'].describe()) # The oldest person in the data set is born in 1927, the youngest, in 2001.
# Bin the dob_year column into generations: baby boomers, millenials, . Store the result in the generation column.
conditions =[(ccf['dob_year'] >= 1925)&(ccf['dob_year'] <=1945),
(ccf['dob_year'] >= 1946)&(ccf['dob_year'] <= 1964),
(ccf['dob_year'] >= 1965)&(ccf['dob_year'] <=1980),
(ccf['dob_year'] >= 1981)&(ccf['dob_year'] <=1996),
(ccf['dob_year'] >= 1997)&(ccf['dob_year'] <= 2012)]
labels = ['silent generation','baby boomers','generation x','millenials','zoomers']
ccf['generation'] = np.select(conditions,labels)