Credit card transactions: doing EDA and building a classifier

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Source: Kaggle

Here is the description (from Kaggle):

This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.

The goals of the project

The project pursues two goals.

First, I will do some exploratory data analysis (EDA) to find out which variables are associated with fraud. For example, is there any relationship between a transaction's amount and the probability of its being fraudulent? Questions like that will be answered in the course of the EDA.

Second, I will build a classifier predicting whether a transaction is fraudulent. The variables identified in the EDA will be used to train the clasifier.

Data Dictionary

transdatetrans_time	Transaction DateTime
merchant	Merchant Name
category	Category of Merchant
amt	Amount of Transaction
city	City of Credit Card Holder
state	State of Credit Card Holder
lat	Latitude Location of Purchase
long	Longitude Location of Purchase
city_pop	Credit Card Holder's City Population
job	Job of Credit Card Holder
dob	Date of Birth of Credit Card Holder
trans_num	Transaction Number
merch_lat	Latitude Location of Merchant
merch_long	Longitude Location of Merchant
is_fraud	Whether Transaction is Fraud (1) or Not (0)

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import the data and look at the first 20 rows.

ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.head(20)

# Examine the data types
ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.info() # The trans_date_trans_time col needs to be converted to the datetime format. The same goes for the dob col.

# Examine how dates and time are represented in the dataframe.

print(ccf['trans_date_trans_time'].sample(n=100)) # ISO 8601 format
print(ccf['dob'].sample(n=100)) # ISO 8601 format

# Set the appropriate format for the trans_date_trans_time col.
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])

print(ccf.info()) # Now the format is correct.

# Check missing values
ccf.isnull().sum() # No missing values

Hidden output

# Examine categorical variables
print(ccf.describe(include='object'))

Hidden output

# Examine numeric columns
print(ccf.describe())

# Examine the relationship between the amount of a transaction and the likelihood of its being fraudulent.

# Look at the distribution of the values of the amt column.
ccf['amt'].describe() # The outlier value of 28948.9 may obscure the column's histogram.

# Create a histogram of the amt column without the outliers. I will use the standard definition of an outlier - a value that is farther than 1.5 * IQR from the 75th percentile

iqr = ccf['amt'].quantile(.75) - ccf['amt'].quantile(.25)
threshold = 1.5 * iqr + ccf['amt'].quantile(.75)

# Histogram without outliers
ccf[ccf['amt'] < threshold]['amt'].hist()
plt.show()

# Histogram of the outlier values
ccf[ccf['amt'] >= threshold]['amt'].hist()
plt.show()

# Cut the amt column into bins.

bins = [0,50,100,200,500,1000,5000,np.inf]
labels = ['0 to 50','50 to 100','100 to 200','200 to 500','500 to 1000','1000 to 5000','over 5000']

ccf['amt_category'] = pd.cut(ccf['amt'],bins=bins,labels=labels)

print(ccf[['amt','amt_category']].sample(100)) # Cutting worked as expected.

# Examine the relationship between the amt_category and the is_fraud columns

print(ccf.groupby('amt_category').agg(mean=('is_fraud','mean'),n_transactions=('is_fraud','count')))
# The percentage of fraudulent transcations varies a lot between the amount categories: from less than 0.001% for transactions between 50 and 100 to more than 20% for transactions between 500 and 1000, and 1000 and 5000.

# Visualize this pattern
sns.barplot(data=ccf,x='amt_category',y='is_fraud')
plt.xticks(rotation=90)
plt.title('Proportion of fraudulent transactions by amount of transaction')
plt.show() # The plot confirms that there is strong association between the is_fraud column and the amount of a transaction.

# Create new columns to be used in the EDA

# Cut the city_pop column into bins
bins = [0,50,100,200,500,1000,np.inf]
labels = ['0 to 50K','50K to 100K','100K to 200K','200K to 500K','500K to 1000K','over 1000K']
ccf['city_pop_category'] = pd.cut(ccf['city_pop'],bins=bins,labels=labels)

# Create the month column - it stores the month when a transaction was made.
ccf['month'] = ccf['trans_date_trans_time'].dt.month

# Create the day of the week column - the day of the week when a transaction was made.
ccf['dow'] = ccf['trans_date_trans_time'].dt.dayofweek # 0 stands for Monday.

# Create the hour column - the hour when a transaction was made.
ccf['hour'] = ccf['trans_date_trans_time'].dt.hour

# Extract the year from the dob column
ccf['dob_year'] = ccf['dob'].dt.year

# 
print(ccf['dob_year'].describe()) # The oldest person in the data set is born in 1927, the youngest, in 2001.

# Bin the dob_year column into generations: baby boomers, millenials, . Store the result in the generation column.

conditions =[(ccf['dob_year'] >= 1925)&(ccf['dob_year'] <=1945),
             (ccf['dob_year'] >= 1946)&(ccf['dob_year'] <= 1964),
             (ccf['dob_year'] >= 1965)&(ccf['dob_year'] <=1980),
             (ccf['dob_year'] >= 1981)&(ccf['dob_year'] <=1996),
             (ccf['dob_year'] >= 1997)&(ccf['dob_year'] <= 2012)]

labels = ['silent generation','baby boomers','generation x','millenials','zoomers']

ccf['generation'] = np.select(conditions,labels)

‌
‌
‌

Credit card transactions: doing EDA and building a classifier

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Credit Card Fraud

The goals of the project

Data Dictionary

Credit Card Fraud