Skip to content

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Source: Kaggle The data was partially cleaned and adapted by DataCamp.

🔍 Scenario: Accurately Predict Instances of Credit Card Fraud

Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

Predicting instances of credit card fraud in the Western United States

As a credit card company, trust is earned by building a reputation for safety. The goal of this project is to create a model which predicts instances of credit card fraud. Once fine-tuned, this model can be used to flag instances of credit card fraud in real time, preventing losses and retaining customers. This report will first go through some preliminary data visualizations to showcase trends, and then will share a model that is well-suited to flagging potential fraud transactions. In the development of the model, I keep in mind that it is far preferable to mistakenly flag a transaction as fraudulent than to miss a real fraudulent transaction.

Data Dictionary

transdatetrans_timeTransaction DateTime
merchantMerchant Name
categoryCategory of Merchant
amtAmount of Transaction
cityCity of Credit Card Holder
stateState of Credit Card Holder
latLatitude Location of Purchase
longLongitude Location of Purchase
city_popCredit Card Holder's City Population
jobJob of Credit Card Holder
dobDate of Birth of Credit Card Holder
trans_numTransaction Number
merch_latLatitude Location of Merchant
merch_longLongitude Location of Merchant
is_fraudWhether Transaction is Fraud (1) or Not (0)

Let's begin by getting a feel for the data, counting the instances of fraud in each transaction category.

# Convert 'trans_date_trans_time' and 'dob' columns to datetime in ccf dataframe
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])

# Group by 'category' and sum the 'is_fraud' column to get the count of frauds per category
fraud_counts = ccf.groupby('category')['is_fraud'].sum()

# Sort the counts in descending order and get the top 5 categories
top_5_fraud_categories = fraud_counts.sort_values(ascending=False).head(5)

top_5_fraud_categories

We also look at the amounts spent on fraudulent transactions (the top 5 spending transactions).

Run cancelled
import matplotlib.pyplot as plt

# Group by 'category' and sum the 'amt' column to get the total amount spent per category
amount_spent = ccf.groupby('category')['amt'].sum()

# Sort the amounts in descending order and get the top 5 categories
top_amount_spent_categories = amount_spent.sort_values(ascending=False)

top_amount_spent_categories

Let's look at how many fraudulent transactions are processed, by state.

Run cancelled
# Group by 'state' and sum the 'is_fraud' column to get the count of frauds per state
fraud_counts_by_state = ccf.groupby('state')['is_fraud'].sum()

# Sort the counts in descending order
fraud_counts_by_state = fraud_counts_by_state.sort_values(ascending=False)
fraud_counts_by_state

Fraudulent transactions as a proportion of total transaction

The above section is limited in usefulness, because we have not taken into account how many transactions in the above categories were not fraud. Below we will try to address this by looking at the proportion of transactions in particular categories that are fraudulent.

The chart below shows the top 5 fraudulent categories, according to proportion of transactions belonging to the fraud category.

Run cancelled
# Calculate the total number of transactions per category
total_transactions_per_category = ccf.groupby('category')['trans_num'].count()

# Calculate the number of fraud transactions per category
fraud_transactions_per_category = ccf[ccf['is_fraud'] == 1].groupby('category')['trans_num'].count()

# Calculate the proportion of fraud transactions per category
fraud_proportion_per_category = (fraud_transactions_per_category / total_transactions_per_category * 100).sort_values(ascending=False)

# Create a bar plot
plt.figure(figsize=(10, 6))
fraud_proportion_per_category.plot(kind='bar', color='lightcoral')
plt.title('Percent fraud by transaction category')
plt.xlabel('Category')
plt.ylabel('Proportion of Fraud Transactions')
plt.xticks(rotation=45)
plt.show()

Both according to raw counts and according to proportion of sales recorded, shopping on the web and grocery point-of-sale are the two biggest credit card fraud categories.

Below we explore amounts. We will create a histogram of transaction amounts (to get a feel for what the distribution reveals), and then create spend categories and look at fraud transactions per spend category.