Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
Source: Kaggle The data was partially cleaned and adapted by DataCamp.
🔍 Scenario: Accurately Predict Instances of Credit Card Fraud
Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
Predicting instances of credit card fraud in the Western United States
As a credit card company, trust is earned by building a reputation for safety. The goal of this project is to create a model which predicts instances of credit card fraud. Once fine-tuned, this model can be used to flag instances of credit card fraud in real time, preventing losses and retaining customers. This report will first go through some preliminary data visualizations to showcase trends, and then will share a model that is well-suited to flagging potential fraud transactions. In the development of the model, I keep in mind that it is far preferable to mistakenly flag a transaction as fraudulent than to miss a real fraudulent transaction.
Data Dictionary
transdatetrans_time | Transaction DateTime |
---|---|
merchant | Merchant Name |
category | Category of Merchant |
amt | Amount of Transaction |
city | City of Credit Card Holder |
state | State of Credit Card Holder |
lat | Latitude Location of Purchase |
long | Longitude Location of Purchase |
city_pop | Credit Card Holder's City Population |
job | Job of Credit Card Holder |
dob | Date of Birth of Credit Card Holder |
trans_num | Transaction Number |
merch_lat | Latitude Location of Merchant |
merch_long | Longitude Location of Merchant |
is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Let's begin by getting a feel for the data, counting the instances of fraud in each transaction category.
# Convert 'trans_date_trans_time' and 'dob' columns to datetime in ccf dataframe
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])
# Group by 'category' and sum the 'is_fraud' column to get the count of frauds per category
fraud_counts = ccf.groupby('category')['is_fraud'].sum()
# Sort the counts in descending order and get the top 5 categories
top_5_fraud_categories = fraud_counts.sort_values(ascending=False).head(5)
top_5_fraud_categories
We also look at the amounts spent on fraudulent transactions (the top 5 spending transactions).
import matplotlib.pyplot as plt
# Group by 'category' and sum the 'amt' column to get the total amount spent per category
amount_spent = ccf.groupby('category')['amt'].sum()
# Sort the amounts in descending order and get the top 5 categories
top_amount_spent_categories = amount_spent.sort_values(ascending=False)
top_amount_spent_categories
Let's look at how many fraudulent transactions are processed, by state.
# Group by 'state' and sum the 'is_fraud' column to get the count of frauds per state
fraud_counts_by_state = ccf.groupby('state')['is_fraud'].sum()
# Sort the counts in descending order
fraud_counts_by_state = fraud_counts_by_state.sort_values(ascending=False)
fraud_counts_by_state
Fraudulent transactions as a proportion of total transaction
The above section is limited in usefulness, because we have not taken into account how many transactions in the above categories were not fraud. Below we will try to address this by looking at the proportion of transactions in particular categories that are fraudulent.
The chart below shows the top 5 fraudulent categories, according to proportion of transactions belonging to the fraud category.
# Calculate the total number of transactions per category
total_transactions_per_category = ccf.groupby('category')['trans_num'].count()
# Calculate the number of fraud transactions per category
fraud_transactions_per_category = ccf[ccf['is_fraud'] == 1].groupby('category')['trans_num'].count()
# Calculate the proportion of fraud transactions per category
fraud_proportion_per_category = (fraud_transactions_per_category / total_transactions_per_category * 100).sort_values(ascending=False)
# Create a bar plot
plt.figure(figsize=(10, 6))
fraud_proportion_per_category.plot(kind='bar', color='lightcoral')
plt.title('Percent fraud by transaction category')
plt.xlabel('Category')
plt.ylabel('Proportion of Fraud Transactions')
plt.xticks(rotation=45)
plt.show()
Both according to raw counts and according to proportion of sales recorded, shopping on the web and grocery point-of-sale are the two biggest credit card fraud categories.
Below we explore amounts. We will create a histogram of transaction amounts (to get a feel for what the distribution reveals), and then create spend categories and look at fraud transactions per spend category.