Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
CONCLUSIONS OF REPORT: This is a summary of our findings. Please continue to the full report below to see visualizations and thought processes as we work through the data.
Out of 339607 transactions, 1782 (about 0.5%) of them were fraudulent totalling $923,192.65 dollars in fraudulent charges. The Probability of randomly pulling a fraudulent transaction is about 0.00525.
The majority of fraud (76.94%) takes place in grocery stores, online shopping, in-person (non grocery) shopping, and gas/transportation. Online shopping has the most unique counts of high dollar charges (750 to 1400), followed by in-person (non grocery) shopping.
Grocery stores account for about 25% of all fraud scattered throughout the western states and are always a charge between 250 to 400 dollars. This does not seem random -- why would there be no grocery charges ranging from 0 to 249 dollars?
A theory could be that a criminal fraud group is scattered throughout the States and distributing the fraud information online as well as instructions on how much they can get away with.
There are 332 unique merchants making up all of the fraudulent charges; certain merchants are associated with multiple fraud cases.
CA has the highest fraud rate and the highest population. We ran a regression analysis to find that state population and fraud rate have a moderate positive correlation. Fraud rates typically go up with an increasing population, but that is not the only factor in high fraud rates for each state.
Certain states seem to have high fraud rates relative to their population rank. Diving deeper into these transactions, it was discovered that one credit card is often used for multiple cases of fraud. We found this information by looking at transactions with duplicate DOB, job, and city parameters.
There are 182 unique Credit Cards's making up the 1782 fradulent transactions. The average amount of transactions for each stolen credit card is about 9.8, averaging roughly $518 dollars per swipe
This helps explain why some states have higher fraud transactions with a smaller population. The states in question had a high average number of swipes per stolen card and/or a high number of unique cards.
The time of day is a strong indicator of fraudulent transactions. The overwhelming majority of fraud cases (86%) take place from between 10pm - 4am.
It was determined that high age of CC holders does not directly correlate with a high number of fraudulent transactions. Although the largest group of fraudulent purchases were made from CC holders with a birthday ranging from 1960-1970, specific birth years with high counts of fraudulent transactions could be based on outside influences such as the timing of data leaks (when and how criminals collect stolen CC information), or multiple repeated transactions on stolen credit cards.
RECOMMENDATIONS:
In order to combat the criminal fraud group distributing the stolen CC information and reduce the overall risk for the company, it is important to catch fraudulent transactions early to prevent continuous fraudulent use of compromised cards and reduce the number of average fraudulent transactions per stolen CC. We can set parameters to flag suspect transactions.
Possible red flags indicating fraud, based on these findings, would be CC purchases in a grocery stores (grocery_pos) from 250-400 dollars, charges for in-store shopping (shopping_pos) from 750+ dollars, or charges for online shopping (misc_net/shopping_net) of amounts exceeding 750+ dollars. These categories make up most of the fraud.
If repeat purchases for one unique card match the descriptions above, with purchase/merchant coordinates that do not align with the CC holder's city or residential area (see heat map of USA below), the transaction should be flagged and investigated.
Transactions taking place late at night that adheres to any of these parameters should especially be flagged. 86% of fraud cases took place between 10pm - 4am, and 51% of total cases took place from 10pm - 12am.
To err on the side of caution, we can add stricter parameters around cards where the CC holder's birth year is 1950-1990, which accounts for 71.2% of all known fraud cases. Stricter parameters may include a lower required number of "red flag" transactions before getting flagged. Also, certain merchants had up to 18 fraudulent transactions and can be categorized as "high-risk". Transactions showing warning signs can be flagged when associated with a high-risk merchant.
Model:
Based on a random forest test, the model containing all of the original data is close to perfect at identifying non-fraud transactions, but less effective at identifying fraud. This model only correctly identified 55% of all fraud cases.
A logistic regression model correctly identified 78% of fraud, but was much less effective at identifying non fraud transactions, resulting in 12 thousand more false positives. This is better for discovering fraud, and errs on the side of caution, but more false-positives are annoying to clients and overall bad for business.
We want to adjust the parameters to be more effective at identifying fraud, and more effective at identifying non-fraud transactions.
We can change the model parameters to ones parallel with the recommendations above.
To be Continued...
Data Dictionary
transdatetrans_time | Transaction DateTime |
---|---|
merchant | Merchant Name |
category | Category of Merchant |
amt | Amount of Transaction |
city | City of Credit Card Holder |
state | State of Credit Card Holder |
lat | Latitude Location of Purchase |
long | Longitude Location of Purchase |
city_pop | Credit Card Holder's City Population |
job | Job of Credit Card Holder |
dob | Date of Birth of Credit Card Holder |
trans_num | Transaction Number |
merch_lat | Latitude Location of Merchant |
merch_long | Longitude Location of Merchant |
is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Source of dataset. The data was partially cleaned and adapted by DataCamp.
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- πΊοΈ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- π Visualize: Use a geospatial plot to visualize the fraud rates across different states.
- π Analyze: Are older customers significantly more likely to be victims of credit card fraud?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import plotly.express as px
import plotly.graph_objs as go
import numpy as np
df = pd.read_csv('credit_card_fraud.csv')
print('checking for null values:')
print(df.dropna().count())
# Finding the Proportion of fraud cases in the dataset
print('Total transactions: ' + str( df['trans_num'].count()) )
print('Total Fraudulent transactions: ' + str( (df['is_fraud']==1).sum()) )
print('Proportion of fraud: ' +
str( (df['is_fraud']==1).sum() / df['trans_num'].count()))
Total transactions: 339607
Total Fraudulent transactions: 1782
The probability of pulling a fraudulent transaction is about 0.00525.
Let us now look at the fraud counts for each category of purchase:
#filter out the non needed columns for analysis:
df2 = df[['trans_num','merchant', 'category', 'amt', 'city', 'state',
'lat', 'long', 'job', 'dob', 'is_fraud','city_pop', 'merch_long', 'merch_lat', 'trans_date_trans_time']]
#filter our colums for fraud cases only.
fraud_cases = df2[df2['is_fraud'] == 1]
#checking fraud counts by category
category_fraud = fraud_cases.groupby('category').size().sort_values(ascending=False)
print(f"fraud cases by {category_fraud} :")
# Plotting
plt.figure(figsize=(12, 6)) # Adjust the size of the figure as needed
plt.bar(category_fraud.index, category_fraud.values, color ='purple')
plt.xlabel('Category')
plt.ylabel('Number of Fraud Cases')
plt.title('Fraud Cases by Category')
plt.xticks(rotation=45) # Rotates the category names for better visibility
plt.show()
The majority of fraud (76.94%) takes place in the first 5 categories: grocery stores, online shopping, in-person shopping, and gas/transportation..
The category with the most fraud counts is grocery_pos is (24.3%), this may indicate that grocery stores are the easiest victims of fraud.
Let us now look at the dollar amount of these fraudulent transactions:
# checking min/max dollar amount of fraud transactions
print('Min fraud amount: $' + str(fraud_cases['amt'].min()))
print('Max fraud amount: $' + str(fraud_cases['amt'].max()))
# Sum of transaction amounts s
total_amt = fraud_cases['amt'].sum()
print("Total amount in fraud transactions: $", total_amt)
print( ' ' )
# cut amt into bins
bins = [0, 100, 250, 500, 750, 1000, 1500]
fraud_cases['dollar_amt'] = pd.cut(fraud_cases['amt'], bins)
#checking fraud counts by bins
amt_fraud = fraud_cases.groupby('dollar_amt').size()
print(amt_fraud)
# Plotting
#amt_fraud.plot(kind='bar', rot=0,
# title='Fraud Cases by Dollar Amount Bins',
# xlabel='Dollar Amount Bins',
# ylabel='Number of Fraud Cases' )
plt.figure(figsize=(8,4))
plt.bar([str(bin) for bin in amt_fraud.index], amt_fraud.values, color='green') # Convert bin ranges to strings for labels
plt.title('Fraud cases by Dollar Amount Bins')
plt.xlabel('Dollar Amount Bins')
plt.ylabel('Number of Fraud Cases')
plt.show()
# Group by category and binned amount, then count
category_amount_fraud_count = fraud_cases.groupby(['category', 'dollar_amt']).size().sort_values(ascending=False)
category_amount_fraud_count = category_amount_fraud_count[category_amount_fraud_count > 0]
print(category_amount_fraud_count)
Min fraud amount: 1.78.
Max fraud amount: 1371.81.
Total amount in fraud transactions: $ 923192.65
Every single instance of in-person grocery store fraud, the most common category of fraud, is a charge between 250-500 dollars.
The most prevalent range of dollars taken by a fraudulent transaction is 750 to 1000.
Online shopping has the most counts of high dollar charges (750 to 1500), followed by in-person (non grocery) shopping.
#check grocery store frauds by state
grocery_frauds = fraud_cases[fraud_cases['category'] == 'grocery_pos']
state_grocery_fraud = grocery_frauds.groupby('state').size()
grocery_price = grocery_frauds['amt']
print(f'grocery_pos fraud counts by {state_grocery_fraud}')
print(grocery_price.sort_values(ascending = False))
All grocery charges are between 250-400 dollars. Why would there be no grocery charges ranging from 100-200?
a theory could be that a criminal group is scattered throughout the States and operating/exchanging the fraud information/instructions online.
Lets look at the fraud counts by each state, and view the populations of the states:
β
β