Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.
We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.
Explore this dataset
Here are some ideas to get your started with your analysis...
- ๐บ๏ธ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- ๐ Visualize: Use a geospatial plot to visualize the fraud rates across different states.
- ๐ Analyze: Are older customers significantly more likely to be victims of credit card fraud?
๐ Scenario: Accurately Predict Instances of Credit Card Fraud
This scenario helps you develop an end-to-end project for your portfolio.
Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
SELECT * FROM 'credit_card_fraud.csv'
LIMIT 5
Context: The rise of fraud in digital payments.
**Goal: **Identify transaction patterns and customer segments most vulnerable to fraud using SQL.
Outcome: Insights + base groundwork for predictive modeling.
/*Q1: What percentage of total transactions are fraudulent? Only 0.52% of all transactions are fraudelent which is quite small*/
WITH FraudData AS (
SELECT
is_fraud
FROM 'credit_card_fraud.csv'
)
SELECT
COUNT(*) AS total_transactions,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_transactions,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM FraudData;
/*Which merchant categories have the highest fraud rate? These are the top 10 fraud rates, with shopping_net and grocery_pos being the top percent rate.*/
SELECT
category,
COUNT(*) AS total,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY category
ORDER BY fraud_rate_pct DESC
LIMIT 10;
/*Does the transaction amount correlate with fraud?I binned the transaction amounts as follows: It looks like most fraud occurs when the value is greater than $500 and then next by a significant difference is the $100-$500 portion. */
SELECT
CASE
WHEN amt < 20 THEN '< $20'
WHEN amt BETWEEN 20 AND 100 THEN '$20โ$100'
WHEN amt BETWEEN 100 AND 500 THEN '$100โ$500'
ELSE '> $500'
END AS amount_range,
COUNT(*) AS total,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY amount_range
ORDER BY fraud_rate_pct DESC;
- Visuals & Insights (Optional with Python or Tableau) Fraud rates per state (geo heatmap).
Bar chart: Fraud rate by merchant category.
Line graph: Age vs. fraud risk.
- Conclusion / Executive Summary Example: โFraud was most prevalent in electronics purchases above $500 and among customers over age 50. The highest fraud rates occurred in CA and NV.โ
Add business recommendation: "Recommend additional fraud filters for high-ticket purchases and older user segments."
/*Fraud by City. */
SELECT
city,
COUNT(*) AS total_transactions,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_transactions,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY city
ORDER BY fraud_transactions DESC
LIMIT 10;
/*Fraud by Age*/
/*Initialize Age as CTE for use in next SQL code*/
WITH AgeData AS (
SELECT
*,
DATE_PART('year', AGE(CURRENT_DATE, CAST(dob AS DATE))) AS age
FROM 'credit_card_fraud.csv'
)
SELECT
CASE
WHEN age < 30 THEN 'Under 30'
WHEN age BETWEEN 30 AND 50 THEN '30โ50'
ELSE 'Over 50'
END AS age_group,
COUNT(*) AS total,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM AgeData
GROUP BY age_group
ORDER BY age_group;
/*fraud by state*/
SELECT
state,
COUNT(*) AS total_transactions,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY state
ORDER BY fraud_rate_pct DESC
LIMIT 10;
/*fraud by amount range*/
import pandas as pd
ccf = pd.read_csv('credit_card_fraud.csv')
ccf.head(100)
โ
โ