Skip to content

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.

Source: Kaggle

We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.

Explore this dataset

Here are some ideas to get your started with your analysis...

  1. ๐Ÿ—บ๏ธ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
  2. ๐Ÿ“Š Visualize: Use a geospatial plot to visualize the fraud rates across different states.
  3. ๐Ÿ”Ž Analyze: Are older customers significantly more likely to be victims of credit card fraud?

๐Ÿ” Scenario: Accurately Predict Instances of Credit Card Fraud

This scenario helps you develop an end-to-end project for your portfolio.

Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

Spinner
DataFrameas
df
variable
SELECT * FROM 'credit_card_fraud.csv'
LIMIT 5

Context: The rise of fraud in digital payments.

**Goal: **Identify transaction patterns and customer segments most vulnerable to fraud using SQL.

Outcome: Insights + base groundwork for predictive modeling.

Spinner
DataFrameas
df8
variable
/*Q1: What percentage of total transactions are fraudulent? Only 0.52% of all transactions are fraudelent which is quite small*/

WITH FraudData AS (
  SELECT 
    is_fraud
  FROM 'credit_card_fraud.csv'
)
SELECT 
  COUNT(*) AS total_transactions,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_transactions,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM FraudData;
Spinner
DataFrameas
df6
variable
/*Which merchant categories have the highest fraud rate? These are the top 10 fraud rates, with shopping_net and grocery_pos being the top percent rate.*/
SELECT 
  category,
  COUNT(*) AS total,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY category
ORDER BY fraud_rate_pct DESC
LIMIT 10;
Spinner
DataFrameas
df5
variable
/*Does the transaction amount correlate with fraud?I binned the transaction amounts as follows: It looks like most fraud occurs when the value is greater than $500 and then next by a significant difference is the $100-$500 portion. */

SELECT
  CASE 
    WHEN amt < 20 THEN '< $20'
    WHEN amt BETWEEN 20 AND 100 THEN '$20โ€“$100'
    WHEN amt BETWEEN 100 AND 500 THEN '$100โ€“$500'
    ELSE '> $500'
  END AS amount_range,
  COUNT(*) AS total,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY amount_range
ORDER BY fraud_rate_pct DESC;
  1. Visuals & Insights (Optional with Python or Tableau) Fraud rates per state (geo heatmap).

Bar chart: Fraud rate by merchant category.

Line graph: Age vs. fraud risk.

  1. Conclusion / Executive Summary Example: โ€œFraud was most prevalent in electronics purchases above $500 and among customers over age 50. The highest fraud rates occurred in CA and NV.โ€

Add business recommendation: "Recommend additional fraud filters for high-ticket purchases and older user segments."

Spinner
DataFrameas
df1
variable
/*Fraud by City. */
SELECT 
  city,
  COUNT(*) AS total_transactions,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_transactions,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY city
ORDER BY fraud_transactions DESC
LIMIT 10;
Spinner
DataFrameas
df2
variable
/*Fraud by Age*/
/*Initialize Age as CTE for use in next SQL code*/
WITH AgeData AS (
  SELECT 
    *,
    DATE_PART('year', AGE(CURRENT_DATE, CAST(dob AS DATE))) AS age
  FROM 'credit_card_fraud.csv'
)

SELECT 
  CASE 
    WHEN age < 30 THEN 'Under 30'
    WHEN age BETWEEN 30 AND 50 THEN '30โ€“50'
    ELSE 'Over 50'
  END AS age_group,
  COUNT(*) AS total,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM AgeData
GROUP BY age_group
ORDER BY age_group;
Spinner
DataFrameas
df3
variable
/*fraud by state*/
SELECT 
  state,
  COUNT(*) AS total_transactions,
  SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) AS fraud_count,
  ROUND(100.0 * SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) AS fraud_rate_pct
FROM 'credit_card_fraud.csv'
GROUP BY state
ORDER BY fraud_rate_pct DESC
LIMIT 10;
Spinner
DataFrameas
df4
variable
Run cancelled
/*fraud by amount range*/
Run cancelled
import pandas as pd 
ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.head(100)
โ€Œ
โ€Œ
โ€Œ