Skip to content

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.

Source: Kaggle The data was partially cleaned and adapted by DataCamp.

We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.

Explore this dataset

Here are some ideas to get your started with your analysis...

  1. πŸ—ΊοΈ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
  2. πŸ“Š Visualize: Use a geospatial plot to visualize the fraud rates across different states.
  3. πŸ”Ž Analyze: Are older customers significantly more likely to be victims of credit card fraud?

πŸ” Scenario: Accurately Predict Instances of Credit Card Fraud

This scenario helps you develop an end-to-end project for your portfolio.

Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

You can query the pre-loaded CSV file using SQL directly. Here’s a sample query, followed by some sample Python code and outputs:

Spinner
DataFrameas
df
variable
-- ================================
-- πŸ“Œ 1. Overall Fraud Rate
-- ================================
SELECT 
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS total_frauds,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent
FROM 'credit_card_fraud.csv';

-- ================================
-- πŸ“Œ 2. Fraud Loss (Total $ Impact)
-- ================================
SELECT 
    SUM(CASE WHEN is_fraud = 1 THEN amt ELSE 0 END) AS total_loss_from_fraud,
    ROUND(SUM(CASE WHEN is_fraud = 1 THEN amt ELSE 0 END) * 100.0 / SUM(amt), 2) AS percent_loss
FROM 'credit_card_fraud.csv';

-- ================================
-- πŸ“Œ 3. Fraud by Merchant Category
-- ================================
SELECT 
    category,
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS fraud_transactions,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent
FROM 'credit_card_fraud.csv'
GROUP BY category
ORDER BY fraud_rate_percent DESC
LIMIT 10;

-- ================================
-- πŸ“Œ 4. Fraud by Transaction Amount Bucket
-- ================================
SELECT 
    CASE 
        WHEN amt < 50 THEN 'Under $50'
        WHEN amt BETWEEN 50 AND 200 THEN '$50 - $200'
        WHEN amt BETWEEN 200 AND 1000 THEN '$200 - $1000'
        ELSE 'Over $1000'
    END AS amount_bucket,
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS fraud_transactions,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent
FROM 'credit_card_fraud.csv'
GROUP BY amount_bucket
ORDER BY fraud_rate_percent DESC;

-- ================================
-- πŸ“Œ 5. Fraud by State
-- ================================
SELECT 
    state,
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS fraud_transactions,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent
FROM 'credit_card_fraud.csv'
GROUP BY state
HAVING COUNT(*) > 500   -- only include states with enough data
ORDER BY fraud_rate_percent DESC;

-- ================================
-- πŸ“Œ 6. Fraud by Age Group
-- ================================
SELECT 
    CASE 
        WHEN (CAST(strftime('%Y', DATE '2025-01-01') AS INTEGER) - CAST(strftime('%Y', CAST(dob AS DATE)) AS INTEGER)) < 25 THEN '18-24'
        WHEN (CAST(strftime('%Y', DATE '2025-01-01') AS INTEGER) - CAST(strftime('%Y', CAST(dob AS DATE)) AS INTEGER)) BETWEEN 25 AND 34 THEN '25-34'
        WHEN (CAST(strftime('%Y', DATE '2025-01-01') AS INTEGER) - CAST(strftime('%Y', CAST(dob AS DATE)) AS INTEGER)) BETWEEN 35 AND 44 THEN '35-44'
        WHEN (CAST(strftime('%Y', DATE '2025-01-01') AS INTEGER) - CAST(strftime('%Y', CAST(dob AS DATE)) AS INTEGER)) BETWEEN 45 AND 54 THEN '45-54'
        WHEN (CAST(strftime('%Y', DATE '2025-01-01') AS INTEGER) - CAST(strftime('%Y', CAST(dob AS DATE)) AS INTEGER)) BETWEEN 55 AND 64 THEN '55-64'
        ELSE '65+'
    END AS age_group,
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS fraud_transactions,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent
FROM 'credit_card_fraud.csv'
GROUP BY age_group
ORDER BY fraud_rate_percent DESC;
Run cancelled
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset only once if not already loaded
if 'df' not in locals():
    df = pd.read_csv("credit_card_fraud.csv")

# Set a consistent style
sns.set(style="whitegrid", palette="pastel", font_scale=1.1)

# Fraud rate by category
fraud_by_category = df.groupby("category")["is_fraud"].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(x=fraud_by_category.index, y=fraud_by_category.values)
plt.xticks(rotation=45, ha='right')
plt.ylabel("Fraud Rate")
plt.xlabel("Merchant Category")
plt.title("Fraud Rate by Merchant Category")
plt.tight_layout()
plt.show()

# Fraud by transaction amount distribution
plt.figure(figsize=(8, 5))
sns.boxplot(x="is_fraud", y="amt", data=df, showfliers=False)
plt.ylim(0, 2000)  # cap outliers
plt.xlabel("Is Fraud")
plt.ylabel("Transaction Amount ($)")
plt.title("Transaction Amounts: Fraud vs Non-Fraud")
plt.tight_layout()
plt.show()

# Fraud by state heatmap (barplot)
if 'fraud_by_state' not in locals():
    fraud_by_state = df.groupby("state")["is_fraud"].mean().reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(
    x="state",
    y="is_fraud",
    data=fraud_by_state.sort_values("is_fraud", ascending=False),
    order=fraud_by_state.sort_values("is_fraud", ascending=False)["state"]
)
plt.xlabel("State")
plt.ylabel("Fraud Rate")
plt.title("Fraud Rate by State")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Spinner
DataFrameas
df1
variable
SELECT 
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS total_frauds,
    ROUND(SUM(is_fraud) * 100.0 / COUNT(*), 2) AS fraud_rate_percent,

    SUM(CASE WHEN is_fraud = 1 THEN amt ELSE 0 END) AS total_loss_from_fraud,
    ROUND(SUM(CASE WHEN is_fraud = 1 THEN amt ELSE 0 END) * 100.0 / SUM(amt), 2) AS percent_loss,

    (SELECT category 
     FROM 'credit_card_fraud.csv'
     GROUP BY category
     ORDER BY SUM(is_fraud) * 100.0 / COUNT(*) DESC
     LIMIT 1) AS riskiest_category,

    (SELECT CASE 
                WHEN amt < 50 THEN 'Under $50'
                WHEN amt BETWEEN 50 AND 200 THEN '$50 - $200'
                WHEN amt BETWEEN 200 AND 1000 THEN '$200 - $1000'
                ELSE 'Over $1000'
            END
     FROM 'credit_card_fraud.csv'
     GROUP BY CASE 
                WHEN amt < 50 THEN 'Under $50'
                WHEN amt BETWEEN 50 AND 200 THEN '$50 - $200'
                WHEN amt BETWEEN 200 AND 1000 THEN '$200 - $1000'
                ELSE 'Over $1000'
              END
     ORDER BY SUM(is_fraud) * 100.0 / COUNT(*) DESC
     LIMIT 1) AS riskiest_amount_bucket,

    (SELECT state 
     FROM 'credit_card_fraud.csv'
     GROUP BY state
     HAVING COUNT(*) > 500
     ORDER BY SUM(is_fraud) * 100.0 / COUNT(*) DESC
     LIMIT 1) AS riskiest_state,

    (SELECT CASE 
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 < 25 THEN '18-24'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 25 AND 34 THEN '25-34'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 35 AND 44 THEN '35-44'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 45 AND 54 THEN '45-54'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 55 AND 64 THEN '55-64'
                ELSE '65+'
            END
     FROM 'credit_card_fraud.csv'
     GROUP BY CASE 
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 < 25 THEN '18-24'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 25 AND 34 THEN '25-34'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 35 AND 44 THEN '35-44'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 45 AND 54 THEN '45-54'
                WHEN (CAST('2025-01-01' AS DATE) - CAST(dob AS DATE)) / 365 BETWEEN 55 AND 64 THEN '55-64'
                ELSE '65+'
              END
     ORDER BY SUM(is_fraud) * 100.0 / COUNT(*) DESC
     LIMIT 1) AS riskiest_age_group
FROM 'credit_card_fraud.csv';

Data Dictionary

transdatetrans_timeTransaction DateTime
merchantMerchant Name
categoryCategory of Merchant
amtAmount of Transaction
cityCity of Credit Card Holder
stateState of Credit Card Holder
latLatitude Location of Purchase
longLongitude Location of Purchase
city_popCredit Card Holder's City Population
jobJob of Credit Card Holder
dobDate of Birth of Credit Card Holder
trans_numTransaction Number
merch_latLatitude Location of Merchant
merch_longLongitude Location of Merchant
is_fraudWhether Transaction is Fraud (1) or Not (0)