Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.

Source: Kaggle The data was partially cleaned and adapted by DataCamp.

We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.

Explore this dataset

Here are some ideas to get your started with your analysis...

🗺️ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
📊 Visualize: Use a geospatial plot to visualize the fraud rates across different states.
🔎 Analyze: Are older customers significantly more likely to be victims of credit card fraud?

🔍 Scenario: Accurately Predict Instances of Credit Card Fraud

This scenario helps you develop an end-to-end project for your portfolio.

Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

You can query the pre-loaded CSV file using SQL directly. Here’s a sample query, followed by some sample Python code and outputs:

DataFrameas

df

variable

SELECT * FROM 'credit_card_fraud.csv'
LIMIT 5

DataFrameas

df1

variable


SELECT 
    category,
    COUNT(*) AS total_transactions,
    SUM(is_fraud) AS fraud_count,
     ROUND(fraud_count * 100.0 / total_transactions, 2) AS fraud_percentage
FROM 
    credit_card_fraud.csv
GROUP BY 
    category
ORDER BY 
    fraud_percentage DESC

import pandas as pd 
ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.head(100)

DataFrameas

df2

variable

Select
category,
count(*) as total_transaction,
sum(is_fraud) as fraud_count,
round(fraud_count*100/total_transaction,2) as fraud_percentage
from credit_card_fraud.csv
group by 
	category
order by
	fraud_percentage desc

list(ccf.columns)


fraud_analysis = (
    ccf.groupby('category')
    .agg(
        total_transactions=('is_fraud', 'count'),
        fraud_count=('is_fraud', 'sum')
    )
    .assign(
        fraud_percentage=lambda x: round(x['fraud_count'] / x['total_transactions'] * 100, 2)
    )
    .sort_values('fraud_percentage', ascending=False)
    .reset_index()
)

# Display results
print(fraud_analysis.head(10))

import plotly.express as px

# Calculate fraud rate by state
state_fraud = ccf.groupby('state').agg(
    total_trans=('is_fraud', 'count'),
    fraud=('is_fraud', 'sum')
).assign(fraud_rate=lambda x: x['fraud']/x['total_trans']).reset_index()

# Plot
fig = px.choropleth(
    state_fraud,
    locations='state',
    locationmode='USA-states',
    color='fraud_rate',
    scope='usa',
    color_continuous_scale='reds',
    hover_data=['total_trans'],
    title='Fraud Rate by State (Darker = Higher Fraud)'
)
fig.show()

from datetime import datetime

# Convert dob to datetime and calculate age
ccf['dob'] = pd.to_datetime(ccf['dob'])
ccf['age'] = (datetime.now() - ccf['dob']).dt.days // 365

# Preview age distribution
print(ccf['age'].describe())

# Prepare data
young = ccf[ccf['age'] < 40]
older = ccf[ccf['age'] >= 40]

# Calculate fraud rates
young_rate = young['is_fraud'].mean()
older_rate = older['is_fraud'].mean()

# Perform t-test
t_stat, p_value = ttest_ind(young['is_fraud'], older['is_fraud'], equal_var=False)

# Create visualization
plt.figure(figsize=(10, 6))

# Bar plot
ax = sns.barplot(
    x=['Young (<40)', 'Older (≥40)'], 
    y=[young_rate, older_rate],
    palette=['#1f77b4', '#ff7f0e']
)

Data Dictionary

transdatetrans_time	Transaction DateTime
merchant	Merchant Name
category	Category of Merchant
amt	Amount of Transaction
city	City of Credit Card Holder
state	State of Credit Card Holder
lat	Latitude Location of Purchase
long	Longitude Location of Purchase
city_pop	Credit Card Holder's City Population
job	Job of Credit Card Holder
dob	Date of Birth of Credit Card Holder
trans_num	Transaction Number
merch_lat	Latitude Location of Merchant
merch_long	Longitude Location of Merchant
is_fraud	Whether Transaction is Fraud (1) or Not (0)

Credit Card Fraud

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Credit Card Fraud

Explore this dataset

🔍 Scenario: Accurately Predict Instances of Credit Card Fraud

Data Dictionary

Credit Card Fraud