Skip to content

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Note: You can access the data via the File menu or in the Context Panel at the top right of the screen next to Report, under Files. The data dictionary and filenames can be found at the bottom of this workbook.

Source: Kaggle The data was partially cleaned and adapted by DataCamp.

We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workbook yours by adding and removing cells, or editing any of the existing cells.

Explore this dataset

Here are some ideas to get your started with your analysis...

  1. 🗺️ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
  2. 📊 Visualize: Use a geospatial plot to visualize the fraud rates across different states.
  3. 🔎 Analyze: Are older customers significantly more likely to be victims of credit card fraud?

🔍 Scenario: Accurately Predict Instances of Credit Card Fraud

This scenario helps you develop an end-to-end project for your portfolio.

Background: A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

Objective: The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

You can query the pre-loaded CSV file using SQL directly. Here’s a sample query, followed by some sample Python code and outputs:

raud
Spinner
DataFrameas
df
variable
SELECT * FROM 'credit_card_fraud.csv'
LIMIT 5;

SELECT MAX(merch_lat), MAX(merch_long), ANY_VALUE(is_fraud) as is_fraud
FROM 'credit_card_fraud.csv'
WHERE is_fraud = 1;

SELECT MIN(amt), MAX(amt), ANY_VALUE(is_fraud) as is_fraud
FROM 'credit_card_fraud.csv'
WHERE is_fraud = 1;

SELECT COUNT(DISTINCT merchant)
FROM 'credit_card_fraud.csv';

SELECT 
    merchant
FROM 'credit_card_fraud.csv'
GROUP BY merchant
HAVING SUM(is_fraud) = 0;

SELECT 
    COUNT(DISTINCT merchant) AS unique_merchants_nonfraud
FROM 'credit_card_fraud.csv'
WHERE is_fraud = 0;

SELECT 
    COUNT(*) AS merchants_never_fraud
FROM (
    SELECT merchant
    FROM 'credit_card_fraud.csv'
    GROUP BY merchant
    HAVING SUM(is_fraud) = 0
) AS t;

SELECT 
    COUNT(*) AS total_transactions_never_fraud
FROM 'credit_card_fraud.csv'
WHERE merchant IN (
    SELECT merchant
    FROM 'credit_card_fraud.csv'
    GROUP BY merchant
    HAVING SUM(is_fraud) = 0
);

SELECT AVG(amt) AS avg_fraud_amount
FROM 'credit_card_fraud.csv'
WHERE is_fraud = 1;
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from datetime import datetime
from sqlalchemy import create_engine
import geopandas as gpd
from shapely.geometry import Point

ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.head(100)

print(ccf.value_counts('is_fraud'))
print(ccf.describe())
print(ccf.info())
print(ccf.columns)
print(ccf['dob'].info())
print(ccf['dob'].describe())
print(ccf['trans_date_trans_time'].min())
print(ccf['trans_date_trans_time'].max())

# Turn dob column into a datetime feature
ccf['dob'] = pd.to_datetime(ccf['dob'])

# Calculate the age of each person
today = pd.Timestamp.today()
ccf['age'] = ccf['dob'].apply(lambda dob: today.year - dob.year - ((today.month, today.day)< (dob.month, dob.day)))

print(ccf['age'].describe())

# Create Bins for age to analyze where we see more fraud happen versus numbe of transactions

bins = [0, 20, 30, 40, 50, 60, 200]
labels = ['<20', '20-29', '30-39', '40-49', '50-59', '60+']

ccf['age_bin'] = pd.cut(ccf['age'], bins=bins, labels=labels, right=False)

print(ccf['age_bin'].value_counts().sort_index())


# To count how many transactions are from merchant 'Kiehn_Emmerich':
print((ccf['merchant'] == 'Kiehn-Emmerich').value_counts())
print("Number of transactions from Kiehn-Emmerich:", (ccf['merchant'] == 'Kiehn-Emmerich').sum())

# Filter Fraud columns to analyze
ccf_fraud = ccf[ccf['is_fraud'] == 1]
print(ccf_fraud)
print(ccf_fraud.value_counts('merchant'))
print(ccf_fraud.value_counts('age'))
print(ccf_fraud['age_bin'].value_counts().sort_index())

# Percentage of fraud versus total transactions
fraud_by_age = (
    ccf.groupby('age_bin')['is_fraud']
       .agg(['count', 'sum'])
       .rename(columns={'count': 'total', 'sum': 'fraud_count'})
)

fraud_by_age['fraud_rate'] = fraud_by_age['fraud_count'] / fraud_by_age['total']
print(fraud_by_age)

fraud_by_age['fraud_rate_pct'] = (fraud_by_age['fraud_rate'] * 100).round(2)
print(fraud_by_age)


plt.figure(figsize=(10,6))
fraud_by_age['fraud_rate_pct'].plot(kind='bar', color='tomato')

overall_fraud_rate_pct = ccf['is_fraud'].mean() * 100

plt.axhline(
    overall_fraud_rate_pct,
    color='blue',
    linestyle='--',
    linewidth=2,
    label=f'Overall Fraud Rate ({overall_fraud_rate_pct:.2f}%)'
)

plt.legend()

plt.title("Fraud Rate by Age Bin")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Age Bin")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

# Plot the actual counts of fraud to coincide with the % chart
plt.figure(figsize=(10,6))
fraud_by_age['fraud_count'].plot(kind='bar', color='steelblue')

plt.title("Number of Fraud Claims by Age Bin")
plt.ylabel("Fraud Claims")
plt.xlabel("Age Bin")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

# Both plots together for the Age and those who had fraud
fig, axes = plt.subplots(1, 2, figsize=(16,6))

# Fraud rate (%)
fraud_by_age['fraud_rate_pct'].plot(kind='bar', color='tomato', ax=axes[0])
axes[0].set_title("Fraud Rate by Age Bin")
axes[0].set_ylabel("Fraud Rate (%)")
axes[0].set_xlabel("Age Bin")
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', linestyle='--', alpha=0.6)

# Fraud count
fraud_by_age['fraud_count'].plot(kind='bar', color='steelblue', ax=axes[1])
axes[1].set_title("Number of Fraud Claims by Age Bin")
axes[1].set_ylabel("Fraud Claims")
axes[1].set_xlabel("Age Bin")
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()

fig, ax1 = plt.subplots(figsize=(12,6))

# --- BAR CHART (Fraud Count) ---
ax1.bar(
    fraud_by_age.index,
    fraud_by_age['fraud_count'],
    color='steelblue',
    alpha=0.7,
    label='Fraud Count'
)

ax1.set_xlabel("Age Bin")
ax1.set_ylabel("Fraud Count", color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')
ax1.set_xticklabels(fraud_by_age.index, rotation=45)

# --- SECOND AXIS (Fraud Rate %) ---
ax2 = ax1.twinx()

ax2.plot(
    fraud_by_age.index,
    fraud_by_age['fraud_rate_pct'],
    color='tomato',
    marker='o',
    linewidth=2,
    label='Fraud Rate (%)'
)

ax2.set_ylabel("Fraud Rate (%)", color='tomato')
ax2.tick_params(axis='y', labelcolor='tomato')

# --- OPTIONAL: Add overall average fraud rate line ---
overall_fraud_rate_pct = ccf['is_fraud'].mean() * 100
ax2.axhline(
    overall_fraud_rate_pct,
    color='gray',
    linestyle='--',
    linewidth=1.5,
    label=f'Avg Fraud Rate ({overall_fraud_rate_pct:.2f}%)'
)

# --- TITLE & GRID ---
plt.title("Fraud Rate (%) and Fraud Count by Age Bin")
ax1.grid(axis='y', linestyle='--', alpha=0.5)

# --- LEGENDS ---
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
plt.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')

plt.tight_layout()
plt.show()

# Fraud rate by merchant
fraud_by_merchant = (
    ccf.groupby('merchant')['is_fraud']
       .agg(['count', 'sum'])
       .rename(columns={'count': 'total', 'sum': 'fraud_count'})
)

fraud_by_merchant['fraud_rate'] = fraud_by_merchant['fraud_count'] / fraud_by_merchant['total']
fraud_by_merchant['fraud_rate_pct'] = (fraud_by_merchant['fraud_rate'] * 100).round(2)

print(fraud_by_merchant.head())
print(fraud_by_merchant.sort_values('fraud_rate_pct', ascending=False).head(20))

# Filter the merchants to reduce the noise and remove those that statisically small
filtered_merchants = fraud_by_merchant[
    (fraud_by_merchant['fraud_count'] >= 10)
]

print(filtered_merchants)

plt.figure(figsize=(12,6))
filtered_merchants['fraud_rate_pct'].sort_values(ascending=False).plot(kind='bar', color='tomato')

plt.title("Fraud Rate by Merchant (Filtered)")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Merchant")
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

# Fraud rate by category
fraud_by_category = (
    ccf.groupby('category')['is_fraud']
       .agg(['count', 'sum'])
       .rename(columns={'count': 'total', 'sum': 'fraud_count'})
)

fraud_by_category['fraud_rate'] = fraud_by_category['fraud_count'] / fraud_by_category['total']
fraud_by_category['fraud_rate_pct'] = (fraud_by_category['fraud_rate'] * 100).round(2)

print(fraud_by_category)
print(fraud_by_category.sort_values('fraud_rate_pct', ascending=False).head(15))

plt.figure(figsize=(10,6))
fraud_by_category['fraud_rate_pct'].sort_values(ascending=False).plot(kind='bar', color='seagreen')

overall_fraud_rate_pct = ccf['is_fraud'].mean() * 100

plt.axhline(
    overall_fraud_rate_pct,
    color='gray',
    linestyle='--',
    linewidth=1.8,
    label=f'Avg Fraud Rate ({overall_fraud_rate_pct:.2f}%)'
)


plt.title("Fraud Rate by Category")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Category")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

top15_merchants = (
    filtered_merchants
    .sort_values('fraud_rate_pct', ascending=False)
    .head(15)
)

print(top15_merchants)

plt.figure(figsize=(12,6))
top15_merchants['fraud_rate_pct'].plot(kind='bar', color='tomato')

overall_fraud_rate_pct = ccf['is_fraud'].mean() * 100

plt.axhline(
    overall_fraud_rate_pct,
    color='gray',
    linestyle='--',
    linewidth=1.8,
    label=f'Avg Fraud Rate ({overall_fraud_rate_pct:.2f}%)'
)


plt.title("Top 15 Merchants by Fraud Rate (%)")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Merchant")
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

# Create amount bins to look for amounts possibly seeing more issues in fraud
bins = [0, 5, 10, 25, 50, 100, 250, 500, 1000, float('inf')]
labels = ["0-5", "5-10", "10-25", "25-50", "50-100", "100-250", "250-500", "500-1000", "1000+"]

ccf['amt_bin'] = pd.cut(ccf['amt'], bins=bins, labels=labels, right=False)

ccf['amt_bin'].value_counts()

# Calculate fraud by the amt_bins
fraud_by_amt = (
    ccf.groupby('amt_bin')['is_fraud']
       .agg(['count', 'sum'])
       .rename(columns={'count': 'total', 'sum': 'fraud_count'})
)

fraud_by_amt['fraud_rate'] = fraud_by_amt['fraud_count'] / fraud_by_amt['total']
fraud_by_amt['fraud_rate_pct'] = (fraud_by_amt['fraud_rate'] * 100).round(2)

print(fraud_by_amt)

# Create a bar graph showing the amount bins
plt.figure(figsize=(10,6))
fraud_by_amt['fraud_rate_pct'].plot(kind='bar', color='purple')

plt.title("Fraud Rate by Transaction Amount Bin")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Amount Bin")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

# Create a side by side chart with count and percentage

fig, ax1 = plt.subplots(figsize=(12,6))

# --- BAR CHART (Fraud Count) ---
ax1.bar(
    fraud_by_amt.index,
    fraud_by_amt['fraud_count'],
    color='steelblue',
    alpha=0.7,
    label='Fraud Count'
)

ax1.set_xlabel("Amount Bin")
ax1.set_ylabel("Fraud Count", color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')
ax1.set_xticklabels(fraud_by_amt.index, rotation=45)

# --- SECOND AXIS (Fraud Rate %) ---
ax2 = ax1.twinx()

ax2.plot(
    fraud_by_amt.index,
    fraud_by_amt['fraud_rate_pct'],
    color='tomato',
    marker='o',
    linewidth=2,
    label='Fraud Rate (%)'
)

ax2.set_ylabel("Fraud Rate (%)", color='tomato')
ax2.tick_params(axis='y', labelcolor='tomato')

# --- ADD AVERAGE FRAUD RATE LINE ---
overall_fraud_rate_pct = ccf['is_fraud'].mean() * 100

ax2.axhline(
    overall_fraud_rate_pct,
    color='gray',
    linestyle='--',
    linewidth=1.8,
    label=f'Avg Fraud Rate ({overall_fraud_rate_pct:.2f}%)'
)

# --- TITLE & GRID ---
plt.title("Fraud Rate (%) and Fraud Count by Transaction Amount Bin")
ax1.grid(axis='y', linestyle='--', alpha=0.5)

# --- COMBINED LEGEND ---
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
plt.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')

plt.tight_layout()
plt.show()


import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point

import os
import requests
import zipfile
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point

# ---------------------------------------------------------
# 1. Convert merchant coordinates into a GeoDataFrame
# ---------------------------------------------------------
geometry = [Point(xy) for xy in zip(ccf_fraud['long'], ccf_fraud['lat'])]
gdf = gpd.GeoDataFrame(ccf_fraud, geometry=geometry, crs="EPSG:4326")

# ---------------------------------------------------------
# 2. Download Natural Earth country boundaries (110m)
# ---------------------------------------------------------
ne_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip"
ne_zip = "ne_110m_admin_0_countries.zip"
ne_dir = "ne_110m_admin_0_countries"

if not os.path.exists(ne_dir):
    print("Downloading Natural Earth world boundaries...")
    r = requests.get(ne_url)
    r.raise_for_status()
    with open(ne_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(ne_zip, "r") as zip_ref:
        zip_ref.extractall(ne_dir)
    os.remove(ne_zip)

# Find .shp file
shp_path = None
for file in os.listdir(ne_dir):
    if file.endswith(".shp"):
        shp_path = os.path.join(ne_dir, file)
        break

world = gpd.read_file(shp_path)
usa = world[world['ADMIN'] == 'United States of America']

# ---------------------------------------------------------
# 3. Download Natural Earth state boundaries (110m)
# ---------------------------------------------------------
states_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_1_states_provinces.zip"
states_zip = "ne_110m_admin_1_states_provinces.zip"
states_dir = "ne_110m_admin_1_states_provinces"

if not os.path.exists(states_dir):
    print("Downloading Natural Earth state boundaries...")
    r = requests.get(states_url)
    r.raise_for_status()
    with open(states_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(states_zip, "r") as zip_ref:
        zip_ref.extractall(states_dir)
    os.remove(states_zip)

# Find states shapefile
states_shp = None
for file in os.listdir(states_dir):
    if file.endswith(".shp"):
        states_shp = os.path.join(states_dir, file)
        break

states = gpd.read_file(states_shp)
states_usa = states[states['admin'] == 'United States of America']

# ---------------------------------------------------------
# 4. CRS Alignment
# ---------------------------------------------------------
if gdf.crs != usa.crs:
    gdf = gdf.to_crs(usa.crs)

# ---------------------------------------------------------
# 5. Plotting
# ---------------------------------------------------------
fig, ax = plt.subplots(figsize=(14, 10))

# Plot USA
usa.plot(ax=ax, color='lightgray', edgecolor='black')

# Plot state boundaries
states_usa.boundary.plot(ax=ax, linewidth=0.8, color="white")

# Plot client points
gdf.plot(ax=ax, color='blue', markersize=50)

# ---------------------------------------------------------
# 6. Add State Labels
# ---------------------------------------------------------
states_usa["center"] = states_usa.geometry.centroid

for idx, row in states_usa.iterrows():
    x = row["center"].x
    y = row["center"].y
    label = row["name"]  # State name
    ax.text(x, y, label, fontsize=7, ha='center', color='black')

plt.title("Client Locations of Fraudulent Transactions with U.S. State Boundaries and Labels")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.grid(True)
plt.tight_layout()
plt.show()



import os
import requests
import zipfile
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point

# ---------------------------------------------------------
# 1. Convert merchant coordinates into a GeoDataFrame
# ---------------------------------------------------------
geometry = [Point(xy) for xy in zip(ccf_fraud['merch_long'], ccf_fraud['merch_lat'])]
gdf2 = gpd.GeoDataFrame(ccf_fraud, geometry=geometry, crs="EPSG:4326")

# ---------------------------------------------------------
# 2. Download Natural Earth country boundaries (110m)
# ---------------------------------------------------------
ne_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip"
ne_zip = "ne_110m_admin_0_countries.zip"
ne_dir = "ne_110m_admin_0_countries"

if not os.path.exists(ne_dir):
    print("Downloading Natural Earth world boundaries...")
    r = requests.get(ne_url)
    r.raise_for_status()
    with open(ne_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(ne_zip, "r") as zip_ref:
        zip_ref.extractall(ne_dir)
    os.remove(ne_zip)

# Find .shp file
shp_path = None
for file in os.listdir(ne_dir):
    if file.endswith(".shp"):
        shp_path = os.path.join(ne_dir, file)
        break

world = gpd.read_file(shp_path)
usa = world[world['ADMIN'] == 'United States of America']

# ---------------------------------------------------------
# 3. Download Natural Earth state boundaries (110m)
# ---------------------------------------------------------
states_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_1_states_provinces.zip"
states_zip = "ne_110m_admin_1_states_provinces.zip"
states_dir = "ne_110m_admin_1_states_provinces"

if not os.path.exists(states_dir):
    print("Downloading Natural Earth state boundaries...")
    r = requests.get(states_url)
    r.raise_for_status()
    with open(states_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(states_zip, "r") as zip_ref:
        zip_ref.extractall(states_dir)
    os.remove(states_zip)

# Find states shapefile
states_shp = None
for file in os.listdir(states_dir):
    if file.endswith(".shp"):
        states_shp = os.path.join(states_dir, file)
        break

states = gpd.read_file(states_shp)
states_usa = states[states['admin'] == 'United States of America']

# ---------------------------------------------------------
# 4. CRS Alignment
# ---------------------------------------------------------
if gdf2.crs != usa.crs:
    gdf2 = gdf2.to_crs(usa.crs)

# ---------------------------------------------------------
# 5. Plotting
# ---------------------------------------------------------
fig, ax = plt.subplots(figsize=(14, 10))

# Plot USA
usa.plot(ax=ax, color='lightgray', edgecolor='black')

# Plot state boundaries
states_usa.boundary.plot(ax=ax, linewidth=0.8, color="white")

# Plot merchant points
gdf2.plot(ax=ax, color='red', markersize=50)

# ---------------------------------------------------------
# 6. Add State Labels
# ---------------------------------------------------------
states_usa["center"] = states_usa.geometry.centroid

for idx, row in states_usa.iterrows():
    x = row["center"].x
    y = row["center"].y
    label = row["name"]  # State name
    ax.text(x, y, label, fontsize=7, ha='center', color='black')

plt.title("Merchant Locations of Fraudulent Transactions with U.S. State Boundaries and Labels")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot all transactions regardless of fraud or not
# ---------------------------------------------------------
# 1. Convert merchant coordinates into a GeoDataFrame
# ---------------------------------------------------------
geometry = [Point(xy) for xy in zip(ccf['merch_long'], ccf['merch_lat'])]
gdf2 = gpd.GeoDataFrame(ccf, geometry=geometry, crs="EPSG:4326")

# ---------------------------------------------------------
# 2. Download Natural Earth country boundaries (110m)
# ---------------------------------------------------------
ne_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip"
ne_zip = "ne_110m_admin_0_countries.zip"
ne_dir = "ne_110m_admin_0_countries"

if not os.path.exists(ne_dir):
    print("Downloading Natural Earth world boundaries...")
    r = requests.get(ne_url)
    r.raise_for_status()
    with open(ne_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(ne_zip, "r") as zip_ref:
        zip_ref.extractall(ne_dir)
    os.remove(ne_zip)

# Find .shp file
shp_path = None
for file in os.listdir(ne_dir):
    if file.endswith(".shp"):
        shp_path = os.path.join(ne_dir, file)
        break

world = gpd.read_file(shp_path)
usa = world[world['ADMIN'] == 'United States of America']

# ---------------------------------------------------------
# 3. Download Natural Earth state boundaries (110m)
# ---------------------------------------------------------
states_url = "https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_1_states_provinces.zip"
states_zip = "ne_110m_admin_1_states_provinces.zip"
states_dir = "ne_110m_admin_1_states_provinces"

if not os.path.exists(states_dir):
    print("Downloading Natural Earth state boundaries...")
    r = requests.get(states_url)
    r.raise_for_status()
    with open(states_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(states_zip, "r") as zip_ref:
        zip_ref.extractall(states_dir)
    os.remove(states_zip)

# Find states shapefile
states_shp = None
for file in os.listdir(states_dir):
    if file.endswith(".shp"):
        states_shp = os.path.join(states_dir, file)
        break

states = gpd.read_file(states_shp)
states_usa = states[states['admin'] == 'United States of America']

# ---------------------------------------------------------
# 4. CRS Alignment
# ---------------------------------------------------------
if gdf2.crs != usa.crs:
    gdf2 = gdf2.to_crs(usa.crs)

# ---------------------------------------------------------
# 5. Plotting
# ---------------------------------------------------------
fig, ax = plt.subplots(figsize=(14, 10))

# Plot USA
usa.plot(ax=ax, color='lightgray', edgecolor='black')

# Plot state boundaries
states_usa.boundary.plot(ax=ax, linewidth=0.8, color="white")

# Plot merchant points
gdf2.plot(ax=ax, color='red', markersize=50)

# ---------------------------------------------------------
# 6. Add State Labels
# ---------------------------------------------------------
states_usa["center"] = states_usa.geometry.centroid

for idx, row in states_usa.iterrows():
    x = row["center"].x
    y = row["center"].y
    label = row["name"]  # State name
    ax.text(x, y, label, fontsize=7, ha='center', color='black')

plt.title("Merchant Locations with U.S. State Boundaries and Labels")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.grid(True)
plt.tight_layout()
plt.show()

print(ccf.info())

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['hour'] = ccf['trans_date_trans_time'].dt.hour
ccf['dayofweek'] = ccf['trans_date_trans_time'].dt.dayofweek
ccf['month'] = ccf['trans_date_trans_time'].dt.month

from geopy.distance import geodesic

ccf['distance'] = ccf.apply(
    lambda row: geodesic((row['lat'], row['long']), (row['merch_lat'], row['merch_long'])).miles,
    axis=1
)

# Drop Columns either redundent or not much of an identifier for 
ccf_model = ccf.drop([
    'trans_num',
    'dob',
    'trans_date_trans_time',
    'city',
    'job',
    'age_bin',
    'amt_bin'
], axis=1)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, recall_score
import numpy as np

# -----------------------------
# 1. Define features and target
# -----------------------------
y = ccf_model['is_fraud']
X = ccf_model.drop(['is_fraud'], axis=1)

categorical_cols = ['merchant', 'category', 'state']
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()

# -----------------------------
# 2. Preprocessing
# -----------------------------
preprocess = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)

# -----------------------------
# 3. Recall‑optimized Random Forest
# -----------------------------
rf_recall = RandomForestClassifier(
    n_estimators=500,          # more trees = more stable recall
    max_depth=10,              # shallower trees reduce overfitting
    min_samples_leaf=5,        # prevents tiny leaves that hurt recall
    class_weight='balanced',   # CRITICAL for fraud
    random_state=42
)

model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('clf', rf_recall)
])

# -----------------------------
# 4. Train/test split
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# 5. Fit model
# -----------------------------
model.fit(X_train, y_train)

# -----------------------------
# 6. Predict with LOWER threshold
# -----------------------------
y_proba = model.predict_proba(X_test)[:, 1]

# Lower threshold from 0.50 → 0.40 (tune this!)
threshold = 0.40
y_pred = (y_proba > threshold).astype(int)

# -----------------------------
# 7. Evaluate recall
# -----------------------------
print("Recall:", recall_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

print("\nThreshold sweep:")
for t in [0.39, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45]:
    preds = (y_proba > t).astype(int)
    recall = recall_score(y_test, preds)
    flag_rate = preds.mean()
    print(f"t={t:.2f} | recall={recall:.3f} | flagged={flag_rate:.3f}")

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8,6))
plt.plot(recall, precision, color='purple', lw=2,
         label=f'Precision-Recall curve (AP = {avg_precision:.4f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.grid(True)
plt.show()



!pip install xgboost

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, recall_score
import numpy as np

# -----------------------------
# 1. Define features and target
# -----------------------------
y = ccf_model['is_fraud']
X = ccf_model.drop(['is_fraud'], axis=1)

categorical_cols = ['merchant', 'category', 'state']
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()

# -----------------------------
# 2. Preprocessing
# -----------------------------
preprocess = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)

# -----------------------------
# 3. XGBoost tuned for RECALL
# -----------------------------
# Fraud imbalance ratio
fraud_ratio = 0.9947 / 0.0053   # ≈ 188
balanced_weight = np.sqrt(fraud_ratio)  # ≈ 13.7

xgb_recall = xgb.XGBClassifier(
    n_estimators=700,
    learning_rate=0.15,
    max_depth=3,
    subsample=0.9,
    colsample_bytree=0.9,
    scale_pos_weight=fraud_ratio,
    min_child_weight = 1,
    gamma=0,
    eval_metric='logloss',
    random_state=42,
    tree_method='hist'
)

model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('clf', xgb_recall)
])

# -----------------------------
# 4. Train/test split
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# 5. Fit model
# -----------------------------
model.fit(X_train, y_train)

# -----------------------------
# 6. Predict with LOWER threshold
# -----------------------------
y_proba = model.predict_proba(X_test)[:, 1]

# Start with 0.30 and adjust based on recall / flag rate
threshold = 0.20
y_pred = (y_proba > threshold).astype(int)


# -----------------------------
# 7. Evaluate recall
# -----------------------------
print("Threshold:", threshold)
print("Recall:", recall_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Optional: inspect how recall and flag rate change with threshold
print("\nThreshold sweep:")
for t in [0.20, 0.25, 0.30, 0.35, 0.40]:
    preds = (y_proba > t).astype(int)
    print(
        f"t={t:.2f} | recall={recall_score(y_test, preds):.3f} | "
        f"flag_rate={preds.mean():.3f}"
    )

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8,6))
plt.plot(recall, precision, color='purple', lw=2,
         label=f'Precision-Recall curve (AP = {avg_precision:.4f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.grid(True)
plt.show()

#  SHAP EXPLANATIONS 

import shap

# Extract trained XGB model
xgb_model = model.named_steps['clf']

# Transform X_test using the pipeline’s preprocessing
X_test_transformed = model.named_steps['preprocess'].transform(X_test)

# Create SHAP explainer
explainer = shap.TreeExplainer(xgb_model)

# Compute SHAP values
shap_values = explainer.shap_values(X_test_transformed)

# Summary plot (global feature importance)
shap.summary_plot(
    shap_values,
    X_test_transformed,
    feature_names=model.named_steps['preprocess'].get_feature_names_out()
)


# Generate fraud risk scores for the entire dataset
fraud_scores = model.predict_proba(X)[:, 1]

# Add to dataframe
ccf_model['fraud_risk_score'] = fraud_scores

# Preview
ccf_model[['fraud_risk_score']].head()

import pandas as pd
import numpy as np

# SHAP values matrix → absolute mean impact per feature
shap_importance = pd.DataFrame({
    'feature': model.named_steps['preprocess'].get_feature_names_out(),
    'mean_abs_shap': np.abs(shap_values).mean(axis=0)
}).sort_values('mean_abs_shap', ascending=False)

shap_importance.head(15)

import matplotlib.pyplot as plt

top_n = 15
plt.figure(figsize=(10,6))
plt.barh(shap_importance['feature'].head(top_n)[::-1],
         shap_importance['mean_abs_shap'].head(top_n)[::-1],
         color='purple')
plt.title("Top Fraud Drivers (Mean |SHAP| Impact)")
plt.xlabel("Mean Absolute SHAP Value")
plt.ylabel("Feature")
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.show()

#
shap_df = pd.DataFrame(
    shap_values,
    columns=model.named_steps['preprocess'].get_feature_names_out()
)

ccf_model = ccf_model.reset_index(drop=True)
ccf_model = pd.concat([ccf_model, shap_df.add_prefix("shap_")], axis=1)

# Top Fraud-driving categories
category_drivers = (
    ccf_model.groupby('category')['fraud_risk_score']
    .mean()
    .sort_values(ascending=False)
)

print(category_drivers.head(10))

# Top fraud-driving merchants
merchant_drivers = (
    ccf_model.groupby('merchant')['fraud_risk_score']
    .mean()
    .sort_values(ascending=False)
)

print(merchant_drivers.head(10))

# High Risk Transactions table
high_risk = (
    ccf_model[['fraud_risk_score', 'merchant', 'category', 'amt', 'state']]
    .sort_values('fraud_risk_score', ascending=False)
    .head(20)
)

print(high_risk)

# SHAP distribution for a single top feature
top_feature = shap_importance['feature'].iloc[0]

plt.figure(figsize=(8,5))
plt.hist(ccf_model[f"shap_{top_feature}"], bins=50, color='teal')
plt.title(f"SHAP Value Distribution for {top_feature}")
plt.xlabel("SHAP Value")
plt.ylabel("Count")
plt.grid(alpha=0.3)
plt.show()

# SHAP Waterfall Plot for High-Risk Transaction

i = ccf_model['fraud_risk_score'].idxmax()  # highest-risk transaction

shap.plots._waterfall.waterfall_legacy(
    explainer.expected_value,
    shap_values[i],
    feature_names=model.named_steps['preprocess'].get_feature_names_out()
)

# SHAP Force Plot for a Mid-Risk Transaction

test_df = ccf_model.loc[X_test.index].copy()

mid_risk = test_df[
    (test_df['fraud_risk_score'] > 0.20) &
    (test_df['fraud_risk_score'] < 0.40)
].sample(1).index[0]

test_pos = list(X_test.index).index(mid_risk)

shap.force_plot(
    explainer.expected_value,
    shap_values[test_pos],
    X_test_transformed[test_pos].toarray().flatten(),
    feature_names=model.named_steps['preprocess'].get_feature_names_out(),
    matplotlib=True
)

# Top Risk Transactions Table with SHAP drivers
top_risk = (
    ccf_model[['fraud_risk_score', 'merchant', 'category', 'amt', 'state'] +
              [f"shap_{f}" for f in shap_importance['feature'].head(5)]]
    .sort_values('fraud_risk_score', ascending=False)
    .head(20)
)

print(top_risk)

import seaborn as sns
import matplotlib.pyplot as plt

feature = "num__amt"  # example — you can loop through top features

plt.figure(figsize=(10,6))
sns.kdeplot(ccf_model[ccf_model['is_fraud']==1][f"shap_{feature}"], 
            label="Fraud", shade=True, color="red")
sns.kdeplot(ccf_model[ccf_model['is_fraud']==0][f"shap_{feature}"], 
            label="Non-Fraud", shade=True, color="blue")

plt.title(f"SHAP Value Distribution for {feature}: Fraud vs Non-Fraud")
plt.xlabel("SHAP Value")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()

summary = []

for feature in top_features:
    sns.kdeplot(ccf_model[ccf_model['is_fraud']==1][f"shap_{feature}"], label="Fraud", shade=True)
    sns.kdeplot(ccf_model[ccf_model['is_fraud']==0][f"shap_{feature}"], label="Non-Fraud", shade=True)
    plt.title(f"SHAP Distribution: {feature}")
    plt.legend()
    plt.show()

pd.DataFrame(summary, columns=["Feature", "Fraud SHAP Mean", "Non-Fraud SHAP Mean", "Difference"])

# Heat Map

# --- Difference Heatmap ---

# Select top N features by SHAP importance
top_features = shap_importance['feature'].head(15)

# Build difference values
diff_values = []

for feature in top_features:
    fraud_mean = ccf_model[ccf_model['is_fraud']==1][f"shap_{feature}"].mean()
    nonfraud_mean = ccf_model[ccf_model['is_fraud']==0][f"shap_{feature}"].mean()
    diff = fraud_mean - nonfraud_mean
    diff_values.append(diff)

# Create DataFrame
diff_df = pd.DataFrame(
    diff_values,
    index=top_features,
    columns=["Fraud - Non-Fraud"]
)

# Ensure numeric (avoids dtype errors)
diff_df = diff_df.apply(pd.to_numeric, errors='coerce').fillna(0)

# Plot heatmap
plt.figure(figsize=(10,4))
sns.heatmap(diff_df.T, annot=True, cmap="coolwarm", center=0)
plt.title("Difference Heatmap (Fraud SHAP Mean – Non-Fraud SHAP Mean)")
plt.xlabel("Feature")
plt.ylabel("Difference")
plt.show()

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage

# --- Build SHAP difference matrix ---
top_features = shap_importance['feature'].head(15)
diff_values = []

for feature in top_features:
    fraud_mean = ccf_model[ccf_model['is_fraud']==1][f"shap_{feature}"].mean()
    nonfraud_mean = ccf_model[ccf_model['is_fraud']==0][f"shap_{feature}"].mean()
    diff = fraud_mean - nonfraud_mean
    diff_values.append(diff)

# Create DataFrame
diff_df = pd.DataFrame(
    diff_values,
    index=top_features,
    columns=["Fraud - Non-Fraud"]
)

# Ensure numeric
diff_df = diff_df.apply(pd.to_numeric, errors='coerce').fillna(0)

# --- Clustered heatmap ---
sns.clustermap(
    diff_df,
    cmap="coolwarm",
    center=0,
    annot=True,
    figsize=(8, 10),
    metric="euclidean",
    method="average"
)

plt.suptitle("Clustered SHAP Difference Heatmap", y=1.05)
plt.show()

Credit Card Fraud Problem Statement Credit Card fraud has been affecting the profitability of banks with the amount of transactions that happen on a regular basis. The analysis coming forth and the creation of a fraud detection algorithm based on how easy it can detect a fraudulent transaction will commence. The goal of this model is to err on the side of caution as the bank wants to capture as many of the fraudulent transactions even if it flags legitimate transactions. The patterns that will be seen forth will help with understanding how fraud is happening and how the bank can use this data to help with future predictions. Date Exploration The dataset provided has 339,607 transactions that encompass the years of 2019 and 2020. There are fifteen columns in the dataset. The columns are the date and time of the transaction, the merchant, the category the merchant is in, the amount of the transaction, the city and state of the client, also the longitude and latitude of the client, the population of that city, the job and date of birth for the client, the transaction number, the merchant latitude and longitude, and lastly the target variable is_fraud which denotes if the transaction is legitimate or if it was fraudulent.

These variables help to shape how the analysis will work as we dig into to see if any insights come out before the algorithm is taken on. A clear analysis and understanding of the dataset will help us to extrapolate and glean useful knowledge from the data. The goal is to find fraudulent transactions and create a machine learning model that best predicts the fraudulent transactions. The dataset has 1,782 fraudulent transactions compared to the 337,825 legitimate transactions. This means that the fraudulent transactions take up 0.53% of the dataset which means the dataset is unbalanced and will need to be rectified during machine learning model If that is not addressed it could allow for additional bias to seep into the model causing less predictive abilities. Now some questions that arise to be answered. Are those older more likely to be victims of fraud? Do certain merchants or categories have a higher tendency to see fraudulent transactions? Are certain amounts more prone to these transactions? Do certain areas see more fraudulent transactions both for the merchant or the client?
Data Insights The first question asked was to explore the data to see if the older group had more possibilities of fraud. First look, the age group from 20-29 had the highest fraud rate at 0.87% with the next highest being the age group over sixty at 0.70%. The other three age groups are below the average of the dataset.

It will be noted that the sample size of the age group is quite small. Here is the visualization of the fraud cases by age group.

It will be seen that the oldest age group has the most transactions as well as the most fraudulent transactions. This proves that the older generation may have something to be seen during the predictive modeling that they are more prone to see fraud happen. This data only has credit card users over the age of twenty-five though not in this scope of this detail looking at ways to gather younger clientele before they become used to using another credit card. Now the next question was about the specific merchants and categories and the results across the fraud spectrum. The dataset has 693 different merchants in it with only 184 of the merchants not having any fraudulent transactions in it. Those 184 merchants accounted for 78,950 transactions during those 2 years. These merchants and transactions of no fraudulent transactions account for only around 25% of the data set. With so many merchants, we narrowed down to the top fifteen merchants and their fraud percents to look at. These merchants had fraud occur at more than four times the rate of the entire dataset.

We had a few businesses with over 3% of their transactions. As we look at these outliers, we have some riskier merchants that may need to flag those transactions a little more since they have such a high rate of fraud compared to the entire data set. There are fourteen distinct categories, and the data shows that four categories have a higher rate than the average of the dataset.

These four categories account for about a third of the transactions with 106,356 during the two-year period while they account for 1,218 of the 1,782 fraudulent transactions or 68.2% of those transactions. Another area to review those categories a little more is to ensure that those transactions may need to be flagged. Lastly, with the age of purchases online looking out how the merchants corresponded with any areas of concern. The transactions that showed as fraudulent show that all of them were US-based businesses and match the areas where the customer footprint matches.

These maps show that the purchase happened within the footprint of where the client lives so nothing flags to be outside of the country. Lastly, the amount of the transaction is looked at for any correlation between the amount and fraudulent transactions. The data is split into bins to be able to capture the data; the bins are 0-5, 5-10, 10-25, 25-50, 50-100, 100-250, 250-500, 500-1000, and 1000+. Below are the results of that analysis.

It can be noted that transactions under twenty-five and over 250 seem to have the most occurrences of fraud. Those between those dollars’ values seem to have a trace amount of fraud and the bulk of the transactions falls between those amounts. When those transactions are brought out, we notice that the fraud rate has increased significantly. This will be area for us to ensure that the predictive model catches these transactions, especially the larger amounts as this adds cost to the bank. Visualize the Data Combining some of the graphs, help to see all the information that really stands out to see it all. The question regarding the age of the client being more susceptible to fraud is visionalized below:

It is noted that though the 20-29 age bracket has such a sizable percentage of fraud that happens it is such a small sample size and that those over sixty seem to have more transactions that result in fraud. The best age group is the one between 40-49 has the lowest fraud rate of our age groups at 0.34% much lower than our sixty and over group at 0.70%. The trend line helps to see where the average of the data set sits. Our clients as well average age is fifty-three so based on the data the business needs to detect fraud more accurately as the 50-59 group starts to have more fraudulent transactions. This is a good business purpose due to the data telling the needs due to our clientele. Next again, viewing the amounts and how the bins detect where the small transactions could be, giving way to a larger transaction being performed. This process is called micro-transactions. People do not recognize a small transaction going through until a larger one has gone through. Possibly looking at ways that the model can line up transactions on time and merchant and how many people had multiple issues at the same merchant that had a smaller than larger transaction happen.

The data show that as we get over 250 the percentage increases and is once over $500 then it shows that 1 in 5 transactions are fraudulent. This concerns that most of these transactions could be seen as needing further review. In the two-year period, there were 4,044 transactions but there were 837 fraudulent transactions, which is over half of the total for this two-year period of data. The dataset though makes tracking this difficult since either card number or customer identification is missing to be able to know if the same card was used in the fraudulent transaction. Machine Learning Predictions Now that the data has been explored and patterns found, a champion model will be set up that captures the most fraudulent transactions and not worrying about False Negatives – not fraud transactions but the model chooses it as a fraudulent transaction. Also, the model will need to look at dealing with the disparity in the data with only 1,782 transactions that are fraudulent transactions versus the other 337,825 of the transactions that are legitimate transactions. The overall accuracy of the model will not matter but the recall of the model which looks for how the model captures the prediction of the fraudulent transactions. The exploration of a few models was run looking for the ones that would provide the best recall with finding the least number of False Negatives. Neural Networks, Logistic Regression, Random Forest, and the Extreme Gradient Boost model (XGBoost in Python) to decide the best machine learning model. While all models worked well due to the limited characteristics of the data, a Neural Network was not the champion which normally is used in everyday fraud cases in the financial sector due to more data moving through the system. The XGBoost model was the champion being able to recognize the fraud cases at 90.8% of them while catching only 0.8% of the legitimate transactions in the model. The Gradient Boost model uses small decisions trees to decide the characteristics that determine if its fraud or not and gives it an overall score. The threshold was set to if the factors pushed to 0.20 then it would realize that it was fraud and marked as fraud. The hyper tuning of this model went through some iterations. A balanced weight and fraud ratio were created and the fraud ratio captured more of the fraud in the model. Since we were working with a dataset that 99.5% of the transactions were legitimate, the model had to take this into account. The number of trees was changed between 500 and 700 with the seven hundred seeming to give a little more depth to the results. The goal was to find a model that performs over 95% of the fraudulent transaction to be captured. After looking at the thresholds for the Gradient Boost Model provided below, the champion model is decided upon.

When the threshold was set to 0.20, the confusion matrix shows that the model missed twenty-one of the test samples which were fraud by captured 425 instances of fraud.

As well it only marked 1375 of the actual legitimate transactions compared to 83,081 marked correctly. The next part is how the model flagged the various aspects of the data to determine the values thus factoring into the credit risk score.

The SHAP value chart shows how each variable affected the impact on the model from most heavily weighted to as you go down the chart. The amount of the transaction and the hour it was made had the biggest impact on the credit score. This was seen in the data analysis at the beginning of this report – the higher the amount of the transaction the more likely it would be fraudulent.

This reinforces how the scores affect the total credit score. Remember the data that showed the threshold was set at 0.20 gave the best indicator of the model. Lastly, to help with understanding the different variables that helped to push the model to be the champion below is a heat map of the different variables and showing how they either help to detect fraud or show a transaction is legitimate.

Conclusion The scope of the project was to create a model that captured the most fraudulent transactions without worrying about how many transactions were flagged incorrectly. The Gradient Boost model set with the parameters as well as a frequency of 0.20 on the scale helped to capture over 95% of all fraud transactions It did flag about 2% of the transactions that were legitimate as fraudulent but the institution is more concerned about capturing fraud and only 2% does not cause too much concern. The average fraud transaction in the dataset was 10,879.47 compared to the other models that had a lower recall percentage though it did not flag as many legitimate transactions. As the business moves forward with the model and the understanding of it, putting in place some safeguards as well as not just reviewing the transactions as they happen, the institution could put a threshold on all credit cards having a $500 limit for purchase. This would require the client to have to call in and ask for a temporary increase since it is noted that one in five transactions over this amount result in fraud. Revisiting the data and looking for insights daily, weekly, and monthly will help to limit the exposure that the institution has to the fraud.

Data Dictionary

transdatetrans_timeTransaction DateTime
merchantMerchant Name
categoryCategory of Merchant
amtAmount of Transaction
cityCity of Credit Card Holder
stateState of Credit Card Holder
latLatitude Location of Purchase
longLongitude Location of Purchase
city_popCredit Card Holder's City Population
jobJob of Credit Card Holder
dobDate of Birth of Credit Card Holder
trans_numTransaction Number
merch_latLatitude Location of Merchant
merch_longLongitude Location of Merchant
is_fraudWhether Transaction is Fraud (1) or Not (0)