Fraud Detection: Feature Engineering and Next Steps

Feature Engineering for Credit Card Fraud Detection

Federal Trade Commission defines "Credit Card Fraud",a type of identity fraud, use of a person’s identifying information to open a new credit card account or to make charges to a person’s existing credit card account without their permission.

In 2022, financial institutions submitted more than 3.6 million suspicious activity reports (SARs) to the U.S. Treasury’s Financial Crimes Enforcement Network. According to a Thomson Reuters Institute report, SAR filings in March 2023 set a monthly record, with more than 351,000 reports—a sign that potentially fraudulent activity will continue to surge. The latest Consumer Centinel Network report higlights the upward trend as the total loss of credit card fraud is $246M as show on the chart below.

Source: https://www.ftc.gov/system/files/ftc_gov/pdf/CSN-Annual-Data-Book-2023.pdf

Therefore, the ability to accurately detect fraud protects and ensures customers' peace of mind and can prevent massive financial losses.

The quality of predictions is highly dependent on the data and features used. In this project, we will take raw credit card data with standard features and engineer additional features to help assist with fraud prediction.

Step 0: Import Libraries ⏳

The geopy library is a Python client for several popular geocoding web services. Geocoding is the process of converting addresses into geographic coordinates (latitude and longitude),which can be used to place markers on a map, or position the map.
geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.It includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services.

%%capture
!pip install geopy

# Basic operations
import numpy as np
import pandas as pd

from datetime import date
from geopy import distance

#Visualizations 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

Data Dictionary

transdatetrans_time	Transaction DateTime
cc_number	Credit Card Number
merchant	Merchant Name
category	Category of Merchant
amt	Amount of Transaction
first	First Name of Credit Card Holder
last	Last Name of Credit Card Holder
gender	Gender of Credit Card Holder
street	Street of Credit Card Holder
city	City of Credit Card Holder
state	State of Credit Card Holder
zip	Zipcode of Credit Card Holder
lat	Latitude Location of Purchase
long	Longitude Location of Purchase
city_pop	Credit Card Holder's City Population
job	Job of Credit Card Holder
dob	Date of Birth of Credit Card Holder
trans_num	Transaction Number
unix_time	Timestamp
merch_lat	Latitude Location of Merchant
merch_long	Longitude Location of Merchant
merch_country	Merchant Country
is_fraud	Whether Transaction is Fraud (1) or Not (0)

Step 1: Inspect the Test Dataset 🔎

1.1. Read the data

The credit card dataset contains typical raw credit card transaction features such as the transaction time, the credit card number, the merchant, the amount spent, and customer details (see Bahnsen et al., 2016 for a list of common features).

# Set path to data
path = "data/fraud_data.csv"

# Specify the transaction time column
trans_time = "trans_date_trans_time"

# Specify any additional date columns
date_cols = ["dob", trans_time]

# Read in the data as a DataFrame and set the index
fraud_df = pd.read_csv(path, parse_dates=date_cols, index_col=trans_time).sort_index()

# Preview the data
fraud_df.head(5)

1.2. Inspect the features and data types

The first step is to inspect the columns and review the date types of each column.

1 hidden cell

# Display unique values and their counts for each column
columns = ['category', 'customer_country','state','merchant_country']
for column in columns:
    print(f"Unique values in {column}: {fraud_df[column].unique()}")
    print(f"Unique value count in {column}: {fraud_df[column].nunique()}")

‌
‌
‌