Skip to content

Feature Engineering for Credit Card Fraud Detection

Federal Trade Commission defines "Credit Card Fraud",a type of identity fraud, use of a person’s identifying information to open a new credit card account or to make charges to a person’s existing credit card account without their permission.

In 2022, financial institutions submitted more than 3.6 million suspicious activity reports (SARs) to the U.S. Treasury’s Financial Crimes Enforcement Network. According to a Thomson Reuters Institute report, SAR filings in March 2023 set a monthly record, with more than 351,000 reports—a sign that potentially fraudulent activity will continue to surge. The latest Consumer Centinel Network report higlights the upward trend as the total loss of credit card fraud is $246M as show on the chart below.

Source: https://www.ftc.gov/system/files/ftc_gov/pdf/CSN-Annual-Data-Book-2023.pdf

Therefore, the ability to accurately detect fraud protects and ensures customers' peace of mind and can prevent massive financial losses.

The quality of predictions is highly dependent on the data and features used. In this project, we will take raw credit card data with standard features and engineer additional features to help assist with fraud prediction.

Step 0: Import Libraries ⏳

  • The geopy library is a Python client for several popular geocoding web services. Geocoding is the process of converting addresses into geographic coordinates (latitude and longitude),which can be used to place markers on a map, or position the map.
  • geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.It includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services.
%%capture
!pip install geopy
# Basic operations
import numpy as np
import pandas as pd

from datetime import date
from geopy import distance

#Visualizations 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

Data Dictionary

transdatetrans_timeTransaction DateTime
cc_numberCredit Card Number
merchantMerchant Name
categoryCategory of Merchant
amtAmount of Transaction
firstFirst Name of Credit Card Holder
lastLast Name of Credit Card Holder
genderGender of Credit Card Holder
streetStreet of Credit Card Holder
cityCity of Credit Card Holder
stateState of Credit Card Holder
zipZipcode of Credit Card Holder
latLatitude Location of Purchase
longLongitude Location of Purchase
city_popCredit Card Holder's City Population
jobJob of Credit Card Holder
dobDate of Birth of Credit Card Holder
trans_numTransaction Number
unix_timeTimestamp
merch_latLatitude Location of Merchant
merch_longLongitude Location of Merchant
merch_countryMerchant Country
is_fraudWhether Transaction is Fraud (1) or Not (0)

Step 1: Inspect the Test Dataset 🔎

1.1. Read the data

The credit card dataset contains typical raw credit card transaction features such as the transaction time, the credit card number, the merchant, the amount spent, and customer details (see Bahnsen et al., 2016 for a list of common features).

# Set path to data
path = "data/fraud_data.csv"

# Specify the transaction time column
trans_time = "trans_date_trans_time"

# Specify any additional date columns
date_cols = ["dob", trans_time]

# Read in the data as a DataFrame and set the index
fraud_df = pd.read_csv(path, parse_dates=date_cols, index_col=trans_time).sort_index()

# Preview the data
fraud_df.head(5)

1.2. Inspect the features and data types

The first step is to inspect the columns and review the date types of each column.


1 hidden cell
# Display unique values and their counts for each column
columns = ['category', 'customer_country','state','merchant_country']
for column in columns:
    print(f"Unique values in {column}: {fraud_df[column].unique()}")
    print(f"Unique value count in {column}: {fraud_df[column].nunique()}")