Feature Engineering for Credit Card Fraud Detection
Federal Trade Commission defines "Credit Card Fraud",a type of identity fraud, use of a person’s identifying information to open a new credit card account or to make charges to a person’s existing credit card account without their permission.
In 2022, financial institutions submitted more than 3.6 million suspicious activity reports (SARs) to the U.S. Treasury’s Financial Crimes Enforcement Network. According to a Thomson Reuters Institute report, SAR filings in March 2023 set a monthly record, with more than 351,000 reports—a sign that potentially fraudulent activity will continue to surge. The latest Consumer Centinel Network report higlights the upward trend as the total loss of credit card fraud is $246M as show on the chart below.
Therefore, the ability to accurately detect fraud protects and ensures customers' peace of mind and can prevent massive financial losses.
The quality of predictions is highly dependent on the data and features used. In this project, we will take raw credit card data with standard features and engineer additional features to help assist with fraud prediction.
Step 0: Import Libraries ⏳
- The geopy library is a Python client for several popular geocoding web services. Geocoding is the process of converting addresses into geographic coordinates (latitude and longitude),which can be used to place markers on a map, or position the map.
geopymakes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.It includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services.
%%capture
!pip install geopy# Basic operations
import numpy as np
import pandas as pd
from datetime import date
from geopy import distance
#Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as snsData Dictionary
| transdatetrans_time | Transaction DateTime |
|---|---|
| cc_number | Credit Card Number |
| merchant | Merchant Name |
| category | Category of Merchant |
| amt | Amount of Transaction |
| first | First Name of Credit Card Holder |
| last | Last Name of Credit Card Holder |
| gender | Gender of Credit Card Holder |
| street | Street of Credit Card Holder |
| city | City of Credit Card Holder |
| state | State of Credit Card Holder |
| zip | Zipcode of Credit Card Holder |
| lat | Latitude Location of Purchase |
| long | Longitude Location of Purchase |
| city_pop | Credit Card Holder's City Population |
| job | Job of Credit Card Holder |
| dob | Date of Birth of Credit Card Holder |
| trans_num | Transaction Number |
| unix_time | Timestamp |
| merch_lat | Latitude Location of Merchant |
| merch_long | Longitude Location of Merchant |
| merch_country | Merchant Country |
| is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Step 1: Inspect the Test Dataset 🔎
1.1. Read the data
The credit card dataset contains typical raw credit card transaction features such as the transaction time, the credit card number, the merchant, the amount spent, and customer details (see Bahnsen et al., 2016 for a list of common features).
# Set path to data
path = "data/fraud_data.csv"
# Specify the transaction time column
trans_time = "trans_date_trans_time"
# Specify any additional date columns
date_cols = ["dob", trans_time]
# Read in the data as a DataFrame and set the index
fraud_df = pd.read_csv(path, parse_dates=date_cols, index_col=trans_time).sort_index()
# Preview the data
fraud_df.head(5)1.2. Inspect the features and data types
The first step is to inspect the columns and review the date types of each column.
1 hidden cell
# Display unique values and their counts for each column
columns = ['category', 'customer_country','state','merchant_country']
for column in columns:
print(f"Unique values in {column}: {fraud_df[column].unique()}")
print(f"Unique value count in {column}: {fraud_df[column].nunique()}")