Skip to content

Part 1: About the Dataset - Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

suppressPackageStartupMessages(library(tidyverse))

fraud <- read_csv('credit_card_fraud.csv', show_col_types = FALSE)
fraud
# Installing packages
install.packages('ggthemes')
install.packages('gt')
install.packages('cowplot')
install.packages('corrplot')
install.packages('skimr')
install.packages('tidygeocoder')
install.packages('ranger')
install.packages('glmnet')
install.packages('themis')
install.packages('lightgbm')
install.packages('bonsai')
Hidden output

Data Dictionary

transdatetrans_timeTransaction DateTime
merchantMerchant Name
categoryCategory of Merchant
amtAmount of Transaction
cityCity of Credit Card Holder
stateState of Credit Card Holder
latLatitude Location of Purchase
longLongitude Location of Purchase
city_popCredit Card Holder's City Population
jobJob of Credit Card Holder
dobDate of Birth of Credit Card Holder
trans_numTransaction Number
merch_latLatitude Location of Merchant
merch_longLongitude Location of Merchant
is_fraudWhether Transaction is Fraud (1) or Not (0)

Source of dataset. The data was partially cleaned and adapted by DataCamp.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • πŸ—ΊοΈ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
  • πŸ“Š Visualize: Use a geospatial plot to visualize the fraud rates across different states.
  • πŸ”Ž Analyze: Are older customers significantly more likely to be victims of credit card fraud?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

Part 2: Preparation, Setting Up

#Setting Up: Installing packages

# Code Block 2: Loading Libraries

# loading tidyverse/ tidymodels packages
library(tidyverse) #core tidyverse
library(tidymodels) # tidymodels framework
library(lubridate) # date/time handling

# visualization
library(viridis) #color scheme that is colorblind friendly
# library(ggthemes) # themes for ggplot
library(gt) # to make nice tables
library(cowplot) # to make multi-panel figures
library(corrplot) # nice correlation plot

#Data Cleaning
library(skimr) #provides overview of data and missingness

#Geospatial Data
library(tidygeocoder) #converts city/state to lat/long

#Modeling
library(ranger) # random forest
library(glmnet) # elastic net logistic regression
library(themis) # provides up/down-sampling methods for the data
library(lightgbm) # fast gradient-boosted machine algo
library(bonsai) #provides parnsip objects for tree-based models
Hidden output

Setting a global theme for our figures using cowplot for visual consistency purposes

#Setting global figure options
theme_set(theme_cowplot(12))
#load the data
fraud

Part 3: Validate Data Types

I will validate the dataset using the skim() function from the skimr package. This will result in a dataframe as the output, so it can be manipulated and formatted more nicely than if we used summary().

# Validation of Data Types Against Data Dictionary
# custom skim function to remove some of the quartile data
my_skim <- skim_with(numeric = sfl(p25 = NULL, p50 = NULL, p75 = NULL))

my_skim(fraud)

From the tables above, we can see that there is a total of 339,607 records and 15 variables. From the dataset, we can see that 6 variables are character, 1 is date, 7 are numeric and another one is POSIXct. The character table does not illustrate any useful information as it is very vague. Seeing the date variable, we can see the range of the oldest person's date of birth and the youngest person - though this again may not be as insightful. An interesting point to look at is the is_fraud variable, which is coded as 0 or 1. Here, the mean of the variable is 0.00525, which shows that the number of fraudulent transactions is very low, and thus we should use treatments for imbalanced classes when we get to the fitting/modelling stage.

Part 4: A closer look into our variables

I will be analysing each variable to decide whether or not I should keep, transform or drop it. This is a combination of exploratory data analysis and feature engineering.

Here, I will consider the following questions:

  1. Should strings be converted to factors?
  2. Is date-time data properly encoded?
  3. Is financial data encoded numerically?
  4. Is geographic data consistenly rendered? (city/state strings vs lat/long numeric pairs)

I will go through the analysis on the variables type basis. As seen before,we have identified the following type of variables:

  1. strings/character
  2. geospatial data
  3. dates
  4. date/times
  5. numerical

Part 4.1.: Looking at the strings

Should strings be kept, transformed or dropped? Stringhs are usually not a useful format for classification problems. We can convert the following strings to the following data types to enhance our exploratory data analysis:

4.1.1. Strings to Factors

  • category: category of merchant
  • job: job of credit card holder

4.1.2. Strings as Strings

  • merchant: merchant name
  • trans_num: transaction number

4.2. Strings to Geospatial Data The dataset already has geospatial data as lat/long pairs. I can convert the city/state variables into lat/long pairs so I can compare them to other geospatial variables and compute new variables such as the distance of a transaction from the home location.

4.1.1. Exploring the factors: how is the compactness of categories

β€Œ
β€Œ
β€Œ