Credit Card Fraud
This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df=pd.read_csv('credit_card_fraud.csv') Data Dictionary
| transdatetrans_time | Transaction DateTime |
|---|---|
| merchant | Merchant Name |
| category | Category of Merchant |
| amt | Amount of Transaction |
| city | City of Credit Card Holder |
| state | State of Credit Card Holder |
| lat | Latitude Location of Purchase |
| long | Longitude Location of Purchase |
| city_pop | Credit Card Holder's City Population |
| job | Job of Credit Card Holder |
| dob | Date of Birth of Credit Card Holder |
| trans_num | Transaction Number |
| merch_lat | Latitude Location of Merchant |
| merch_long | Longitude Location of Merchant |
| is_fraud | Whether Transaction is Fraud (1) or Not (0) |
Source of dataset. The data was partially cleaned and adapted by DataCamp.
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- ๐บ๏ธ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- ๐ Visualize: Use a geospatial plot to visualize the fraud rates across different states.
- ๐ Analyze: Are older customers significantly more likely to be victims of credit card fraud?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.
The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.
โ๏ธ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.
Summary of Findings
Fraud prediction of 339,607 credit card transactions from January 1, 2019 to December 31, 2020 was able to correctly predict 67.4% of the fraudulent transactions. State, transaction category, amount, and age were correlated with fraud. Transactions in Alaska were fraud 1.7% of the time with no other state exceeding 0.7%. Transactions categorized as "net_shopping","grocery", and "misc_net" were more than twice as likely to be fraudulent (>1.2% vs <0.6%). Cardholders between 20 and 25 or greater than 70 were more likely to have fraudulent transactions. No fraud was observed for transactions greater than $1372. No correlation was found between merchant distance and fraud.
Data Exploration
The credit card dataset includes 15 variables with 339,607 entries. No fields are null and datatypes are appropriate except for dates, which are converted.
df.info()df.trans_date_trans_time=pd.to_datetime(df.trans_date_trans_time)
df.dob=pd.to_datetime(df.dob)df.head()print(df.is_fraud.value_counts())
print(df.is_fraud.value_counts(normalize=True))Initial EDA
Explore the data for initial insights. Merchant, category, and state all appear to have predictive power as categorical variables. Category and state can be one hot encoded, but merchant has too many fields to encode. Large transaction amounts appear less likely to be fraud.
display(df[['amt']].describe().transpose())print(df.trans_date_trans_time.max())
print(df.trans_date_trans_time.min())โ
โ