KNN logistic rf ens voting

Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd 
ccf = pd.read_csv('credit_card_fraud.csv') 
ccf.head(100)

Data Dictionary

transdatetrans_time	Transaction DateTime
merchant	Merchant Name
category	Category of Merchant
amt	Amount of Transaction
city	City of Credit Card Holder
state	State of Credit Card Holder
lat	Latitude Location of Purchase
long	Longitude Location of Purchase
city_pop	Credit Card Holder's City Population
job	Job of Credit Card Holder
dob	Date of Birth of Credit Card Holder
trans_num	Transaction Number
merch_lat	Latitude Location of Merchant
merch_long	Longitude Location of Merchant
is_fraud	Whether Transaction is Fraud (1) or Not (0)

Source of dataset. The data was partially cleaned and adapted by DataCamp.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
📊 Visualize: Use a geospatial plot to visualize the fraud rates across different states.
🔎 Analyze: Are older customers significantly more likely to be victims of credit card fraud?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

fraud vs product category, amount,merchant name, city_pop use unsupervised learning, logistic regression, decision tree

dop vs fraud linear regression,

geospatial plot of fraud dist lat long merchant, purchase location

fraud freq. vs. city, merchanct name, merchant category,city,state,job

data cleaning

import pandas as pd
from datetime import date, datetime

#convert str to categories
ccf[['merchant', 'category', 'city', 'state', 'job', 'trans_num']] = ccf[['merchant', 'category', 'city', 'state', 'job', 'trans_num']].astype('category')

#convert time
ccf['trans_date_trans_time'] = pd.to_datetime(ccf['trans_date_trans_time'])
ccf['dob'] = pd.to_datetime(ccf['dob'])
print(ccf['trans_date_trans_time'])
today = pd.to_datetime(date.today())
current_year = today.year
ccf['age'] = current_year - ccf['dob'].dt.year
ccf['year']=ccf['trans_date_trans_time'].dt.year
print(ccf['age'])
print(ccf.info())
print(ccf.isna().sum())
print()

fraud vs product category, merchant name, state,city,job

import seaborn as sns
import matplotlib.pyplot as plt

#selecg categorical cols
X = ccf[['merchant', 'category', 'job', 'state', 'city', 'is_fraud']]

#count fraud for different categories
def fraud_count(col):
    global flag 
    df = X.groupby(col, as_index=False)['is_fraud'].count()
    df = df.rename(columns={'is_fraud': 'fraud_count'})
    df = df.sort_values(by='fraud_count', ascending=False) 
    df=df.head(10)
    print(df)
    
    # prepare bar plot for each category sort before from max to min
    sorted_cat = df.sort_values(by='fraud_count', ascending=False)[col]
    sns.barplot(data=df, x=col, y='fraud_count', order=sorted_cat)
    plt.title(f'{col} fraud rank')
    plt.xticks(rotation=90)  
    plt.show()
    
                     

cols=['merchant', 'category', 'job', 'state', 'city']
for col in cols:
    fraud_count(col)

Try supervised learning KNN to predict fraud

# supervised learning
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Assuming ccf is a DataFrame that has been defined earlier in the notebook
X = ccf[['amt', 'city_pop', 'year', 'age','lat','long']]
y = ccf['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
y_pred = knn.predict(X_test)
conf = confusion_matrix(y_test, y_pred)
print(f'knn score {score}')
print(conf)
print(classification_report(y_test, y_pred))

disp = ConfusionMatrixDisplay(confusion_matrix=conf, display_labels=['Reject', 'Approval'])
disp.plot()
plt.show()

Determine the optimum n_neighbors for KNN

‌
‌
‌