Skip to content

Credit card fraud!

A new credit card company has just entered the market. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

The original source of the data (prior to preparation by DataCamp) can be found here, and the data dictionary can be found in the data_dictionary.ipynb file in your file browser!

[2]
import pandas as pd 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as pltx

data = pd.read_csv('credit_card_fraud.csv') 
data
data.info()
data.drop_duplicates()
[17]
data.isna().sum()
[18]
data[['amt', 'city_pop']].describe().transpose()
pltx.pie(data, names='is_fraud', 
         color='is_fraud', 
         color_discrete_map={1:'#acc8fc', 0:'#6f6cd4'}, 
         title='What percentage of fraud incidents are present in our dataset?')
# Create the histograms
pltx.histogram(data, x='category', 
                color='is_fraud', 
                color_discrete_map={1:'#acc8fc', 0:'#6f6cd4'}, 
                title='Is category important in this equation?')
[21]
#Relation between age vs fraud
import datetime as dt
data['age']=dt.date.today().year-pd.to_datetime(data['dob']).dt.year
ax=sns.kdeplot(x='age',data=data, hue='is_fraud', common_norm=False)
ax.set_xlabel('Credit Card Holder Age')
ax.set_ylabel('Density')
plt.xticks(np.arange(0,110,5))
plt.title('Age Distribution in Fraudulent vs Non-Fraudulent Transactions')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
#subset the training data to include only the features that we need
X=data[['category','amt','age','city_pop', 'job']]
y = data['is_fraud'].values
X = pd.get_dummies(X, drop_first=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import confusion_matrix, classification_report
model2 = RandomForestClassifier()
model2.fit(X_train,y_train)
predicted=model2.predict(X_test)
print('Classification report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
print('Share of Non-Fraud in Test Data:', 1-round(y_test.sum()/len(y_test),4))