Skip to content

Challenge

For this challenge we will use a fake credit card dataset that is attached as df.csv, which includes information from a public Kaggle dataset with three added fields: activated_date, last_payment_date and fraud

First We import the libraries we will use.

# Start coding here... 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We import import the file while parsing date columns we'll use later.

df = pd.read_csv('Stori_Data_Challenge.xls', index_col =0, parse_dates=['activated_date', 'last_payment_date'])
df.head()

We use describe method to get statistics and sumatized values. We can see that there are 8950 rows. In the balance column We can see that the min value is 0 and max 119043. This means values are spread by a big amount.

df.describe()
df.dtypes

We check for missing values. We will use cash_advance later so We will inpute it.

df.isnull().sum()
df.fillna(method ='ffill', inplace = True) 
df.isnull().sum()

Plot an histogram of the balance amount for all the customers.

Lets create a histogram to chart how is the balance distributed. With this plot we can see that most people less than 5000 on their balance accounts. A good way to visualize this would be a boxplot.
sns.set_theme()
sns.displot(df.balance,bins=20)
plt.show()

The blue bar show where most of the values are located. Here we can see most of them are bellow 2500.

sns.boxplot(y=df.balance, width=0.3)
plt.show()