Bank Marketing
This dataset consists of direct marketing campaigns by a Portuguese banking institution using phone calls. The campaigns aimed to sell subscriptions to a bank term deposit (see variable y).
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
df = pd.read_csv("bank-marketing.csv", sep=";")Data Dictionary
| Column | Variable | Class |
|---|---|---|
| age | age of customer | |
| job | type of job | categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown" |
| marital | marital status | categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed |
| education | highest degree of customer | categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown" |
| default | has credit in default? | categorical: "no","yes","unknown" |
| housing | has housing loan? | categorical: "no","yes","unknown" |
| loan | has personal loan? | categorical: "no","yes","unknown" |
| contact | contact communication type | categorical: "cellular","telephone" |
| month | last contact month of year | categorical: "jan", "feb", "mar", ..., "nov", "dec" |
| day_of_week | last contact day of the week | categorical: "mon","tue","wed","thu","fri" |
| campaign | number of contacts performed during this campaign and for this client | numeric, includes last contact |
| pdays | number of days that passed by after the client was last contacted from a previous campaign | numeric; 999 means client was not previously contacted |
| previous | number of contacts performed before this campaign and for this client | numeric |
| poutcome | outcome of the previous marketing campaign | categorical: "failure","nonexistent","success" |
| emp.var.rate | employment variation rate - quarterly indicator | numeric |
| cons.price.idx | consumer price index - monthly indicator | numeric |
| cons.conf.idx | consumer confidence index - monthly indicator | numeric |
| euribor3m | euribor 3 month rate - daily indicator | numeric |
| nr.employed | number of employees - quarterly indicator | numeric |
| y | has the client subscribed a term deposit? | binary: "yes","no" |
Source of dataset.
Citations:
- S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
- S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS.
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- ๐บ๏ธ Explore: What are the jobs of the people most likely to subscribe to a term deposit?
- ๐ Visualize: Create a plot to visualize the number of people subscribing to a term deposit by
month. - ๐ Analyze: What impact does the number of contacts performed during the last campaign have on the likelihood that a customer subscribes to a term deposit?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
You work for a financial services firm. The past few campaigns have not gone as well as the firm would have hoped, and they are looking for ways to optimize their marketing efforts.
They have supplied you with data from a previous campaign and some additional metrics such as the consumer price index and consumer confidence index. They want to know whether you can predict the likelihood of subscribing to a term deposit. The manager would also like to know what factors are most likely to increase a customer's probability of subscribing.
You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.
โ๏ธ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.
Exploratory Data Analysis
# Preview dataframe
print(df.head())
# To see how many columns and rows
df.shape
# To view dataframe
df
# Get summary info
df.info()
# To get descriptive summary
df.describe()
# Import necessary packages
import matplotlib.pyplot as plt
from scipy import stats
# To check for outliers we can use Z-scores
# Calculate the Z-scores
z_scores = stats.zscore(df['campaign'])
# Define outliers as Z-scores greater than 3 or less than -3
outliers = df[abs(z_scores) > 3]
print("Outliers based on Z-score:")
print(outliers)
# Here we see that we have outliers, let us first see the performance of the model with outliers.
# We can also identify the outliers using IQR
Q1 = df["campaign"].quantile(0.25)
Q3 = df["campaign"].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# Remove outliers
clean_df = df[(df['campaign'] >= lower) & (df['campaign'] <= upper)]
clean_df
# Based on the info above, there is no null values
# But to look for null values
df.isnull().sum()
# No null values
# To look for duplicates
dup = df.duplicated()
print(dup[dup] == True)
# there are duplicated rows
df = df.drop_duplicates() # remove duplicated rows
df
df['job'].unique() # to check the entries under jobVisualize
Through data visualization, we can understand the distribution of some variable with our target variable
# Let us group by job and subscription, and count the occurrences
j_y_distribution = df.groupby(['job', 'y']).size().unstack().fillna(0)
# Create a bar chart for the distribution
j_y_distribution.plot(kind='bar', stacked=False)
# Customizing plot
plt.title('Job vs Subscription')
plt.xlabel('Job')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Job Titles', bbox_to_anchor=(1.05, 1), loc='upper left')
# Show the plot
plt.show()
# Let us look at other variables by creating a heatmap
# But first let us perform one-hot encoding for 'y'
df['y'].replace(['no', 'yes'], [0, 1], inplace=True)
df['job'].replace(['housemaid', 'services', 'admin.', 'blue-collar', 'technician',
'retired', 'management', 'unemployed', 'self-employed', 'unknown',
'entrepreneur', 'student'], list(range(12)), inplace=True)
import seaborn as sns
# Now let us create a heatmap
sns.heatmap(df.corr(), cmap="coolwarm", annot=True, fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
# Here we can see duration has a positive moderate relationship with y while others are considered weak
# Now let us see the trend of customer's subscription by creating a line garph
# However, let us first arranged the months
# Define the correct month order
month_order = ['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr', 'sep']
# Sorting the months
df['month'] = pd.Categorical(df['month'], categories = month_order, ordered = True)
# Filter the dataframe for only those who are subscribed
df_y = df[df['y'] == 1]
# Sort the month
df_y = df_y.sort_values('month')
# Count the number of subscriptions per month
subscription_counts = df_y.groupby(['month', 'y']).size().unstack().fillna(0)
# Assuming subscription_counts is already defined
subscription_counts.plot(kind='line', figsize=(10, 6))
# Adding labels and title
plt.title('Monthly Subscription Trend Each Month')
plt.xlabel('Month')
plt.ylabel('Number of Subscriptions')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
# Displaying the plot
plt.tight_layout()
plt.show()
Analyze
# We need to identify if there is an association between the y variable and other variables
# But let us first check if the last campaign has an impact to the customers subscription
# Here we can use ANOVA or T-test
# However, we need to check for assumptions
# 2 groups: subscribed and not subscribed customers are independent groups
import seaborn as sns
# To check for normality
# Extract the 2 groups
df_yes = df[df['y'] == 1] ['campaign']
df_no = df[df['y'] == 0]['campaign']
# Shapiro-wilk test
shapiro_stat1, shapiro_p_value1 = stats.shapiro(df_yes)
print(f'Shapiro-Wilk Test: statistic = {shapiro_stat1}, p-value = {shapiro_p_value1}')
if shapiro_p_value1 > 0.05:
print("The 'Subscribed' group follows a normal distribution. ")
#fail to reject null hypothesis
else:
print("The 'Subscribed' group does not follow a normal distribution.")
#reject null
shapiro_stat2, shapiro_p_value2 = stats.shapiro(df_no)
print(f'Shapiro-Wilk Test: statistic = {shapiro_stat2}, p-value = {shapiro_p_value2}')
if shapiro_p_value2 > 0.05:
print("The 'Not subscribed' group follows a normal distribution. ")
#fail to reject null hypothesis
else:
print("The 'Not subscribed' group does not follow a normal distribution.")
#reject null
# Histogram
plt.figure(figsize=(8, 6))
sns.histplot(df_yes, kde=True)
plt.title('Histogram for Subscribed Group')
plt.show()
plt.figure(figsize=(8, 6))
sns.histplot(df_no, kde=True)
plt.title('Histogram for Not Subscribed Group')
plt.show()
# Since it does not follow a normal distribution we can't use ANOVA or T-testโ
โ