Non-Contactual Churn Analysis
For this Churn Analysis, I will be using the BetaGeoFitter from the lifetimes library. This model predicts customer activity using their past transactions, mainly how often they have placed orders, and how long it has been since their last transactions. For example, if a customer places an order on average every 20 days, and their last transaction was 5 days ago, their probability of being active will be very high, however, if it's been 80 days since their last transaction, it will be quite low.
Import Data and Libraries
%%capture
!pip install lifetimes #For DataCamp Workspace Only
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from lifetimes import BetaGeoFitter
from lifetimes.utils import calibration_and_holdout_data
from lifetimes.utils import summary_data_from_transaction_data
from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix
from lifetimes.plotting import plot_period_transactions
from lifetimes.plotting import plot_history_alive
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
import warnings
warnings.filterwarnings('ignore')
Source of dataset.
Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
data = pd.read_csv("online_retail.csv")
Creating a copy to retain original data if necessary
df = data.copy()
df
For the model, we need 4 columns: Order ID, Customer ID, Date, and Sales Amount. Below code gets those 4 columns and drops all others.
df['SalesAmount'] = df['UnitPrice'] * df['Quantity']
df1 = df.drop(columns = ['StockCode', 'Description', 'Country', 'UnitPrice', 'Quantity'])
df1 = df1.groupby(by = ['InvoiceNo', 'CustomerID', 'InvoiceDate'], as_index = False).sum()
df1
df1.sort_values(by = 'InvoiceDate', ascending = False )
RFM Metrics
df_rfmt = summary_data_from_transaction_data(df,
'CustomerID',
'InvoiceDate',
'SalesAmount',
observation_period_end='2011-09-09')
df_rfmt
Distribution of RFM Metrics
ax = sns.distplot(df_rfmt['recency'])