Skip to content
Course Notes: Customer Segmentation in Python
Course Notes
Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets
folder.
Chapter 1: Cohort Analysis
- Cohort: A group of individuals who share a common characteristic or experience within a specific timeframe
- Compare metrics across product life cycle and customer life cycle Type of Cohorts:
- Time cohorts
- Behavior cohorts: Type of product/service they signed up -> Understanding needs
- Size cohorts: Amount of spending, product type customer ordered the most Element of cohort analysis:
- Pivot table
- Assigned cohort in rows
- Cohort index in columns
- Metrics in the table
Time Cohorts
Segment customers into acquisition cohorts based on the month they made their first purchase. Then, assign cohort index to each purchase of the customer -> represent the number of months since the last transaction.
Cohort Analysis Heatmap
- Rows: First activity
- Columns: Time since first activity
import pandas as pd
from datetime import datetime as dt
online = pd.read_csv("online.csv")
online.head()
# Import datetime module
import datetime as dt
# Define a function that will parse the date
def get_day(x):
x = dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S') # convert string to datetime object
return dt.datetime(x.year, x.month, x.day)
# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day)
# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay']
# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')
online.head()
def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
# Get the integers for date parts from the `InvoiceDay` column
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')
# Get the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')
# Calculate difference in years
years_diff = invoice_year - cohort_year
# Calculate difference in months
months_diff = invoice_month - cohort_month
# Calculate difference in days
days_diff = invoice_day - cohort_day
# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
online.head()
Cohort Matrics
The retention measures how many customers from each of the cohort have returned in the subsequent months Customer retention: Percentage of active customers out of total customers
grouping = online.groupby(['CohortDay', 'CohortIndex'])
# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()
# Create a pivot
cohort_counts = cohort_data.pivot(index='CohortDay', columns='CohortIndex', values='CustomerID')
# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]
# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortDay', 'CohortIndex'])
# Calculate the average of the unit price column
cohort_data = grouping['UnitPrice'].mean()
# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()
# Create a pivot
average_price = cohort_data.pivot(index='CohortDay', columns='CohortIndex', values='UnitPrice')
print(average_price.round(1))
Cohort Analysis Visualization
Heatmap
- Easiest way to visualize cohort analysis
- Include both data and visuals
- Only few lines of codes with seaborn
# Import seaborn package as sns
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
average_quantity = pd.read_csv("average_quantity.csv", index_col=0)
# Initialize an 8 by 6 inches plot figure
plt.figure(figsize=(8, 6))
# Add a title
plt.title('Average Spend by Monthly Cohorts')
# Create the heatmap
sns.heatmap(average_quantity, annot=True, cmap='Blues')
plt.show()
Chapter 2: RFM Segmentation
Recency, Frequency, and Monetary
- Calculate RFM scores
- Grouping RFM values
The RFM values can be grouped in several ways:
- Percentiles e.g. quantiles
- Pareto 80/20
- Custom-based on business knowledge
Short reviews of perccentiles
- Sort customers based on that metric
- Break customers into a pre-defined number of groups of equal size
- Assign a label to each group
Calculating percentiles with Python
- pd.qcut(data, q = ..., labels = ...) Assigning values: Highest score to the best metric - best is not always highest e.g. recency
# Spend Quartile
import pandas as pd
data = pd.read_csv("data.csv")
# Create a spend quartile with 4 groups - a range between 1 and 5
spend_quartile = pd.qcut(data['Spend'], q=4, labels=range(1,5))
# Assign the quartile values to the Spend_Quartile column in data
data['Spend_Quartile'] = spend_quartile
# Print data with sorted Spend values
print(data.sort_values('Spend'))