Skip to content
New Workbook
Sign up
Course Notes: Customer Segmentation in Python

Course Notes

Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets folder.

Chapter 1: Cohort Analysis

  • Cohort: A group of individuals who share a common characteristic or experience within a specific timeframe
  • Compare metrics across product life cycle and customer life cycle Type of Cohorts:
  • Time cohorts
  • Behavior cohorts: Type of product/service they signed up -> Understanding needs
  • Size cohorts: Amount of spending, product type customer ordered the most Element of cohort analysis:
  • Pivot table
  • Assigned cohort in rows
  • Cohort index in columns
  • Metrics in the table

Time Cohorts

Segment customers into acquisition cohorts based on the month they made their first purchase. Then, assign cohort index to each purchase of the customer -> represent the number of months since the last transaction.

Cohort Analysis Heatmap

  • Rows: First activity
  • Columns: Time since first activity
import pandas as pd
from datetime import datetime as dt
online = pd.read_csv("online.csv")

online.head()
# Import datetime module
import datetime as dt

# Define a function that will parse the date
def get_day(x): 
    x = dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S') # convert string to datetime object
    return dt.datetime(x.year, x.month, x.day) 

# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

online.head()
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day
# Get the integers for date parts from the `InvoiceDay` column
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')
# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1

online.head()

Cohort Matrics

The retention measures how many customers from each of the cohort have returned in the subsequent months Customer retention: Percentage of active customers out of total customers

grouping = online.groupby(['CohortDay', 'CohortIndex'])
# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()

# Create a pivot 
cohort_counts = cohort_data.pivot(index='CohortDay', columns='CohortIndex', values='CustomerID')

# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortDay', 'CohortIndex']) 

# Calculate the average of the unit price column
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_price = cohort_data.pivot(index='CohortDay', columns='CohortIndex', values='UnitPrice')
print(average_price.round(1))

Cohort Analysis Visualization

Heatmap

  • Easiest way to visualize cohort analysis
  • Include both data and visuals
  • Only few lines of codes with seaborn
# Import seaborn package as sns
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

average_quantity = pd.read_csv("average_quantity.csv", index_col=0)

# Initialize an 8 by 6 inches plot figure
plt.figure(figsize=(8, 6))

# Add a title
plt.title('Average Spend by Monthly Cohorts')

# Create the heatmap
sns.heatmap(average_quantity, annot=True, cmap='Blues')
plt.show()

Chapter 2: RFM Segmentation

Recency, Frequency, and Monetary

  1. Calculate RFM scores
  2. Grouping RFM values The RFM values can be grouped in several ways:
    • Percentiles e.g. quantiles
    • Pareto 80/20
    • Custom-based on business knowledge

Short reviews of perccentiles

  • Sort customers based on that metric
  • Break customers into a pre-defined number of groups of equal size
  • Assign a label to each group

Calculating percentiles with Python

  • pd.qcut(data, q = ..., labels = ...) Assigning values: Highest score to the best metric - best is not always highest e.g. recency
# Spend Quartile
import pandas as pd
data = pd.read_csv("data.csv")

# Create a spend quartile with 4 groups - a range between 1 and 5
spend_quartile = pd.qcut(data['Spend'], q=4, labels=range(1,5))

# Assign the quartile values to the Spend_Quartile column in data
data['Spend_Quartile'] = spend_quartile

# Print data with sorted Spend values
print(data.sort_values('Spend'))