Customer Segmentation Report
Customer segmentation is a way to identify groups of similar customers. Customers can be segmented on a wide variety of characteristics, such as demographic information, purchase behavior, and attitudes. This template provides an end-to-end report for processing and segmenting customer purchase data using a K-means clustering algorithm. It also includes a snake plot and heatmap to visualize the resulting clusters and feature importance.
To use your data, the following criteria must be satisfied:
- Multiple numerical variables that you can use for clustering.
- No NaN/NA values. You can use this template to impute missing values if needed.
The placeholder dataset in this template consists of customer data, including purchase recency, frequency, and monetary value. Each row represents a different customer with a distinct customer ID.
1. Loading packages and Inspecting the Data
The code below imports the packages necessary for data manipulation, visualization, pre-processing, and clustering. It also sets up the visualization style and loads in the data.
Finally, it inspects the data types and missing values with the .info()
method from pandas
.
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Set visualization style
sns.set_style("darkgrid")
# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_data.csv")
# Preview the data
df
# Check columns for data types and missing values
df.info()
2. Exploring the Data
Based on the evaluation above, you can select columns that you wish to inspect further. In this template, three columns are selected from the four columns. CustomerID is omitted because it is an identifier and not useful for clustering.
The code below reduces the DataFrame to the columns you wish to cluster on and then prints descriptive statistics using the describe()
method from pandas
.
Printing descriptive statistics is helpful because K-means clustering has several key assumptions that can be revealed via this exploration:
- There is no skewness to the data.
- The variables have the same average values.
- The variables have the same variance.
If you'd like to learn more about pre-processing data for K-means clustering, you can refer to this video from the course Customer Segmentation in Python.
# Select columns for clustering
columns_for_clustering = ["Recency", "Frequency", "MonetaryValue"]
# Create new DataFrame with clustering variables
df_features = df[columns_for_clustering]
# Print a summary of descriptive statistics
df_features.describe()
The facetgrid()
function from seaborn
creates a grid of histograms of the data to be clustered. It serves as a further exploration of the data to determine its skew and whether it needs transformation.
# Plot the distributions of the selected variables
g = sns.FacetGrid(
df_features.melt(), # Reformat the DataFrame for plotting purposes
col="variable", # Split on the 'variable' column created by reformating
sharey=False, # Turn off shared y-axis
sharex=False, # Turn off shared x-axis
)
# Apply a histogram to the facet grid
g.map(sns.histplot, "value")
# Adjust the top of the plots to make room for the title
g.fig.subplots_adjust(top=0.8)
# Create a title
g.fig.suptitle("Unprocessed Variable Distributions", fontsize=16)
plt.show()
Before proceeding, it is crucial to ensure that all columns selected for clustering are numeric. The following code iterates through the reduced DataFrame and checks whether each column is numeric. If it returns True
, then you can proceed with the pre-processing.
all([pd.api.types.is_numeric_dtype(df_features[col]) for col in columns_for_clustering])
3. Pre-processing the Data
Based on the grids above, if there is a skew, you will have to complete this step which removes the skew and center the variables. This is the case for the placeholder dataset used in this template and will likely be the case for your data.
- First, a log transformation is applied to the data using the
numpy
log()
function. A log transformation unskews the data in preparation for clustering. - Next, the
StandardScaler()
fromsklearn.preprocessing
fits and transforms the log-transformed data. This centers and scales the data in further preparation for clustering. - Finally, a new DataFrame is created and visualized again to confirm the results.
# Perform a log transformation of the data to unskew the data
df_log = np.log(df_features)
# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(df_log)
# Scale and center the data
df_normalized = scaler.transform(df_log)
# Create a pandas DataFrame of the processed data
df_processed = pd.DataFrame(
data=df_normalized, index=df_features.index, columns=df_features.columns
)
# Plot the distributions of the selected variables
g = sns.FacetGrid(df_processed.melt(), col="variable")
g.map(sns.histplot, "value")
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle("Preprocessed Variable Distributions", fontsize=16)
plt.show()
4. Choosing the Number of Clusters
The next step is to fit a variable number of clusters and plot each cluster's sum-of-squared errors (SSE). The SSE reflects the sum of squared distances from every data point to the cluster center. The aim is to reduce the SSE while still maintaining a reasonable number of clusters.
By plotting the SSE for each number of clusters, you can identify at what point there are diminishing returns by adding new clusters. This type of plot is called an elbow plot.
In the code below, you can set the maximum number of clusters you want to plot, and then a loop is used to generate the SSE for each number of clusters. Finally, the seaborn
function pointplot()
plots a curve with each cluster number and SSE. This allows you to identify the 'elbow' or point where there are only marginal reductions for each additional cluster.
# Set the maximum number of clusters to plot
max_clusters = 10
# Initialize empty dictionary to store sum of squared errors
sse = {}
# Fit KMeans and calculate SSE for each k
for k in range(1, max_clusters):
# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=1)
# Fit KMeans on the normalized dataset
kmeans.fit(df_processed)
# Assign sum of squared distances to k element of dictionary
sse[k] = kmeans.inertia_
# Initialize a figure of set size
plt.figure(figsize=(10, 4))
# Create an elbow plot of SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
# Add labels to the plot
plt.title("Elbow Method Plot", fontsize=16) # Add a title to the plot
plt.xlabel("Number of Clusters") # Add x-axis label
plt.ylabel("SSE") # Add y-axis label
# Show the plot
plt.show()