Customer segmentation is a way to identify groups of similar customers. Customers can be segmented on a wide variety of characteristics, such as demographic information, purchase behavior, and attitudes.
This end-to-end analysis is focusing on segmenting customers based on purchase data using a K-means clustering algorithm. It also includes a snake plot and heatmap to visualize the resulting clusters and feature importance.
The dataset includes 3 customer behavior metrics:
- Recency - which measures how recent was each customer's last purchase the lower it is, the better, since every company wants its customers to be recent and active.
- Frequency - which measures how many purchases the customer has done in the last 12 months
- Monetary Value - measures how much has the customer spent in the last 12 months.
We will use these values to assign customers to RFM segments. In the real world, we would be working with the most recent snapshot of the data of today or yesterday. In this dataset, we assume that a hypothetical snapshot date based on the most recent data.
All the metrics are calculated. Therefore, the database has a row for each customer with their recency, frequency and monetary value as of today, as if we were running the analysis the day after this data was pulled from the retailer's website which would be the real world project.
What we will focus on building powerful and intuitive RFM segments. Let's go!
Step 0: Import Libraries β³
# Basic operations
import pandas as pd
import numpy as np
# Data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# sklearn for predictive analytics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Set visualization style
sns.set_style("darkgrid")
SEED=123Step 1: Get to know the dataset
1.1. Read the data π
# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_data.csv")
# Preview the data
df.head(5)1.2. Inspect the data π
- All the varibles are numeric.
- There are no missing duplicate data.
- Each
'CustomerID'represents 1 customer. - 3643 unique customer data is ready for our analysis.
# Check columns for data types and missing values
df.info()# Check for duplicate values in CustomerID column
duplicate_values = df['CustomerID'].duplicated().sum()
duplicate_values Step 2: Exploratory Data Analysis π
It is time to perform exploratory data analysis to understand the dataset better and prepare it for clustering.
K-means clustering has several key assumptions that can be revealed via descriptive statistics:
- Each variable has symmetrical distribution. (no skewness)
- All variables have the same average values.
- All variables have the same variance.
K-means clustering is based on the Euclidean distance between data points, which means that features with larger values or ranges will have more influence on the clustering results than features with smaller values or ranges. This can skew the clusters and make them less meaningful. Also, each metric gets an equal weight in the k-means calculation.
As a result, each assumption must be met for a meaningful segmentation.
# Select columns for clustering (CustomerID is omitted because it does not add value to clustering.)
columns_for_clustering = ["Recency", "Frequency", "MonetaryValue"]
# Create new DataFrame with clustering variables
df_features = df[columns_for_clustering]
# Print a summary of descriptive statistics
df_features.describe()Observations:
- The average recency is 3 months. We would like our customers make purchase more often, which means lower recency is the better.
- The average frequency for all customers is 18.7 which means customers make 19 purchases in a year.
- Customers spend 370 dollars in average, but it is influenced by outliers. Therefore, it is fair to say that only 25 percent of customers spend 334 dollars in a year.
- By looking at minimum and maximum values, it is obvious that all the variables are skewed.
- Both the average values as well as standard deviations are different between the three variables.
In the next session, we will draw histograms of each variable to determine whether they need a transformation.
# Plot the distributions of variables with distribution lines
g = sns.FacetGrid(
df_features.melt(), # Reformat the DataFrame for plotting purposes
col="variable", # Split on the 'variable' column created by reformating
sharey=True, # Turn on shared y-axis
sharex=False # Turn off shared x-axis
)
# Apply a histogram to the facet grid and add a kernel density estimate (kde) for the distribution line
g.map(sns.histplot, "value", kde=True)
# Adjust the top of the plots to make room for the title
g.fig.subplots_adjust(top=0.8)
# Create a title
g.fig.suptitle("Unprocessed RFM Distributions", fontsize=16)
plt.show()β
β