Skip to content
Customer Segmentation Report
  • AI Chat
  • Code
  • Report
  • Customer Segmentation Report

    Customer segmentation is a way to identify groups of similar customers. Customers can be segmented on a wide variety of characteristics, such as demographic information, purchase behavior, and attitudes. This template provides an end-to-end report for processing and segmenting customer purchase data using a K-means clustering algorithm. It also includes a snake plot and heatmap to visualize the resulting clusters and feature importance.

    To use your data, the following criteria must be satisfied:

    • Multiple numerical variables that you can use for clustering.
    • No NaN/NA values. You can use this template to impute missing values if needed.

    The placeholder dataset in this template consists of customer data, including purchase recency, frequency, and monetary value. Each row represents a different customer with a distinct customer ID.

    # Load packages
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    
    # Set visualization style
    sns.set_style("darkgrid")
    
    # Load the data and replace with your CSV file path
    df = pd.read_csv("data/customer_data.csv")
    
    # Preview the data
    df
    # Check columns for data types and missing values
    df.info()

    2. Exploring the Data

    Based on the evaluation above, you can select columns that you wish to inspect further. In this template, three columns are selected from the four columns. CustomerID is omitted because it is an identifier and not useful for clustering.

    The code below reduces the DataFrame to the columns you wish to cluster on and then prints descriptive statistics using the describe() method from pandas.

    Printing descriptive statistics is helpful because K-means clustering has several key assumptions that can be revealed via this exploration:

    1. There is no skewness to the data.
    2. The variables have the same average values.
    3. The variables have the same variance.

    If you'd like to learn more about pre-processing data for K-means clustering, you can refer to this video from the course Customer Segmentation in Python.

    # Select columns for clustering
    columns_for_clustering = ["Recency", "Frequency", "MonetaryValue"]
    
    # Create new DataFrame with clustering variables
    df_features = df[columns_for_clustering]
    
    # Print a summary of descriptive statistics
    df_features.describe()

    The facetgrid() function from seaborn creates a grid of histograms of the data to be clustered. It serves as a further exploration of the data to determine its skew and whether it needs transformation.

    # Plot the distributions of the selected variables
    g = sns.FacetGrid(
        df_features.melt(),  # Reformat the DataFrame for plotting purposes
        col="variable",  # Split on the 'variable' column created by reformating
        sharey=False,  # Turn off shared y-axis
        sharex=False,  # Turn off shared x-axis
    )
    # Apply a histogram to the facet grid
    g.map(sns.histplot, "value")
    # Adjust the top of the plots to make room for the title
    g.fig.subplots_adjust(top=0.8)
    # Create a title
    g.fig.suptitle("Unprocessed Variable Distributions", fontsize=16)
    plt.show()

    Before proceeding, it is crucial to ensure that all columns selected for clustering are numeric. The following code iterates through the reduced DataFrame and checks whether each column is numeric. If it returns True, then you can proceed with the pre-processing.

    all([pd.api.types.is_numeric_dtype(df_features[col]) for col in columns_for_clustering])

    3. Pre-processing the Data

    Based on the grids above, if there is a skew, you will have to complete this step which removes the skew and center the variables. This is the case for the placeholder dataset used in this template and will likely be the case for your data.

    • First, a log transformation is applied to the data using the numpy log() function. A log transformation unskews the data in preparation for clustering.
    • Next, the StandardScaler() from sklearn.preprocessing fits and transforms the log-transformed data. This centers and scales the data in further preparation for clustering.
    • Finally, a new DataFrame is created and visualized again to confirm the results.
    # Perform a log transformation of the data to unskew the data
    df_log = np.log(df_features)
    
    # Initialize a standard scaler and fit it
    scaler = StandardScaler()
    scaler.fit(df_log)
    
    # Scale and center the data
    df_normalized = scaler.transform(df_log)
    
    # Create a pandas DataFrame of the processed data
    df_processed = pd.DataFrame(
        data=df_normalized, index=df_features.index, columns=df_features.columns
    )
    
    # Plot the distributions of the selected variables
    g = sns.FacetGrid(df_processed.melt(), col="variable")
    g.map(sns.histplot, "value")
    g.fig.subplots_adjust(top=0.8)
    g.fig.suptitle("Preprocessed Variable Distributions", fontsize=16)
    plt.show()

    4. Choosing the Number of Clusters

    The next step is to fit a variable number of clusters and plot each cluster's sum-of-squared errors (SSE). The SSE reflects the sum of squared distances from every data point to the cluster center. The aim is to reduce the SSE while still maintaining a reasonable number of clusters.

    By plotting the SSE for each number of clusters, you can identify at what point there are diminishing returns by adding new clusters. This type of plot is called an elbow plot.

    In the code below, you can set the maximum number of clusters you want to plot, and then a loop is used to generate the SSE for each number of clusters. Finally, the seaborn function pointplot() plots a curve with each cluster number and SSE. This allows you to identify the 'elbow' or point where there are only marginal reductions for each additional cluster.

    # Set the maximum number of clusters to plot
    max_clusters = 10
    
    # Initialize empty dictionary to store sum of squared errors
    sse = {}
    
    # Fit KMeans and calculate SSE for each k
    for k in range(1, max_clusters):
        # Initialize KMeans with k clusters
        kmeans = KMeans(n_clusters=k, random_state=1)
        # Fit KMeans on the normalized dataset
        kmeans.fit(df_processed)
        # Assign sum of squared distances to k element of dictionary
        sse[k] = kmeans.inertia_
    
    # Initialize a figure of set size
    plt.figure(figsize=(10, 4))
    
    # Create an elbow plot of SSE values for each key in the dictionary
    sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
    
    # Add labels to the plot
    plt.title("Elbow Method Plot", fontsize=16)  # Add a title to the plot
    plt.xlabel("Number of Clusters")  # Add x-axis label
    plt.ylabel("SSE")  # Add y-axis label
    
    # Show the plot
    plt.show()