Skip to content

For this project I was inspired by Avinash Navlani's tutorial Introduction to Customer Segmentation in Python

RFM Analysis on E-Commerce Data

RFM Analysis is a customer segmentation technique that allows businesses to analyze and categorize their customers based on their purchasing behavior. RFM stands for:

  • Recency: How recently a customer has made a purchase
  • Frequency: How often a customer makes purchases
  • Monetary: How much money a customer spends on purchases

By analyzing these three dimensions, businesses can gain insights into customer behavior and tailor their marketing strategies accordingly.

Steps for RFM Analysis

  1. Data Preparation: Prepare the retail dataset by cleaning and transforming the data.
  2. Calculate RFM Metrics: Calculate the recency, frequency, and monetary metrics for each customer.
  3. RFM Segmentation: Assign RFM scores to each customer based on their RFM metrics.
  4. Visualize RFM Segments: Use Python or other tools to create visualizations of the RFM segments.

Data Dictionary

VariableExplanation
InvoiceNoA 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation.
StockCodeA 5-digit integral number uniquely assigned to each distinct product.
DescriptionProduct (item) name
QuantityThe quantities of each product (item) per transaction
InvoiceDateThe day and time when each transaction was generated
UnitPriceProduct price per unit in sterling (pound)
CustomerIDA 5-digit integral number uniquely assigned to each customer
CountryThe name of the country where each customer resides

Source of dataset.

Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Data Preview

This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.

Spinner
DataFrameas
online_retail
variable
SELECT * FROM 'online_retail.csv';

Data Preparation

We will now exclude :

  • null values in CustomerID column
  • canceled orders, quantity below 0
  • all duplicates entries
Spinner
DataFrameas
clean_online_retail
variable
-- Select all distinct rows from the table "online_retail"
-- where the column "customerID" is not null
-- and either the column "quantity" is greater than 0
-- or the column "InvoiceNo" does not start with 'C' means that it is not a cancellation
SELECT DISTINCT *
FROM online_retail
WHERE customerID IS NOT NULL
AND (quantity > 0 OR InvoiceNo NOT LIKE 'C%');

In the given dataset, since the company is UK-based you can assume that most of the customers are from the "United Kingdom".

Spinner
DataFrameas
country_distribution
variable
SELECT Country, count(*) as count_
FROM clean_online_retail
GROUP BY Country
ORDER BY count_ DESC
LIMIT 10;

Based on our chart we only keep United Kingdom's customer.

Spinner
DataFrameas
uk_online_retail
variable
SELECT * 
FROM clean_online_retail
WHERE Country = 'United Kingdom';

RFM Analysis

Here, you are going to perform following tasks:

  • For Recency, Calculate the number of days between present date and date of last purchase each customer.
  • For Frequency, Calculate the number of orders for each customer.
  • For Monetary, Calculate sum of purchase price for each customer.

The last purchase date for the entire database was '2011-12-09', the analysis is considered to have been carried out the following day.

Spinner
DataFrameas
rfm_table
variable
SELECT CustomerID,
	(
		MAX(
			MAX(strptime(InvoiceDate, '%-m/%-d/%-y %-H:%-M'))
		)OVER()::date + INTERVAL 1 DAY --last purchase date + 1
		- 		
		MAX(strptime(InvoiceDate, '%-m/%-d/%-y %-H:%-M')::date) -- last purchase date per Customer
	) as Recency,
	
	COUNT(InvoiceNo) as Frequency, 	
	SUM(UnitPrice*Quantity) as Monetary
FROM uk_online_retail
GROUP BY CustomerID;

Computing Quantile of RFM Values

Customers with the lowest recency, highest frequency and monetary amounts considered as top customers. Score 1 is for the best possible result and 4 for the worst.