For this project I was inspired by Avinash Navlani's tutorial Introduction to Customer Segmentation in Python
RFM Analysis on E-Commerce Data
RFM Analysis is a customer segmentation technique that allows businesses to analyze and categorize their customers based on their purchasing behavior. RFM stands for:
- Recency: How recently a customer has made a purchase
- Frequency: How often a customer makes purchases
- Monetary: How much money a customer spends on purchases
By analyzing these three dimensions, businesses can gain insights into customer behavior and tailor their marketing strategies accordingly.
Steps for RFM Analysis
- Data Preparation: Prepare the retail dataset by cleaning and transforming the data.
- Calculate RFM Metrics: Calculate the recency, frequency, and monetary metrics for each customer.
- RFM Segmentation: Assign RFM scores to each customer based on their RFM metrics.
- Visualize RFM Segments: Use Python or other tools to create visualizations of the RFM segments.
Data Dictionary
Variable | Explanation |
---|---|
InvoiceNo | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation. |
StockCode | A 5-digit integral number uniquely assigned to each distinct product. |
Description | Product (item) name |
Quantity | The quantities of each product (item) per transaction |
InvoiceDate | The day and time when each transaction was generated |
UnitPrice | Product price per unit in sterling (pound) |
CustomerID | A 5-digit integral number uniquely assigned to each customer |
Country | The name of the country where each customer resides |
Source of dataset.
Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Data Preview
This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.
SELECT * FROM 'online_retail.csv';
Data Preparation
We will now exclude :
- null values in CustomerID column
- canceled orders, quantity below 0
- all duplicates entries
-- Select all distinct rows from the table "online_retail"
-- where the column "customerID" is not null
-- and either the column "quantity" is greater than 0
-- or the column "InvoiceNo" does not start with 'C' means that it is not a cancellation
SELECT DISTINCT *
FROM online_retail
WHERE customerID IS NOT NULL
AND (quantity > 0 OR InvoiceNo NOT LIKE 'C%');
In the given dataset, since the company is UK-based you can assume that most of the customers are from the "United Kingdom".
SELECT Country, count(*) as count_
FROM clean_online_retail
GROUP BY Country
ORDER BY count_ DESC
LIMIT 10;
Based on our chart we only keep United Kingdom's customer.
SELECT *
FROM clean_online_retail
WHERE Country = 'United Kingdom';
RFM Analysis
Here, you are going to perform following tasks:
- For Recency, Calculate the number of days between present date and date of last purchase each customer.
- For Frequency, Calculate the number of orders for each customer.
- For Monetary, Calculate sum of purchase price for each customer.
The last purchase date for the entire database was '2011-12-09', the analysis is considered to have been carried out the following day.
SELECT CustomerID,
(
MAX(
MAX(strptime(InvoiceDate, '%-m/%-d/%-y %-H:%-M'))
)OVER()::date + INTERVAL 1 DAY --last purchase date + 1
-
MAX(strptime(InvoiceDate, '%-m/%-d/%-y %-H:%-M')::date) -- last purchase date per Customer
) as Recency,
COUNT(InvoiceNo) as Frequency,
SUM(UnitPrice*Quantity) as Monetary
FROM uk_online_retail
GROUP BY CustomerID;
Computing Quantile of RFM Values
Customers with the lowest recency, highest frequency and monetary amounts considered as top customers. Score 1 is for the best possible result and 4 for the worst.