Skip to content
Customer Segmentation Analysis
E-Commerce Data
This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.
Not sure where to begin? Scroll to the bottom to find challenges!
Data Dictionary
Variable | Explanation |
---|---|
InvoiceNo | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation. |
StockCode | A 5-digit integral number uniquely assigned to each distinct product. |
Description | Product (item) name |
Quantity | The quantities of each product (item) per transaction |
InvoiceDate | The day and time when each transaction was generated |
UnitPrice | Product price per unit in sterling (pound) |
CustomerID | A 5-digit integral number uniquely assigned to each customer |
Country | The name of the country where each customer resides |
import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#Loading Dataset
online = pd.read_csv("online_retail.csv", parse_dates=['InvoiceDate'])
online.head()
online.shape
Assign daily acquisition cohort
# Define a function that will parse the date
def get_day(x):
return dt.datetime(x.year, x.month, x.day)
# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day)
# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay']
# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')
# View the top 5 rows
print(online.head())
Source of dataset.
Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Calculate time offsets
def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
# Get the integers for date parts from the `InvoiceDay` column
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')
# Get the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')
# Calculate difference in years
years_diff = invoice_year - cohort_year
# Calculate difference in months
months_diff = invoice_month - cohort_month
# Calculate difference in days
days_diff = invoice_day - cohort_day
# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
print(online.head())
Calculate retention rate from scratch
grouping = online.groupby(['CohortDay', 'CohortIndex'])
# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()
# Create a pivot
cohort_counts = cohort_data.pivot_table(index='CohortDay', columns='CohortIndex', values='CustomerID')
# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]
# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)