RFM analysis

Skills showed in this Notebook

The analysis covers data manipulation, computation and visualization. The goal is to conduct exploratory analysis with general metrics and analyze the characteristics of each group using an RFM analysis.

This analysis is based in Python

Data manipulation : Numpy and Pandas
Data visualization : Seaborn, Matplotlib and Plotly

ATTENTION! If the cells with the code are not displayed, then you need to log in.

Background: E-Commerce Data

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Source of dataset.

Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Data Dictionary

Variable	Explanation
InvoiceNo	A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation.
StockCode	A 5-digit integral number uniquely assigned to each distinct product.
Description	Product (item) name
Quantity	The quantities of each product (item) per transaction
InvoiceDate	The day and time when each transaction was generated
UnitPrice	Product price per unit in sterling (pound)
CustomerID	A 5-digit integral number uniquely assigned to each customer
Country	The name of the country where each customer resides

Data Validation (stage 1)

# importing packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
import pingouin
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import datetime

First, let's have a glance on the table.

df = pd.read_csv('online_retail.csv', index_col=None)
df

Format

Second, check-out formats of columns.

df.info()

InvoiceDate and CustomerID columns formats don't match with their values. Let's change them and check the result.

df['InvoiceDate'] = pd.to_datetime(df.InvoiceDate , format = '%m/%d/%y %H:%M')
df['CustomerID']= df.CustomerID.astype('object')
df.info()

Missing data

Third, indentify missing values.

print("Number of missing values by columns:")
print(df.isna().sum(), end = "\n\n") 
print("Proportion of missing values by columns in %:")
print(df.isna().sum() * 100 / len(df))

Description and CustomerID columns miss some data. Let's have a closer look at them.

Missing data: Descrition column

‌
‌
‌