Skills showed in this Notebook
The analysis covers data manipulation, computation and visualization. The goal is to conduct exploratory analysis with general metrics and analyze the characteristics of each group using an RFM analysis.
This analysis is based in Python
- Data manipulation : Numpy and Pandas
- Data visualization : Seaborn, Matplotlib and Plotly
ATTENTION! If the cells with the code are not displayed, then you need to log in.
Background: E-Commerce Data
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Source of dataset.
Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Data Dictionary
| Variable | Explanation |
|---|---|
| InvoiceNo | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation. |
| StockCode | A 5-digit integral number uniquely assigned to each distinct product. |
| Description | Product (item) name |
| Quantity | The quantities of each product (item) per transaction |
| InvoiceDate | The day and time when each transaction was generated |
| UnitPrice | Product price per unit in sterling (pound) |
| CustomerID | A 5-digit integral number uniquely assigned to each customer |
| Country | The name of the country where each customer resides |
Data Validation (stage 1)
# importing packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
import pingouin
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import datetimeFirst, let's have a glance on the table.
df = pd.read_csv('online_retail.csv', index_col=None)
dfFormat
Second, check-out formats of columns.
df.info()InvoiceDate and CustomerID columns formats don't match with their values. Let's change them and check the result.
df['InvoiceDate'] = pd.to_datetime(df.InvoiceDate , format = '%m/%d/%y %H:%M')
df['CustomerID']= df.CustomerID.astype('object')
df.info()Missing data
Third, indentify missing values.
print("Number of missing values by columns:")
print(df.isna().sum(), end = "\n\n")
print("Proportion of missing values by columns in %:")
print(df.isna().sum() * 100 / len(df)) Description and CustomerID columns miss some data. Let's have a closer look at them.
Missing data: Descrition column