Skip to content

Exploratory data analysis (EDA) for E-Commerce Data

This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.

import pandas as pd

pd.read_csv("online_retail.csv")

Data Dictionary

VariableExplanation
InvoiceNoA 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation.
StockCodeA 5-digit integral number uniquely assigned to each distinct product.
DescriptionProduct (item) name
QuantityThe quantities of each product (item) per transaction
InvoiceDateThe day and time when each transaction was generated
UnitPriceProduct price per unit in sterling (pound)
CustomerIDA 5-digit integral number uniquely assigned to each customer
CountryThe name of the country where each customer resides

Source of dataset.

Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).


1 hidden cell

Data Cleaning

1. Check the column's data type and if there is any null values

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
retail = pd.read_csv("online_retail.csv")
print(retail.info())
print(retail.describe())

2. In the Description column there is null values, fill the null value associtaed with the StockCode to match the product desciptions.

Explanation:

groupby('StockCode')['Description']: Group by the 'StockCode' and selects the 'Description' column. Transform() allows the transformation of each group. In this situation, the transformation is fill the null value. lambda x: x.fillna(x.mode().iloc[0]) if x.notnull().any() else x fills null values in each group's 'Description' column with the mode value (most frequent) if the Description column contains any non-null values; otherwise, it leaves the column unchanged. This code fills null values in the 'Description' column based on the most frequent non-null 'Description' within each 'StockCode' group.

x.notnull().any(): Within each group (x), this checks if there are any non-null values in the 'Description' column. If there are, it means there are some descriptions available for that 'StockCode'.

x.fillna(x.mode().iloc[0]): If there are non-null values in the 'Description' column within a group, it computes the mode (most common value) of the 'description' within that group (x). Then, it fills the null values in that group's 'Description' with this mode value.

retail['Description'] = retail.groupby('StockCode')['Description'].transform(lambda x: x.fillna(x.mode().iloc[0]) if x.notnull().any() else x)

print(retail['Description'].isnull())

3. In the CustomerID there is null value, for better interpretation, fill the null value with No ID

retail['CustomerID'] = retail['CustomerID'].fillna('No ID')
print(retail[retail['CustomerID'] == 'No ID'])

4. Convert the InvoiceDate data type to date time

retail['InvoiceDate']=pd.to_datetime(retail['InvoiceDate'])
retail['InvoiceDate_only_date'] =retail['InvoiceDate'].dt.date
print(retail['InvoiceDate'].dtypes)

5. Check the 'Country' columen if there is any typo or spelling errors

retail['Country'].value_counts().sort_values(ascending=False)