Exploratory data analysis (EDA) for E-Commerce Data
This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.
import pandas as pd
pd.read_csv("online_retail.csv")Data Dictionary
| Variable | Explanation |
|---|---|
| InvoiceNo | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation. |
| StockCode | A 5-digit integral number uniquely assigned to each distinct product. |
| Description | Product (item) name |
| Quantity | The quantities of each product (item) per transaction |
| InvoiceDate | The day and time when each transaction was generated |
| UnitPrice | Product price per unit in sterling (pound) |
| CustomerID | A 5-digit integral number uniquely assigned to each customer |
| Country | The name of the country where each customer resides |
Source of dataset.
Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
1 hidden cell
Data Cleaning
1. Check the column's data type and if there is any null values
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
retail = pd.read_csv("online_retail.csv")
print(retail.info())
print(retail.describe())2. In the Description column there is null values, fill the null value associtaed with the StockCode to match the product desciptions.
Explanation:
groupby('StockCode')['Description']: Group by the 'StockCode' and selects the 'Description' column. Transform() allows the transformation of each group. In this situation, the transformation is fill the null value. lambda x: x.fillna(x.mode().iloc[0]) if x.notnull().any() else x fills null values in each group's 'Description' column with the mode value (most frequent) if the Description column contains any non-null values; otherwise, it leaves the column unchanged. This code fills null values in the 'Description' column based on the most frequent non-null 'Description' within each 'StockCode' group.
x.notnull().any(): Within each group (x), this checks if there are any non-null values in the 'Description' column. If there are, it means there are some descriptions available for that 'StockCode'.
x.fillna(x.mode().iloc[0]): If there are non-null values in the 'Description' column within a group, it computes the mode (most common value) of the 'description' within that group (x). Then, it fills the null values in that group's 'Description' with this mode value.
retail['Description'] = retail.groupby('StockCode')['Description'].transform(lambda x: x.fillna(x.mode().iloc[0]) if x.notnull().any() else x)
print(retail['Description'].isnull())3. In the CustomerID there is null value, for better interpretation, fill the null value with No ID
retail['CustomerID'] = retail['CustomerID'].fillna('No ID')
print(retail[retail['CustomerID'] == 'No ID'])4. Convert the InvoiceDate data type to date time
retail['InvoiceDate']=pd.to_datetime(retail['InvoiceDate'])
retail['InvoiceDate_only_date'] =retail['InvoiceDate'].dt.date
print(retail['InvoiceDate'].dtypes)5. Check the 'Country' columen if there is any typo or spelling errors
retail['Country'].value_counts().sort_values(ascending=False)