Skip to content

E-Commerce Data

This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.

Not sure where to begin? Scroll to the bottom to find challenges!

# Importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

df = pd.read_csv("online_retail.csv")

Data Dictionary

VariableExplanation
InvoiceNoA 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation.
StockCodeA 5-digit integral number uniquely assigned to each distinct product.
DescriptionProduct (item) name
QuantityThe quantities of each product (item) per transaction
InvoiceDateThe day and time when each transaction was generated
UnitPriceProduct price per unit in sterling (pound)
CustomerIDA 5-digit integral number uniquely assigned to each customer
CountryThe name of the country where each customer resides

Source of dataset.

Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • ๐Ÿ—บ๏ธ Explore: What five countries are responsible for the most profit (by quantity and price of goods sold)?
  • ๐Ÿ“Š Visualize: Create a plot visualizing the profits earned from UK customers weekly, monthly, and quarterly.
  • ๐Ÿ”Ž Analyze: Are order sizes from countries outside the United Kingdom significantly larger than orders from inside the United Kingdom?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working for an online retailer. Currently, the retailer sells over 4000 unique products. To take inventory of the items, your manager has asked you whether you can group the products into a small number of categories. The categories should be similar in terms of price and quantity sold and any other characteristics you can extract from the data.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.


โœ๏ธ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.

#Display ramdom sample taken from the data and the variable explanation 
display(df.sample(10))
#Showing the Summary of retail data
df.info()

As seen above, there are some missing values in Description and CustomerID the dataset. The data is further summarized below

#Summary of the data
df.describe()
# The data has extreme outliers in the that need to be removed
low = np.quantile(df['Quantity'], 0.1)
low_p = np.quantile(df['UnitPrice'], 0.004)
drs = df[(df['Quantity'] < low) | (df['UnitPrice'] <= low_p)]
df.drop(drs.index, inplace = True)
df.describe()
df.shape

Challenges

The challenges involve exploring the data to find out how much profit is made by looking at the countries that give the most profits and looking at how the sales behave weekly, yearly and quarterly. The data collected needs to be cleaned and some feature engineering required. Also duplicates, missing data need to be dropped and the total price of goods ordered calculated.
# Checking if there are duplicates in the data
cols_list = df.columns
duplicates = df.duplicated(subset= cols_list, keep = False)
df[duplicates].sort_values('Description')
# Dropping duplicate files in the dataset
df.drop_duplicates()
โ€Œ
โ€Œ
โ€Œ