Skip to content

Problem definition E-Commerce Data

Problem definition START

This dataset consists of orders made in different countries from December 2010 to December 2011. The company is a UK-based online retailer that mainly sells unique all-occasion gifts. Many of its customers are wholesalers.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
retail = pd.read_csv("online_retail.csv")
retail.head(100)

Data Dictionary

VariableExplanation
InvoiceNoA 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c' it indicates a cancellation.
StockCodeA 5-digit integral number uniquely assigned to each distinct product.
DescriptionProduct (item) name
QuantityThe quantities of each product (item) per transaction
InvoiceDateThe day and time when each transaction was generated
UnitPriceProduct price per unit in sterling (pound)
CustomerIDA 5-digit integral number uniquely assigned to each customer
CountryThe name of the country where each customer resides

Source of dataset.

Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • ๐Ÿ—บ๏ธ Explore: Negative order quantities indicate returns. Which products have been returned the most?
  • ๐Ÿ“Š Visualize: Create a plot visualizing the profits earned from UK customers weekly, monthly.
  • ๐Ÿ”Ž Analyze: Are order sizes from countries outside the United Kingdom significantly larger than orders from inside the United Kingdom?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working for an online retailer. Currently, the retailer sells over 4000 unique products. To take inventory of the items, your manager has asked you whether you can group the products into a small number of categories. The categories should be similar in terms of price and quantity sold and any other characteristics you can extract from the data.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

Problem definition END -------------------------------------------------------------------------------------------------------------------------------

Problem solving

*SQL queries for filtered data can be found at the end of the notebook.

Problem 1ยด) Which products have been returned the most?

It is important to differentiate returns and related refundings from purchases. Total income includes refunded amounts or we can calculate (negative) refunds and (positive) incomes separately.

As first step total database size, total number of returns and percentage compared to total event number is calculated (printed in two different ways).

To determine the most returned products (itemcodes) unique (identifying) values of returned products ('StockCode') is determined besides of the amount of unique itemcodes and amount of (unique) returned itemcodes in the complete dataset.

dataframe.info() as a built in method of dataframes gives some key information on the full dataset.

(Note: the notebook has an in-built automatic presentation option for calculated values in form of a table or as a graph)

#import pandas as pd
#retail = pd.read_csv("online_retail.csv")
database_size = len(retail)
print('Database size: ' + str(database_size))
#retail.colums.to_list() #not accessible in this environment
returns = len(retail[retail['Quantity'] < 0])
percentage = returns/database_size*100
print(f'Returns: {returns}, {percentage} % of total')
print('Returns : %2d, percentage: %5.2f ' % (returns, percentage) + '%') 
count_total_unique = len(retail['StockCode'].unique())
count_returned_unique = len(retail[retail['Quantity'] < 0]['StockCode'].unique())
print('Number of unique itemcodes (in total): ' + str(count_total_unique))
print('Number of unique returned itemcodes: ' + str(count_returned_unique) + ' (' + '{:.1f}'.format(count_returned_unique / count_total_unique *100) + '%)')
n = 10
print('Top ' + str(n) + ' most appearing products\' itemcodes from ' +str(count_total_unique) + ':')
print(retail['StockCode'].value_counts()[:n].index.tolist())
print('-----------------------------------')
retail.info()

We may extract returns related data from the entire dataset (to keep original dataframe intact and not to perform analysis on the entire dataset).

df_returns = retail[retail['Quantity'] < 0]
df_returns

Data can be sorted based on quantity in ascending order (by default) to show products returned in largest return amounts (lowest negative numbers).

df_returns[['StockCode', 'Description', 'Quantity']].sort_values('Quantity')

Remember that the previous sort shows the quantity values of all registers which all are treated unrelated in this case, however there might be multiple return-registers related to the same product type. See for example StockCode 23005. In such case the quantities defined in each of same product lines should be summed up.

First, group registers that are related to the same item ('StockCode') and then sum up returned quantities, finally sort based on quantities similarly, as above.

top_returns = df_returns.groupby('StockCode').sum('Quantity').sort_values('Quantity')
top_returns.iloc[:, 0:1]

However the list order for first 9 items did not change much, the most relevant change is that the returns of item 23005 has been summed up.

As next step, it is useful (in general) to know some details of the returned goods:

  • total number of individual (unique) products
  • the itemcode (list) of returned products
print('Number of product types returned: '+  str(len(retail[retail['Quantity'] < 0]['StockCode'].unique())))
retail[retail['Quantity'] < 0]['StockCode'].unique()
โ€Œ
โ€Œ
โ€Œ