Skip to content

Exploring Bicycle Sales

The Bicycle Sales database from Microsoft SQL Server was obtained through DataLab (DataCamp). There are 2 schemas in the Bicycle Sales database, namely 'production' and 'sales'. The following lists the data tables found in each schema and their respective columns.

SchemasTable NameTable Column Names (Data Types)
productionbrandsbrand_id (int), brand_name (varchar)
productioncategoriescategory_id (int), category_name (varchar)
productionproductsproduct_id (int), product_name (varchar), brand_id (int), category_id (int), model_year (smallint), list_price (decimal)
productionstocksstore_id (int), product_id (int), quantity (int)
salescustomerscustomer_id (int), first_name (varchar), last_name (varchar), phone (varchar), email (varchar), street (varchar), city (varchar), state (varchar), zip_code (varchar)
salesorder_itemsorder_id (int), item_id (int), product_id (int), quantity (int), list_price (decimal), discount (decimal)
salesordersorder_id (int), customer_id (int), order_status (tinyint), order_date (date), required_date (date), shipped_date (date), store_id (int), staff_id (int)
salesstaffsstaff_id (int), first_name (varchar), last_name (varchar), email (varchar), phone (varchar), active (tinyint), store_id (int), manager_id (int)
salesstoresstore_id (int), store_name (varchar), phone (varchar), phone (varchar), email (varchar), street (varchar), city (varchar), state (varchar), zip_code (varchar)

Please refer to the 'Bicycle Sales Data Schema Diagram' PNG file for how the tables are connected.

The database was queried using MS SQL to answer the questions posed in the subsequent section.

Questions of Interests

Q1. How many orders are there in the dataset? What is the minimum, maximum and average revenue per order?

Answer: There were 1615 orders. The minimum revenue of an order was $104.49, the maximum revenue of an order was $29,147.03, and the average revenue per order was $4,781.06.

Q2A. How many orders were not shipped by the required date?

Answer: There were 458 orders that were not shipped by the required date.

Q2B. How many days were they late by?

Answer: There were 305 orders late by 1 day, while there were 153 orders late by 2 days.

Q2C. Did the late orders contain a large quantity of items?

Answer: There is no evidence that the late orders contained a larger quantity of items compared to orders shipped on time. The following table shows a comparison of the frequency and percentage of late order quantities against the on-time order quantities. The 'Percentage Late' was calculated by dividing the frequency of late orders by the sum of the frequency of on-time orders and the frequency of late orders.

Quantity in OrderFrequency (On-time)Frequency (Late)Percentage Late (%)
1934030.08
21456932.24
31305730.48
41658233.20
51435828.86
61356632.84
7954230.66
8613032.97
9181240.00
102250.00

Note: The analyses for Q2 were done based on the available records that contained both 'required date' and 'shipped date'. There were 170 records with missing 'shipped date' in the original dataset.

Q3. How many staff were there in each store and how was the sales performance (measured by revenue)?

Answer: The following table lists the number of staff of each store and the overall sales performance.

Store IDStore NameNumber of staffOverall Performance ($)
2Baldwin Bikes35,215,751.28
1Santa Cruz Bikes41,605,823.04
3Rowlett Bikes3867,542.24

Q4. Get the employee-manager relationship. Who is the top manager?

Answer: The table lists the employee-manager relationship. Fabiola Jackson is the top manager as Fabiola is the manager to other store managers/employee.

Staff IDStaff NameStaff's Store IDManager NameHierarchy Level
1Fabiola Jackson1NA0
2Mireya Copeland1Fabiola Jackson1
5Jannette David2Fabiola Jackson1
8Kali Vargas3Fabiola Jackson1
6Marcelene Boyer2Jannette David2
7Venita Daniel2Jannette David2
3Genna Serrano1Mireya Copeland2
4Virgie Wiggins1Mireya Copeland2
9Layla Terrell3Venita Daniel3
10Bernardine Houston3Venita Daniel3

Q5. What is the third most popular brand of bicycles?

Answer: The third most popular brand of bicycle over the years 2016 to 2018 was Surly, with an order quantity of 908.

Q6. What is the most popular brand in each bicycle category?

Answer: The table below lists the most popular brands in each bicycle category.

CategoryBrandQuantity Sold
Children BicyclesElectra747
Comfort BicyclesElectra524
Cruisers BicyclesElectra1329
Cyclocross BicyclesSurly305
Electric BikesTrek269
Mountain BikesTrek752
Road BikesTrek482

SQL Code

Spinner
DataFrameas
df2
variable
-- Q1 How many orders are there? And what is the minimum, maximum and average revenue per order?
SELECT COUNT(DISTINCT oi.order_id) AS num_order, ROUND(MIN(oi.rev_per_order),2) AS min_rev, ROUND(MAX(oi.rev_per_order),2) AS max_rev, ROUND(AVG(oi.rev_per_order),2) AS avg_rev_per_order
FROM (
	SELECT order_id, SUM(quantity * list_price * (1 - discount)) AS rev_per_order
	FROM sales.order_items
	GROUP BY order_id
) AS oi;
Spinner
DataFrameas
df5
variable
-- 170 records with missing date in shipped_date
SELECT COUNT(*) AS num_missing 
FROM sales.orders
WHERE (shipped_date IS NULL)
Spinner
DataFrameas
df1
variable
-- Q2A How many orders were not shipped by the required date?
SELECT COUNT(*) AS num_late
FROM sales.orders
WHERE shipped_date > required_date;
Spinner
DataFrameas
df
variable
-- Q2B How many days were they late by?
SELECT o.days_late, COUNT(*) AS freq
FROM (
	SELECT order_id, DATEDIFF(day, required_date, shipped_date) AS days_late
	FROM sales.orders
	WHERE shipped_date > required_date
) AS o
GROUP BY o.days_late;
Spinner
DataFrameas
df3
variable
-- Q2C Did the late orders contain a large quantity of items?
CREATE TABLE #order_quan(
	order_id INT,
	total_quan INT
);

INSERT INTO #order_quan
	SELECT order_id, SUM(quantity) AS total_quan
	FROM sales.order_items
	GROUP BY order_id;

SELECT in_time.total_quan, 
       in_time.freq_in_time, 
       late.freq_late, ROUND(100.0*late.freq_late/(in_time.freq_in_time + late.freq_late),2) AS percent_late
FROM (
	SELECT oi_sum.total_quan, COUNT(*) AS freq_in_time
	FROM sales.orders AS o
	LEFT JOIN #order_quan AS oi_sum
	ON o.order_id = oi_sum.order_id
	WHERE o.shipped_date <= o.required_date
	GROUP BY oi_sum.total_quan
) AS in_time
LEFT JOIN (
	SELECT l_oi_sum.total_quan, COUNT(*) AS freq_late
	FROM sales.orders AS l_o
	LEFT JOIN #order_quan AS l_oi_sum
	ON l_o.order_id = l_oi_sum.order_id
	WHERE l_o.shipped_date > l_o.required_date
	GROUP BY l_oi_sum.total_quan
) AS late
ON in_time.total_quan = late.total_quan;

DROP TABLE #order_quan;
Spinner
DataFrameas
df4
variable
-- Q3 How many staff were there in each store and how was the sales performance (measured by revenue)?
SELECT s.store_id, s.store_name, COUNT(DISTINCT st.staff_id) AS num_staff, os.store_rev
FROM sales.stores AS s
LEFT JOIN (
	SELECT o.store_id, ROUND(SUM(oi.rev_per_order),2) AS store_rev
	FROM sales.orders AS o
	LEFT JOIN (
		SELECT order_id, SUM(quantity*list_price*(1 - discount)) AS rev_per_order
		FROM sales.order_items
		GROUP BY order_id
	) AS oi
	ON o.order_id = oi.order_id
	GROUP BY o.store_id
) AS os
ON s.store_id = os.store_id
LEFT JOIN sales.staffs AS st
ON s.store_id = st.store_id
GROUP BY s.store_id, s.store_name, os.store_rev
ORDER BY os.store_rev DESC;
Spinner
DataFrameas
df6
variable
-- Q4 Get the employee-manager relationship. Who is the top manager?

WITH getManager AS (
	SELECT st.staff_id, st.first_name, st.last_name, st.store_id, st.manager_id, mg.first_name AS manager_first_name, mg.last_name AS manager_last_name, mg.store_id AS manager_store_id, 0 AS hierarchy_level
	FROM sales.staffs as st
	LEFT JOIN sales.staffs AS mg
	ON mg.staff_id = st.manager_id
	WHERE st.manager_id IS NULL

	UNION ALL

	SELECT st.staff_id, st.first_name, st.last_name, st.store_id, st.manager_id, mg.first_name, mg.last_name, mg.store_id, mg.hierarchy_level + 1 AS hierarchy_level
	FROM sales.staffs as st
	INNER JOIN getManager AS mg
	ON mg.staff_id = st.manager_id
	WHERE mg.hierarchy_level < 5
)

SELECT *
FROM getManager
ORDER BY hierarchy_level ASC;
Spinner
DataFrameas
df9
variable
-- Q5 What is the third most popular brand of bikes?
SELECT pb.brand_id, b.brand_name, pb.brand_total_quan
FROM production.brands AS b
RIGHT JOIN (
	SELECT p.brand_id, SUM(oi.total_quan) AS brand_total_quan, DENSE_RANK() OVER(ORDER BY SUM(oi.total_quan) DESC) AS rank
	FROM production.products AS p
	RIGHT JOIN (
		SELECT product_id, SUM(quantity) AS total_quan
		FROM sales.order_items
		GROUP BY product_id
	) AS oi
	ON p.product_id = oi.product_id
	GROUP BY p.brand_id
) AS pb
ON pb.brand_id = b.brand_id
WHERE pb.rank = 3;
Spinner
DataFrameas
df7
variable
-- Q6 What is the most popular brand in each bike category?
SELECT c.category_name, b.brand_name, summ.total
FROM (
	SELECT p.category_id, p.brand_id, SUM(oi.total_quan) AS total, DENSE_RANK() OVER(PARTITION BY p.category_id ORDER BY SUM(oi.total_quan) DESC) AS rank
	FROM production.products AS p
	RIGHT JOIN (
		SELECT product_id, SUM(quantity) AS total_quan
		FROM sales.order_items
		GROUP BY product_id
	) AS oi
	ON p.product_id = oi.product_id
	GROUP BY p.category_id, p.brand_id
) AS summ
LEFT JOIN production.categories AS c
ON c.category_id = summ.category_id
LEFT JOIN production.brands AS b
ON b.brand_id = summ.brand_id
WHERE summ.rank = 1;