Skip to content

Cafe Sales

Data source: https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training The dataset has 10,001 rows including headers.

Useful info:

  • Navigate by clicking "Code" for details on my SQL code, which you'll find on the panel located on the top right of this page

Note:

In this project I will be working on a CSV file so cannot query the dataset as in a database. To work around this I will use e.g. CTE's and dataframes to use results from previous queries when needed. A dataframe is an SQL cell with a given name which can be referenced in later queries.

Scenario: Analyzing Sales Patterns for Takeaway and In-Store Orders to Optimize Operations

Business Problem: A café with both takeaway and in-store customers is facing challenges in optimizing its staffing and in boosting sales. The goal is to use the sales data to understand the dynamics between takeaway and in-store orders, uncover trends, and provide actionable insights to help improve sales and operational efficiency.

Key Questions to Answer:

  1. What menu items gave highest and lowest revenue for takeaway and in-store orders?
  2. How do sales differ between takeaway and in-store orders?
  3. Which day of the week was busiest? What about peak time of the year? Describe performance for number of transactions last year.
  4. What are the trends for takeaway orders vs. in-store orders over time?
  5. How can we improve revenue?

Summary

  • Takeaway was were more consistent than instore orders but revenue from instore exceeded by 2.3%
  • Busiest periods included Friday and October time

Contents

Click on this button on the top left of page to navigate to the different sections

Cleaning

  1. Identify/remove duplicates
  2. Data types
  3. Standardize
  4. Nulls/blanks

Analysis

  1. Highest and lowest revenue items
  2. Sales: Takeaway vs in-store
  3. Busiest day of week? Peak time of the year? Performance for number of transactions
  4. Takeaway vs. in-store orders over time?
  5. How can we improve revenue?

Recommendations

Dashboard

Cleaning

1. Identify/remove duplicates

I will label rows that are exact copies of another row. I will make a column that gives a value of 1 for a row and gives a value greater than 1 if a row is a duplicate. I used all columns to create this new columns because they, all combined, make a row unique.

Spinner
DataFrameas
row_num
variable
SELECT 
	*,
	ROW_NUMBER() OVER(PARTITION BY transaction_ID, item, quantity, price_per_unit, total_spent, payment_method, location, transaction_date) AS row_num  -- Unique row will have row_num = 1. Duplicate rows will have row_num > 1
FROM 'raw_cafe_sales.csv';
Hidden output

Filtering this result set where the indicator is larger than 1 returns no results. This shows that there are no duplicates in our data.

To do this, I used a CTE to filter.

Spinner
DataFrameas
df9
variable
WITH duplicate_indicator AS (
	SELECT *
	FROM row_num
	)
SELECT *
FROM duplicate_indicator
WHERE row_num > 1;

2. Data types

First I will select all columns from the table.

In the numeric columns, there are string values such as 'ERROR'. For the numeric columns, I will first replace these will NULLs and then change the data type of the columns to INT.

Then I will use SQL's CAST function to correct the data types of each column.

Additionally, the transaction_date column is in the format 'dd/mm/yyyy'. I first replace 'ERROR' and 'UNKNOWN' values will nulls and then to the column in a format that can used by SQL I will use SQL's STRPTIME() function to parse the string and then cast it to a date data type.

Spinner
DataFrameas
data_types
variable
WITH CTE AS(
SELECT
	transaction_ID,
	item,
	CASE WHEN quantity NOT IN ('ERROR', 'UNKNOWN') THEN quantity  -- replace string values with NULL in numeric columns
		ELSE NULL
		END AS quantity,
	CASE WHEN price_per_unit NOT IN ('ERROR', 'UNKNOWN') THEN price_per_unit  -- replace string values with NULL in numeric columns
		ELSE NULL
		END AS price_per_unit,
	CASE WHEN total_spent NOT IN ('ERROR', 'UNKNOWN') THEN total_spent  -- replace string values with NULL in numeric columns
		ELSE NULL
		END AS total_spent,
	payment_method,
	location,
	CASE WHEN transaction_date NOT IN ('ERROR', 'UNKNOWN') THEN transaction_date
		ELSE NULL
		END AS transaction_date
FROM 'raw_cafe_sales.csv'
	)
	
SELECT 
	CAST(transaction_ID AS TEXT) AS transaction_ID,
	CAST(item AS TEXT) AS item,
	CAST(quantity AS INT) AS quantity,
	CAST(price_per_unit AS INT) AS price_per_unit,
	CAST(total_spent AS INT) AS total_spent,
	CAST(payment_method AS TEXT) AS payment_method,
	CAST(location AS TEXT) AS location,
	STRPTIME(transaction_date, '%d/%m/%Y')::DATE AS transaction_date -- parsing string into dates and then converting to date data type
FROM CTE;
Hidden output

3. Standardize

Checking for leading/trailing spaces and mispellings in categorical columns. The example below shows unique values in item column.

Spinner
DataFrameas
df
variable
SELECT DISTINCT(item)  -- code to check spelling and unwanted spaces
FROM data_types
ORDER BY 1;
Hidden output
Spinner
DataFrameas
blank_check
variable
-- check if empty values are all nulls or if there are blanks too. There are no blanks
SELECT *
FROM 'raw_cafe_sales.csv'
WHERE item = ''
	OR quantity = ''
	OR price_per_unit = ''
	OR total_spent = ''
	OR payment_method = ''
	OR location = ''
	OR transaction_date = '';

4. Nulls/blanks

I first replaced incorrect values indicated by 'UNKOWN' or 'ERROR' in columns with nulls as there is no way of checking what these should be with the information provided in the data. Each transaction is an independent event and so values such as quantity sold or payment method can't be found out or imputed from other values in the table...

The only columns we can replace nulls and error values with is price_per_unit and total_spent. For price_per_unit column we use the item's default price and for total_spent column we multiply quantity with price_per_unit.

All columns have blanks/errors except transaction_ID. In categorical columns errors are indicated by UNKNOWN, ERROR or null. These are replaced with nulls to make data more uniform.