Cafe Sales
Data source: https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training The dataset has 10,001 rows including headers.
Useful info:
- Navigate by clicking "Code" for details on my SQL code, which you'll find on the panel located on the top right of this page
Note:
In this project I will be working on a CSV file so cannot query the dataset as in a database. To work around this I will use e.g. CTE's and dataframes to use results from previous queries when needed. A dataframe is an SQL cell with a given name which can be referenced in later queries.
Scenario: Analyzing Sales Patterns for Takeaway and In-Store Orders to Optimize Operations
Business Problem: A café with both takeaway and in-store customers is facing challenges in optimizing its staffing and in boosting sales. The goal is to use the sales data to understand the dynamics between takeaway and in-store orders, uncover trends, and provide actionable insights to help improve sales and operational efficiency.
Key Questions to Answer:
- What menu items gave highest and lowest revenue for takeaway and in-store orders?
- How do sales differ between takeaway and in-store orders?
- Which day of the week was busiest? What about peak time of the year? Describe performance for number of transactions last year.
- What are the trends for takeaway orders vs. in-store orders over time?
- How can we improve revenue?
Summary
- Takeaway was were more consistent than instore orders but revenue from instore exceeded by 2.3%
- Busiest periods included Friday and October time
Contents
Click on this button on the top left of page to navigate to the different sections
Cleaning
- Identify/remove duplicates
- Data types
- Standardize
- Nulls/blanks
Analysis
- Highest and lowest revenue items
- Sales: Takeaway vs in-store
- Busiest day of week? Peak time of the year? Performance for number of transactions
- Takeaway vs. in-store orders over time?
- How can we improve revenue?
Recommendations
Dashboard
Cleaning
1. Identify/remove duplicates
I will label rows that are exact copies of another row. I will make a column that gives a value of 1 for a row and gives a value greater than 1 if a row is a duplicate. I used all columns to create this new columns because they, all combined, make a row unique.
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY transaction_ID, item, quantity, price_per_unit, total_spent, payment_method, location, transaction_date) AS row_num -- Unique row will have row_num = 1. Duplicate rows will have row_num > 1
FROM 'raw_cafe_sales.csv';Filtering this result set where the indicator is larger than 1 returns no results. This shows that there are no duplicates in our data.
To do this, I used a CTE to filter.
WITH duplicate_indicator AS (
SELECT *
FROM row_num
)
SELECT *
FROM duplicate_indicator
WHERE row_num > 1;2. Data types
First I will select all columns from the table.
In the numeric columns, there are string values such as 'ERROR'. For the numeric columns, I will first replace these will NULLs and then change the data type of the columns to INT.
Then I will use SQL's CAST function to correct the data types of each column.
Additionally, the transaction_date column is in the format 'dd/mm/yyyy'. I first replace 'ERROR' and 'UNKNOWN' values will nulls and then to the column in a format that can used by SQL I will use SQL's STRPTIME() function to parse the string and then cast it to a date data type.
WITH CTE AS(
SELECT
transaction_ID,
item,
CASE WHEN quantity NOT IN ('ERROR', 'UNKNOWN') THEN quantity -- replace string values with NULL in numeric columns
ELSE NULL
END AS quantity,
CASE WHEN price_per_unit NOT IN ('ERROR', 'UNKNOWN') THEN price_per_unit -- replace string values with NULL in numeric columns
ELSE NULL
END AS price_per_unit,
CASE WHEN total_spent NOT IN ('ERROR', 'UNKNOWN') THEN total_spent -- replace string values with NULL in numeric columns
ELSE NULL
END AS total_spent,
payment_method,
location,
CASE WHEN transaction_date NOT IN ('ERROR', 'UNKNOWN') THEN transaction_date
ELSE NULL
END AS transaction_date
FROM 'raw_cafe_sales.csv'
)
SELECT
CAST(transaction_ID AS TEXT) AS transaction_ID,
CAST(item AS TEXT) AS item,
CAST(quantity AS INT) AS quantity,
CAST(price_per_unit AS INT) AS price_per_unit,
CAST(total_spent AS INT) AS total_spent,
CAST(payment_method AS TEXT) AS payment_method,
CAST(location AS TEXT) AS location,
STRPTIME(transaction_date, '%d/%m/%Y')::DATE AS transaction_date -- parsing string into dates and then converting to date data type
FROM CTE;3. Standardize
Checking for leading/trailing spaces and mispellings in categorical columns. The example below shows unique values in item column.
SELECT DISTINCT(item) -- code to check spelling and unwanted spaces
FROM data_types
ORDER BY 1;-- check if empty values are all nulls or if there are blanks too. There are no blanks
SELECT *
FROM 'raw_cafe_sales.csv'
WHERE item = ''
OR quantity = ''
OR price_per_unit = ''
OR total_spent = ''
OR payment_method = ''
OR location = ''
OR transaction_date = '';4. Nulls/blanks
I first replaced incorrect values indicated by 'UNKOWN' or 'ERROR' in columns with nulls as there is no way of checking what these should be with the information provided in the data. Each transaction is an independent event and so values such as quantity sold or payment method can't be found out or imputed from other values in the table...
The only columns we can replace nulls and error values with is price_per_unit and total_spent. For price_per_unit column we use the item's default price and for total_spent column we multiply quantity with price_per_unit.
All columns have blanks/errors except transaction_ID. In categorical columns errors are indicated by UNKNOWN, ERROR or null. These are replaced with nulls to make data more uniform.