Practical Exam: Grocery Store Sales
FoodYum is a grocery store chain that is based in the United States.
Food Yum sells items such as produce, meat, dairy, baked goods, snacks, and other household food staples.
As food costs rise, FoodYum wants to make sure it keeps stocking products in all categories that cover a range of prices to ensure they have stock for a broad range of customers.
Data
The data is available in the table products.
The dataset contains records of customers for their last full year of the loyalty program.
| Column Name | Criteria |
|---|---|
| product_id | Nominal. The unique identifier of the product. Missing values are not possible due to the database structure. |
| product_type | Nominal. The product category type of the product, one of 5 values (Produce, Meat, Dairy, Bakery, Snacks). Missing values should be replaced with “Unknown”. |
| brand | Nominal. The brand of the product. One of 7 possible values. Missing values should be replaced with “Unknown”. |
| weight | Continuous. The weight of the product in grams. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median weight. |
| price | Continuous. The price the product is sold at, in US dollars. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median price. |
| average_units_sold | Discrete. The average number of units sold each month. This can be any positive integer value. Missing values should be replaced with 0. |
| year_added | Nominal. The year the product was first added to FoodYum stock. Missing values should be replaced with 2022. |
| stock_location | Nominal. The location that stock originates. This can be one of four warehouse locations, A, B, C or D Missing values should be replaced with “Unknown”. |
Task 1
Last year (2022) there was a bug in the product system. For some products that were added in that year, the year_added value was not set in the data. As the year the product was added may have an impact on the price of the product, this is important information to have.
Write a query to determine how many products have the year_added value missing. Your output should be a single column, missing_year, with a single row giving the number of missing values.
-- Write your query for task 1 in this cell
SELECT COUNT(*) AS missing_year
FROM public.products
--to assess all the missing rows from year_added column
WHERE year_added IS NULL
GROUP BY year_added;
Task 2
Given what you know about the year added data, you need to make sure all of the data is clean before you start your analysis. The table below shows what the data should look like.
Write a query to ensure the product data matches the description provided. Do not update the original table.
| Column Name | Criteria |
|---|---|
| product_id | Nominal. The unique identifier of the product. Missing values are not possible due to the database structure. |
| product_type | Nominal. The product category type of the product, one of 5 values (Produce, Meat, Dairy, Bakery, Snacks). Missing values should be replaced with “Unknown”. |
| brand | Nominal. The brand of the product. One of 7 possible values. Missing values should be replaced with “Unknown”. |
| weight | Continuous. The weight of the product in grams. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median weight. |
| price | Continuous. The price the product is sold at, in US dollars. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median price. |
| average_units_sold | Discrete. The average number of units sold each month. This can be any positive integer value. Missing values should be replaced with 0. |
| year_added | Nominal. The year the product was first added to FoodYum stock. Missing values should be replaced with last year (2022). |
| stock_location | Nominal. The location that stock originates. This can be one of four warehouse locations, A, B, C or D Missing values should be replaced with “Unknown”. |
WITH median_values AS (
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY CAST(REGEXP_REPLACE(weight, '[^0-9.]', '', 'g') AS NUMERIC)) AS median_weight,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM public.products
WHERE (weight IS NOT NULL AND TRIM(CAST(weight AS TEXT)) != '' AND weight::TEXT NOT IN ('missing', 'na'))
OR (price IS NOT NULL AND CAST(price AS TEXT) NOT IN ('missing', 'na'))
)
SELECT
product_id,
-- Handle missing product_type
CASE
WHEN product_type IS NULL THEN 'Unknown'
WHEN TRIM(product_type) = '' THEN 'Unknown'
WHEN LOWER(product_type) IN ('-', 'missing', 'na') THEN 'Unknown'
ELSE product_type
END AS product_type,
-- Handle missing or placeholder values for brand
CASE
WHEN brand IS NULL THEN 'Unknown'
WHEN TRIM(brand) = '' THEN 'Unknown'
WHEN LOWER(brand) IN ('-', 'missing', 'na') THEN 'Unknown'
ELSE brand
END AS brand,
-- Replace missing weights with the median value and clean numeric weights
CASE
WHEN weight IS NULL THEN
(SELECT median_weight FROM median_values)
WHEN TRIM(CAST(weight AS TEXT)) = '' THEN
(SELECT median_weight FROM median_values)
WHEN CAST(weight AS TEXT) IN ('-','missing', 'na') THEN
(SELECT median_weight FROM median_values)
ELSE
CAST(REGEXP_REPLACE(CAST(weight AS TEXT), '[^0-9.]', '', 'g') AS NUMERIC)
END AS weight,
-- Replace missing prices with the median value
CASE
WHEN price IS NULL THEN
(SELECT median_price FROM median_values)
WHEN TRIM(CAST(price AS TEXT)) = '' THEN
(SELECT median_price FROM median_values)
WHEN CAST(price AS TEXT) IN ('-','missing', 'na') THEN
(SELECT median_price FROM median_values)
ELSE price
END AS price,
-- Replace missing average_units_sold with 0
CASE
WHEN average_units_sold IS NULL THEN 0
WHEN TRIM(CAST(average_units_sold AS TEXT)) = '' THEN 0
WHEN CAST(average_units_sold AS TEXT) IN ('-','missing', 'na') THEN 0
ELSE average_units_sold
END AS average_units_sold,
-- Replace empty year_added values with '2022'
CASE
WHEN year_added IS NULL THEN '2022'
WHEN TRIM(CAST(year_added AS TEXT)) = '' THEN '2022'
WHEN CAST(year_added AS TEXT) IN ('-','missing', 'na') THEN '2022'
ELSE year_added
END AS year_added,
-- Format stock_location to capitalize each word, handle NULL/empty values
CASE
WHEN stock_location IS NULL THEN 'Unknown'
WHEN TRIM(stock_location) = '' THEN 'Unknown'
WHEN LOWER(stock_location) IN ('-', 'missing', 'na') THEN 'Unknown'
ELSE INITCAP(stock_location)
END AS stock_location
FROM public.products;
Task 3
To find out how the range varies for each product type, your manager has asked you to determine the minimum and maximum values for each product type.
Write a query to return the product_type, min_price and max_price columns.
-- Write your query for task 3 in this cell
SELECT product_type,
--maximum & minimum prices for each product_type
MIN(price) AS min_price,
MAX(price) AS max_price
FROM public.products
GROUP BY product_type
--for putting the product_type in alphabetical order
ORDER BY product_type ASC;Task 4
The team want to look in more detail at meat and dairy products where the average units sold was greater than ten.
Write a query to return the product_id, price and average_units_sold of the rows of interest to the team.
-- Write your query for task 4 in this cell
SELECT product_id,
price,
average_units_sold
FROM public.products
--results only for 'Meat' & 'Dairy'
WHERE product_type IN ('Meat', 'Dairy')
GROUP BY product_id, price, average_units_sold
--Avg.units sold greater than 10
HAVING average_units_sold > 10;