Skip to content
Demo of SQL skills. (Untick Reader mode to view code)
The content below is a short demonstration of some of my SQL skills. I've tried to keep this as similar to real-life as possible by first coming up with a task/question about the dataset, then using SQL querying techniques to produce the result.
HEALTH DEPT INSPECTION DATA
Create a chart showing total restaurant inspections per month for the last 12 months. Ensure that we only ever have the enitre months count or this month so far i.e. exclude incomplete data from this month last year.
- Note: the code could easily be altered and applied to different scenarios e.g. to return sales/profit/traffic etc. and extract different parts of the timestamp e.g. hour/quarter
DataFrameavailable as
df7
variable
CREATE TABLE restaurant_inspection AS SELECT * FROM NYC_Open_Data_Restaurant_Inspection_Results_sample5k.csv;
-- Extract and define most recent date. In reality this would be today/yesterday.
WITH most_recent AS (SELECT MAX("INSPECTION DATE") AS mostrecent,
EXTRACT(MONTH FROM mostrecent) AS mostrecentmonth,
EXTRACT(YEAR FROM mostrecent) AS mostrecentyear
FROM restaurant_inspection),
-- Extract year and month from inspection date and reformat as YYYY, MM
years_and_months AS (SELECT camis,
"INSPECTION DATE",
EXTRACT('YEAR' FROM "INSPECTION DATE") AS inspection_year,
EXTRACT('MONTH' FROM "INSPECTION DATE") AS inspection_month,
inspection_year || ', '|| inspection_month AS actual_month,
(SELECT mostrecent FROM most_recent) - "INSPECTION DATE" AS AGE
FROM restaurant_inspection)
SELECT actual_month,
inspection_year,
inspection_month,
COUNT(*) AS Inspections
FROM years_and_months
/*Add a condition so that we only ever have data from the most recent 365 days and we only ever have the enitre months count or this month so far i.e. exclude data from this month last year */
WHERE AGE < 365
AND NOT (inspection_month = (SELECT mostrecentmonth FROM most_recent) AND inspection_year = (SELECT mostrecentyear FROM most_recent) -1)
GROUP BY actual_month, inspection_year, inspection_month
ORDER BY inspection_year, inspection_month;
SALES DATA
Sales Analysis: Show how much each product type contributes to the total sales of each warehouse.
DataFrameavailable as
df11
variable
-- Calculate total revenue grouped by warehouse and product line
WITH product_line_total_by_area AS (SELECT warehouse,
product_line,
SUM(total) AS revenue
FROM sales
GROUP BY warehouse, product_line)
-- Calculate total revenue by area (warehouse), then use the product_line_total_by_area CTE to calculate the contribution of each product line to the overall revenue of each warehouse.
SELECT warehouse,
product_line,
revenue,
SUM(revenue) OVER(PARTITION BY warehouse) AS warehouse_total,
ROUND((revenue/(SUM(revenue) OVER(PARTITION BY warehouse)) *100.0), 1) AS pct_contribution_to_warehouse_total
FROM product_line_total_by_area
ORDER BY warehouse_total DESC, pct_contribution_to_warehouse_total DESC;
Show a simplified distibution of payment amounts grouped in £3 increments. Ensure any records with a payment amount of 0 are excluded.
- Utilizes generate_series function and a careful join.
- Better than a CASE statement because this technique could easily be amended and applied to a larger range of values and includes groups with zero observations (I've included the £12-15 bin to demonstrate this).
DataFrameavailable as
df12
variable
-- Create grouping bins
WITH bins AS (SELECT generate_series(0,12,3) AS lower,
generate_series(3,15,3) AS upper)
-- Exclude payment values of 0
, not_free AS (SELECT *
FROM dvdrentals.payment
WHERE amount > 0)
-- Utilize the upper and lower bounds to group the payment values into the newly created bins
SELECT lower,
upper,
'£' || lower || ' - £' || upper AS Bin,
COUNT(amount)
FROM bins
LEFT JOIN not_free
ON amount >= lower
AND amount < upper
GROUP BY lower, upper
ORDER BY lower;
FOOTBALL SCORES DATA
From a table of football match results containing home and away team ids and goals scored by each team within a single record, return the top scoring team (home goals plus away goals) for each season in the English Premier League.
- This is a multi-step transformation using CTEs to improve query readabilty.
- Difficult due to both teams in a match appearing in separate fields of the same record.
- Required the home/away team and respective goals scored to be separated, unioned on top of each other and then re-grouped by team name.
- Required multiple joins to pull in information from other tables and a partitioned window function to show top scoring team by season.
Tables:
Match table example 5 rows:
DataFrameavailable as
df13
variable
WITH Goals AS ( -- CTE to stack home and away teams and goals on top of each other to allow for correct aggregation
SELECT country_id
,C.name
,season
,hometeam_id AS team_id
,home_goal AS goals
FROM soccer.match AS M
LEFT JOIN soccer.country AS C -- Join to country table in order to include Country name in dataset
ON M.country_id = C.id
WHERE C.name = 'England' -- Filter to only retrn matches played in English League
UNION ALL
SELECT country_id
,C.name
,season
,awayteam_id
,away_goal
FROM soccer.match AS M
LEFT JOIN soccer.country AS C
ON M.country_id = C.id
WHERE C.name = 'England'
)
,
GoalsBySeason AS ( -- CTE to calculate each total goals in each season
SELECT T.team_long_name AS Team
,G.season
,SUM(G.Goals) AS TotalGoals
FROM Goals AS G
LEFT JOIN soccer.team AS T -- Join to retrieve string team names from team table
ON G.team_id = T.team_api_id
GROUP BY G.season, T.team_long_name
)
--QUALIFY clause unsupported, therefore RANK used inside a subquery, then filtered with WHERE to show Top Goal Scorers
SELECT season,
Team AS TopGoalScorers
,TotalGoals
FROM (SELECT season
,Team
,TotalGoals
,RANK() OVER (PARTITION BY season ORDER BY TotalGoals DESC) AS rank
FROM GoalsBySeason) AS TotalGoalsRanked
WHERE rank = 1
DataFrameavailable as
df4
variable
[32]
--V1 of above query. Included to show there is often more than one way to get to get the same result --
-- filter to only return matches played by English teams.
WITH england_country_id AS (SELECT id
FROM soccer.country
WHERE name = 'England')
,english_matches AS (SELECT *
FROM soccer.match
WHERE country_id IN (SELECT id
FROM england_country_id))
-- 2 joins to get the home team name and away team name from the soccer.team table
, english_match_details AS (SELECT season,
soccer.league.name AS league,
date,
hometeam_id,
hometeams.team_long_name AS hometeam,
awayteam_id,
awayteams.team_long_name AS awayteam,
home_goal,
away_goal
FROM english_matches
LEFT JOIN soccer.team AS hometeams
ON hometeam_id = hometeams.team_api_id
LEFT JOIN soccer.team AS awayteams
ON awayteam_id = awayteams.team_api_id
LEFT JOIN soccer.league USING(country_id))
/* many subsequent CTEs to get each teams total goals per season, then identify the highest scoring team for each season */
-- calculate total home goals for each team, for each season
, homegoals AS (SELECT season,
hometeam,
SUM(home_goal) AS total_home_goals
FROM english_match_details
GROUP BY season, hometeam)
-- calculate total away goals for each team, for each season
, awaygoals AS (SELECT season,
awayteam,
SUM(away_goal) AS total_away_goals
FROM english_match_details
GROUP BY season, awayteam)
-- union the 2 datasets from the 2 previous CTEs
, unionhomeandawaygoals AS (SELECT season,
hometeam AS team,
total_home_goals AS total_goals
FROM homegoals
UNION ALL
SELECT season,
awayteam,
total_away_goals
FROM awaygoals)
-- group by season and team then sum home goals and away goals to get each teams total goals for each season
, totalseasongoals AS (SELECT season,
team,
SUM(total_goals) AS season_goals
FROM unionhomeandawaygoals
GROUP BY season, team
ORDER BY team, season)
-- calaculate highest number of total goals scored by a team for each season
,seasontop AS (SELECT season,
team,
season_goals,
MAX(season_goals) OVER(PARTITION BY season) AS seasontopscorer
FROM totalseasongoals)
-- return team name and goals scored for the highest scoring team for each season
SELECT season,
team,
season_goals
FROM seasontop
WHERE season_goals = seasontopscorer
ORDER BY season;
Use the match and country tables to return the countries in which the average number of goals (home and away goals) scored in a match is greater than the average number of goals in a match for all countries overall.
- Required GROUP BY, HAVING and a simple subquery to calculate the overall average goals for each country to be compared against.
DataFrameavailable as
df5
variable
SELECT c.name,
ROUND(AVG(m.home_goal + m.away_goal),1) AS avg_goals_per_match
FROM soccer.match AS m
LEFT JOIN soccer.country AS c
ON m.country_id = c.id
GROUP BY c.name
HAVING AVG(m.home_goal + m.away_goal) > (SELECT AVG(m.home_goal + m.away_goal)
FROM soccer.match AS m)
ORDER BY avg_goals_per_match DESC;
Calculate the running total of points earned by Chelsea (team id 8455) in the 2014/2015 season.
- Required CASE statement to identify Chelsea as either the home team or the away team and use this to determine their points for each game they played in. Win = 3, Draw = 1, Loss = 0.
- Then a window function was used to calulate the running points total alongside points for each match.
DataFrameavailable as
df6
variable
WITH chelsea_points AS (
SELECT date,
CASE WHEN hometeam_id = 8455 THEN CASE WHEN home_goal > away_goal THEN 3
WHEN away_goal > home_goal THEN 0
ELSE 1 END
WHEN awayteam_id = 8455 THEN CASE WHEN home_goal > away_goal THEN 0
WHEN away_goal > home_goal THEN 3
ELSE 1 END
END AS points
FROM soccer.match
WHERE season = '2014/2015'
AND (hometeam_id = 8455 OR awayteam_id = 8455)
ORDER BY date DESC)
SELECT date,
points,
SUM(points) OVER (ORDER BY date DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM chelsea_points;
OLYMPICS MEDALS DATA
For the top 5 all-time highest medal-winning countries in the dataset: show their medal total per games as well as their overall total.
- Required use of GROUP BY in CTEs to get top 5 highest medal-winning countries and medals won by each country each olympics.
- Then GROUP BY with ROLLUP to show each countries total for each olympics as well as their grand total for the last 10 olympics that they competed in.