Overview of the workspace:
Window functions are used to perform calculations on a group of rows without combining them into a single result (GROUP BY). Window functions become very handy when calculating metrics such as an average over a period of time or for a specific segment as well as a running total.
Utilizing the Olympics data*, you will go through:
- How to define a window function that includes a set of rows
- How to divide the set of rows included within a window function into partitions or frames
- How to calculate moving averages and running totals
- How to calculate grand totals
*This workspace uses the Olympics data from DataCamp.
Overview of the summer_medals Dataset
summer_medals DatasetThe summer_medals table in the medals schema of our PostgreSQL database contains historical data (1896-2012) related to the Summer Olympic Games. Here is a brief overview of the columns present in the summer_medals table:
year: The year in which the Olympic Games were held.city: The city where the Olympic Games took place.sport: The category of sport (e.g., Athletics, Swimming, etc.).discipline: A specific discipline within a sport category.athlete: The name of the athlete who competed.country: The country that the athlete represented.gender: The gender of the athlete (e.g., Male, Female).event: The specific event in which the athlete competed.medal: The type of medal won by the athlete (Gold, Silver, Bronze).
This dataset is a rich source of information for analyzing the performance of athletes and countries across different sports disciplines over the years. It can be used to track the evolution of the Olympic Games, study trends in athletic performance, and explore the distribution of medals among participating nations.
--Snapshot of the dataset
SELECT *
from medals.summer_medals
LIMIT 5
1. Ranking Window Functions: row numbers, rank, dense rank
The example below uses ROW_NUMBER() to assign a row number to each row. The OVER() clause is the indicator of a window function and defines the range of rows upon which the calculations would be performed.
In the query2,OVER() clause is empty, so only the first five records of the entire query can be seen.
SELECT year, athlete,
-- Assign a row number to athletes
ROW_NUMBER() OVER() AS athlete_number
FROM medals.summer_medals
LIMIT 52. Ordering the rows within a window
To define an order, we use ORDER BY within the OVER() clause. This order is applied to the range of rows defined within the window function.
In query 2, row numbers are assigned to athletes based on the descending number of medals won. However, you need RANK() and/or DENSE_RANK() if you want to distinguish atlethes with same amount of medals. Both functions give the same rank to ties values, but DENSE_RANK() does not skip the next rankings.
SELECT athlete,medal_count,
ROW_NUMBER() OVER(ORDER BY medal_count DESC) as medal_row,
RANK() OVER (ORDER BY medal_count DESC) AS medal_ranking,
DENSE_RANK() OVER (ORDER BY medal_count DESC) AS medal_denseranking
FROM ( -- Fetch athlete and the number of medals they won
SELECT athlete, COUNT(*) AS medal_count
FROM medals.summer_medals
GROUP BY athlete
) AS medal_counts
LIMIT 103. Calculating window functions over partitions- GROUP BY vs PARTITION BY
OVER() has a clause called PARTITION BY, which divides the range of rows defined within the window function into partitions. If it is not used, the entire result of the query is treated as one partition.(Plese refer to query 4 below)
In query 3, the window is divided into partitions based on the country. Therefore,you can observe how total medals won by each athlete participating from USA and Turkiye in 2000.
Note:For simplicity,ROW_NUMBER()function is used here.RANK()and/or DENSE_RANK() can also be utilized based on the business problem you are solving.
SELECT athlete,country, medal_count,
--Assign row numbers to each country based on the desc number of medals won while partitioning by country
ROW_NUMBER() OVER(PARTITION BY country ORDER BY medal_count DESC) AS medal_ranking
FROM (-- Fetch athlete, country, and the number of medals athlete won
SELECT athlete, country, COUNT(*) AS medal_count
FROM medals.summer_medals
WHERE country IN ('USA', 'TUR') and year=2000
GROUP BY athlete, country
) AS medal_counts
ORDER BY country, medal_count DESC
LIMIT 10---Total medals won by USA and Turkiye in 2000
SELECT country,COUNT(*) AS medal_count
FROM medals.summer_medals
WHERE country IN ('USA', 'TUR') and year=2000
GROUP BY country4. Aggregate window functions: Moving Average& Running Totals (Cumulative Sum)
Up until now, ROW_NUMBER() was used to assign numbers to rows. Window functions can also be used to perform aggregations over rows.
Let's begin by calculating a moving average. Moving averages take the average value over a previous number of rows,can help eliminate noise from data and identify trends.OVER()is used in combination with the AVG() function. By default, this will calculate the average over all previous rows.
Running total calculates the total sum of a column over the preceding values and the current value. It is performed in the same way as a moving average, except that SUM() is used in place of AVG(). MIN()andMAX() functions also follow the same logic.
The following query (query5) calculates several different aspects of medals won by United States over time.
SELECT year,medal_count,
-- Calculate the moving average
ROUND(AVG(medal_count) OVER(ORDER BY year),1) as average_medal_count,
-- Calculate the running total
SUM(medal_count) OVER(ORDER BY year) as cum_medal_count,
-- Calculate maximum medals earned so far
MAX(medal_count) OVER( ORDER BY year) as max_medal_count,
-- Calculate minimum medals earned so far
MIN(medal_count) OVER( ORDER BY year) as min_medal_count
FROM ( -- Fetch the year and number of medals won by USA
SELECT year, COUNT(*) AS medal_count
FROM medals.summer_medals
WHERE country = 'USA'
GROUP BY year) AS usa_medals
LIMIT 105. Adjusting the range of window functions: Frames
Frames are used to define the range of rows for a windows function. They are specified within OVER(). The range can be defined by using ROWS BETWEEN [start] and [finish]. The start and finish can be one of the three clauses - PRECEDING, CURRENT ROW, and FOLLOWING.
For example:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROWwill define a frame that starts two rows before the current row and ends at the current row.ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWINGwill define a frame that starts one row before the current row and ends three rows after the current row.ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWwill define a frame that starts from the first row and ends at the current row.
The following query calculates the moving average for the Canadian team's medal count for the past three summer Olympics.
SELECT year,medal_count,
-- Calculate moving average over 3 rows
AVG(medal_count) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as last_three_average_medal_count
FROM
(-- Fetch the year and number of medals won
SELECT year,COUNT(*) AS medal_count
FROM medals.summer_medals
-- Narrow the results to Canada
WHERE country = 'CAN'
GROUP BY year) AS canadian_medals