Skip to content

Overview of the workspace:

Window functions are used to perform calculations on a group of rows without combining them into a single result (GROUP BY). Window functions become very handy when calculating metrics such as an average over a period of time or for a specific segment as well as a running total.

Utilizing the Olympics data*, you will go through:

  • How to define a window function that includes a set of rows
  • How to divide the set of rows included within a window function into partitions or frames
  • How to calculate moving averages and running totals
  • How to calculate grand totals

*This workspace uses the Olympics data from DataCamp.

Overview of the summer_medals Dataset

The summer_medals table in the medals schema of our PostgreSQL database contains historical data (1896-2012) related to the Summer Olympic Games. Here is a brief overview of the columns present in the summer_medals table:

  • year: The year in which the Olympic Games were held.
  • city: The city where the Olympic Games took place.
  • sport: The category of sport (e.g., Athletics, Swimming, etc.).
  • discipline: A specific discipline within a sport category.
  • athlete: The name of the athlete who competed.
  • country: The country that the athlete represented.
  • gender: The gender of the athlete (e.g., Male, Female).
  • event: The specific event in which the athlete competed.
  • medal: The type of medal won by the athlete (Gold, Silver, Bronze).

This dataset is a rich source of information for analyzing the performance of athletes and countries across different sports disciplines over the years. It can be used to track the evolution of the Olympic Games, study trends in athletic performance, and explore the distribution of medals among participating nations.

Spinner
Queryas
query1
variable
--Snapshot of the dataset
SELECT *
from medals.summer_medals
LIMIT 5

1. Ranking Window Functions: row numbers, rank, dense rank

The example below uses ROW_NUMBER() to assign a row number to each row. The OVER() clause is the indicator of a window function and defines the range of rows upon which the calculations would be performed.

In the query2,OVER() clause is empty, so only the first five records of the entire query can be seen.

Spinner
Queryas
query
variable
SELECT year, athlete,
    -- Assign a row number to athletes
	ROW_NUMBER() OVER() AS athlete_number
FROM medals.summer_medals
LIMIT 5

2. Ordering the rows within a window

To define an order, we use ORDER BY within the OVER() clause. This order is applied to the range of rows defined within the window function.

In query 2, row numbers are assigned to athletes based on the descending number of medals won. However, you need RANK() and/or DENSE_RANK() if you want to distinguish atlethes with same amount of medals. Both functions give the same rank to ties values, but DENSE_RANK() does not skip the next rankings.

Spinner
Queryas
query2
variable
SELECT athlete,medal_count, 
   ROW_NUMBER() OVER(ORDER BY medal_count DESC) as medal_row,
   RANK() OVER (ORDER BY medal_count DESC) AS medal_ranking,
   DENSE_RANK() OVER (ORDER BY medal_count DESC) AS medal_denseranking
FROM ( -- Fetch athlete and the number of medals they won
    SELECT athlete, COUNT(*) AS medal_count
    FROM medals.summer_medals
    GROUP BY athlete
  ) AS medal_counts
LIMIT 10

3. Calculating window functions over partitions- GROUP BY vs PARTITION BY

OVER() has a clause called PARTITION BY, which divides the range of rows defined within the window function into partitions. If it is not used, the entire result of the query is treated as one partition.(Plese refer to query 4 below)

In query 3, the window is divided into partitions based on the country. Therefore,you can observe how total medals won by each athlete participating from USA and Turkiye in 2000.

Note:For simplicity,ROW_NUMBER()function is used here.RANK()and/or DENSE_RANK() can also be utilized based on the business problem you are solving.

Spinner
Queryas
query3
variable
SELECT athlete,country, medal_count, 
--Assign row numbers to each country based on the desc number of medals won while partitioning by country
  ROW_NUMBER() OVER(PARTITION BY country ORDER BY medal_count DESC) AS medal_ranking
FROM (-- Fetch athlete, country, and the number of medals athlete won
    SELECT athlete, country, COUNT(*) AS medal_count
    FROM medals.summer_medals
    WHERE country IN ('USA', 'TUR') and year=2000
    GROUP BY athlete, country
  ) AS medal_counts
ORDER BY country, medal_count DESC
LIMIT 10
Spinner
Queryas
query4
variable
---Total medals won by USA and Turkiye in 2000
SELECT country,COUNT(*) AS medal_count
FROM medals.summer_medals
WHERE country IN ('USA', 'TUR') and year=2000
GROUP BY country

4. Aggregate window functions: Moving Average& Running Totals (Cumulative Sum)

Up until now, ROW_NUMBER() was used to assign numbers to rows. Window functions can also be used to perform aggregations over rows. Let's begin by calculating a moving average. Moving averages take the average value over a previous number of rows,can help eliminate noise from data and identify trends.OVER()is used in combination with the AVG() function. By default, this will calculate the average over all previous rows.

Running total calculates the total sum of a column over the preceding values and the current value. It is performed in the same way as a moving average, except that SUM() is used in place of AVG(). MIN()andMAX() functions also follow the same logic.

The following query (query5) calculates several different aspects of medals won by United States over time.

Spinner
Queryas
query5
variable
SELECT year,medal_count,
  -- Calculate the moving average
  ROUND(AVG(medal_count) OVER(ORDER BY year),1) as average_medal_count, 
   -- Calculate the running total
  SUM(medal_count) OVER(ORDER BY year) as cum_medal_count, 
  -- Calculate maximum medals earned so far
  MAX(medal_count) OVER( ORDER BY year) as max_medal_count,
  -- Calculate minimum medals earned so far
  MIN(medal_count) OVER( ORDER BY year) as min_medal_count
FROM ( -- Fetch the year and number of medals won by USA
SELECT year, COUNT(*) AS medal_count
    FROM medals.summer_medals
    WHERE country = 'USA'
    GROUP BY year) AS usa_medals
LIMIT 10

5. Adjusting the range of window functions: Frames

Frames are used to define the range of rows for a windows function. They are specified within OVER(). The range can be defined by using ROWS BETWEEN [start] and [finish]. The start and finish can be one of the three clauses - PRECEDING, CURRENT ROW, and FOLLOWING.

For example:

  • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW will define a frame that starts two rows before the current row and ends at the current row.
  • ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING will define a frame that starts one row before the current row and ends three rows after the current row.
  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW will define a frame that starts from the first row and ends at the current row.

The following query calculates the moving average for the Canadian team's medal count for the past three summer Olympics.

Spinner
DataFrameas
moving_average
variable
SELECT year,medal_count,
  -- Calculate moving average over 3 rows
  AVG(medal_count) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as last_three_average_medal_count 
FROM 
  (-- Fetch the year and number of medals won
SELECT year,COUNT(*) AS medal_count
    FROM medals.summer_medals
    -- Narrow the results to Canada
    WHERE country = 'CAN'
    GROUP BY year) AS canadian_medals