Skip to content

PostgreSQL Summary Stats and Window Functions

Here you can access the summer_medals table used in the course. To access the table, you will need to specify the medals schema in your queries (e.g., medals.summer_medals).

Summer olympics dataset

  • Each row represents a medal awarded in the Summer Olympics games

Explore Datasets

Use the summer_medals table to explore the data and practice your skills!

  • Select the athlete, event, and year from the summer_medals table.
    • Add another column, previous_winner, which contains the previous winner of the same event.
    • Filter your results for gold medalists.
  • Return the year, total number of medalists per year, and running total number of medalists in the history of the Summer Olympics.
    • Order your results by year in ascending order.
  • Return the country, year, and the number of gold medals earned.
    • Limit your results to the years 2004, 2008, and 2012.
    • Each country should have a subtotal of all gold medals earned across the three years.
Spinner
DataFrameas
df
variable
SELECT *
FROM medals.summer_medals
LIMIT 5;

Introduction to window functions

Course outline:

  • Introduction to window functions
  • Fetching, ranking, and paging, bin values in rows
  • Aggregate window functions and frames
  • Beyond window functions

Window functions perform an operation across a set of rows that are somehow related to the current row. They're similar to GROUP BY aggregate functions, but all rows remain in the output (instead of being grouped in a single row).

Uses:
  • Fetching values from preceding or following rows (e.g fetching the previous row's value)
  • Assigning ordinal ranks (1st, 2nd, etc.) to rows based on their values' positions in a sorted list
  • Running totals, moving averages
Spinner
DataFrameas
df
variable
-- The most basic thing you can do with window functions is assign row numbers. They allow you to reference a row by its posiiton or index as opposed to its values.
--Eg. This:
SELECT year, event, country
FROM medals.summer_medals
WHERE medal = 'Gold';
Spinner
DataFrameas
df
variable
-- Can be changed into this:
SELECT year, event, country,
    ROW_NUMBER() OVER() AS row_n
FROM medals.summer_medals
WHERE medal = 'Gold';
Anatomy of a window function

FUNCTION_NAME() OVER(...)

  • The OVER clause indicates that a function is a window function
  • The parenthesis after OVER() can contain subclauses, such as ORDER BY, PARTITION BY, and ROWS/RANGE, PRECEDING/FOLLOWING/UNBOUNDED
Spinner
DataFrameas
df
variable
-- Querying the table to easiloy find in which year the 13th summer olympics were held

SELECT
  year,
  -- Assign numbers to each year
  ROW_NUMBER() OVER() AS row_n
FROM (
  SELECT DISTINCT(year)
  FROM medals.summer_medals
  ORDER BY year ASC
) AS years
ORDER BY Year ASC;
ORDER BY
  • One of the subclauses within the OVER clause
  • ORDER BY in OVER orders the rows related to the current row that the window function will use. E.g. ordering by year in descending order in ROW_NUMBER's OVER clause will assign 1 to the most recent year's rows
  • You can order by multiple columns in the OVER clause, just like normal, and they'll change the numbers assign to each row
  • You can order both inside and outside the OVER clause at the same time. First, ROW_NUMBER will assign numbers based on the order within OVER. After that, the ORDER outside of OVER takes over, and sorts the results of the table by that. The first row on the result won't necessarily be row_number 1.
  • ORDER BY inside OVER takes effect before ORDER BY outside OVER.
Spinner
DataFrameas
df
variable
SELECT year, country, event, 
    ROW_NUMBER() OVER(ORDER BY year DESC, event DESC) as row_n
    -- the row numbers are given after sorting the table by year and event
FROM medals.summer_medals
WHERE medal = 'Gold'
ORDER BY country, row_n;
-- the results of the table will now be sorted by country and row number. SO the first row in the result isn't the row with number 1, because the two orders are based on different columns.
Spinner
DataFrameas
df
variable
-- Ranking olympic medalists by medals earned

WITH athlete_medals AS(
SELECT
  -- Count the number of medals each athlete has earned
  athlete,
  count(medal) AS Medals
FROM medals.Summer_Medals
GROUP BY athlete)
    
SELECT
    -- Number each athlete by how many medals they've earned
    athlete,
    ROW_NUMBER() OVER(ORDER BY medals DESC) AS row_n
FROM athlete_medals
ORDER BY Medals DESC;

Using ORDER BY to determine reigning champions (a champion who's won both the previous and current year's competitions).
The previous and current year's champions need to be in the same row (in two different columns) so that they can be compared.

How can you get a previous value without complex self-joins? Using LAG.

  • LAG(column, n) OVER(...) returns column's value at the row n rows before the current row.
    • LAG(column, 1) OVER(...) returns the previous row's value
Spinner
DataFrameas
df
variable
-- Step 1) getting each year's champions and wrapping it in a CTE to temporarily store a query's results as a table

WITH discus_gold AS(
SELECT year, country AS champion
FROM medals. summer_medals
WHERE year IN (1996, 2000, 2004, 2008, 2012)
    AND gender = 'Men' AND medal = 'Gold' AND event = 'Discus Throw')

-- STEP 2) Using the previous CTE with LAG
SELECT year, champion, 
    LAG(champion, 1) OVER (ORDER BY year) AS last_champion
FROM discus_gold
ORDER BY year;