Window Functions in SQL
  • AI Chat
  • Code
  • Report
  • Spinner

    Window Functions in SQL

    Window functions are used to perform calculations on a group of rows without combining them into a single result. This can be useful when you are calculating metrics such as an average over a period of time or a running total. Unlike a GROUP BY, it does not group rows into a single row. In this workspace, you will learn:

    • How to define a window function that includes a set of rows
    • How to divide the set of rows included within a window function into partitions or frames
    • How to calculate moving averages and running totals.

    This workspace uses the Olympics data from the course PostgreSQL Summary Stats and Window Functions. You are free to create an integration to your own data or use one of the existing integrations. You can learn more about integrations here.

    Calculating row numbers

    The example below uses ROW_NUMBER() to assign a row number to each row. The row numbers are assigned based on the results of the query. The OVER() clause defines the range of rows upon which the calculations would be performed. By default, the OVER() clause will use the entire result of the query.

    Note: ROW_NUMBER() assigns a unique number to each row. If you want rows with the same value to receive identical numbers, you can use RANK() or DENSE_RANK(). Refer to this video to learn more about the differences between these two functions.

    👇  To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT 
    	year, 
        athlete,
        -- Assign a row number to athletes
    	ROW_NUMBER() OVER() AS athlete_number
    FROM medals.summer_medals
    LIMIT 5
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Ordering the rows within a window

    To define an order, you can use ORDER BY within the OVER() clause. This order is applied to the range of rows defined within the window function. In the example below, row numbers are assigned to each event based on the descending number of tickets sold.

    If you want to know more about ordering in window functions, you can check out this video.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT 
      athlete,
      medal_count, 
      -- Assign row numbers to athletes based on the descending number of medals won
      ROW_NUMBER() OVER(ORDER BY medal_count DESC) as medal_ranking
    FROM 
      (
        -- Fetch athlete and the number of medals they won
        SELECT athlete, COUNT(*) AS medal_count
        FROM medals.summer_medals
        GROUP BY athlete
      ) AS medal_counts
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Calculating window functions over partitions

    OVER() has a clause called PARTITION BY, which divides the range of rows defined within the window function into partitions. If it is not used, the entire result of the query is treated as one partition. The window is divided into partitions based on the country in the example below.

    You can learn more about partitions in this video.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT 
      athlete,
      country,
      medal_count, 
      --Assign row numbers to each country based on the descending number of medals won while partitioning by country
      ROW_NUMBER() OVER(PARTITION BY country ORDER BY medal_count DESC) AS medal_ranking
    FROM 
      (
        -- Fetch athlete, country, and the number of medals athlete won
        SELECT 
    		athlete,
          	country,
          	COUNT(*) AS medal_count
        FROM medals.summer_medals
        WHERE country IN ('QAT', 'PHI')
        GROUP BY athlete, country
      ) AS medal_counts
    ORDER BY country, medal_count DESC
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Calculating a moving average

    Up until now, ROW_NUMBER() was used to assign numbers to rows. Window functions can also be used to perform aggregations over rows. Let's begin by calculating a moving average. Moving averages take the average value over a previous number of rows and can help eliminate noise from data and identify trends.

    To calculate the moving average, we again use OVER(), but this time in combination with the AVG() function. By default, this will calculate the average over all previous rows.

    You can find more information about moving averages here.

    Unknown integration
    DataFrameavailable as
    ticket_prices
    variable
    SELECT 
      year,
      medal_count,
      -- Calculate the moving average
      AVG(medal_count) OVER(ORDER BY year) as average_medal_count 
    FROM 
      (
        -- Fetch the year and number of medals won
        SELECT 
          	year, COUNT(*) AS medal_count
        FROM medals.summer_medals
        -- Narrow the results to Canada
        WHERE country = 'CAN'
        GROUP BY year
      ) AS canadian_medals
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Adjusting the range of window functions

    Frames are used to define the range of rows for a windows function. They are specified within OVER(). The range can be defined by using ROWS BETWEEN [start] and [finish]. The start and finish can be one of the three clauses - PRECEDING, CURRENT ROW, and FOLLOWING.

    For example:

    • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW will define a frame that starts two rows before the current row and ends at the current row.
    • ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING will define a frame that starts one row before the current row and ends three rows after the current row.
    • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW will define a frame that starts from the first row and ends at the current row.

    The following query calculates the moving average for the Canadian team's medal count for the past three summer Olympics. You can find more information about frames here.

    Unknown integration
    DataFrameavailable as
    moving_average
    variable
    SELECT 
      year,
      medal_count,
      -- Calculate moving average over 3 rows
      AVG(medal_count) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as last_three_average_medal_count 
    FROM 
      (
        -- Fetch the year and number of medals won
        SELECT 
          	year,
          	COUNT(*) AS medal_count
        FROM medals.summer_medals
        -- Narrow the results to Canada
        WHERE country = 'CAN'
        GROUP BY year
      ) AS canadian_medals
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Plotting the moving average

    The plot below shows how a moving average removes noise by plotting both the original column and the moving average.

    # Import libraries and set plotting style
    import seaborn as sns
    import matplotlib.pyplot as plt
    sns.set_style('darkgrid')
    plt.rcParams["figure.figsize"]=(12, 6)
    
    # Create the line plots
    sns.lineplot(x='year', y='medal_count', data=moving_average)
    sns.lineplot(x='year', y='last_three_average_medal_count', data=moving_average)
    
    # Add labels and show the plot
    plt.title('Medals Won by Canada', fontsize=20)
    plt.xlabel('Date', fontsize=15) 
    plt.ylabel('Number of Medals', fontsize=15)
    plt.xticks(rotation ='45')
    plt.legend(labels=["Average Medal Count","Medal Count"])
    plt.show()

    Calculating a running total

    A running total calculates the total sum of a column over the preceding values and the current value. It is performed in much the same way as a moving average, except that SUM() is used in place of AVG().

    The following query calculates the total number of medals won by Canada over time.

    Unknown integration
    DataFrameavailable as
    running_total
    variable
    SELECT 
      year,
      medal_count,
      -- Calculate a running total
      SUM(medal_count) OVER(ORDER BY year) as total_medal_count 
    FROM 
      (
        -- Fetch the year and number of medals won
        SELECT 
          	year,
          	COUNT(*) AS medal_count
        FROM medals.summer_medals
        -- Narrow the results to Canada
        WHERE country = 'CAN'
        GROUP BY year
      ) AS canadian_medals
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.