Working with Window Functions in SQL
Window functions are used to perform calculations on a group of rows without combining them into a single result. This can be useful when you are calculating metrics such as SUM, AVG, and MAX and want to display all the rows in the output. Unlike Group By, it does not group rows into a single row. In this workspace, you will learn:
- How to define a window function that includes a set of rows
- How to divide the set of rows included within a window function into partitions or frames
- How to calculate moving averages and sum using window functions
This workspace utilizes the online ticket sales database from the Datacamp integrations. The link to the dataset is here. You are free to create an integration to your data set or use one of the existing integrations. You can learn more about integrations here
Calculating Row numbers
In the example below, the code uses one of the simplest window functions called ROW_NUMBER() which assigns a row number to each row. The row numbers are assigned based on the results of the query. The OVER() clause defines the range of rows upon which the calculations would be performed. Since the example does not have any arguments for the OVER() clause, it will use the entire result of the query.
Note: The following queries are being run on Amazon Redshift Database.
👇 To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.
SELECT
eventname,
--Assign a row number to each record of the sub query
ROW_NUMBER() OVER() as rownumber
FROM event
LIMIT 10 Ordering The Rows Within Window
You can use Order By within the OVER() clause to define an order. This order is applied to the range of rows defined within the window function. In the example below, row numbers are assigned to each event based on the descending number of tickets sold. If you want to know more about ordering in window functions, you can check out this exercise in the course PostgreSQL Summary Stats and Window Functions.
SELECT
eventname,
qtysold,
--Assign row numbers to each event based on the descending number of tickets sold
ROW_NUMBER() OVER(ORDER BY qtysold DESC) as event_ranking
FROM
(
SELECT
eventname, SUM(qtysold) AS qtysold
FROM event
INNER JOIN sales USING(eventid)
GROUP BY eventname
ORDER BY qtysold DESC
LIMIT 10
) ticket_sales
ORDER BY event_ranking Calculating Window Functions Over Partitions
The OVER() clause has an argument called PARTITION BY which divides the range of rows defined within the window function into partitions. If it is not mentioned, the entire result of the query is considered as one partition. In the example below, the window is divided into partitions based upon the column catgroup. This returns different rankings for concerts and shows. You can learn more about partitions in this exercise in the course PostgreSQL Summary Stats and Window Functions.
SELECT
*,
-- Create different rankings by partitioning the window based on catgroup
ROW_NUMBER() OVER( PARTITION BY catgroup ORDER BY qtysold DESC) as event_ranking
FROM
(
SELECT
catgroup,
eventname,
SUM(qtysold) AS qtysold
FROM event
INNER JOIN category USING(catid)
INNER JOIN sales USING(eventid)
GROUP BY 2, 1
ORDER BY qtysold DESC
LIMIT 100
) ticket_sales
ORDER BY
catgroup, event_ranking Adjusting Range Of Window Functions
Frames are used to define the range of rows for a windows function. The range can be defined by using ROWS BETWEEN [start] and [finish]. The start and finish can be one of the three clauses - PRECEDING, CURRENT ROW and FOLLOWING.
For example:
- `ROWS BETWEEN 2 PRECEDING and CURRENT ROW` will define a frame that starts two rows before the current row and ends at the current row.
- `ROWS BETWEEN 1 PRECEDING and 3 FOLLOWING` will define a frame that starts one row before the current row and ends three rows after the current row.
- `ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW` will define a frame that starts from the first row and ends at the current row.
The following code calculates the maximum number of tickets sold for an event. The frame is defined between the first row and the current row. You can find more information about frames here.
SELECT
*,
-- calculating the maximum number of tickets sold for an event until now
MAX(qtysold) OVER(PARTITION BY eventname
ORDER BY caldate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as maxsold
FROM
(
SELECT
caldate,
eventname,
SUM(qtysold) AS qtysold
FROM event
INNER JOIN date USING(dateid)
INNER JOIN sales USING(eventid)
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 100
) ticket_prices
ORDER BY
eventname, caldateCalculating Moving Average With Window Function
Up until now, ROW_NUMBER() was used to assign ranking to rows. Window functions can also be used to perform aggregations over rows. Let's begin by calculating a moving average. Moving averages help eliminate noise from the data to project useful trends. They are useful in several industries which want to eliminate seasonality. You can find more information about moving averages here.
The following query calculates the moving average of the number of tickets sold.
SELECT
*,
-- Calculating Moving Average
AVG(qtysold) OVER(ORDER BY caldate
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) as avgqtysold
FROM
(
SELECT
caldate,
SUM(qtysold) AS qtysold
FROM
event INNER JOIN date USING(dateid)
INNER JOIN sales USING(eventid)
GROUP BY caldate
ORDER BY qtysold DESC
LIMIT 100
) ticket_prices Plotting Moving Average
You can see in the visualization below how a moving average removes noise by plotting both the original column and the moving average.
# Import libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Plotting Moving Average
sns.set_style('darkgrid')
sns.lineplot(x='caldate', y='avgqtysold', data=ticket_prices)
sns.lineplot(x='caldate', y='qtysold', data=ticket_prices)
plt.xlabel('Date', fontsize=15) # label for x-axis
plt.ylabel('Quantity Sold', fontsize=15) # label for y-axis
plt.xticks(rotation ='45')
plt.legend(labels=["Quantity Moving Average","Quantity"])
plt.rcParams["figure.figsize"]=(8, 8)
plt.show()Calculating Running Sum With Window Function
The running sum calculates the total sum of a column over preceding values and the current value. It is also used to reflect performance in general. If the running sum is going down, the overall performance is going down as well. The following code indicates the performance of ticket sales by calculating the running sum.
SELECT
*,
-- Calculating Running Sum
SUM(qtysold) OVER(ORDER BY caldate
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as runningtotal
FROM
(
SELECT
caldate,
SUM(qtysold) AS qtysold
FROM
event INNER JOIN date USING(dateid)
INNER JOIN sales USING(eventid)
GROUP BY caldate
LIMIT 200
) sum_trend
ORDER BY caldate Plotting Running Sum
After extracting the running sum, the data is plotted to notice the performance of the sales of tickets. The rise in the running sum reflects the rise in the sale of tickets over the recent times and the dip in the running sum reflects the dip in the sale of tickers over the recent times. After extracting the moving average, the data is plotted to notice the performace of the sales of tickets.