Tutorials
must read
sql

Supercharge Your Subqueries

In this tutorial, you'll learn how to create subqueries in SQL to better analyze and report data.

Once you become familiar with SQL, you realize that all of the really cool analyses require multiple steps. For example, suppose you want to create a histogram of the amount of time each user spends on your website. First, you'll need to calculate the amount of time spent per user. Then, you'll want to count the number of users who spend a certain amount of time on your site. There are three options for creating these subqueries:

TEMP Tables

You can create a temporary table that exists for your current SQL session, is only accessible by you, and disappears when you disconnect. This is great for long calculations that you'll want to refer to multiple times. The syntax is easy:

SELECT <query goes here>
INTO TEMP <name of the temporary table>
FROM <rest of query goes here>

The biggest problem with TEMP tables is that they require you to run multiple commands. If you're working with dashboarding software like Looker or Tableau, TEMP tables probably won't work.

Nested FROM statements

Normally, we use FROM statements to indicate which table our data will be coming from. However, they can also select from a query nested inside of some parenthesis. Consider this example where we're building a histogram of average purchase amount per user. First, we need to calculate the average purchase amount per user_id:

SELECT user_id,
ROUND(AVG(purchase)) as avg_purchase
FROM user_purchase
GROUP BY 1;

We can take this query and place it inside of a FROM statement:

SELECT avg_purchase,
    COUNT(user_id) as num_customers
FROM (
    SELECT user_id,
        ROUND(AVG(purchase)) as avg_purchase
    FROM user_purchase
    GROUP BY 1 ) as subquery1
GROUP BY 1;

The highlighted query will run first, and the results will be aggregated by the outer query. By aliasing the subquery (in this case as subquery1, you can do complicated things like referring to it in a join.

The biggest problem with nested FROM statements is that they're hard to read. In particular, if you're referring back to your code at a later date, or if you're trying to share it with someone else, it's harder to see what you were thinking. Once you have more than one nested FROM, it becomes almost impossible to figure out what the calculation does.

WITH statements

WITH statements are similar to FROM statements, but with one key difference: they are written sequentially, so it's easier to understand the steps that create the final query. Consider our previous query, rewritten using WITH:

WITH subquery1 AS (
    SELECT user_id,
        ROUND(AVG(purchase)) as avg_purchase
    FROM user_purchase
    GROUP BY 1 )
SELECT avg_purchase,
    COUNT(user_id) as num_customers
FROM subquery1
GROUP BY 1;

We can have as many WITH statements as we want before our main query. We just separate each one with commas:

WITH subquery1 AS (
... ),
subquery2 AS (
... ),
subquery3 AS (
... )
SELECT <final query>
FROM subquery3;

Many Data Analysts and Data Scientists prefer WITH statements to nested FROM statements because it's easier to for a third party to review the query and to follow the logic of the original coder.

Conclusion

Whether you choose to use TEMP tables, nested FROM statements, or sequential WITH statements, multi-step queries will give you richer, more actionable insights into your big data sets.

If you would like to learn more about SQL, take DataCamp's Joining Data in PostgreSQL course.

Want to leave a comment?