Supercharge Your Subqueries
Once you become familiar with SQL, you realize that all of the really cool analyses require multiple steps. For example, suppose you want to create a histogram of the amount of time each user spends on your website. First, you'll need to calculate the amount of time spent per user. Then, you'll want to count the number of users who spend a certain amount of time on your site. There are three options for creating these subqueries:
You can create a temporary table that exists for your current SQL session, is only accessible by you, and disappears when you disconnect. This is great for long calculations that you'll want to refer to multiple times. The syntax is easy:
SELECT <query goes here> INTO TEMP <name of the temporary table> FROM <rest of query goes here>
The biggest problem with
TEMP tables is that they require you to run multiple commands. If you're working with dashboarding software like Looker or Tableau,
TEMP tables probably won't work.
Nested FROM statements
Normally, we use
FROM statements to indicate which table our data will be coming from. However, they can also select from a query nested inside of some parenthesis.
Consider this example where we're building a histogram of average purchase amount per user.
First, we need to calculate the average
purchase amount per
SELECT user_id, ROUND(AVG(purchase)) as avg_purchase FROM user_purchase GROUP BY 1;
We can take this query and place it inside of a
SELECT avg_purchase, COUNT(user_id) as num_customers FROM ( SELECT user_id, ROUND(AVG(purchase)) as avg_purchase FROM user_purchase GROUP BY 1 ) as subquery1 GROUP BY 1;
The highlighted query will run first, and the results will be aggregated by the outer query. By aliasing the subquery (in this case as
subquery1, you can do complicated things like referring to it in a join.
The biggest problem with nested
FROM statements is that they're hard to read. In particular, if you're referring back to your code at a later date, or if you're trying to share it with someone else, it's harder to see what you were thinking. Once you have more than one nested
FROM, it becomes almost impossible to figure out what the calculation does.
WITH statements are similar to
FROM statements, but with one key difference: they are written sequentially, so it's easier to understand the steps that create the final query.
Consider our previous query, rewritten using
WITH subquery1 AS ( SELECT user_id, ROUND(AVG(purchase)) as avg_purchase FROM user_purchase GROUP BY 1 ) SELECT avg_purchase, COUNT(user_id) as num_customers FROM subquery1 GROUP BY 1;
We can have as many
WITH statements as we want before our main query. We just separate each one with commas:
WITH subquery1 AS ( ... ), subquery2 AS ( ... ), subquery3 AS ( ... ) SELECT <final query> FROM subquery3;
Many Data Analysts and Data Scientists prefer
WITH statements to nested
FROM statements because it's easier to for a third party to review the query and to follow the logic of the original coder.
Whether you choose to use
TEMP tables, nested
FROM statements, or sequential
WITH statements, multi-step queries will give you richer, more actionable insights into your big data sets.
If you would like to learn more about SQL, take DataCamp's Joining Data in PostgreSQL course.