Tutorial (SQL): When to use GROUP BY, WHERE, and HAVING
This brief guide was created to provide a clear and intuitive introduction to three fundamental SQL clauses for aggregating rows in a data set.
In particular, if you are preparing for a job interview in data analysis or software development, you will likely be asked to demonstrate your skills in SQL. The most frequently discussed clauses are GROUP BY,
WHERE,
and HAVING,
not only because of their importance in data aggregation but also because they are often confusing. Recruiters then use them to assess candidates’ analytical skills.
Introduction
Aggregation is a key concept when working with large data sets, especially in areas such as environmental analysis or sustainability. For example, when analyzing climate change data, it may be more informative to examine trends for different countries or socioeconomic groups rather than focusing on a global average. This tutorial will use the SQL GROUP BY,
WHERE,
and HAVING
clauses to perform these detailed analyses, a topic frequently explored during job interviews in various industries.
For the practical exercises in this tutorial, we will use a database focused on climate analysis. This database contains various socio-economic and climate metrics for multiple countries worldwide. The main table, called climate,
includes critical data such as adaptation to water technologies, trade volume, days to register a property, GDP per capita, labor employment rate, school enrollment rate, and water stress index. This dataset thus provides a comprehensive and detailed overview of the global challenges associated with climate change.
SELECT * FROM climate
Using GROUP BY
The GROUP BY
clause aggregates data based on specific columns. The aggregation process occurs in three stages:
- Split: data are divided into groups based on the unique values in the specified column(s).
- Apply: for each group, we perform an aggregate function such as average (
AVG
), minimum (MIN
), or maximum (MAX
). - Combine: the aggregated results are then combined into a single table.
For example, to identify the top five countries with the highest average water stress over the years:
SELECT
country,
AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_water_stress DESC
LIMIT 5;
In the above query:
- We divide the data into groups based on the unique values in the
country
column. - We apply the aggregate average function (
AVG
) to each group’swater_stress_index
column. - We combine the results into a table, sorting the countries by average water stress index in descending order.
- Limitiamo i risultati ai primi 5 paesi.
Instead, to calculate the average GDP per capita, the total number of years recorded, and the average water stress index for each country:
SELECT
country,
AVG(gdp_per_capita) AS avg_gdp_per_capita,
COUNT(year) AS total_years,
AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_gdp_per_capita DESC;