Tutorial (SQL): When to use GROUP BY, WHERE and HAVING

Tutorial (SQL): When to use GROUP BY, WHERE, and HAVING

This brief guide was created to provide a clear and intuitive introduction to three fundamental SQL clauses for aggregating rows in a data set.

In particular, if you are preparing for a job interview in data analysis or software development, you will likely be asked to demonstrate your skills in SQL. The most frequently discussed clauses are GROUP BY, WHERE, and HAVING, not only because of their importance in data aggregation but also because they are often confusing. Recruiters then use them to assess candidates’ analytical skills.

Introduction

Aggregation is a key concept when working with large data sets, especially in areas such as environmental analysis or sustainability. For example, when analyzing climate change data, it may be more informative to examine trends for different countries or socioeconomic groups rather than focusing on a global average. This tutorial will use the SQL GROUP BY, WHERE, and HAVING clauses to perform these detailed analyses, a topic frequently explored during job interviews in various industries.

For the practical exercises in this tutorial, we will use a database focused on climate analysis. This database contains various socio-economic and climate metrics for multiple countries worldwide. The main table, called climate, includes critical data such as adaptation to water technologies, trade volume, days to register a property, GDP per capita, labor employment rate, school enrollment rate, and water stress index. This dataset thus provides a comprehensive and detailed overview of the global challenges associated with climate change.

DataFrameas

df

variable

SELECT * FROM climate

Using GROUP BY

The GROUP BY clause aggregates data based on specific columns. The aggregation process occurs in three stages:

Split: data are divided into groups based on the unique values in the specified column(s).
Apply: for each group, we perform an aggregate function such as average (AVG), minimum (MIN), or maximum (MAX).
Combine: the aggregated results are then combined into a single table.

For example, to identify the top five countries with the highest average water stress over the years:

DataFrameas

df1

variable

SELECT 
    country, 
    AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_water_stress DESC
LIMIT 5;

In the above query:

We divide the data into groups based on the unique values in the country column.
We apply the aggregate average function (AVG) to each group’s water_stress_index column.
We combine the results into a table, sorting the countries by average water stress index in descending order.
Limitiamo i risultati ai primi 5 paesi.

Instead, to calculate the average GDP per capita, the total number of years recorded, and the average water stress index for each country:

DataFrameas

df2

variable

SELECT 
    country,
    AVG(gdp_per_capita) AS avg_gdp_per_capita,
    COUNT(year) AS total_years,
    AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_gdp_per_capita DESC;

‌
‌
‌

Tutorial (SQL): When to use GROUP BY, WHERE and HAVING

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Tutorial (SQL): When to use GROUP BY, WHERE, and HAVING

Introduction

Using GROUP BY

Tutorial (SQL): When to use GROUP BY, WHERE, and HAVING