Tutorials
sql

Aggregate Functions in SQL

Learn how to use aggregate functions for summarizing results and gaining useful insights about data in SQL.

Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. With these functions, you will be able to answer questions like:

  • What is the maximum value for the some_column_from_the_table? or
  • What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?

and many more.

Let's get started to perform some data aggregation.

Note: To be able to follow along with this tutorial, you need to know how to write basic queries in PostgreSQL (which you are going to use as the RDBMS). This tutorial can serve as a good refresher.

Setting up the database

Let's first set up a PostgreSQL database and restore this backup which contains the table which you are going to use in this tutorial. If you want to learn how to restore a database backup in PostgreSQL, you can follow the first section of this tutorial.

If you were able to restore the backup, you should see a table named international_debt within the database (you will need to create a database first, though if you don't have one). Let's quickly take a look at the first few rows of the table (a simple select query will help you do this):

The table contains information about the debt statistics of different countries across the globe for this current year in different categories (refer to the columns indicator_name and indicator_code). The debt column shows the amount of debt (in USD) a particular country has in a specific category. The data belongs to the domain of economics and is often used to analyze the economic conditions of different countries. The data was collected from the World Bank.

Now that you have set up the database successfully let's run some simple queries to know more about the data. Open up the pgAdmin tool and get started.

Simple information matters

From the above figure, you can see that there are many duplicate entries for a single country but for different categories. A question that quickly gets raised from here is:

What are the different countries that the table contains records of?

If you simply run a select query with the country_name column, you will not get the right answer to this question because the result of the query will contain duplicates. Let's use the DISTINCT keyword to combat this.

select distinct country_name from international_debt;

And this should return something like this:

Now you have a fair answer to the above question. Now one last question before you proceed to aggregate functions:

How many different kinds of debt indicators are there in the table?

The query for answering should be similar to the above one. You just need to change the column name. Take it as an exercise, then? The result should be similar to the following:

Aggregate functions

Let's start by running a query with an aggregate function and proceed accordingly. Along the way, you will learn more about the syntaxes and the kind of constructs you need to follow when applying aggregate functions in SQL.

select sum(debt) from international_debt;

And the result:

With the SUM() aggregate function, you can calculate the arithmetic sum across a column (that contains numeric values). With the above query, you got to know about the total debt that is pending by the countries listed in the table.

Note that SUM() does not take NULL values into consideration when calculating the sum. Now, let's find the answer to the question:

What is the maximum amount of debt?

Here comes the MAX() aggregate function to your rescue:

select max(debt) from international_debt;

And the answer is:

Like the SUM() function, MAX() does not consider the NULL entries while doing its calculations. There's similar MIN() function as well. Let me know the minimum value of the debt column via the Comments section? Now, it will be a good idea to actually find out if there is an invalid entry in the debt column to ensure that the results are correct until now.

Note that you use these functions in lower case as shown above.

When you execute this query:select * from international_debt where debt is null;, you should get an empty result. Let's now find out the total number of distinct countries present in the table.

select count(distinct(country_name)) from international_debt;

And you see there is a total of 124 distinct countries present in the table. Pay close attention to the succession of functions you applied in the above query. Yes, it is allowed here, i.e. to connect more than one aggregate function in a logical way.

Now, suppose, you want to see the average value of the debt column. The function is AVG():

select avg(debt) from international_debt;

You see the value to be 1306633214.966397971 (USD). It is a good idea to present these results with suitable column names. From the above results, you can see that PostgreSQL changes the column name to aggregate function name (that is being included in the query) when returning the result. So, it is a good idea to give a proper alias to these columns. You can do this like so:

select avg(debt) as Average_Debt_By_A_Country from international_debt;

The result is much more interpretable:

Let's now take this to a bit more complex level. To be able to answer questions like What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?, you need to pair an aggregate function with GROUP BY clause. Let's see how.

Aggregate functions + GROUP BY + more

Let's say you want to produce a report where you will be showing the country_name and the sum of their debts. Following is an example:

Reports like this are quite often used in the real world. So, what might be the query for getting a report like this? You will have to use the SUM() function on the debt. And you will also need to show the country_name with the sum of debts. Let's execute the following query:

select country_name, sum(debt) from international_debt;

Doesn't it produce the following error?

ERROR: column "international_debt.country_name" must appear in the GROUP BY clause or be used in an aggregate function

Let's now understand what it actually means. When you are using an aggregate function (like SUM() with an unaggregated column like country_name) you need to pass the unaggregated column to a GROUP BY clause. So, the correct query will be:

select country_name, sum(debt) as total_debt from international_debt group by country_name;

And the result is just right:

Note the use of aliasing in the query.

Now, suppose you need to sort this report in with respect to the total_debt in a descending manner. Remember ORDER BY clause? Yes, you can pair aggregate functions with ORDER BY clause as well:

select country_name, sum(debt) as total_debt from international_debt
group by country_name order by total_debt desc;

The result should be reversely sorted now:

Note the column that you used in the ORDER BY clause.

Now another important question:

What the highest amount of debt across different categories (in a reversely sorted manner)?

You will need to use the MAX() function here. The query to answer this question should not be a hard one to write now.

select indicator_code, max(debt) as maximum_debt from international_debt
group by indicator_code order by maximum_debt desc;

And you get a clean report:

You can also limit the number of rows in the reports like this. Say you just want to include the top five entries in the above report. You can do by using the LIMIT clause.

select indicator_code, max(debt) as maximum_debt from international_debt
group by indicator_code order by maximum_debt desc
limit 5;

Time for the final report for this tutorial. You need to include the names of the countries to the above report. How can you do it? The following query should let you do this:

select country_name, indicator_code, max(debt) as maximum_debt from international_debt
group by country_name, indicator_code order by maximum_debt desc;

Another good report:

In the above query, you added the country_name column after the SELECT clause and also added it after GROUP BY. You can extend this format up to any number.

The order of GROUP BY, ORDER BY and LIMIT is very important while generating reports like this. If you mistakenly change the order, you will have to face errors. See this for yourself:

select country_name, sum(debt) as total_debt from international_debt
order by total_debt desc group by country_name;

And you get:

ERROR:  syntax error at or near "group"
LINE 1: ... from international_debt order by total_debt desc group by c...

In the above query, you placed the ORDER BY clause before GROUP BY which is not permissible. In fact, it is also not applicable when you are not using aggregate functions. The correct order is - GROUP BY -> ORDER BY -> LIMIT. Always remember this.

Taking things forward

Congrats! You made it till the end of this tutorial. In this tutorial, you got to know about the different aggregate functions in PostgreSQL and how to use them to generate useful reports. These are crucial skills for a data scientist indeed. To level up your SQL skills in a systematic way, you can take the following DataCamp courses:

Want to leave a comment?