Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. With these functions, you will be able to answer questions like:
- What is the maximum value for the
- What are the minimum values of
some_column_from_the_tablewith respect to
and many more.
Let's get started to perform some data aggregation.
Note: To be able to follow along with this tutorial, you need to know how to write basic queries in PostgreSQL (which you are going to use as the RDBMS). This tutorial can serve as a good refresher.
Setting up the Database
Let's first set up a PostgreSQL database and restore this backup which contains the table which you are going to use in this tutorial. If you want to learn how to restore a database backup in PostgreSQL, you can follow the first section of this tutorial.
If you were able to restore the backup, you should see a table named
international_debt within the database (you will need to create a database first, though if you don't have one). Let's quickly take a look at the first few rows of the table (a simple select query will help you do this):
The table contains information about the debt statistics of different countries across the globe for this current year in different categories (refer to the columns
indicator_code). The debt column shows the amount of debt (in USD) a particular country has in a specific category. The data belongs to the domain of economics and is often used to analyze the economic conditions of different countries. The data was collected from the World Bank.
Now that you have set up the database successfully let's run some simple queries to know more about the data. Open up the
pgAdmin tool and get started.
Simple Information Matters
From the above figure, you can see that there are many duplicate entries for a single country but for different categories. A question that quickly gets raised from here is:
What are the different countries that the table contains records of?
If you simply run a select query with the
country_name column, you will not get the right answer to this question because the result of the query will contain duplicates. Let's use the
DISTINCT keyword to combat this.
select distinct country_name from international_debt;
And this should return something like this:
Now you have a fair answer to the above question. Now one last question before you proceed to aggregate functions:
How many different kinds of debt indicators are there in the table?
The query for answering should be similar to the above one. You just need to change the column name. Take it as an exercise, then? The result should be similar to the following:
Let's start by running a query with an aggregate function and proceed accordingly. Along the way, you will learn more about the syntaxes and the kind of constructs you need to follow when applying aggregate functions in SQL.
select sum(debt) from international_debt;
And the result:
SUM() aggregate function, you can calculate the arithmetic sum across a column (that contains numeric values). With the above query, you got to know about the total debt that is pending by the countries listed in the table.
SUM() does not take
NULL values into consideration when calculating the sum. Now, let's find the answer to the question:
What is the maximum amount of debt?
Here comes the
MAX() aggregate function to your rescue:
select max(debt) from international_debt;
And the answer is:
MAX() does not consider the
NULL entries while doing its calculations. There's similar
MIN() function as well. Let me know the minimum value of the
debt column via the
Comments section? Now, it will be a good idea to actually find out if there is an invalid entry in the
debt column to ensure that the results are correct until now.
Note that you use these functions in lower case as shown above.
When you execute this query:
select * from international_debt where debt is null;, you should get an empty result. Let's now find out the total number of distinct countries present in the table.
select count(distinct(country_name)) from international_debt;
And you see there is a total of 124 distinct countries present in the table. Pay close attention to the succession of functions you applied in the above query. Yes, it is allowed here, i.e. to connect more than one aggregate function in a logical way.
Now, suppose, you want to see the average value of the
debt column. The function is
select avg(debt) from international_debt;
You see the value to be 1306633214.966397971 (USD). It is a good idea to present these results with suitable column names. From the above results, you can see that PostgreSQL changes the column name to aggregate function name (that is being included in the query) when returning the result. So, it is a good idea to give a proper alias to these columns. You can do this like so:
select avg(debt) as Average_Debt_By_A_Country from international_debt;
The result is much more interpretable:
Let's now take this to a bit more complex level. To be able to answer questions like
What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?, you need to pair an aggregate function with
GROUP BY clause. Let's see how.
Aggregate Functions + GROUP BY + more
Let's say you want to produce a report where you will be showing the
country_name and the sum of their debts. Following is an example:
Reports like this are quite often used in the real world. So, what might be the query for getting a report like this? You will have to use the
SUM() function on the
debt. And you will also need to show the
country_name with the sum of debts. Let's execute the following query:
select country_name, sum(debt) from international_debt;
Doesn't it produce the following error?
ERROR: column "international_debt.country_name" must appear in the GROUP BY clause or be used in an aggregate function
Let's now understand what it actually means. When you are using an aggregate function (like
SUM() with an unaggregated column like
country_name) you need to pass the unaggregated column to a
GROUP BY clause. So, the correct query will be:
select country_name, sum(debt) as total_debt from international_debt group by country_name;
And the result is just right:
Note the use of aliasing in the query.
Now, suppose you need to sort this report in with respect to the
total_debt in a descending manner. Remember
ORDER BY clause? Yes, you can pair aggregate functions with
ORDER BY clause as well:
select country_name, sum(debt) as total_debt from international_debt group by country_name order by total_debt desc;
The result should be reversely sorted now:
Note the column that you used in the
ORDER BY clause.
Now another important question:
What the highest amount of debt across different categories (in a reversely sorted manner)?
You will need to use the
MAX() function here. The query to answer this question should not be a hard one to write now.
select indicator_code, max(debt) as maximum_debt from international_debt group by indicator_code order by maximum_debt desc;
And you get a clean report:
You can also limit the number of rows in the reports like this. Say you just want to include the top five entries in the above report. You can do by using the
select indicator_code, max(debt) as maximum_debt from international_debt group by indicator_code order by maximum_debt desc limit 5;
Time for the final report for this tutorial. You need to include the names of the countries to the above report. How can you do it? The following query should let you do this:
select country_name, indicator_code, max(debt) as maximum_debt from international_debt group by country_name, indicator_code order by maximum_debt desc;
Another good report:
In the above query, you added the
country_name column after the
SELECT clause and also added it after
GROUP BY. You can extend this format up to any number.
The order of
ORDER BY and
LIMIT is very important while generating reports like this. If you mistakenly change the order, you will have to face errors. See this for yourself:
select country_name, sum(debt) as total_debt from international_debt order by total_debt desc group by country_name;
And you get:
ERROR: syntax error at or near "group" LINE 1: ... from international_debt order by total_debt desc group by c...
In the above query, you placed the
ORDER BY clause before
GROUP BY which is not permissible. In fact, it is also not applicable when you are not using aggregate functions. The correct order is -
GROUP BY ->
ORDER BY ->
LIMIT. Always remember this.
Taking Things Forward
Congrats! You made it till the end of this tutorial. In this tutorial, you got to know about the different aggregate functions in PostgreSQL and how to use them to generate useful reports. These are crucial skills for a data scientist indeed. To level up your SQL skills in a systematic way, you can take the following DataCamp courses: