Skip to main content
HomeTutorialsSQL

SQL Commands for Data Scientists

Learn the basic and essential commands for SQL with examples and a workspace notebook to refer to.
Jan 2022  · 12 min read

As a data scientist or data analyst, data is naturally at the heart of the job. There are many sources from which you might gather your data. Often the data lives in a SQL database, in which case an understanding of SQL query commands can be essential in order to perform this role successfully. This article will introduce you to some of the more basic commands, right up to more advanced operations that will serve you well when working as a data analyst or data scientist.

In order to understand the commands listed above, we are going to use the open source DVD rental database to show you how these commands can be used. You can find the entity relationship diagram (EDR) here.

For the purpose of this article, we have set up the database locally and completed all queries using pgAdmin4, so we are using postgreSQL. Many if not all the commands should work in other SQL servers. The installation of postgreSQL and the setting up of the database itself is outside the scope of this article; we encourage you to check out other resources such as DataCamp's Beginner's Guide to PostgreSQL for advice on these processes.

Simple Data Retrieval

SELECT FROM

The simplest example of retrieval is to get all the content from a specific table in the database. If we wanted to know all the film categories that are in the 'Category' table, you could try inputting this command:

SELECT *
FROM category

Which returns (only the first 10 rows shown):

SELECT FROM

Here we've specified the table name after FROM. Since we want all the content in the table, we've used * after SELECT to specify that we want everything.

DISTINCT

There are cases in which we are only interested in knowing unique values. For example, what if we want to know all the MPAA ratings from the film table? Using DISTINCT allows us to choose unique values alone:

SELECT DISTINCT(rating)
FROM film

Which returns:

DISTINCT

Data Retrieval with Simple Conditions

WHERE

There are also situations in which we only want to retrieve specific entries that meet specific conditions. We can provide conditions that have to be met before returning the data using the WHERE clause. In essence, we are filtering entries given the condition. For example, what if we want to know only movies that are rated PG-13?

SELECT title
FROM film
WHERE rating = 'PG-13'

Which returns (only the first 10 rows shown):

Data Retrieval with Simple Conditions 1

Additionally, we can figure out the list of movies that have a rental rating of 4.5 or higher:

SELECT title, rental_rate
FROM film
WHERE rental_rate >= 4.5

Which returns (only the first 10 rows shown):

Data Retrieval with Simple Conditions 2

ORDER BY

Ordering the returned result is often useful for us to understand the data better. How about if we want the same list above with PG-13 movies, but ordered by the movie length from longest to shortest, and the corresponding movie length? Using ORDER BY, this is easily accomplished:

SELECT title, length
FROM film
WHERE rating = 'PG-13'
ORDER BY length DESC

Which returns (only the first 10 rows shown):

ORDER BY

LIMIT

Additionally, sometimes we are only interested in a limited number of entries. Here, we are actually only interested in the top 10 longest movies that are PG-13, so we can use the LIMIT clause.

SELECT title, length
FROM film
WHERE rating = 'PG-13'
ORDER BY length DESC
LIMIT 10

Which returns:

LIMIT

Aggregations

GROUP BY & COUNT( )

Aggregations are often used to get a summary of the dataset to gain insights. Aggregations are often used with the GROUP BY clause. For example, if we need to know how many rentals each customer has rented so far, we can count the number of rentals and order the results with the most rentals on top.

SELECT customer_id, COUNT(rental_id)
FROM rental
GROUP BY customer_id
ORDER BY COUNT(rental_id) DESC

Which returns (only the first 10 rows shown):

GROUP BY & COUNT( )

SUM( )

How about how much money each customer has spent? We can accomplish this using the SUM aggregator function. Here, we want the top spender on top.

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
ORDER BY SUM(amount) DESC

Which returns (only the first 10 rows shown):

SUM( )

AVG( ) & HAVING

We can add additional conditions using HAVING after you have calculated a grouped aggregation. For example, we want to know the MAAP rating categories that have movies that are longer than 115 minutes on average?

SELECT rating, AVG(length)
FROM film
GROUP BY rating
HAVING AVG(length) > 115

Which returns:

AVG( ) & HAVING

MIN( ) & Alias

Grouping is not always necessary using an aggregation. If you want to know the shortest movie length, we can simply use the MIN function. We also want the column to read as 'shortest_movie_length', which can be accomplished by assigning an alias using 'AS'.

SELECT MIN(length) AS shortest_movie_length
FROM film

Which returns (only the first 10 rows shown):

MIN( ) & Alias

You can also use an alias for table names. This is useful when you are working with long table names, especially when joining tables (see below). We also want to be mindful of what kind of alias we use as it can drastically change the readability of the code .

There are other aggregation functions that are useful, and you can learn more about them in the Introduction to SQL course.

Joins

INNER JOIN (JOIN)

You have probably noticed that there are only limited things we can do by looking at a single table. We often want to look at the data from multiple tables. This is where we use JOINs. We can join tables based on a common key. For example, if we were looking for films that are not in English:

SELECT film.title, language.name
FROM film
JOIN language
    ON film.language_id = language.language_id
WHERE language.name != 'English'

(Please note that there are no movies other than those in English in this database. You will get an empty table with this query right above.)

You can also join multiple tables. For example, below we want to know the categories and their corresponding average rating. We've also returned values ordered by the average rating, with the highest first.

SELECT category.name, AVG(film.rental_rate) AS average_rating
FROM film
JOIN film_category
    ON film.film_id = film_category.film_id
JOIN category
    ON film_category.category_id = category.category_id
GROUP BY category.name
ORDER BY average_rating DESC

Which returns (only the first 10 rows shown):

INNER JOIN (JOIN)

There are many types of joins: INNER JOIN, LEFT JOIN, RIGHT JOIN, OUTER JOIN, CROSS JOIN. Here, we used JOIN, which by default is an INNER JOIN in which you get back entries only if they are present in both tables. If you would like to learn more about the differences between these joins, you can take a look at the Joining Data in SQL course.

Changing Data Types

CAST( )

In the example used above where we calculated dollar amounts, you might have noticed that SQL only sees the values as numbers. We can change these to dollar amounts using the CAST function:

SELECT customer_id, CAST(SUM(amount) AS money)
FROM payment
GROUP BY customer_id
ORDER BY SUM(amount) DESC

Which returns (only the first 10 rows shown):

CAST( )

Money is not the only thing that we can change values into. We can change numbers into floats, text, and even into date and time.

ROUND( )

Similarly, we can also round numbers. In the example above in which we calculated the average rental rating, there are many decimal points in the calculated values. We can use the ROUND function to round to two decimals:

SELECT category.name, ROUND(AVG(film.rental_rate), 2) AS average_rating
FROM film
JOIN film_category
    ON film.film_id = film_category.film_id
JOIN category
    ON film_category.category_id = category.category_id
GROUP BY category.name
ORDER BY average_rating DESC

Which returns (only the first 10 rows shown):

Complex Conditions

CASE Statement

You can imagine that by combining many of the above functions and clauses, you can perform a fairly complex query. However, there are situations in which the basics are not sufficient. For example, if we wanted to add a column with movie-length categories. We can label movies as 'Short Film' if less than 50 minutes, 'Long Film' if more than 150 minutes, and 'Medium Film' for the rest. We can accomplish this by using a CASE statement:

SELECT title, length,
CASE
    WHEN length < 50 THEN 'Short Film'
    WHEN length > 150 THEN 'Long Film'
    ELSE 'Medium Film'
END AS Movie_Length_Category
FROM film

Which returns (only the first 10 rows shown):

CASE Statement

As you can see, the result is returned when the specified condition is met. This is similar to 'if-else' statements in programming languages.

Subqueries

We may now want to know which movies are the longest in length, and their associated titles. For this, we can use the WHERE clause that we introduced above, just with a more complex condition:

SELECT title, length
FROM film
WHERE length = (SELECT max(length)
                FROM film)

Which returns:

Subqueries

Complex queries can happen using subqueries. In this example, we want to know the list of movie titles that have a higher rental rating than the average rental rating of all comedy films:

SELECT title, rental_rate
FROM film
WHERE rental_rate >
(SELECT AVG(film.rental_rate) AS Average_Rating
FROM film
JOIN film_category AS cat_id
    ON film.film_id=cat_id.film_id
JOIN category AS cat
    ON cat_id.category_id=cat.category_id
GROUP BY cat.name, cat_id.category_id
HAVING cat.name = 'Comedy')

Which returns (only the first 10 rows shown):

Which returns (only the first 10 rows shown):

In this example, we first find the average film rental rating for comedy films. The value found in this query (seen inside the parentheses above) is then used as a condition in the WHERE clause in which we can find the movie titles.

Common Table Expressions (CTEs)

In a more complex example, we can use common table expression (CTE) to save a temporary table from which we extract the information needed. For this example, we want to know the average number of rentals each staff processes per week.

WITH weekly_rentals AS(
    SELECT
        staff_id,
        DATE_PART('week', payment_date) AS week,
        COUNT(rental_id) AS rental_numbers
    FROM payment
    GROUP BY staff_id, week)

SELECT staff_id, AVG(rental_numbers)
FROM weekly_rentals
GROUP BY staff_id

Which returns:

Common Table Expressions (CTEs)

In the above case, we are saving the number of rentals per week in a temporary table called 'weekly_rentals' upon which we can then perform another query to get the final result.

Window Functions

Window functions are another set of SQL skills that are important and can take you to the next level. Even though window functions are considered intermediate to advanced level skills, many data analysts and scientists believe them to be essential for their jobs. It is therefore worthwhile for you to spend some time getting familiar with them.

OVER( ) & PARTITION BY Here, let's say we want to find out each staff's weekly number of rentals and the overall total number of rentals, along with the staff ID and the week number. We can reuse the CTE generated in the above example, as shown below:

WITH weekly_rentals AS(
    SELECT
        staff_id,
        DATE_PART('week', payment_date) AS week,
        COUNT(rental_id) AS rental_numbers
    FROM payment
    GROUP BY staff_id, week)

SELECT *, SUM(rental_numbers) OVER() AS total_rentals
FROM weekly_rentals

Which returns (only the first 10 rows shown):

Window Functions

Having OVER in the code specifies that it's a window function. Here, we are saying the window function (SUM here) is applied to every row. We can also use the window function along with PARTITION BY to apply the window function at specified 'partitions'. Let's say that we now want to figure out the total number of rentals per week instead of the overall number of rentals (in addition to the other columns. Again, in this example below we are using the CTE generated before.

WITH weekly_rentals AS(
    SELECT
        staff_id,
        DATE_PART('week', payment_date) AS week,
        COUNT(rental_id) AS rental_numbers
    FROM payment
    GROUP BY staff_id, week)

SELECT *, SUM(rental_numbers) OVER(PARTITION BY week) AS total_weekly_rentals
FROM weekly_rentals

Which returns (only the first 10 rows shown):

Window Functions 2

The additional column contains the total weekly rental numbers instead of the running total.

We introduced only a couple of simple examples of window functions here that are important for being a data analyst or scientist in many companies. We encourage you to learn more about these commands using the links provided. You could also check out the Intermediate SQL course to learn or brush up your SQL skills. Additionally, you can practice your skills by taking an open source database and trialling these different commands while thinking about relevant questions to answer.

Topics
Related

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.
Mark Graus's photo

Mark Graus

10 min

Top 5 SQL Server Certifications: A Complete Guide

Unlock SQL Server certification success with our guide on paths, preparation with DataCamp, and the top certifications to enhance your career.
Matt Crabtree's photo

Matt Crabtree

8 min

Mastering SQL ROUND: Precision Handling in Data Analysis

Learn how to use the SQL ROUND function in this tutorial. It covers the implementation of the ROUND function in Oracle, MySQL, PostgreSQL, and SQL Server.
Laiba Siddiqui's photo

Laiba Siddiqui

10 min

Mastering SQL NOT EQUAL Operator: A Beginner's Guide

Unlock the power of SQL NOT EQUAL with our expert guide. Learn to refine data queries with practical examples and optimization tips for better analysis.
Abid Ali Awan's photo

Abid Ali Awan

5 min

SQL NOT IN Operator: A Comprehensive Guide for Beginners

Master SQL's NOT IN operator with this beginner's guide. Learn to filter data effectively, avoid common pitfalls, and explore efficient alternatives
Abid Ali Awan's photo

Abid Ali Awan

5 min

SQL CONTAINS: A Comprehensive Tutorial

Unlock the power of SQL CONTAINS for advanced text searches. Dive into logical operators, proximity searches, and wildcard uses for precise data analysis.
Abid Ali Awan's photo

Abid Ali Awan

5 min

See MoreSee More