Tutorials
must read
sql
+2

SQL: Reporting and Analysis

Master SQL for Data Reporting & daily data analysis by learning how to select, filter & sort data, customize output, & how you can report aggregated data from a database!

In this tutorial you will go through the following topics in detail:

  1. Introduction to SQL and Databases

  2. SQL Data Types

  3. SQL and Data Reporting

    • Writing SQL SELECT Statements
      • Project, selection, joining
      • Concatenation
      • Aliases
      • Eliminating duplicate rows using DISTINCT
    • Restricting and Sorting Data
      • WHERE clause, wildcard like IN, LIKE
      • Comparison Operators, Logical operators (AND, OR, NOT)
      • ORDER BY clause
    • Using Single-Row Functions to Customize Output
      • Character Functions
      • Number Functions
      • Date Functions
      • Conversion Functions
      • General Functions
  4. Grouping your SQL results

    • Reporting Aggregated Data Using the Group Functions
      • AVG, COUNT, MIN, MAX...
      • Using GROUP BY and HAVING Clauses
    • Displaying Data From Multiple Tables
      • Cartesian products
      • Equi-join, inner-join, right outer-join, left outer-join, full outer-join
    • Using Subqueries to Solve Queries

      • What is sub-query
      • Self-join
      • Single-row sub-query
      • group functions in sub-query
    • Using the SET Operators

      • Unions, Intersect, Minus

In this tutorial, you will focus on ANSI(American National Standards Institute) SQL that works on every database like Oracle, MySQL, Microsoft SQL Server, etc.! Let's start with an introduction to the SQL (Structured Query Language) and why a data scientist may need it.

SQL and Data Science

In this part of the tutorial, you will see why a data scientist should learn SQL. Let's take a look at how SQL will help you in your career as a data scientist:

  • SQL has become a requirement for most data science jobs that include: data analyst, BI (Business Intelligence) developer, programmer, database programmer. SQL will let you communicate with the database and work with your data.

  • If you have used software like Tableau or any other data reporting or visualization software you might have seen how you connect your project to a database and, drag and drop graphs and plots onto the report specifying the fields only, and the software does the rest for you. So when you perform drag and drop activity of data attributes or use the graphical user interface to do all the operations within in a click, at the back of the software is running SQL that is interacting with the database. By learning SQL, you can directly interact with the database.

  • SQL can be used with all application programming languages like PHP, Java. You can create your own data visualizations by integrating SQL in the application or get data from the database and convert into XML, JSON formats to use them in web services or APIs.

  • Databases have evolved over the years, as Big Data has become the talk of the town and data used in our daily lives, NoSQL databases are becoming popular. Learning SQL will help you build a strong foundation in databases and help to understand when to use structured databases and when to use NoSQL databases and appreciate their differences.

Here I will go through a little bit of introduction and terminologies used in SQL and databases that will help you learn to code faster! If you already know about what a relational database is and what is SQL, then you can skip to code.

Introduction to SQL and Databases

A database is an organized collection of information. To manage databases, we need Data Base Management Systems (DBMS). A DBMS is a program that stores, retrieves, and modifies data in the databases on request.

There have been different types of databases over the time: Hierarchical, network, relational Databases and now NoSQL databases. Relational Database is a collection of relations or two-dimensional tables.

Following are terminologies used in RDBMS:

Term Description
Table A Table is the basic storage structure of an RDBMS. A table stores all the data necessary about something in the real world. Example: Employees.
Single Row or Tuple Representing all data required for a particular employee. Each row in a table may be identified by a Primary Key, which allows no duplicate rows.
Column or Attribute Usually refers to a characteristic of an entity
Primary Key A Field which uniquely identifies a row
Foreign Key A Foreign Key is a column that identifies how tables relate to each other. A foreign key refers to a Primary Key in another table.

You can relate multiple tables using Primary and Foreign Keys. Each row in a relational database is identified uniquely by a Primary Key (PK). You can refer to another table using a Foreign Key (FK). For example:

Relational databases can be accessed using Structured Query Language or SQL. Every database will support ANSI SQL that is the standard SQL but also will have its own syntax to facilitate in some operations. In this tutorial, you will learn ANSI SQL so that you can work with all databases. ANSI SQL can be divided into five sections. I will name them all but here only two sections that are relevant Data Retrieval and DML:

  1. Data Retrieval:
    • Select.
  2. Data Manipulation Language (DML):
    • Insert, Update, Delete , Merge
  3. Data Definition Language (DDL):
    • Create, Alter, Drop, Rename, Truncate.
  4. Data Control Language (DCL):
    • Grant, Revoke.
  5. Transaction Control:
    • Commit, Rollback, Savepoint.

There different vendors available for RDBMS. Most common and most used ones are:

  • Oracle (Oracle Corporation)
  • Microsoft SQL Server(Microsoft)
  • MySQL (Oracle Corporation)
  • PostgreSQL (PostgreSQL Global Development Group)
  • SQLite (Developed by D. Richard Hipp )

So where do you run SQL queries? It may be:

  • A reporting software that allows you to read from the database and view data (For example: Tableau, Microsoft BI)
  • A GUI to database administration (for example: TOAD, SQL developer for Oracle, phpmyadmin for MySQL)
  • A console based direct interface to the database (For example: SQL plus for Oracle database)

If you have credentials to login to a database, you can view the database objects by using queries or GUI to view the database objects as per your database vendor.

That's it! Let's get to learning SQL...

SQL DATA TYPES

Each column in a database has a name and its data type and sometimes a size associated to it. It is the job of the database developer to design the database and decide what data type should be used according to requirements and volume of data.

As a data scientist, you need to be familiar with the data types as it will assist you to use database functions properly and write your queries accurately. There is a data type for each type of column in a database such as a person's name, some text stored, numbers, an image stored in the database and so on.

Here basic data types for Oracle Server, SQL server and MySQL server will be shown to you:

For further reading:

SQL and Data Reporting

For all SQL used in this tutorial, the following sample database schema:

Consider a database with two tables named emp that holds employee data and dept table that holds records about departments.

The emp table has employee number (empno), employee name (ename), salary (sal), commission (comm), job name (job), manager id (mgr), hire date (hiredate) and, department number (deptno). As Manager is also an employee and will have an employee number, mgr is from one of the empno whose job is "MANAGER".

The dept table has department no (deptno), department name (dname) and location of the department (loc).

Note that, different databases have different date format. The one here, DD-MON-YY is default date format for an Oracle database. The default format for Microsoft SQL Server and MySQL is YYYY-MM-DD.

Your database tables may be different, so you have to adjust only according to your table name and attributes' name. In this tutorial, you will be only reading from the database and not writing to, updating or creating new tables and objects. So no worries on any data loss or changes!

Information: Schema is a collection of objects such as tables, functions, procedures, views that belong to a user

In the above database schema, you can see that the emp table has six attributes for the entity Employee. Table dept has three attributes for the entity Department.

Retrieving Data Using SELECT Statement

A SELECT statement retrieves information from the database. Using a SELECT statement, you can do the following:

1. Projection: You can use the projection capability in SQL to choose the columns in a table that you want to be returned by your query. You can choose as few or as many columns in the table as you require.

2. Selection: You can use the selection capability in SQL to choose the rows in a table that you want to be returned by a query.

You can use various criteria to restrict the rows that you see.

3. Joining: You can use the join capability in SQL to bring together data that is stored in different tables by creating a link between them.

The basic SELECT statement allows you to specify what columns do you want from which table. The SELECT clause specifies columns, expressions and FROM clause specifies from which table to get data from.

For example, What are the names of all employees and what are their jobs: SELECT ename, job FROM employee; The above SQL query will give you following output based on the database schema discussed above:

ename job
A Salesman
B Manager
C Manager

To select all attributes and all rows from table * operator is used:

SELECT *
FROM employee;
The output will be:

     EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
      7369 SMITH      CLERK           7902 17-DEC-80        800                    20
      7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
      7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30
      7566 JONES      MANAGER         7839 02-APR-81       2975                    20
      7698 BLAKE      MANAGER         7839 01-MAY-81       2850                    30

Some tips for writing SQL statements:

  • SQL statement are not case sensitive
  • Clauses are written on separate lines to increase readability
  • You can write SQL statement on one or more than one lines

You can also create an expression using +,-,/,* operators on the date and number data to generate required data. For example, you are asked to find out 20% of the salary of all employees. The query will be as follows:

SELECT ename, sal*(20/100)
FROM employee;
Output:

    ENAME      SAL*(20/100)
   ---------- ------------
    SMITH               160
    ALLEN               320
    WARD                250
    JONES               595
    BLAKE               570

Information:

  • Parentheses are used to clarify the expression
  • Division and multiplication have precedence over addition and subtraction
  • If the same precedence occurs then the expression is evaluated from left->right

NULL values are treated differently in databases. A NULL value implies that a value is unknown. Any operation performed with NULL results in NULL. Different databases have different functions to handle null. There is a common function that is used in MySQL, Microsoft SQL Server, and Oracle to treat NULLs.

SELECT ename, sal+COALESCE(comm,0)
FROM employee;
Output:

    ENAME      SAL+COALESCE(COMM,0)
---------- --------------------
    SMITH                       800
    ALLEN                      1900
    WARD                       1750
    JONES                      2975
    BLAKE                      2850

Column Aliases and Concatenation

You can see in above outputs, that the column name is the same as the database field or the expression you select. Sometimes, while generating reports you want to give your own names to headers. This can be done using aliases.

SELECT ename AS "Emp Name", sal*(20/100) as "20% of Salary"
FROM employee;
Output:

    Emp Name      20% of Salary
   ---------- ------------
    SMITH               160
    ALLEN               320
    WARD                250
    JONES               595
    BLAKE               570

When you are giving an alias that has spaces in it you have to use double quotes. Otherwise, there is no need to put double quotes after AS. Sometimes you can also omit the keyword AS.

You can format your output by concatenation. You can add your own statements in the output using CONCAT function or operators like || or + depending on vendor database:

  • Oracle supports CONCAT() and || but CONCAT() will take only two arguments. You have to use nested CONCAT().
  • MySQL uses CONCAT()
  • Microsodt SQL server uses '+' operator and CONCAT().

Oracle: SELECT '20% of salary of '||ename||' is '||sal*(20/100) as "20% of salary" FROM employee;

SELECT CONCAT(CONCAT('20% of salary of',ename),CONCAT(' is ',sal(20/100))) as "20% of salary" FROM employee; MySQL and Microsoft SQL server: SELECT CONCAT('20% of salary of ',ename,' is ',sal(20/100)) as "20% of salary" FROM employee; All will result in same output:

    20% of salary
-----------------------------------------------------------------------
    20% of salary of SMITH is 160
    20% of salary of ALLEN is 320
    20% of salary of WARD is 250
    20% of salary of JONES is 595
    20% of salary of BLAKE is 570

Eliminating repeated rows using DISTINCT

SELECT deptno
FROM employee;
Above query will result in:


    DEPTNO
----------
        20
        30
        30
        20
        30

Here the deptno 20 and 30 are repeating. You can eliminate such rows by using the DISTINCT keyword in the SELECT clause.

SELECT distinct ename, deptno, job
FROM employee;



    DEPTNO
----------
        30
        20

Restricting and Sorting Data

The WHERE clause is used to filter data on some condition.

Find all employees who have job CLERK:

SELECT ename, job
FROM employee
WHERE job='CLERK';
    ENAME      JOB
---------- ---------
    SMITH      CLERK

You may want to filter your results with different conditions. For this operators, conditional symbols and specific keywords are used:

Use of =,<>,!=,>=,<=,>,< is simple as shown in the above query that is using =.

AND & OR syntax: SELECT column1, column2,.. FROM table_name WHERE condition1 AND condition2 AND condition 3...;

SELECT column1, column2,.. FROM table_name WHERE condition1 OR condition2 OR condition 3...; Find names of employees whose job is MANAGER and belong to department 30:

SELECT ename
FROM employee
WHERE job='MANAGER' AND deptno=30;
    ENAME
    ----------
    BLAKE
SELECT ename
FROM employee
WHERE job='MANAGER' OR deptno=30;
    ENAME
    ----------
    ALLEN
    WARD
    JONES
    BLAKE

NOT syntax: SELECT column1, column2, ... FROM table_name WHERE NOT condition; Find all employees whose job is not SALESMAN:

SELECT ename, job
from employee
WHERE NOT job='SALESMAN';
    ENAME      JOB
    ---------- ---------
    SMITH      CLERK
    JONES      MANAGER
    BLAKE      MANAGER

You can make complex conditions using all three AND, OR and NOT operators. The precedence is:

  1. NOT
  2. AND
  3. OR

Find all employees whose job is not CLERK and belong to department 20:

SELECT ename, job
from employee
WHERE NOT job='SALESMAN' AND sal>800;
    ENAME      JOB
    ---------- ---------
    JONES      MANAGER
    BLAKE      MANAGER

Here NOT executed first and after NOT, AND was evaluated.

Other handy conditional operators are shown below:

BETWEEN...AND:

SELECT *
FROM employee
WHERE sal BETWEEN 1000 AND 2000;
     EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
      7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
      7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30

LIKE:

LIKE uses two wildcards: percentage % and underscore _ to represent the number of characters in the pattern.

  1. % means any zero, one, or multiple characters
    • %M%: Match any string having M in any position
    • M%: Match value having M at start
    • %M: Match value having M at end
    • M%A: Start with M and end with A

Patterns are case-sensitive.

  1. _ specifies the number of unknown characters before or after the known character. One underscore is one character.
    • _r%: Match value having r in the second position.

Get names of all employees whose names start with 'B':

SELECT *
FROM employee
WHERE ename LIKE 'B%';

     EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
      7698 BLAKE      MANAGER         7839 01-MAY-81       2850                    30

Get names of all employees whose names start with 'A' and have 'E' in any position after 'A':

SELECT *
FROM employee
WHERE ename LIKE 'A%E%';

        EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- ----------
      7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30

IN(value1,value2, value3,..):

The IN() function can take one, two or multiple values and allows you to match a column to the given values in parentheses in the WHERE clause:

SELECT ename, job, hiredate
FROM employee
WHERE job IN ('CLERK','SALESMAN');
    ENAME      JOB       HIREDATE
    ---------- --------- ---------
    SMITH      CLERK     17-DEC-80
    ALLEN      SALESMAN  20-FEB-81
    WARD       SALESMAN  22-FEB-81

You can also use a SELECT statement in IN() that will return some values. For example:

SELECT ename, job, hiredate
FROM employee
WHERE deptno IN (select deptno FROM department WHERE loc='CHICAGO');
    ENAME      JOB       HIREDATE
    ---------- --------- ---------
    ALLEN      SALESMAN  20-FEB-81
    WARD       SALESMAN  22-FEB-81
    BLAKE      MANAGER   01-MAY-81

The SELECT statement in IN() is also called as a sub-query. More on sub-queries later in this tutorial!

IS NULL:

IS NULL is used to check for NULL values in a given attribute. For example, find all employees who don't have commission:

SELECT ename, job, sal
FROM employee
WHERE comm IS NULL;
    ENAME      JOB              SAL
    ---------- --------- ----------
    SMITH      CLERK            800
    JONES      MANAGER         2975
    BLAKE      MANAGER         2850

If you want to get names of employees who earned a commission then you will use IS NOT NULL:

SELECT ename, job, sal, comm
FROM employee
WHERE comm IS NOT NULL;
    ENAME      JOB              SAL       COMM
    ---------- --------- ---------- ----------
    ALLEN      SALESMAN        1600        300
    WARD       SALESMAN        1250        500

To filter results on using date columns you have to use default date format. If some other format is to be used, date functions need to be applied which you will see later in this tutorial. Find all employees who were hired after 21 February 1981:

(Here Oracle default date format is used)

SELECT ename, job, sal, comm
FROM employee
WHERE hiredate>'21-FEB-81';
    ENAME      JOB              SAL       COMM
    ---------- --------- ---------- ----------
    WARD       SALESMAN        1250        500
    JONES      MANAGER         2975
    BLAKE      MANAGER         2850

Complex conditional expressions can be made using a combination of all the above operators, but you have to be careful with precedence. Precedence is the order in which operators are evaluated. Following are precedence rules for different databases:

There are two useful functions: ANY(), ALL() that can be used in conditions. For example:

SELECT ename, job, sal, comm
FROM employee
WHERE deptno=ANY(SELECT deptno from dept WHERE loc='NEW YORK');
SELECT ename, job, sal, comm
FROM employee
WHERE deptno=ALL(SELECT deptno from dept WHERE dname='SALES');

Notice the use of subqueries in the above examples. You will learn more about sub-queries later.

Ordering Results Using ORDER BY CLAUSE

You can order your results ascending (ASC) or Descending (DESC) by any attribute or multiple attributes of the table. You can also order by aliases that you specify in SELECT clause:

SELECT ename, job, sal, comm
FROM employee
WHERE hiredate>'21-FEB-81'
ORDER BY sal desc;
    ENAME      JOB              SAL       COMM
    ---------- --------- ---------- ----------
    JONES      MANAGER         2975
    BLAKE      MANAGER         2850
    WARD       SALESMAN        1250        500

Note: Default order is ascending denoted by ASC, and you don't need to specify ASC.

SELECT ename, job, sal, comm
FROM employee
WHERE hiredate>'21-FEB-81'
ORDER BY sal;
    ENAME      JOB              SAL       COMM
    ---------- --------- ---------- ----------
    WARD       SALESMAN        1250        500
    BLAKE      MANAGER         2850
    JONES      MANAGER         2975

You can specify multiple columns in ORDER BY clause that will execute in order of the columns specified. For example, sort employees first by their deptno in ascending order and then names in descending:

SELECT ename, job, sal, comm, deptno
FROM employee
WHERE hiredate>'21-FEB-81'
ORDER BY deptno ASC, ename DESC;
    ENAME      JOB              SAL       COMM     DEPTNO
    ---------- --------- ---------- ---------- ----------
    JONES      MANAGER         2975                    20
    WARD       SALESMAN        1250        500         30
    BLAKE      MANAGER         2850                    30

Using Single-Row Functions to Customize Output

There are numerous functions in all RDBMS that will help you perform common tasks such as getting the length of a string, joining strings, formatting functions, mathematical functions, etc. There are two types of row functions in SQL:

  • Single-Row functions
  • Multiple-Row functions

Single-Row Functions:

These functions are applied on each row and return result per each row. CONCAT() is a character manipulation single-row function. These can be used in SELECT, WHERE and ORDER BY clause. There are single-row functions for the following in all RDBMS:

  • Character Manipulation functions
  • Date and Time functions
  • Number Functions
  • Conversion Functions

Different databases have different function names for the same functionality. Here I will tell you common useful functions among Microsoft SQL Server, Oracle, and MySQL. Some of these you have already learned in the above sections. At the end of this section, you will find links to a complete list of functions from these vendors for further practice.

Let's go through each category:

Character Manipulation Functions

LOWER(): Converts string to lower caps.

SELECT lower(ename) as ename
FROM employee;
    ENAME
    ----------
    smith
    allen
    ward
    jones
    blake

UPPER(): Converts string to lower caps.

SELECT upper(ename) as ename
FROM employee;
    ENAME
    ----------
    SMITH
    ALLEN
    WARD
    JONES
    BLAKE

SUBSTR()[Oracle, MySQL]: Returns substring as specified SUBSTR(string, start-position,length)

SUBSTRING()[SQL Server]: Returns substring as specified SUBSTRING(string, start-position,length).

SELECT SUBSTR(ename,2,3) as substr_ename
FROM employee;

Just replace function name to SUBSTRING for SQL server.

SUBSTR_ENAME
------------
MIT
LLE
ARD
ONE
LAK

LENGTH()[Oracle, MySQL]: Returns length of the string in parentheses

LEN()[SQL server]: Returns length of the string in parentheses

SELECT LENGTH(ename) as len_ename
FROM employee;

Just replace function name to LEN for SQL server.

     LEN_ENAME
    ----------
             5
             5
             4
             5
             5

Functions such as padding a string to left or right, replace start to differ in the syntax for different databases. List of Character and String functions for the three vendor databases:

Number Functions
Function name Function
ROUND(m,n): Rounds the value m to specified n decimal places.
ABS(m,n): Returns absolute value of a number
FLOOR(n): FLOOR returns the largest integer equal to or less than n.
MOD(m,n): Returns remainder of division of m by n (not available in SQL server but uses % operator: 35 % 6)

Oracle:

    SELECT ROUND(45.926,2),MOD(11,5),FLOOR(34.4),ABS(-24) from dual;

OUTPUT:

    ROUND(45.926,2)  MOD(11,5) FLOOR(34.4)   ABS(-24)
--------------- ---------- ----------- ----------
          45.93          1          34         24

MySQL:

    SELECT ROUND(45.926,2),MOD(11,5),FLOOR(34.4),ABS(-24);

SQL server:

    SELECT ROUND(45.926,2) as round,11%5 as mod, FLOOR(34.4) as floor,ABS(-24) as abs;

List of Numeric functions for the three vendor databases:

Conversion Functions

Conversion Functions are used to convert one data type into another. Convert functions differ from server to server. Here I will give you an example of an Oracle server query that uses the to_char function. This will help you understand how you can use conversion functions to customize the output. Oracle's default date format is DD-MON-YY

SELECT ename, to_char(hiredate,'DD, MONTH YYYY') as Hiredate, to_char(hiredate,'DY') as Day
from employee;
    ENAME      Hiredate                                     DAY
    ---------- -------------------------------------------- ------------
    SMITH      17,DECEMBER  1980                            WED
    ALLEN      20,FEBRUARY  1981                            FRI
    WARD       22,FEBRUARY  1981                            SUN
    JONES      02,APRIL     1981                            THU
    BLAKE      01,MAY       1981                            FRI

There lots of conversion functions for each database server. For further reading and exploration following are links to different database server's conversion functions:

Date and Time Functions

Date and time functions allow you to manipulate the date by adding days, calculating months between two dates, etc. Such functions are beneficial when generating reports. Date functions also vary from database to database. A function may perform same functionality but may have a different name in another database. Here you will look at a few examples and then I will provide you with links to the date functions of the top most common databases: Oracle, SQL Server, and MySQL. Go ahead and try out these examples on your database server!

Grouping Your SQL Results

Grouping functions or Multi-Row functions are applied on groups and return one result per group. You will need group functions when you want to find out facts such as total sales per quarter, the Average price of a product throughout some time period, highest investment your company received this month and so on.

Reporting Aggregated Data Using the Group Functions

GROUP BY clause is used to group results used often with functions like COUNT, MAX, MIN, AVG, and SUM. The syntax of GROUP BY is:

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

You have to keep in mind some points for the GROUP BY clause:

  • Alias names cannot be used in GROUP BY clause
  • WHERE clause will always come before GROUP BY
  • HAVING clause will come after GROUP BY and will apply conditions on group functions.
  • ORDER BY clause will always come at last

COUNT is a group function that counts the number of rows of data according to condition or no condition. The COUNT function ignores null values.

Count all employees whose name start with 'A':

SELECT count(ename)
FROM employee
WHERE ename LIKE 'A%';

Output:

COUNT(ENAME)
------------
           1

Find total salary, average salary, minimum salary and maximum salary in table employee:

SELECT sum(sal) as "TOTAL SAL", avg(sal) as "AVG SAL", min(sal) as "MIN SAL", max(sal) as "MAX SAL"
FROM employee;

     TOTAL SAL    AVG SAL    MIN SAL    MAX SAL
    ---------- ---------- ---------- ----------
      9475       1895        800       2975

You may want to group above result for each department:

SELECT sum(sal) as "TOTAL SAL", avg(sal) as "AVG SAL", min(sal) as "MIN SAL", max(sal) as "MAX SAL", deptno
FROM employee
GROUP BY deptno;
     TOTAL SAL    AVG SAL    MIN SAL    MAX SAL     DEPTNO
    ---------- ---------- ---------- ---------- ----------
      5700       1900       1250       2850         30
      3775     1887.5        800       2975         20

You can also calculate Variance and Standard Deviation of values using following functions:

For Oracle and MySQL MS SQL SERVER
SELECT STDDEV(column_name) FROM table_name; SELECT STDEV(column_name) FROM table_name;
SELECT VARIANCE(column_name) FROM table_name; SELECT VAR(column_name) FROM table_name;

You can group results by more than one column.

Find the sum of salaries of each job within each department:

SELECT sum(sal) as "TOTAL SAL", deptno, job
FROM employee
GROUP BY deptno, job;

     TOTAL SAL     DEPTNO   JOB
    ---------- ---------- ---------
       800         20       CLERK
      2850         30       SALESMAN
      2975         20       MANAGER
      2850         30       MANAGER

You can restrict group results using the HAVING clause. It's used only for group conditions.

Note: Do not use WHERE for applying conditions on group functions.

SELECT sum(sal) as "TOTAL SAL", deptno, job
FROM employee
GROUP BY deptno, job
HAVING sum(sal)>1000
ORDER BY sum(sal);
    TOTAL SAL     DEPTNO    JOB
    ---------- ---------- ---------
      2850         30       SALESMAN
      2850         30       MANAGER
      2975         20       MANAGER

Displaying Data From Multiple Tables

In this section following joins will be discussed:

  • Cartesian product/Cross join
  • Inner join/EquiJoins
  • Natural joins
  • Outer Joins(Left, Right, Full)
  • Self Join

You will need to display reports that get data from multiple tables. See the example below:

To generate a report like above, you need to link employee and department table and get data from both of them. For this Joins are used in SQL.

Cartesian Product/Cross Join:

The Cartesian product is formed when every tuple of a relation R creates a combination with every tuple in a relation S.

Cartesian product operation also called as Cross Join multiplies two tables to form a relation that consists of all possible pairs of tuples from two tables. If one relation say, R has I tuples with M number of attributes and other relation say S has J tuples N attributes cartesian product will have IxJ tuples with M+N attributes. A cartesian product can be formed using the SQL complaint CROSS JOIN as well.

SELECT empno, ename, dname
FROM employee, department;

OR

SELECT empno, ename, dname
FROM employee CROSS JOIN department;
       EMPNO      ENAME      DNAME
    ---------- ---------- --------------
      7369        SMITH      ACCOUNTING
      7499        ALLEN      ACCOUNTING
      7521        WARD       ACCOUNTING
      7566        JONES      ACCOUNTING
      7698        BLAKE      ACCOUNTING
      7369        SMITH      RESEARCH
      7499        ALLEN      RESEARCH
      7521        WARD       RESEARCH
      7566        JONES      RESEARCH
      7698        BLAKE      RESEARCH
      7369        SMITH      SALES

     EMPNO        ENAME      DNAME
    ---------- ---------- --------------
      7499        ALLEN      SALES
      7521        WARD       SALES
      7566        JONES      SALES
      7698        BLAKE      SALES

15 rows selected.

As the employee table comprises of 5 tuples and department contains 3 tuples the Cartesian product has 5x3=15 rows. A Cartesian product is generated when:

  • A join condition is not used
  • A join condition is not valid or not formed properly

When data from two or more tables is required, a join condition is made using the common attributes linking the involved tables. These are most commonly the Primary Key and the Foreign Key.

Cartesian products are used for simulation of large amounts of data for testing purposes.

*Inner Join/EquiJoin:

Inner Join or Equijoin (Term used by Oracle developers) uses primary and foreign key relationships to join two or more tables:

SELECT ename, dname
FROM employee e,department d
WHERE e.deptno=d.deptno;

OR

SELECT ename, dname
FROM employee e
JOIN department d
ON e.deptno=d.deptno;

    ENAME      DNAME
    ---------- --------------
    SMITH      RESEARCH
    ALLEN      SALES
    WARD       SALES
    JONES      RESEARCH
    BLAKE      SALES

In the above query, deptno is the primary key in the department table and foreign key in the employee table. You can add additional conditions using logical operations as discussed in the sections above.

Point to remember for JOINS:

  • If the same column name appears in more than one table, the column name must be prefixed with the table name. It is good practice to do this otherwise as well to enhance clarity.
  • To join n tables together, you need a minimum of n-1 join conditions. For example, to join four tables, a minimum of three joins is required.

Table aliases: Table aliases are used such as in FROM employee e, department d where e and d are table aliases, to make it easier for the database engine to identify from which table the column is from especially when same columns are present in more than one involved table because otherwise error for ambiguous column will be given. And it saves time from writing large table names again and again.

You can create nonequi-joins as well which join tables based on conditions other than equality. Consider another table salgrade that has information about the grade of an employee based on salary:

You want to find out each employees' grade based on salary and grades are stored in salgrade table while salary is stored in employee table:

SELECT e.ename, e.sal, s.grade
FROM employee e, salgrade s
WHERE e.sal between s.losal AND s.hisal;
    ENAME             SAL      GRADE
    ---------- ---------- ----------
    JONES            2975          4
    BLAKE            2850          4
    ALLEN            1600          3
    WARD             1250          2
    SMITH             800          1

Example: You are asked to give the names of employees, their salary, grade, and department name. Now, as you know, salary is stored in the employee table, grades in the salgrade table and department name in department table you will need to join three tables together:

SELECT e.ename, e.sal, d.dname, s.grade
FROM employee e, department d, salgrade s
WHERE e.deptno=d.deptno
AND e.sal BETWEEN s.losal AND s.hisal;

Output:
ENAME             SAL DNAME               GRADE
---------- ---------- -------------- ----------
JONES            2975 RESEARCH                4
BLAKE            2850 SALES                   4
ALLEN            1600 SALES                   3
WARD             1250 SALES                   2
SMITH             800 RESEARCH                1

In above example, three tables are involved and two join condition one of them is a nonequi-join.

Natural Join:

Natural joins allow databases to join tables automatically by matching columns with the same name. If columns match with the same name but data types are different, then an error is given.

Syntax: SELECT FROM table1 NATURAL JOIN table2; SELECT FROM employee NATURAL JOIN department; Since natural join itself find out matching column it may find more than one matching columns with the same name but may have a different data type which may result in an error. So USING clause is used to specify columns on which to make an equi join.

Note: NATURAL JOIN and USING are two different clauses and used separately. One cannot be used in the presence of the other that is mutually exclusive.

SELECT e.ename, d.dname, e.sal
FROM employee e JOIN department d
USING (deptno)
WHERE deptno=20;

output:

ENAME      DNAME                 SAL
---------- -------------- ----------
SMITH      RESEARCH              800
JONES      RESEARCH             2975

Columns in the USING clause cannot use table name or table prefixes anywhere in the SQL statement. For example following is incorrect:

SELECT e.ename, d.dname, e.sal
FROM employee e JOIN department d
USING (d.deptno)
WHERE d.deptno=20;

d.deptno is incorrect. Only deptno should be used. Outer Joins:

There are three outer joins:

  1. Left Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the left table, is called a LEFT OUTER JOIN
  2. Right Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the right table, is called RIGHT OUTER JOIN.
  3. Full Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the right table, is called RIGHT OUTER JOIN.

Let's see these one by one:

SELECT e.ename, s.grade
FROM salgrade s LEFT OUTER JOIN employee e
ON e.sal BETWEEN s.losal AND s.hisal;

Output:

ENAME          DEPTNO DNAME
---------- ---------- --------------
SMITH              20 RESEARCH
JONES              20 RESEARCH
ALLEN              30 SALES
WARD               30 SALES
BLAKE              30 SALES.sal BETWEEN s.losal AND s.hisal;

Output:

ENAME          DEPTNO DNAME
---------- ---------- --------------
SMITH              20 RESEARCH
JONES              20 RESEARCH
ALLEN              30 SALES
WARD               30 SALES
BLAKE              30 SALESSELECT e.ename, d.deptno, d.dname
FROM employee e RIGHT OUTER JOIN department d
ON e.deptno=d.deptno;

output:

ENAME          DEPTNO DNAME
---------- ---------- --------------
SMITH              20 RESEARCH
ALLEN              30 SALES
WARD               30 SALES
JONES              20 RESEARCH
BLAKE              30 SALES
                   10 ACCOUNTINGSELECT e.ename, d.deptno, d.dname
FROM employee e FULL OUTER JOIN department d
ON e.deptno=d.deptno;

Output:
ENAME          DEPTNO DNAME
---------- ---------- --------------
SMITH              20 RESEARCH
ALLEN              30 SALES
WARD               30 SALES
JONES              20 RESEARCH
BLAKE              30 SALES
                   10 ACCOUNTING

The department table had one unmatched row for deptno 10, and no unmatched row was found for employee table. Similarly, salgrade has one unmatched row for grade 5.

Self-Join:

There are situations when you want to join a table with itself. For example, the employee table contains information about all employees including managers (Refer to the table employee above). If you are asked to find out employees and their managers, then you will need to join a table to itself. Self-Joins are made in the following way:

SELECT e.ename as Employee, m.ename as Manager
FROM employee e, employee m
WHERE e.mgr=m.empno;
    EMPLOYEE   MANAGER
    ---------- ----------
    ALLEN      BLAKE
    WARD       BLAKE

Using Subqueries to Solve Queries

A subquery is a query within a query. A subquery is useful to break down large queries into segments. Subqueries can be nested and can be used in the following SQL clauses:

  • WHERE
  • FROM
  • HAVING

Consider an example; you are asked to find an employee whose salary is higher than another employee, in this case, James. How would you approach this? This query can be broken down as:

  • Find the salary of James
  • Compare salary of James to all employee

This can be solved using a subquery. The subquery (inner query) will execute once before the main (outer) query.

SELECT empno, ename
FROM employee
WHERE sal>(SELECT sal from employee where ename='JAMES');

Types of Subqueries:

  • Single-row subquery: Query that returns only one row from an inner SELECT statement.

  • Multiple-row subquery: Query that returns more than one row from an inner SELECT statement.

Points to remember:

  • Enclose subqueries in parenthesis.
  • Place subqueries on the right of the comparison operator.
  • Use single row operators (>, < , >= , <= , < >) with single row subqueries.
  • Use multiple row operators (IN, ANY, ALL) with multiple row subqueries.

Single-row Subquery:

You are asked to find names of employees whose job title is same as empno 7521

SELECT ename, job
FROM employee
WHERE job=(SELECT job FROM employee WHERE empno=7521);
    ENAME      JOB
    ---------- ---------
    ALLEN      SALESMAN
    WARD       SALESMAN

Select maximum salary of departments where the maximum salary is higher than equal to the maximum salary of department 20

SELECT deptno, max(sal)
FROM employee
GROUP BY deptno
HAVING max(sal)>=(SELECT max(sal) FROM employee WHERE deptno=20);
    DEPTNO   MAX(SAL)
---------- ----------
        20       2975

Multiple-row Subquery:

Find employees whose salary is equal to any of the salaries of employees having a job as Manager. This will be a multiple-row subquery as the subquery can return more than one rows.

SELECT ename, job, sal
FROM employee
WHERE sal IN (SELECT sal FROM employee WHERE job='MANAGER');
    ENAME      JOB              SAL
    ---------- --------- ----------
    JONES      MANAGER         2975
    BLAKE      MANAGER         2850

When subqueries are used in FROM clause, it acts as a temporary table that does not physically exist on the storage but is a view of data. For example:

SELECT e.ename, e.job, e.sal
FROM employee e, (SELECT deptno FROM department WHERE loc='DALLAS') d
WHERE e.deptno=d.deptno;
    ENAME      JOB              SAL
    ---------- --------- ----------
    SMITH      CLERK            800
    JONES      MANAGER         2975

Using SET Operators

In SQL Set operators are used to combine the results of multiple queries into a single result. It uses the set theory in mathematics and its operations such as UNION, MINUS, INTERSECT. Here you will see how you can use SET operators to optimize your queries. In this tutorial following SET operators will be discussed.

Information: Queries using SET operators are called compound statements.

  • UNION and UNION ALL
  • INTERSECT
  • EXCEPT (SQL standard) and MINUS (Oracle specific)

UNION

Consider two relations (tables) R and S then UNION selects all rows from R and all rows from S eliminating duplicates. The number of returned rows can be maximum r+s where r is the number of rows in R and s is the number of rows in S.

Select all department names including the department names of employees who were hired before 23-OCT-1999

SELECT dname
FROM department
UNION
SELECT dname
FROM department, employee
WHERE department.deptno=employee.deptno AND employee.hiredate<to_date('23-OCT-1999');
    DNAME
    --------------
    ACCOUNTING
    OPERATIONS
    RESEARCH
    SALES

Points to remember for UNION:

  • The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
  • The names of the columns need not be identical.
  • The output is sorted in ascending order of the first column of the SELECT clause.
  • NULL values are not ignored during duplicate checking.

UNION ALL combines result from one or more queries and does NOT remove duplicates, and hence, the DISTINCT keyword cannot be used. Consider another table emp that contains employees from the year 2000. Get all employee number, name and job from the emp and the employee table:

SELECT empno, ename, job
FROM employee
UNION ALL
SELECT empno, ename, job
FROM emp;
       EMPNO      ENAME      JOB
    ---------- ---------- ---------
      7369        SMITH      CLERK
      7499        ALLEN      SALESMAN
      7521        WARD       SALESMAN
      7566        JONES      MANAGER
      7698        BLAKE      MANAGER
      7369        SMITH      CLERK
      7499        ALLEN      SALESMAN
      7521        WARD       SALESMAN
      7566        JONES      MANAGER
      7654        MARTIN     SALESMAN
      7698        BLAKE      MANAGER

     EMPNO        ENAME      JOB
    ---------- ---------- ---------
      7782        CLARK      MANAGER
      7788        SCOTT      ANALYST
      7839        KING       PRESIDENT
      7844        TURNER     SALESMAN
      7876        ADAMS      CLERK
      7900        JAMES      CLERK

INTERSECT

Intersection follows the same rule as in set theory, to give the common values from two sets. Consider two relations R and S then INTERSECT will return set of all tuples that are common in both R and S.

SELECT empno, ename, job
FROM employee
INTERSECT
SELECT empno, ename, job
FROM emp;
     EMPNO ENAME      JOB
---------- ---------- ---------
      7369 SMITH      CLERK
      7499 ALLEN      SALESMAN
      7521 WARD       SALESMAN
      7566 JONES      MANAGER
      7698 BLAKE      MANAGER

Points to remember for INTERSECT:

  • The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
  • The names of the columns need not be identical.
  • INTERSECT does not ignore NULL values

EXCEPT and MINUS:

EXCEPT (used in SQL server) and MINUS (Oracle specific) both have the same function. The function is to select all distinct rows selected by the first query but not the second. To understand this operator, consider the following table:

Say, you need to find a product that has a quantity between 1 to 100 but does not have quantity 50-75. Here you will use EXCEPT if you are using SQL server and MINUS if you are using Oracle server.

SELECT prod_name, qty
FROM products
WHERE qty BETWEEN 1 AND 100
EXCEPT
SELECT prod_name, qty
FROM products
WHERE qty BETWEEN 50 AND 75;

For Oracle server:

SELECT prod_name, qty
FROM products
WHERE qty BETWEEN 1 AND 100
MINUS
SELECT prod_name, qty
FROM products
WHERE qty BETWEEN 50 AND 75;

      PROD_NAME               QTY      
    -------------       ----------
          COLGATE               1
          SENSODYNE             100
          SENSODYNE TOOTHBRUSH  30

Points to remember for EXCEPT/MINUS:

  • The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
  • The names of the columns need not be identical.
  • All of the columns in the WHERE clause must be in the SELECT clause for the MINUS operator to work.

Congratulations!

This is the end of the tutorial. In this tutorial, you have learned a significant amount of SQL that will help you master SQL in your data science journey. SQL is used to generate reports from databases. You have learned in this tutorial the foundations of databases and SQL, common data types used, functions that help you to achieve formatting reports, aggregation of results creating summaries and also ways on how you can gather data from different tables in the databases according to your requirements! This tutorial is specifically designed for data science learners that will not only help them with relational databases but also go smoothly through NoSQL databases and how to use SQL skills in Big Data studies.

If you would like to learn more about SQL, take DataCamp's free Intro to SQL for Data Science course.

References

  1. Introduction to Oracle 10g-SQL Lecture by Tanveer Zahid Khan (Senior Assistant Professor, Department of Computer Science, Bahria University)
  2. https://docs.oracle.com/cd/B19306_01/server.102/b14200/operators005.htm
  3. https://en.wikipedia.org/wiki/Set_operations_(SQL)#EXCEPT_operator
  4. https://www.w3schools.com/sql/sql_datatypes.asp
  5. https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT012
Want to leave a comment?