Course
In this tutorial, you will focus on ANSI(American National Standards Institute) SQL that works on every database like Oracle, MySQL, Microsoft SQL Server, etc.! Let's start with an introduction to the SQL (Structured Query Language) and why a data scientist may need it.
SQL and Data Science
In this part of the tutorial, you will see why a data scientist should learn SQL. Let's take a look at how SQL will help you in your career as a data scientist:
- SQL has become a requirement for most data science jobs that include: data analyst, BI (Business Intelligence) developer, programmer, database programmer. SQL will let you communicate with the database and work with your data.
- If you have used software like Tableau or any other data reporting or visualization software you might have seen how you connect your project to a database and, drag and drop graphs and plots onto the report specifying the fields only, and the software does the rest for you. So when you perform drag and drop activity of data attributes or use the graphical user interface to do all the operations within in a click, at the back of the software is running SQL that is interacting with the database. By learning SQL, you can directly interact with the database.
- SQL can be used with all application programming languages like PHP, Java. You can create your own data visualizations by integrating SQL in the application or get data from the database and convert into XML, JSON formats to use them in web services or APIs.
- Databases have evolved over the years, as Big Data has become the talk of the town and data used in our daily lives, NoSQL databases are becoming popular. Learning SQL will help you build a strong foundation in databases and help to understand when to use structured databases and when to use NoSQL databases and appreciate their differences.
Here I will go through a little bit of introduction and terminologies used in SQL and databases that will help you learn to code faster! If you already know about what a relational database is and what is SQL, then you can skip to code.
Introduction to SQL and Databases
A database is an organized collection of information. To manage databases, we need Data Base Management Systems (DBMS). A DBMS is a program that stores, retrieves, and modifies data in the databases on request.
There have been different types of databases over the time: Hierarchical, network, relational Databases and now NoSQL databases. Relational Database is a collection of relations or two-dimensional tables.
Following are terminologies used in RDBMS:
Term | Description |
---|---|
Table | A Table is the basic storage structure of an RDBMS. A table stores all the data necessary about something in the real world. Example: Employees. |
Single Row or Tuple | Representing all data required for a particular employee. Each row in a table may be identified by a Primary Key, which allows no duplicate rows. |
Column or Attribute | Usually refers to a characteristic of an entity |
Primary Key | A Field which uniquely identifies a row |
Foreign Key | A Foreign Key is a column that identifies how tables relate to each other. A foreign key refers to a Primary Key in another table. |
You can relate multiple tables using Primary and Foreign Keys. Each row in a relational database is identified uniquely by a Primary Key (PK). You can refer to another table using a Foreign Key (FK). For example:
Relational databases can be accessed using Structured Query Language or SQL. Every database will support ANSI SQL that is the standard SQL but also will have its own syntax to facilitate in some operations. In this tutorial, you will learn ANSI SQL so that you can work with all databases. ANSI SQL can be divided into five sections. I will name them all but here only two sections that are relevant Data Retrieval and DML:
- Data Retrieval:
- Select.
- Data Manipulation Language (DML):
- Insert, Update, Delete , Merge
- Data Definition Language (DDL):
- Create, Alter, Drop, Rename, Truncate.
- Data Control Language (DCL):
- Grant, Revoke.
- Transaction Control:
- Commit, Rollback, Savepoint.
There different vendors available for RDBMS. Most common and most used ones are:
- Oracle (Oracle Corporation)
- Microsoft SQL Server(Microsoft)
- MySQL (Oracle Corporation)
- PostgreSQL (PostgreSQL Global Development Group)
- SQLite (Developed by D. Richard Hipp )
So where do you run SQL queries? It may be:
- A reporting software that allows you to read from the database and view data (For example: Tableau, Microsoft BI)
- A GUI to database administration (for example: TOAD, SQL developer for Oracle, phpmyadmin for MySQL)
- A console based direct interface to the database (For example: SQL plus for Oracle database)
If you have credentials to login to a database, you can view the database objects by using queries or GUI to view the database objects as per your database vendor.
That's it! Let's get to learning SQL...
SQL Data Types
Each column in a database has a name and its data type and sometimes a size associated to it. It is the job of the database developer to design the database and decide what data type should be used according to requirements and volume of data.
As a data scientist, you need to be familiar with the data types as it will assist you to use database functions properly and write your queries accurately. There is a data type for each type of column in a database such as a person's name, some text stored, numbers, an image stored in the database and so on.
Here basic data types for Oracle Server, SQL server and MySQL server will be shown to you:
For further reading:
SQL and Data Reporting
For all SQL used in this tutorial, the following sample database schema:
Consider a database with two tables named emp that holds employee data and dept table that holds records about departments.
The emp table has employee number (empno), employee name (ename), salary (sal), commission (comm), job name (job), manager id (mgr), hire date (hiredate) and, department number (deptno). As Manager is also an employee and will have an employee number, mgr is from one of the empno whose job is "MANAGER".
The dept table has department no (deptno), department name (dname) and location of the department (loc).
Note that, different databases have different date format. The one here, DD-MON-YY is default date format for an Oracle database. The default format for Microsoft SQL Server and MySQL is YYYY-MM-DD.
Your database tables may be different, so you have to adjust only according to your table name and attributes' name. In this tutorial, you will be only reading from the database and not writing to, updating or creating new tables and objects. So no worries on any data loss or changes!
Information: Schema is a collection of objects such as tables, functions, procedures, views that belong to a user
In the above database schema, you can see that the emp table has six attributes for the entity Employee. Table dept has three attributes for the entity Department.
Retrieving Data Using SELECT Statement
A SELECT statement retrieves information from the database. Using a SELECT statement, you can do the following:
1. Projection: You can use the projection capability in SQL to choose the columns in a table that you want to be returned by your query. You can choose as few or as many columns in the table as you require.
2. Selection: You can use the selection capability in SQL to choose the rows in a table that you want to be returned by a query.
You can use various criteria to restrict the rows that you see.
3. Joining: You can use the join capability in SQL to bring together data that is stored in different tables by creating a link between them.
The basic SELECT statement allows you to specify what columns do you want from which table. The SELECT clause specifies columns, expressions and FROM clause specifies from which table to get data from.
For example, What are the names of all employees and what are their jobs: SELECT ename, job FROM employee; The above SQL query will give you following output based on the database schema discussed above:
ename | job |
---|---|
A | Salesman |
B | Manager |
C | Manager |
To select all attributes and all rows from table * operator is used:
SELECT * FROM employee; The output will be: EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- 7369 SMITH CLERK 7902 17-DEC-80 800 20 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30 7566 JONES MANAGER 7839 02-APR-81 2975 20 7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
Some tips for writing SQL statements:
- SQL statement are not case sensitive
- Clauses are written on separate lines to increase readability
- You can write SQL statement on one or more than one lines
You can also create an expression using +,-,/,* operators on the date and number data to generate required data. For example, you are asked to find out 20% of the salary of all employees. The query will be as follows:
SELECT ename, sal*(20/100) FROM employee; Output: ENAME SAL*(20/100) ---------- ------------ SMITH 160 ALLEN 320 WARD 250 JONES 595 BLAKE 570
Information:
- Parentheses are used to clarify the expression
- Division and multiplication have precedence over addition and subtraction
- If the same precedence occurs then the expression is evaluated from left->right
NULL values are treated differently in databases. A NULL value implies that a value is unknown. Any operation performed with NULL results in NULL. Different databases have different functions to handle null. There is a common function that is used in MySQL, Microsoft SQL Server, and Oracle to treat NULLs.
SELECT ename, sal+COALESCE(comm,0) FROM employee; Output: ENAME SAL+COALESCE(COMM,0) ---------- -------------------- SMITH 800 ALLEN 1900 WARD 1750 JONES 2975 BLAKE 2850
Column Aliases and Concatenation
You can see in above outputs, that the column name is the same as the database field or the expression you select. Sometimes, while generating reports you want to give your own names to headers. This can be done using aliases.
SELECT ename AS "Emp Name", sal*(20/100) as "20% of Salary" FROM employee; Output: Emp Name 20% of Salary ---------- ------------ SMITH 160 ALLEN 320 WARD 250 JONES 595 BLAKE 570
When you are giving an alias that has spaces in it you have to use double quotes. Otherwise, there is no need to put double quotes after AS. Sometimes you can also omit the keyword AS.
You can format your output by concatenation. You can add your own statements in the output using CONCAT function or operators like || or + depending on vendor database:
- Oracle supports CONCAT() and || but CONCAT() will take only two arguments. You have to use nested CONCAT().
- MySQL uses CONCAT()
- Microsodt SQL server uses '+' operator and CONCAT().
Oracle: SELECT '20% of salary of '||ename||' is '||sal*(20/100) as "20% of salary" FROM employee;
SELECT CONCAT(CONCAT('20% of salary of',ename),CONCAT(' is ',sal(20/100))) as "20% of salary" FROM employee; MySQL and Microsoft SQL server: SELECT CONCAT('20% of salary of ',ename,' is ',sal(20/100)) as "20% of salary" FROM employee; All will result in same output:
20% of salary ----------------------------------------------------------------------- 20% of salary of SMITH is 160 20% of salary of ALLEN is 320 20% of salary of WARD is 250 20% of salary of JONES is 595 20% of salary of BLAKE is 570
Eliminating repeated rows using DISTINCT
SELECT deptno FROM employee; Above query will result in: DEPTNO ---------- 20 30 30 20 30
Here the deptno 20 and 30 are repeating. You can eliminate such rows by using the DISTINCT keyword in the SELECT clause.
SELECT distinct ename, deptno, job FROM employee; DEPTNO ---------- 30 20
Restricting and Sorting Data
The WHERE clause is used to filter data on some condition.
Find all employees who have job CLERK:
SELECT ename, job FROM employee WHERE job='CLERK'; ENAME JOB ---------- --------- SMITH CLERK
You may want to filter your results with different conditions. For this operators, conditional symbols and specific keywords are used:
Use of =,<>,!=,>=,<=,>,< is simple as shown in the above query that is using =.
AND & OR syntax: SELECT column1, column2,.. FROM table_name WHERE condition1 AND condition2 AND condition 3...;
SELECT column1, column2,.. FROM table_name WHERE condition1 OR condition2 OR condition 3...; Find names of employees whose job is MANAGER and belong to department 30:
SELECT ename FROM employee WHERE job='MANAGER' AND deptno=30; ENAME ---------- BLAKE
SELECT ename FROM employee WHERE job='MANAGER' OR deptno=30; ENAME ---------- ALLEN WARD JONES BLAKE
NOT syntax: SELECT column1, column2, ... FROM table_name WHERE NOT condition; Find all employees whose job is not SALESMAN:
SELECT ename, job from employee WHERE NOT job='SALESMAN'; ENAME JOB ---------- --------- SMITH CLERK JONES MANAGER BLAKE MANAGER
You can make complex conditions using all three AND, OR and NOT operators. The precedence is:
- NOT
- AND
- OR
Find all employees whose job is not CLERK and belong to department 20:
SELECT ename, job from employee WHERE NOT job='SALESMAN' AND sal>800; ENAME JOB ---------- --------- JONES MANAGER BLAKE MANAGER
Here NOT executed first and after NOT, AND was evaluated.
Other handy conditional operators are shown below:
BETWEEN...AND:
SELECT * FROM employee WHERE sal BETWEEN 1000 AND 2000; EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
LIKE:
LIKE uses two wildcards: percentage % and underscore _ to represent the number of characters in the pattern.
- % means any zero, one, or multiple characters
- %M%: Match any string having M in any position
- M%: Match value having M at start
- %M: Match value having M at end
- M%A: Start with M and end with A
Patterns are case-sensitive.
- _ specifies the number of unknown characters before or after the known character. One underscore is one character.
- _r%: Match value having r in the second position.
Get names of all employees whose names start with 'B':
SELECT * FROM employee WHERE ename LIKE 'B%'; EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- 7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
Get names of all employees whose names start with 'A' and have 'E' in any position after 'A':
SELECT * FROM employee WHERE ename LIKE 'A%E%'; EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
IN(value1,value2, value3,..):
The IN() function can take one, two or multiple values and allows you to match a column to the given values in parentheses in the WHERE clause:
SELECT ename, job, hiredate FROM employee WHERE job IN ('CLERK','SALESMAN'); ENAME JOB HIREDATE ---------- --------- --------- SMITH CLERK 17-DEC-80 ALLEN SALESMAN 20-FEB-81 WARD SALESMAN 22-FEB-81
You can also use a SELECT statement in IN() that will return some values. For example:
SELECT ename, job, hiredate FROM employee WHERE deptno IN (select deptno FROM department WHERE loc='CHICAGO'); ENAME JOB HIREDATE ---------- --------- --------- ALLEN SALESMAN 20-FEB-81 WARD SALESMAN 22-FEB-81 BLAKE MANAGER 01-MAY-81
The SELECT statement in IN() is also called as a sub-query. More on sub-queries later in this tutorial!
IS NULL:
IS NULL is used to check for NULL values in a given attribute. For example, find all employees who don't have commission:
SELECT ename, job, sal FROM employee WHERE comm IS NULL; ENAME JOB SAL ---------- --------- ---------- SMITH CLERK 800 JONES MANAGER 2975 BLAKE MANAGER 2850
If you want to get names of employees who earned a commission then you will use IS NOT NULL:
SELECT ename, job, sal, comm FROM employee WHERE comm IS NOT NULL; ENAME JOB SAL COMM ---------- --------- ---------- ---------- ALLEN SALESMAN 1600 300 WARD SALESMAN 1250 500
To filter results on using date columns you have to use default date format. If some other format is to be used, date functions need to be applied which you will see later in this tutorial. Find all employees who were hired after 21 February 1981:
(Here Oracle default date format is used)
SELECT ename, job, sal, comm FROM employee WHERE hiredate>'21-FEB-81'; ENAME JOB SAL COMM ---------- --------- ---------- ---------- WARD SALESMAN 1250 500 JONES MANAGER 2975 BLAKE MANAGER 2850
Complex conditional expressions can be made using a combination of all the above operators, but you have to be careful with precedence. Precedence is the order in which operators are evaluated. Following are precedence rules for different databases:
- Microsoft Transact-SQL operator precedence
- Oracle 10g condition precedence
- Oracle MySQL 9 operator precedence
- PostgreSQL operator Precedence
- SQLite operator Precedence
There are two useful functions: ANY(), ALL() that can be used in conditions. For example:
SELECT ename, job, sal, comm FROM employee WHERE deptno=ANY(SELECT deptno from dept WHERE loc='NEW YORK');
SELECT ename, job, sal, comm FROM employee WHERE deptno=ALL(SELECT deptno from dept WHERE dname='SALES');
Notice the use of subqueries in the above examples. You will learn more about sub-queries later.
Ordering Results Using ORDER BY CLAUSE
You can order your results ascending (ASC) or Descending (DESC) by any attribute or multiple attributes of the table. You can also order by aliases that you specify in SELECT clause:
SELECT ename, job, sal, comm FROM employee WHERE hiredate>'21-FEB-81' ORDER BY sal desc; ENAME JOB SAL COMM ---------- --------- ---------- ---------- JONES MANAGER 2975 BLAKE MANAGER 2850 WARD SALESMAN 1250 500
Note: Default order is ascending denoted by ASC, and you don't need to specify ASC.
SELECT ename, job, sal, comm FROM employee WHERE hiredate>'21-FEB-81' ORDER BY sal; ENAME JOB SAL COMM ---------- --------- ---------- ---------- WARD SALESMAN 1250 500 BLAKE MANAGER 2850 JONES MANAGER 2975
You can specify multiple columns in ORDER BY clause that will execute in order of the columns specified. For example, sort employees first by their deptno in ascending order and then names in descending:
SELECT ename, job, sal, comm, deptno FROM employee WHERE hiredate>'21-FEB-81' ORDER BY deptno ASC, ename DESC; ENAME JOB SAL COMM DEPTNO ---------- --------- ---------- ---------- ---------- JONES MANAGER 2975 20 WARD SALESMAN 1250 500 30 BLAKE MANAGER 2850 30
Using Single-Row Functions to Customize Output
There are numerous functions in all RDBMS that will help you perform common tasks such as getting the length of a string, joining strings, formatting functions, mathematical functions, etc. There are two types of row functions in SQL:
- Single-Row functions
- Multiple-Row functions
Single-Row Functions:
These functions are applied on each row and return result per each row. CONCAT() is a character manipulation single-row function. These can be used in SELECT, WHERE and ORDER BY clause. There are single-row functions for the following in all RDBMS:
- Character Manipulation functions
- Date and Time functions
- Number Functions
- Conversion Functions
Different databases have different function names for the same functionality. Here I will tell you common useful functions among Microsoft SQL Server, Oracle, and MySQL. Some of these you have already learned in the above sections. At the end of this section, you will find links to a complete list of functions from these vendors for further practice.
Let's go through each category:
Character Manipulation Functions
LOWER(): Converts string to lower caps.
SELECT lower(ename) as ename FROM employee; ENAME ---------- smith allen ward jones blake
UPPER(): Converts string to lower caps.
SELECT upper(ename) as ename FROM employee; ENAME ---------- SMITH ALLEN WARD JONES BLAKE
SUBSTR()[Oracle, MySQL]: Returns substring as specified SUBSTR(string, start-position,length)
SUBSTRING()[SQL Server]: Returns substring as specified SUBSTRING(string, start-position,length).
SELECT SUBSTR(ename,2,3) as substr_ename FROM employee;
Just replace function name to SUBSTRING for SQL server.
SUBSTR_ENAME ------------ MIT LLE ARD ONE LAK
LENGTH()[Oracle, MySQL]: Returns length of the string in parentheses
LEN()[SQL server]: Returns length of the string in parentheses
SELECT LENGTH(ename) as len_ename FROM employee;
Just replace function name to LEN for SQL server.
LEN_ENAME ---------- 5 5 4 5 5
Functions such as padding a string to left or right, replace start to differ in the syntax for different databases. List of Character and String functions for the three vendor databases:
Number Functions
Function name | Function |
---|---|
ROUND(m,n): | Rounds the value m to specified n decimal places. |
ABS(m,n): | Returns absolute value of a number |
FLOOR(n): | FLOOR returns the largest integer equal to or less than n. |
MOD(m,n): | Returns remainder of division of m by n (not available in SQL server but uses % operator: 35 % 6) |
Oracle:
SELECT ROUND(45.926,2),MOD(11,5),FLOOR(34.4),ABS(-24) from dual; OUTPUT: ROUND(45.926,2) MOD(11,5) FLOOR(34.4) ABS(-24) --------------- ---------- ----------- ---------- 45.93 1 34 24
MySQL:
SELECT ROUND(45.926,2),MOD(11,5),FLOOR(34.4),ABS(-24);
SQL server:
SELECT ROUND(45.926,2) as round,11%5 as mod, FLOOR(34.4) as floor,ABS(-24) as abs;
List of Numeric functions for the three vendor databases:
Conversion Functions
Conversion Functions are used to convert one data type into another. Convert functions differ from server to server. Here I will give you an example of an Oracle server query that uses the to_char function. This will help you understand how you can use conversion functions to customize the output. Oracle's default date format is DD-MON-YY
SELECT ename, to_char(hiredate,'DD, MONTH YYYY') as Hiredate, to_char(hiredate,'DY') as Day from employee; ENAME Hiredate DAY ---------- -------------------------------------------- ------------ SMITH 17,DECEMBER 1980 WED ALLEN 20,FEBRUARY 1981 FRI WARD 22,FEBRUARY 1981 SUN JONES 02,APRIL 1981 THU BLAKE 01,MAY 1981 FRI
There lots of conversion functions for each database server. For further reading and exploration following are links to different database server's conversion functions:
- Microsoft SQL server conversion functions
- Oracle server conversion functions
- MySQL conversion functions
Date and Time Functions
Date and time functions allow you to manipulate the date by adding days, calculating months between two dates, etc. Such functions are beneficial when generating reports. Date functions also vary from database to database. A function may perform same functionality but may have a different name in another database. Here you will look at a few examples and then I will provide you with links to the date functions of the top most common databases: Oracle, SQL Server, and MySQL. Go ahead and try out these examples on your database server!
- Oracle Date and Time functions
- MySQL Date and Time functions
- Microsoft SQL Server Date and Time functions
Grouping your SQL Results
Grouping functions or Multi-Row functions are applied on groups and return one result per group. You will need group functions when you want to find out facts such as total sales per quarter, the Average price of a product throughout some time period, highest investment your company received this month and so on.
Reporting Aggregated Data Using the Group Functions
GROUP BY clause is used to group results used often with functions like COUNT, MAX, MIN, AVG, and SUM. The syntax of GROUP BY is:
SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s);
You have to keep in mind some points for the GROUP BY clause:
- Alias names cannot be used in GROUP BY clause
- WHERE clause will always come before GROUP BY
- HAVING clause will come after GROUP BY and will apply conditions on group functions.
- ORDER BY clause will always come at last
COUNT is a group function that counts the number of rows of data according to condition or no condition. The COUNT function ignores null values.
Count all employees whose name start with 'A':
SELECT count(ename) FROM employee WHERE ename LIKE 'A%'; Output: COUNT(ENAME) ------------ 1
Find total salary, average salary, minimum salary and maximum salary in table employee:
SELECT sum(sal) as "TOTAL SAL", avg(sal) as "AVG SAL", min(sal) as "MIN SAL", max(sal) as "MAX SAL" FROM employee; TOTAL SAL AVG SAL MIN SAL MAX SAL ---------- ---------- ---------- ---------- 9475 1895 800 2975
You may want to group above result for each department:
SELECT sum(sal) as "TOTAL SAL", avg(sal) as "AVG SAL", min(sal) as "MIN SAL", max(sal) as "MAX SAL", deptno FROM employee GROUP BY deptno; TOTAL SAL AVG SAL MIN SAL MAX SAL DEPTNO ---------- ---------- ---------- ---------- ---------- 5700 1900 1250 2850 30 3775 1887.5 800 2975 20
You can also calculate Variance and Standard Deviation of values using following functions:
For Oracle and MySQL | MS SQL SERVER |
---|---|
SELECT STDDEV(column_name) FROM table_name; | SELECT STDEV(column_name) FROM table_name; |
SELECT VARIANCE(column_name) FROM table_name; | SELECT VAR(column_name) FROM table_name; |
You can group results by more than one column.
Find the sum of salaries of each job within each department:
SELECT sum(sal) as "TOTAL SAL", deptno, job FROM employee GROUP BY deptno, job; TOTAL SAL DEPTNO JOB ---------- ---------- --------- 800 20 CLERK 2850 30 SALESMAN 2975 20 MANAGER 2850 30 MANAGER
You can restrict group results using the HAVING clause. It's used only for group conditions.
Note: Do not use WHERE for applying conditions on group functions.
SELECT sum(sal) as "TOTAL SAL", deptno, job FROM employee GROUP BY deptno, job HAVING sum(sal)>1000 ORDER BY sum(sal); TOTAL SAL DEPTNO JOB ---------- ---------- --------- 2850 30 SALESMAN 2850 30 MANAGER 2975 20 MANAGER
Displaying Data From Multiple Tables
In this section following joins will be discussed:
- Cartesian product/Cross join
- Inner join/EquiJoins
- Natural joins
- Outer Joins(Left, Right, Full)
- Self Join
You will need to display reports that get data from multiple tables. See the example below:
To generate a report like above, you need to link employee and department table and get data from both of them. For this Joins are used in SQL.
Cartesian Product/Cross Join:
The Cartesian product is formed when every tuple of a relation R creates a combination with every tuple in a relation S.
Cartesian product operation also called as Cross Join multiplies two tables to form a relation that consists of all possible pairs of tuples from two tables. If one relation say, R has I tuples with M number of attributes and other relation say S has J tuples N attributes cartesian product will have IxJ tuples with M+N attributes. A cartesian product can be formed using the SQL complaint CROSS JOIN as well.
SELECT empno, ename, dname FROM employee, department; OR SELECT empno, ename, dname FROM employee CROSS JOIN department; EMPNO ENAME DNAME ---------- ---------- -------------- 7369 SMITH ACCOUNTING 7499 ALLEN ACCOUNTING 7521 WARD ACCOUNTING 7566 JONES ACCOUNTING 7698 BLAKE ACCOUNTING 7369 SMITH RESEARCH 7499 ALLEN RESEARCH 7521 WARD RESEARCH 7566 JONES RESEARCH 7698 BLAKE RESEARCH 7369 SMITH SALES EMPNO ENAME DNAME ---------- ---------- -------------- 7499 ALLEN SALES 7521 WARD SALES 7566 JONES SALES 7698 BLAKE SALES
15 rows selected.
As the employee table comprises of 5 tuples and department contains 3 tuples the Cartesian product has 5x3=15 rows. A Cartesian product is generated when:
- A join condition is not used
- A join condition is not valid or not formed properly
When data from two or more tables is required, a join condition is made using the common attributes linking the involved tables. These are most commonly the Primary Key and the Foreign Key.
Cartesian products are used for simulation of large amounts of data for testing purposes.
*Inner Join/EquiJoin:
Inner Join or Equijoin (Term used by Oracle developers) uses primary and foreign key relationships to join two or more tables:
SELECT ename, dname FROM employee e,department d WHERE e.deptno=d.deptno; OR SELECT ename, dname FROM employee e JOIN department d ON e.deptno=d.deptno; ENAME DNAME ---------- -------------- SMITH RESEARCH ALLEN SALES WARD SALES JONES RESEARCH BLAKE SALES
In the above query, deptno is the primary key in the department table and foreign key in the employee table. You can add additional conditions using logical operations as discussed in the sections above.
Point to remember for JOINS:
- If the same column name appears in more than one table, the column name must be prefixed with the table name. It is good practice to do this otherwise as well to enhance clarity.
- To join n tables together, you need a minimum of n-1 join conditions. For example, to join four tables, a minimum of three joins is required.
Table aliases: Table aliases are used such as in FROM employee e, department d where e and d are table aliases, to make it easier for the database engine to identify from which table the column is from especially when same columns are present in more than one involved table because otherwise error for ambiguous column will be given. And it saves time from writing large table names again and again.
You can create nonequi-joins as well which join tables based on conditions other than equality. Consider another table salgrade that has information about the grade of an employee based on salary:
You want to find out each employees' grade based on salary and grades are stored in salgrade table while salary is stored in employee table:
SELECT e.ename, e.sal, s.grade FROM employee e, salgrade s WHERE e.sal between s.losal AND s.hisal; ENAME SAL GRADE ---------- ---------- ---------- JONES 2975 4 BLAKE 2850 4 ALLEN 1600 3 WARD 1250 2 SMITH 800 1
Example: You are asked to give the names of employees, their salary, grade, and department name. Now, as you know, salary is stored in the employee table, grades in the salgrade table and department name in department table you will need to join three tables together:
SELECT e.ename, e.sal, d.dname, s.grade FROM employee e, department d, salgrade s WHERE e.deptno=d.deptno AND e.sal BETWEEN s.losal AND s.hisal; Output: ENAME SAL DNAME GRADE ---------- ---------- -------------- ---------- JONES 2975 RESEARCH 4 BLAKE 2850 SALES 4 ALLEN 1600 SALES 3 WARD 1250 SALES 2 SMITH 800 RESEARCH 1
In above example, three tables are involved and two join condition one of them is a nonequi-join.
Natural Join:
Natural joins allow databases to join tables automatically by matching columns with the same name. If columns match with the same name but data types are different, then an error is given.
Syntax: SELECT FROM table1 NATURAL JOIN table2; SELECT FROM employee NATURAL JOIN department; Since natural join itself find out matching column it may find more than one matching columns with the same name but may have a different data type which may result in an error. So USING clause is used to specify columns on which to make an equi join.
Note: NATURAL JOIN and USING are two different clauses and used separately. One cannot be used in the presence of the other that is mutually exclusive.
SELECT e.ename, d.dname, e.sal FROM employee e JOIN department d USING (deptno) WHERE deptno=20; output: ENAME DNAME SAL ---------- -------------- ---------- SMITH RESEARCH 800 JONES RESEARCH 2975
Columns in the USING clause cannot use table name or table prefixes anywhere in the SQL statement. For example following is incorrect:
SELECT e.ename, d.dname, e.sal FROM employee e JOIN department d USING (d.deptno) WHERE d.deptno=20;
d.deptno is incorrect. Only deptno should be used. Outer Joins:
There are three outer joins:
- Left Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the left table, is called a LEFT OUTER JOIN
- Right Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the right table, is called RIGHT OUTER JOIN.
- Full Outer Join: A join between two tables that returns the results of the inner join, as well as unmatched rows of the right table, is called RIGHT OUTER JOIN.
Let's see these one by one:
SELECT e.ename, s.grade FROM salgrade s LEFT OUTER JOIN employee e ON e.sal BETWEEN s.losal AND s.hisal; Output: ENAME DEPTNO DNAME ---------- ---------- -------------- SMITH 20 RESEARCH JONES 20 RESEARCH ALLEN 30 SALES WARD 30 SALES BLAKE 30 SALES.sal BETWEEN s.losal AND s.hisal; Output: ENAME DEPTNO DNAME ---------- ---------- -------------- SMITH 20 RESEARCH JONES 20 RESEARCH ALLEN 30 SALES WARD 30 SALES BLAKE 30 SALESSELECT e.ename, d.deptno, d.dname
FROM employee e RIGHT OUTER JOIN department d ON e.deptno=d.deptno; output: ENAME DEPTNO DNAME ---------- ---------- -------------- SMITH 20 RESEARCH ALLEN 30 SALES WARD 30 SALES JONES 20 RESEARCH BLAKE 30 SALES 10 ACCOUNTINGSELECT e.ename, d.deptno, d.dname
FROM employee e FULL OUTER JOIN department d ON e.deptno=d.deptno; Output: ENAME DEPTNO DNAME ---------- ---------- -------------- SMITH 20 RESEARCH ALLEN 30 SALES WARD 30 SALES JONES 20 RESEARCH BLAKE 30 SALES 10 ACCOUNTING
The department table had one unmatched row for deptno 10, and no unmatched row was found for employee table. Similarly, salgrade has one unmatched row for grade 5.
Self-Join:
There are situations when you want to join a table with itself. For example, the employee table contains information about all employees including managers (Refer to the table employee above). If you are asked to find out employees and their managers, then you will need to join a table to itself. Self-Joins are made in the following way:
SELECT e.ename as Employee, m.ename as Manager FROM employee e, employee m WHERE e.mgr=m.empno; EMPLOYEE MANAGER ---------- ---------- ALLEN BLAKE WARD BLAKE
Using Subqueries to Solve Queries
A subquery is a query within a query. A subquery is useful to break down large queries into segments. Subqueries can be nested and can be used in the following SQL clauses:
- WHERE
- FROM
- HAVING
Consider an example; you are asked to find an employee whose salary is higher than another employee, in this case, James. How would you approach this? This query can be broken down as:
- Find the salary of James
- Compare salary of James to all employee
This can be solved using a subquery. The subquery (inner query) will execute once before the main (outer) query.
SELECT empno, ename FROM employee WHERE sal>(SELECT sal from employee where ename='JAMES');
Types of Subqueries:
-
Single-row subquery: Query that returns only one row from an inner SELECT statement.
-
Multiple-row subquery: Query that returns more than one row from an inner SELECT statement.
Points to remember:
- Enclose subqueries in parenthesis.
- Place subqueries on the right of the comparison operator.
- Use single row operators (>, < , >= , <= , < >) with single row subqueries.
- Use multiple row operators (IN, ANY, ALL) with multiple row subqueries.
Single-row Subquery:
You are asked to find names of employees whose job title is same as empno 7521
SELECT ename, job FROM employee WHERE job=(SELECT job FROM employee WHERE empno=7521); ENAME JOB ---------- --------- ALLEN SALESMAN WARD SALESMAN
Select maximum salary of departments where the maximum salary is higher than equal to the maximum salary of department 20
SELECT deptno, max(sal) FROM employee GROUP BY deptno HAVING max(sal)>=(SELECT max(sal) FROM employee WHERE deptno=20); DEPTNO MAX(SAL) ---------- ---------- 20 2975
Multiple-row Subquery:
Find employees whose salary is equal to any of the salaries of employees having a job as Manager. This will be a multiple-row subquery as the subquery can return more than one rows.
SELECT ename, job, sal FROM employee WHERE sal IN (SELECT sal FROM employee WHERE job='MANAGER'); ENAME JOB SAL ---------- --------- ---------- JONES MANAGER 2975 BLAKE MANAGER 2850
When subqueries are used in FROM clause, it acts as a temporary table that does not physically exist on the storage but is a view of data. For example:
SELECT e.ename, e.job, e.sal FROM employee e, (SELECT deptno FROM department WHERE loc='DALLAS') d WHERE e.deptno=d.deptno; ENAME JOB SAL ---------- --------- ---------- SMITH CLERK 800 JONES MANAGER 2975
Using SET Operators
In SQL Set operators are used to combine the results of multiple queries into a single result. It uses the set theory in mathematics and its operations such as UNION, MINUS, INTERSECT. Here you will see how you can use SET operators to optimize your queries. In this tutorial following SET operators will be discussed.
Information: Queries using SET operators are called compound statements.
- UNION and UNION ALL
- INTERSECT
- EXCEPT (SQL standard) and MINUS (Oracle specific)
UNION
Consider two relations (tables) R and S then UNION selects all rows from R and all rows from S eliminating duplicates. The number of returned rows can be maximum r+s where r is the number of rows in R and s is the number of rows in S.
Select all department names including the department names of employees who were hired before 23-OCT-1999
SELECT dname FROM department UNION SELECT dname FROM department, employee WHERE department.deptno=employee.deptno AND employee.hiredate<to_date('23-OCT-1999'); DNAME -------------- ACCOUNTING OPERATIONS RESEARCH SALES
Points to remember for UNION:
- The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
- The names of the columns need not be identical.
- The output is sorted in ascending order of the first column of the SELECT clause.
- NULL values are not ignored during duplicate checking.
UNION ALL combines result from one or more queries and does NOT remove duplicates, and hence, the DISTINCT keyword cannot be used. Consider another table emp that contains employees from the year 2000. Get all employee number, name and job from the emp and the employee table:
SELECT empno, ename, job FROM employee UNION ALL SELECT empno, ename, job FROM emp; EMPNO ENAME JOB ---------- ---------- --------- 7369 SMITH CLERK 7499 ALLEN SALESMAN 7521 WARD SALESMAN 7566 JONES MANAGER 7698 BLAKE MANAGER 7369 SMITH CLERK 7499 ALLEN SALESMAN 7521 WARD SALESMAN 7566 JONES MANAGER 7654 MARTIN SALESMAN 7698 BLAKE MANAGER EMPNO ENAME JOB ---------- ---------- --------- 7782 CLARK MANAGER 7788 SCOTT ANALYST 7839 KING PRESIDENT 7844 TURNER SALESMAN 7876 ADAMS CLERK 7900 JAMES CLERK
INTERSECT
Intersection follows the same rule as in set theory, to give the common values from two sets. Consider two relations R and S then INTERSECT will return set of all tuples that are common in both R and S.
SELECT empno, ename, job FROM employee INTERSECT SELECT empno, ename, job FROM emp; EMPNO ENAME JOB ---------- ---------- --------- 7369 SMITH CLERK 7499 ALLEN SALESMAN 7521 WARD SALESMAN 7566 JONES MANAGER 7698 BLAKE MANAGER
Points to remember for INTERSECT:
- The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
- The names of the columns need not be identical.
- INTERSECT does not ignore NULL values
EXCEPT and MINUS:
EXCEPT (used in SQL server) and MINUS (Oracle specific) both have the same function. The function is to select all distinct rows selected by the first query but not the second. To understand this operator, consider the following table:
Say, you need to find a product that has a quantity between 1 to 100 but does not have quantity 50-75. Here you will use EXCEPT if you are using SQL server and MINUS if you are using Oracle server.
SELECT prod_name, qty FROM products WHERE qty BETWEEN 1 AND 100 EXCEPT SELECT prod_name, qty FROM products WHERE qty BETWEEN 50 AND 75;
For Oracle server:
SELECT prod_name, qty FROM products WHERE qty BETWEEN 1 AND 100 MINUS SELECT prod_name, qty FROM products WHERE qty BETWEEN 50 AND 75; PROD_NAME QTY ------------- ---------- COLGATE 1 SENSODYNE 100 SENSODYNE TOOTHBRUSH 30
Points to remember for EXCEPT/MINUS:
- The number of columns and the data types of the columns being selected must be identical in all the SELECT statements used in the query.
- The names of the columns need not be identical.
- All of the columns in the WHERE clause must be in the SELECT clause for the MINUS operator to work.
Congratulations!
This is the end of the tutorial. In this tutorial, you have learned a significant amount of SQL that will help you master SQL in your data science journey. SQL is used to generate reports from databases. You have learned in this tutorial the foundations of databases and SQL, common data types used, functions that help you to achieve formatting reports, aggregation of results creating summaries and also ways on how you can gather data from different tables in the databases according to your requirements! This tutorial is specifically designed for data science learners that will not only help them with relational databases but also go smoothly through NoSQL databases and how to use SQL skills in Big Data studies.
If you would like to learn more about SQL, take DataCamp's free Intro to SQL for Data Science course.
References
- Introduction to Oracle 10g-SQL Lecture by Tanveer Zahid Khan (Senior Assistant Professor, Department of Computer Science, Bahria University)
- https://docs.oracle.com/cd/B19306_01/server.102/b14200/operators005.htm
- https://en.wikipedia.org/wiki/Set_operations_(SQL)#EXCEPT_operator
- https://www.w3schools.com/sql/sql_datatypes.asp
- https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT012
Learn more about SQL
Course
Introduction to SQL
Course
Joining Data in SQL
cheat-sheet
SQL Basics Cheat Sheet
tutorial
SELECTing Multiple Columns in SQL
DataCamp Team
3 min
tutorial
Aggregate Functions in SQL
code-along
Getting Started in SQL
Kelsey McNeillie
code-along
Exploratory Data Analysis in SQL for Absolute Beginners
Jasmin Ludolf
code-along