Course
Set operations form the foundation of SQL and enable us to combine, compare, and filter data from multiple sources.
These operations are indispensable for tasks ranging from data integration and cleansing to advanced analytics and reporting.
In this tutorial, we’ll learn what set operators are, how they are used in SQL, their practical applications, and more!
If you’re looking for a complete SQL learning resource, check out this seven-course SQL Fundamentals skill track.
For those of you in a hurry, let’s start with a very short answer on what set operations in SQL are.
Short Answer: What Are Set Operations in SQL?
Set operations in SQL are techniques for combining or comparing the results of two or more SELECT statements. They act like mathematical set operations, letting us find the union, intersection, or difference between the rows returned by our queries. This makes them indispensable when analyzing data from multiple sources or perspectives.
Here's a quick overview of the core set operations:
- UNION: Merges all unique rows from two or more SELECT statements, eliminating duplicates.
- UNION ALL: Merges all rows from two or more SELECT statements, keeping duplicates.
- INTERSECT: Returns only the rows that appear in both SELECT statements.
- EXCEPT: Returns rows from the first SELECT statement that don't appear in the second.
Understanding Set Operators
Set operators are specialized commands or symbols used to perform operations on the result sets of multiple SELECT queries. They enable us to perform tasks like finding the union (all rows), intersection (shared rows), and difference (unique rows) between different datasets.
Set operators are essential tools for data manipulation and analysis within SQL. They provide a powerful means of querying and processing data from relational databases.
Comparison with relational algebra operations
Relational algebra is a theoretical framework that provides a foundation for understanding database queries. It offers abstract operations like projection, selection, and join, rooted in mathematical principles and independent of specific database systems. Think of it as the "behind-the-scenes" logic that powers our database interactions.
Set operators in SQL offer a practical implementation of these concepts within a database environment. They enable us to perform set operations like union, intersection, and difference directly on the result sets of SQL queries.
While relational algebra provides a formal foundation for database operations, set operators in SQL offer a standardized, user-friendly interface for data manipulation tasks.
Understanding the relationship between set operators and relational algebra operations provides insights into the theoretical underpinnings of SQL. It enables users to leverage these concepts effectively in database querying and analysis tasks.
Types of Set Operators in SQL
There are three primary set operators in SQL:
- UNION
- INTERSECT
- EXCEPT (or MINUS in some dialects)
These operators mathematically correspond to the concepts of union, intersection, and set difference.
Let’s dive into more detail about each.
The UNION operator
The UNION operator combines the results of two or more SELECT queries into a single result set, removing duplicate rows by default.
For example, suppose we have two tables, employees
and contractors
, each with similar columns such as contractors
, department
, and salary
. For learning purposes, let’s consider these two dummy tables:
name |
department |
salary |
Alice |
Marketing |
65000 |
Bob |
Sales |
70000 |
Carol |
Engineering |
80000 |
John |
HR |
55000 |
Employees
name |
department |
salary |
David |
Marketing |
60000 |
Eva |
Sales |
68000 |
Carol |
Engineering |
75000 |
Contractors
We can combine the results from both tables using the following command:
-- Using INTERSECT to find common employees
SELECT name, department FROM employees
INTERSECT
SELECT name, department FROM contractors;
This query selects the name
, department
, and salary
columns from both the employees
and contractors
tables and combines them into a single result set. The UNION operator automatically removes duplicate rows from the final result set.
name |
department |
salary |
Alice |
Marketing |
65000 |
Bob |
Sales |
70000 |
Carol |
Engineering |
80000 |
John |
HR |
55000 |
David |
Marketing |
60000 |
Eva |
Sales |
68000 |
Notice that Carol, who appears in both tables, is only listed once in the result. If we wanted to keep both instances of Carol (with her different salaries), we would use UNION ALL.
It's important to remember that the UNION operator does not remove NULL values. If a column contains NULL values in one result set and non-NULL values in the corresponding column of another result set, the NULL values will be retained in the final result set produced by the UNION operator.
If we want to include NULL values in the result set and prevent their removal by the UNION operator, we can use the UNION ALL operator instead. This operator combines the results of multiple SELECT queries, including all rows from each result set, regardless of whether they are duplicates or contain NULL values.
The INTERSECT operator
The INTERSECT operator returns only the rows that appear in both result sets. Think of it as finding the people who belong to both groups.
Let’s use INTERSECT to query our tables above. For the sake of the example, let’s just query the name and department columns:
-- Using INTERSECT to find common employees
SELECT name, department FROM employees
INTERSECT
SELECT name, department FROM contractors;
This query selects the name
and department
columns from both the employees and contractors tables and returns only the rows that exist in both tables based on all selected columns.
name |
department |
Carol |
Engineering |
The INTERSECT operator handles NULL values based on standard comparison rules, considering NULL values equal when comparing corresponding columns. It also results in an empty set when dealing with empty result sets.
In other words, if a NULL value is present in one result set and the corresponding column in the other result set contains a non-NULL value, the rows are not considered equal – they will not be included in the intersection result.
Additionally, If one of the result sets provided to the INTERSECT operator is empty (i.e., it contains no rows), the overall result of the INTERSECT operation will also be empty since there are no common rows between an empty set and any other set.
The EXCEPT (MINUS) operator
The EXCEPT operator retrieves the rows present in the first result set but not in the second.
For example, let’s say we execute the following query:
-- Using EXCEPT to find employees who are not contractors
SELECT name, department, salary FROM employees
EXCEPT
SELECT name, department, salary FROM contractors;
The name
, department
, and salary
columns are selected from the employees
table and return only the rows that do not exist in the contractors
table.
name |
department |
salary |
Alice |
Marketing |
65000 |
Bob |
Sales |
70000 |
John |
HR |
55000 |
The EXCEPT operator also follows standard comparison rules for handling NULL values. Its behavior with empty result sets results in an empty set if the first result set is empty or includes all rows from the first result set if the second result set is empty.
Set Operators: Performance and Optimization
The impact of set operators on query performance in SQL can vary depending on factors such as the size of the datasets involved, the complexity of the queries, and the database management system (DBMS) used.
Let's break down the key factors and strategies for optimization.
Data volume and query complexity
When working with large amounts of data, set operators can significantly impact query performance because the size of the result sets that need to be combined, intersected, or compared increases the processing time required to perform that operation.
Complex queries containing multiple subqueries, joins, or set operators may result in additional processing overhead and impact query performance. Chained operations or nested set operations could further exacerbate performance consequences.
Indexing and optimization techniques
Proper indexing of the columns involved in set operations can significantly improve query performance. Indexes help the database engine quickly locate and retrieve the relevant rows, reducing the need for full-table scans and improving query execution times.
To enhance the performance of queries involving set operators, database administrators, and developers can utilize optimization techniques like query rewriting, query plan analysis, and database schema optimization. Techniques like query caching and materialized views can also be used to precompute and store the results of complex queries, reducing the computational overhead of set operations.
Database engine and hardware resources
The performance of set operations may vary depending on the underlying database engine and its optimization capabilities. Different DBMSs may employ different optimization strategies and algorithms for processing set operations, leading to variations in performance.
The availability of hardware resources such as CPU, memory, and disk I/O also influences the performance of queries involving set operators. Adequate hardware resources can help mitigate performance bottlenecks and ensure efficient query execution.
Set Operators in Practice: A Case Study
Set operators aren't just theoretical tools; they have real-world applications that can significantly impact business decisions. Let's walk through a simplified example of how a company might use set operators to segment its customer base for targeted marketing campaigns.
The scenario
Imagine a company that sells both online and in physical stores. They have two separate datasets:
- Online purchases: Customer ID, purchase history, demographics, and location for online shoppers.
- In-store transactions: Similar information for customers who shopped in person.
Using set operators
To get a complete picture of all customers, the company would first use UNION to combine both datasets into a single table, removing any duplicates. This gives them a unified view of their entire customer base.
Next, they could use INTERSECT to identify customers who have shopped both online and in-store. This segment is particularly valuable as they are highly engaged with the brand across multiple channels.
To find opportunities for cross-channel promotion, the company could use EXCEPT. For example, SELECT * FROM online_purchases EXCEPT SELECT * FROM in_store_transactions
would find customers who have only shopped online but not in stores. The company could then target these customers with promotions encouraging them to visit a physical location.
Beyond segmentation
With these segments identified, the company can go further, refining them based on additional factors like demographics or purchase history. This granular understanding of their customers allows them to tailor marketing campaigns with greater precision.
Limitations and Considerations
When using set operators in SQL, it's essential to consider several limitations and factors that can affect query performance, result accuracy, and overall usability.
Data type compatibility and NULL values
Corresponding columns in the result sets must have compatible data types. Ensure data consistency and compatibility across result sets to avoid errors and unexpected results.
Set operators may treat NULL values differently depending on the DBMS and the specific operator. To avoid errors, developers must understand how NULL values are treated.
Performance impact and duplicate rows
Set operations can significantly impact query performance, especially when dealing with large or complex datasets. Factors like indexing, query optimization, and hardware resources can influence performance. Optimization techniques and performance-tuning strategies are essential to mitigate performance bottlenecks.
By default, set operators remove duplicate rows from the result set. However, in some cases, retaining duplicate rows may be necessary. It's important to understand set operators' behavior regarding duplicate rows and use appropriate techniques to handle duplicates if needed.
Ordering of results and memory constraints
Set operators do not guarantee the order of the results in the final output. If preserving the order of results is essential, additional sorting operations may be required after applying set operators.
Set operations can consume significant memory and resources, particularly when dealing with large datasets. Memory constraints and resource limitations must be considered to avoid performance degradation or system instability.
Complexity, maintainability, and cross-DBMS compatibility
Complex queries involving multiple set operators, subqueries, and joins can be challenging to understand, maintain, and debug. To improve readability and maintainability, queries must be concise, well-documented, and modular.
Set operators may have variations in syntax and behavior across different database management systems (DBMSs). Knowing these differences is vital when writing SQL queries for cross-platform compatibility.
Conclusion
In summary, set operators are core components of SQL and indispensable for efficient data manipulation.
They encompass functions like UNION, INTERSECT, and EXCEPT and empower us with versatile data analysis capabilities.
If you want to learn more, check out this course on Joining Data in SQL.
SQL Operators FAQs
What is the difference between UNION ALL and UNION?
UNION ALL
includes all rows from both queries, even if there are duplicates. UNION
eliminates duplicate rows.
How is UNION different from JOIN in SQL?
UNION
combines the results of queries vertically, appending rows from one query to another. JOIN
combines tables horizontally, matching rows based on a related column and creating a wider result set.
Are there any performance considerations when using set operations?
Set operations can be computationally expensive, especially when dealing with large datasets. It's important to optimize the individual queries and use indexes where possible to improve performance.
Learn more about SQL!
Course
Intermediate SQL
Course
Applying SQL to Real-World Problems
tutorial
How to Use the SQL IN Operator: A Guide to Efficient Filtering Techniques
Allan Ouko
8 min
tutorial
SELECTing Multiple Columns in SQL
DataCamp Team
3 min
tutorial
How to Use the SQL BETWEEN Operator
Allan Ouko
10 min
tutorial
Introduction to SQL Joins
tutorial
Aggregate Functions in SQL
code-along
Getting Started in SQL
Kelsey McNeillie