Categorical Variables in SQL
Categorical variables are variables that represent one or a finite number of categories. Working with categorical data is an essential skill as a data analyst or data scientist. This template will teach you how to inspect, create, filter, and aggregate categorical variables.
This tutorial will use Summer Olympics data. You are free to create an integration to your data set or use another existing integration. You can learn more about integrations here.
Note: The databases from different PostgreSQL courses are available in the Course Databases database. You can click the "Browse tables" button in the upper righthand corner of the cell below to view the available schemas and tables. The data used for this workspace is contained in the medals schema. To access each table, you need to specify this schema in your queries (e.g., medals.summer_medals for the summer_medals table).
Inspecting categorical variables
Often, you will want to know the unique values in a categorical column. In the query below, we use the DISTINCT
keyword to display the unique values of the medal
variable.
👇 To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.
SELECT DISTINCT medal
FROM medals.summer_medals
Inspect unique combinations of categorical variables
You can also use multiple columns to inspect the unique combinations of two or more variables. In the query below, we use DISTINCT
to inspect the unique combination of year
and city
to return the date and location of each Summer Olympics.
SELECT DISTINCT year, city
FROM medals.summer_medals
ORDER BY year
Inspect the counts of categorical variables
You can also use the COUNT()
function combined with DISTINCT
to return the number of distinct categories within a column. Here, we inspect the number of unique sports in the sport
column.
SELECT COUNT(DISTINCT sport) AS event_count
FROM medals.summer_medals
Filtering categorical variables
Categorical variables can also be filtered, much like quantitative variables. If you know the specific category you want to search for, you can use the WHERE
clause to specify the category of interest.
In the query below, we return all athletes who have competed in "High Jump".
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event = 'High Jump'
Filtering for multiple values
If you want to filter for multiple values in a categorical column, you can use the IN
keyword combined with your WHERE
clause. Here, we use IN
to filter for all athletes who have competed in both "High Jump" or "Long Jump".
You can learn more about filtering data using IN
operator in the second chapter of Introduction to SQL.
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event IN ('High Jump', 'Long Jump')
Filtering using string matching
If you want to search for a general pattern, you can combine the LIKE
keyword with the WHERE
clause. There are two wildcards you can use when defining a pattern.
- The
%
wildcard will match zero, one, or many characters. - The
_
wildcard will match a single character. In the query below, we enclose "Team" with two%
wildcards to return all rows where theevent
contains "Team" (whether it is at the beginning, middle, or end). You can learn more about pattern matching in this exercise of Introduction to SQL.
Note: This string matching will be case-sensitive. If you want to make your search case insensitive, you can use ILIKE
instead of LIKE
.
SELECT *
FROM medals.summer_medals
WHERE event LIKE '%Team%'
Creating categorical variables
Sometimes, you will want to create categorical variables based on existing columns. You can create a categorical variable using a CASE statement.
A CASE
statement has the following structure:
CASE WHEN <condition> THEN <value> ELSE <alternative_value> END AS <column_name>
- The first
WHEN
specifies a condition. THEN
specifies the value if the condition is met.ELSE
returns an alternate value if the condition is not met.END
finishes the case statement, after which you can useAS
to alias the column.
In the query below, we create a new categorical variable called event_type
based on whether the event
column contains individual or team events. You can learn more about creating categorical variables here.
SELECT
event,
CASE WHEN event LIKE '%Team%' or event LIKE '%Relay%' THEN 'Team'
ELSE 'Individual' END AS event_type
FROM medals.summer_medals
WHERE discipline = 'Swimming'
LIMIT 100