Skip to content

Categorical Variables in SQL

Categorical variables are variables that represent one or a finite number of categories. Working with categorical data is an essential skill as a data analyst or data scientist. This template will teach you how to inspect, create, filter, and aggregate categorical variables.

This tutorial will use Summer Olympics data. You are free to create an integration to your data set or use another existing integration. You can learn more about integrations here.

Note: The databases from different PostgreSQL courses are available in the Course Databases database. You can click the "Browse tables" button in the upper righthand corner of the cell below to view the available schemas and tables. The data used for this workspace is contained in the medals schema. To access each table, you need to specify this schema in your queries (e.g., medals.summer_medals for the summer_medals table).

Inspecting categorical variables

Often, you will want to know the unique values in a categorical column. In the query below, we use the DISTINCT keyword to display the unique values of the medal variable.

👇  To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.

Spinner
DataFrameas
df
variable
SELECT DISTINCT medal 
FROM medals.summer_medals

Inspect unique combinations of categorical variables

You can also use multiple columns to inspect the unique combinations of two or more variables. In the query below, we use DISTINCT to inspect the unique combination of year and city to return the date and location of each Summer Olympics.

Spinner
DataFrameas
df
variable
SELECT DISTINCT year, city
FROM medals.summer_medals
ORDER BY year

Inspect the counts of categorical variables

You can also use the COUNT() function combined with DISTINCT to return the number of distinct categories within a column. Here, we inspect the number of unique sports in the sport column.

Spinner
DataFrameas
df
variable
SELECT COUNT(DISTINCT sport) AS event_count
FROM medals.summer_medals

Filtering categorical variables

Categorical variables can also be filtered, much like quantitative variables. If you know the specific category you want to search for, you can use the WHERE clause to specify the category of interest.

In the query below, we return all athletes who have competed in "High Jump".

Spinner
DataFrameas
df
variable
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event = 'High Jump'

Filtering for multiple values

If you want to filter for multiple values in a categorical column, you can use the IN keyword combined with your WHERE clause. Here, we use IN to filter for all athletes who have competed in both "High Jump" or "Long Jump".

You can learn more about filtering data using IN operator in the second chapter of Introduction to SQL.

Spinner
DataFrameas
df
variable
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event IN ('High Jump', 'Long Jump')

Filtering using string matching

If you want to search for a general pattern, you can combine the LIKE keyword with the WHERE clause. There are two wildcards you can use when defining a pattern.

  • The % wildcard will match zero, one, or many characters.
  • The _ wildcard will match a single character. In the query below, we enclose "Team" with two % wildcards to return all rows where the event contains "Team" (whether it is at the beginning, middle, or end). You can learn more about pattern matching in this exercise of Introduction to SQL.

Note: This string matching will be case-sensitive. If you want to make your search case insensitive, you can use ILIKE instead of LIKE.

Spinner
DataFrameas
df
variable
SELECT *
FROM medals.summer_medals
WHERE event LIKE '%Team%'

Creating categorical variables

Sometimes, you will want to create categorical variables based on existing columns. You can create a categorical variable using a CASE statement.

A CASE statement has the following structure:

CASE WHEN <condition> THEN <value> ELSE <alternative_value> END AS <column_name>
  • The first WHEN specifies a condition.
  • THEN specifies the value if the condition is met.
  • ELSE returns an alternate value if the condition is not met.
  • END finishes the case statement, after which you can use AS to alias the column.

In the query below, we create a new categorical variable called event_type based on whether the event column contains individual or team events. You can learn more about creating categorical variables here.

Spinner
DataFrameas
df
variable
SELECT 
  event,
  CASE WHEN event LIKE '%Team%' or event LIKE '%Relay%' THEN 'Team' 
       ELSE 'Individual' END AS event_type 
FROM medals.summer_medals 
WHERE discipline = 'Swimming' 
LIMIT 100