Skip to content

Working with Text in SQL

Text data from your queries often may be messy or in the wrong format. It is therefore crucial to be able to clean up and reformat data. This template runs through the basics of combining, editing, and extracting text in PostgreSQL.

This template connects to a movie database by default. However, you can use any integration you want, so long as you have access to a table with text data. You can learn more about integrations here.

💡  The data we will use is contained in the cinema schema of the Course Databases database. If you wish to edit the queries, you must add the schema name when referencing tables in this database (e.g., cinema.films).

Concatenating text

A standard operation you may want to perform is combining the text from two or more columns.

In the example below, we use CONCAT() to combine the title and release_year columns, enclosing the release year in parentheses.

👇  To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.

Spinner
DataFrameas
df
variable
SELECT 
-- Add title and release year in parentheses
	CONCAT(title, ' (', release_year, ')') AS title_and_year
FROM cinema.films
-- Limit to prevent the query taking too long
LIMIT 5

Change case of text

Another standard string operation you may want to perform is changing the text's case. The three PostgreSQL functions you can use to accomplish this are:

  • UPPER() to convert text to uppercase.
  • LOWER() to convert text to lowercase.
  • INITCAP() to convert text to title case.

In the example below, we convert the title in the films table to three different cases.

💡  Note that INITCAP() will capitalize the first letter of each word and may not be appropriate for text in the form of sentences.

Spinner
DataFrameas
df
variable
SELECT 
	-- Convert title to three different cases
	UPPER(title) AS uppercase_title,
    LOWER(title) AS lowercase_title,
    INITCAP(title) AS titlecase_title
FROM cinema.films
-- Limit to prevent the query taking too long
LIMIT 5

Truncating text

Another helpful text operation is to select a given number of characters from the left or right of a set of text. Depending on the direction you want to extract text from, you can use LEFT() or RIGHT(). You need to specify the text (either a column or a string) and the number of characters you want to extract.

In the example below, we use LEFT() to extract the first 50 characters of the description column in the descriptions table. We then use CONCAT() to add an ellipsis (...) to indicate that the column has been truncated.

Spinner
DataFrameas
df
variable
SELECT
	description,
    -- Extract the first 50 characters of the column
    CONCAT(LEFT(description, 50),
    -- Concatenate '...' to indicate it has been trimmed
           '...') AS short_description
FROM cinema.descriptions
-- Limit to prevent the query taking too long
LIMIT 5

Trimming text

You may notice a space between the text and the ellipsis in the third row of the example above. This is because the LEFT() function is cutting at precisely 50 characters, whether or not there is whitespace left.

You can use the TRIM() function to remove whitespace from the left and right sides of a string. To trim text, you use the form:

TRIM(FROM string_you_want_to_trim)

In the example below, we add the TRIM() function to eliminate whitespace before concatenating the ellipsis.

💡  You can also supply leading or trailing in front of FROM to only remove whitespace from the left and right of a text.

Spinner
DataFrameas
df
variable
SELECT
	description,
    -- Extract and trim the first 50 characters of the column
    CONCAT(TRIM(FROM LEFT(description, 50)),
    -- Concatent '...' to indicate it has been trimmed
           '...') AS short_description
FROM cinema.descriptions
-- Limit to prevent the query taking too long
LIMIT 5

Replacing text

Some text may contain characters or pieces of text that you wish to remove or replace. You can use the REPLACE() function to accomplish this. The function takes three arguments:

  • The text that you want to use.
  • The character(s) you want to remove.
  • The character(s) you want to insert. In the example below, we replace spaces in the name column of the people table with underscores and convert the entire text to lowercase.
Spinner
DataFrameas
df
variable
SELECT 
	name,
    -- Replace spaces in the string with an underscore
	-- and convert to lowercase
	LOWER(REPLACE(name, ' ', '_')) AS revised_name
FROM cinema.people
-- Limit to prevent the query taking too long
LIMIT 5

Extracting substrings

You can use the SUBSTRING() function to extract text elements. It follows the following structure:

SUBSTRING(text_to_extract FROM position_to_start FOR number_of_characters)

In the first example below, we use SUBSTRING() to extract the first 10 characters of the title column in the films table.

💡  You will notice this performs in much the same way as the LEFT() function. However, as will be shown below, SUBSTRING() can be used to perform more powerful string extraction.

Spinner
DataFrameas
df
variable
SELECT
	title,
    -- Extract the first 10 characters of the column 
    -- starting from position 0
	SUBSTRING(title FROM 0 FOR 10) AS reduced_title
FROM cinema.films
-- Limit to prevent the query taking too long
LIMIT 5

In the more advanced example below, we use SUBSTRING() in combination with a new function POSITION(). This function returns the location of a string within another string (such as the position of a space " " character).

Using POSITION() allows us to split the name before and after the space to extract the first and last names of actors in the people table.

Spinner
DataFrameas
df
variable
SELECT
	name,
    -- Extract a string starting from position 0 up until the space
	SUBSTRING(name FROM 0 FOR POSITION(' ' IN name)) AS first_name,
    -- Extract a string starting from the position of the space plus one
    -- up until the end of the string
    SUBSTRING(name FROM POSITION(' ' IN name)+1) AS last_name
FROM cinema.people
-- Limit to prevent the query taking too long
LIMIT 5