Skip to content

Cleaning Data in SQL

Data from your queries may often contain missing values, duplicate rows, or data in the wrong format. It is crucial to be able to clean the data you retrieve through your queries.

Missing values

Missing or null values are common in data and can often present a problem for future analyses.

Spinner
DataFrameas
df
variable
SELECT COUNT(*) AS number_missing_unemployment_rates
FROM world.economies
WHERE unemployment_rate IS NULL

Filling missing value

Using COALESCE() to replace NULL values with the average unemployment rate.

Spinner
DataFrameas
df
variable
SELECT
	code,
    unemployment_rate,
	COALESCE(unemployment_rate, 
             (SELECT AVG(unemployment_rate) FROM world.economies)) AS filled_unemployment_rate
FROM world.economies

Duplicate rows

Another data issue you may come across is duplicate rows. To identify them, you can use ROW_NUMBER() to assign numbers to rows based on identical combinations. By choosing the PARTITION of the window function, you can specify over which columns you want to look for duplicates.

In the example below, we use PARTITION BY to assign row numbers based on the combination of country code and unemployment rate. As you can see from the query results, duplicate rows have a value of 2 or greater.

Spinner
DataFrameas
df
variable
SELECT *
FROM (
    SELECT 
        code, 
        unemployment_rate,
        ROW_NUMBER() OVER(PARTITION BY code, unemployment_rate) AS row_number
    FROM world.economies
) AS sub

Discarding duplicate rows

Removing duplicate rows is just as simple as identifying them. To do so, you simply need to change your filter to select row_numbers with a value of 1.

Spinner
DataFrameas
df
variable
SELECT *
FROM (
    SELECT 
        code, 
        unemployment_rate,
        ROW_NUMBER() OVER(PARTITION BY code, unemployment_rate) AS row_number
    FROM world.economies
) AS sub
WHERE row_number = 1

Invalid data

If you are aware of invalid data, you can find (and remove) it using pattern-matching.

In the example below, we search for rows where the indep_year contains a negative value. To do so, we convert the column to text using ::TEXT, and then use LIKE and our pattern. The pattern we use searches for a minus sign (-), followed by any other characters (using the wildcard %).

Spinner
DataFrameas
df
variable
SELECT indep_year
FROM world.countries
WHERE indep_year::TEXT LIKE '-%'

You can also use pattern matching to find rows with similar variants. In the example below, we use a pattern to identify all rows with Monarchy in the gov_form column.

Our pattern searches for any row with "Monarchy". We use the % wildcard characters to allow for words/whitespace on either side of the word we are searching for.

Spinner
DataFrameas
df
variable
SELECT DISTINCT name, gov_form
FROM world.countries
WHERE gov_form LIKE '%Monarchy%'

Fixing invalid data

There are a variety of ways you can fix invalid data. One way is to use a CASE statement to recategorize the data. In the example below, we convert all gov_form rows that contain "Monarchy" to "Monarchy". The remaining entries are left as they are.

Spinner
DataFrameas
df
variable
SELECT DISTINCT 
	name, 
    gov_form,
    CASE WHEN gov_form LIKE '%Monarchy%' THEN 'Monarchy' 
    ELSE gov_form END AS fixed_gov_form
FROM world.countries
WHERE gov_form LIKE '%Monarchy%'