Skip to content

Functions for Manipulating Data in PostgreSQL

Here you can access the tables used in the course. To access the table, you will need to specify the dvdrentals schema in your queries (e.g., dvdrentals.film for the film table and dvdrentals.country for the country table).

Take Notes

Add notes about the concepts you've learned and SQL cells with queries you want to keep.

Add your notes here

Spinner
DataFrameas
movie_info
variable
-- Add your own queries here
SELECT *
FROM dvdrentals.film
LIMIT 10

Explore Datasets

Use the different tables to explore the data and practice your skills!

  • Select the title, release_year, and rating of films in the film table.
    • Add a description_shortened column which contains the first 50 characters of the description column, ending with "...".
    • Filter the film table for rows where the special_features column contains "Commentaries".
  • Select the customer_id, amount, and payment_date from the payment table.
    • Extract date information from the payment_date column, creating new columns for the day, month, quarter, and year of transaction.
    • Use the rental table to include a column containing the number of days rented (i.e., time between the rental_date and the return_date).
  • Update the title column so that titles with multiple words are reduced to the first word and the first letter of the second word followed by a period.
    • For example:
      • "BEACH HEARTBREAKERS" becomes "BEACH H."
      • "BEAST HUNCHBACK" becomes "BEAST H."
    • Reformat your shortened title to title case (e.g., "BEACH H." becomes "Beach H.").

Full text functions exploring

LIKE wildcard usage for searching

the "_" and the "%"

when you want to search a to make a search that isn't case sensitive you may need to make a convertion to tsvector and tsquery type of data.

you can use the where like : -- in this case it will filter with every row with elf or ELF or Elf. -- Select the title and description SELECT title, description FROM film -- Convert the title to a tsvector and match it against the tsquery WHERE to_tsvector(title) @@ to_tsquery('elf');

---- in this case below it will only filter the GOLD uppercased. SELECT * FROM film -- Select only records that contain the word 'GOLD' WHERE title LIKE '%GOLD%';

Extending PostgreSQL

Just like in Python we can create our own functions to enhance our filtering and querying capabilities.

CREATE TYPE dayofweek AS ENUM( 'Monday','Tuesday',....) - This will create a data list of the type E (enumerate) Once created our table, we can get informatioins about it with :

SELECT typname, typcategory FROM pg_type WHERE typname ='dayofweek'

You can also use information schema

SELECT column_name,data_type,udt_name FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'film' ;


-- Select the column name, data type and udt name columns SELECT column_name, data_type, udt_name FROM INFORMATION_SCHEMA.COLUMNS -- Filter by the rating column in the film table WHERE table_name ='film' AND column_name='rating';

  • creating summaries and changing data type in queries

On SQL we haver different types of data and in this different types we have sub categories that it can change a lot the dinamic of work of it. For instance if we divide to integer 10/3 we will have an integer as result 3. But if we change one of those number to a numeric 10/3.0 we would have as result a 3.333. So changing and manipulating data types is important to data manipulation.

To change our data type restricted to our query we can use the function CAST or simply put "::" and the new datatype in the SELECT

SELECT colum_X,CAST(column AS integer)

SELECT column::integer

sumarizing and analising

There are the basic calculus sumarization that are the functions sum(),min().. max,stddev,avg...

The median is a little different from those summary statistics It can be discrrete or continuous : percentile_disc percentile_cont discrete will get the nearest value to the median if there isn't a median and continuous will make an average between the two median values. to get the median: SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY column)

We can also explore distributions creating bins for histogram like analysis on our data. It can be done via trunc() function and generate series. We can se the both examples below.

-- -- Truncate employees SELECT trunc(employees, -5) AS employee_bin, -- Count number of companies with each truncated value count(*) FROM fortune500 -- Use alias to group GROUP BY employee_bin -- Use alias to order ORDER BY employee_bin;


  • Bins created in Step 2 WITH bins AS ( SELECT generate_series(2200, 3050, 50) AS lower, generate_series(2250, 3100, 50) AS upper), -- Subset stackoverflow to just tag dropbox (Step 1) dropbox AS ( SELECT question_count FROM stackoverflow WHERE tag='dropbox') -- Select columns for result -- What column are you counting to summarize? SELECT lower, upper, count(question_count) FROM bins -- Created above -- Join to dropbox (created above), -- keeping all rows from the bins table in the join LEFT JOIN dropbox -- Compare question_count to lower and upper ON question_count >= lower AND question_count < upper -- Group by lower and upper to count values in each bin GROUP BY lower, upper -- Order by lower to put bins in order ORDER BY lower;

Strategies for multiple transformations

Lets pretend we have a column that we want to extract a substring but the data it had some inconsistancy on the data colection fase. In this cases we can have two aproaches, create a CASE statment to clean the data or create a temporary table and join later with the edited column and use as label of the data. Following examples bellow.

- CASE example

SELECT CASE WHEN zipcount < ??? THEN 'other' ELSE zip END AS zip_recoded, sum(zipcount) AS zipsum FROM (SELECT zip, count(*) AS zipcount FROM evanston311 GROUP BY zip) AS fullcounts GROUP BY zip_recoded ORDER BY zipsum DESC;

- Temporary table example

-- Code from previous step DROP TABLE IF EXISTS recode; CREATE TEMP TABLE recode AS SELECT DISTINCT category, rtrim(split_part(category, '-', 1)) AS standardized FROM evanston311; UPDATE recode SET standardized='Trash Cart' WHERE standardized LIKE 'Trash%Cart'; UPDATE recode SET standardized='Snow Removal' WHERE standardized LIKE 'Snow%Removal%'; UPDATE recode SET standardized='UNUSED' WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart', '(DO NOT USE) Water Bill', 'DO NOT USE Trash', 'NO LONGER IN USE');

-- Select the recoded categories and the count of each SELECT standardized, count(*) -- From the original table and table with recoded values FROM evanston311 LEFT JOIN recode -- What column do they have in common? ON evanston311.category = recode.category -- What do you need to group by to count? GROUP BY standardized -- Display the most common val values first ORDER BY count DESC;

date and time functions

These datatypes can use comparison operators and sum and subtract just like the numeric data types with some especifics.

The datatypes automatic convert to timestamp and this one requires to converte the integer in a sum or subctration operation to interval. as it goes '2019-01-21' + '1 year and 2 days':: interval.

If the data type is data and u sum or subctract an integer it will automatically assume a day. You can change to month or year by converting it to interval as well.

To make summaries and some analysis on date type datas we can use some function to extract some fields from it. Fields are : century, decade, year, month, day, hour, minute and second.

these 2 functions have the same outcome changing only the synthax

  • date_part('field', timestamp) exmp. date_part('month',now())

  • EXTRACT(FIELD FROM timestamp) exmp. EXTRACT(MONTH FROM now())

    example -- Extract the month from date_created and count requests SELECT EXTRACT('month' FROM date_created) AS month, count(*) FROM evanston311 -- Limit the date range WHERE date_created >= '2016-01-01' AND date_created < '2018-01-01' -- Group by what to get monthly counts? GROUP BY month;

One example on howo these functions can be used to summarize some data.

you can also use the function date_trunc() to have a similar sumarization

  • date_trunc('field',timestamp)

    example : in the example below the extra step on from is rework but its there to ilustrate the difference between trunc and EXTRACT or date_part. These last ones only extract the field chosen meanwhile the trunc bring the field choosen and also all larger ones within. So in the FROM subquery we extract the days and in the major query we where able to extract the months because this field is larger them day. Again the subquery step wasn't necessary it has only demonstrative purposes.

    -- Aggregate daily counts by month SELECT date_trunc('month',day) AS month, avg(count) -- Subquery to compute daily counts FROM (SELECT date_trunc('day', date_created) AS day, count(*) AS count FROM evanston311 GROUP BY day) AS daily_count GROUP BY month ORDER BY month;

the third way to sumarize it using generate_series

-generate_series(from,to,interval) generate_series('2019-01-23,-'20190-03-01','2 days'::interval)

example : using this method we can make it appear in the summary intervals with 0 observations while using the trunc or EXTRACT methods the interval wouldn't appear giving at first glance a misleading idea of the data. This alternative methodo also have the benefits to create bins with customized upper and loewr bonds. The only negative aspect of the generate_series method is that it requires more work with subqueries and and WITH statmens to be later joined.