Skip to content
Relational Model and Web Analytics Using PostgreSQL
Example of a relational database model for a digital news website
The following blocks of code are intended to exemplify how perfomance reports can be generated through the use of a relational database. This example is of a company that operates a tech news and reviews site, creatively named "MySite". MySite generates revenue via ad placements on its pages and affiliate links.
Description of the data warehouse contents.
ad_rev.csv
- Contains daily ad earning totals for each of the site's pages.
Field Name | Description |
---|---|
url | the web page address |
month | the date on which the revenue was generated |
ad_revenue | the amount of revenue generated |
affiliate_rev.csv
- Contains daily affiliate earning totals for each of the site's pages.
Field Name | Description |
---|---|
url | the web page address |
month | the date on which the revenue was generated |
ad_revenue | the amount of revenue generated |
master_page_list.csv
- Contains the records of the site's page adresses, types of articles, the authors, date of publishing and last update date.
Field Name | Description |
---|---|
permalink | the web page address |
author | name of the article writer |
date_published | the publish date |
last_modified | the date of the last update |
amazon_tag_id | an unique id that records MySite as the origin of a purchase made through amazon |
salaries.csv
- Contains a list of employee names, their start and end dates and their monthly pay.
Field Name | Description |
---|---|
department | the department within which the employee works |
employee | the employee's full name |
start_date | the date at which the employee started work |
end_date | the date at which the worker left his job |
net_salary | the amount the employee is owed at the end of each work cycle |
tax | amount of applicable tax |
gross_salary_with_tax | total amount payable after tax |
traffic.csv
- Records of monthly aggregated data regarding the traffic that the pages of MySite has generated.
Field Name | Description |
---|---|
url | address of the web page |
month | the end day of each month in date format |
page_views | the number of page views generated in that month |
sessions | the number of sessions generated in that month |
users | the number of users generated in that month |
DataFrameas
clean_master_page_list
variable
--Clean master page list table to make sure there are no leading or trailing whitespaces
--The purpose of this is to make sure that our primary/foreign key fields or any other field that might be used in joins/lookups can find their match in the other tables.
SELECT TRIM(Permalink) AS url,
TRIM(author) AS author,
TRIM(amazon_tag_id) AS amazon_tag_id,
TRIM(page_type) AS page_type
FROM master_page_list.csv
DataFrameas
clean_traffic
variable
--Repeat the previous step shown above and also replace any missing values with 0.
SELECT TRIM(url) AS url,
EXTRACT(month from month) AS month,
CASE WHEN page_views IS NULL THEN 0 ELSE page_views END AS page_views,
CASE WHEN sessions IS NULL THEN 0 ELSE sessions END AS sessions,
CASE WHEN users IS NULL THEN 0 ELSE users::BIGINT END AS users
FROM traffic.csv
WHERE url IS NOT NULL
DataFrameas
clean_ad_rev
variable
--Clean string data columns and make sure the ad revenue column is of the numeric data type so that calculations can be performed.
SELECT TRIM(url) AS url,
EXTRACT(month from month) AS month,
CAST(ad_revenue AS NUMERIC(10, 2)) AS ad_revenue
FROM ad_rev.csv
DataFrameas
clean_affiliate_rev
variable
--Same transformations as in the previous step
SELECT TRIM(category) AS category,
TRIM(name) AS product_name,
TRIM(asin) AS ASIN,
TRIM(tracking_id) AS tracking_id,
EXTRACT(month from date_shipped) AS month,
CAST(price AS NUMERIC(10, 2)) AS product_price,
CAST(items_shipped AS INT) AS items_shipped,
CAST(returns AS INT) AS items_returned,
CAST(items_shipped AS NUMERIC(10, 2)) AS revenue,
CAST(commission AS NUMERIC(10, 2)) AS commission,
TRIM(device_type_group) AS device_type_group
FROM affiliate_rev.csv
DataFrameas
clean_employee_salaries
variable
--Clean the string data columns while making sure that there are no text elements in the numeric columns so that calculations can be performed on those.
SELECT TRIM(department) AS department,
TRIM(employee) as employee_name,
CASE WHEN start_date IS NULL THEN CAST('2023-08-01' AS DATE) ELSE CAST(start_date AS DATE) END AS start_date,
CAST(end_date AS DATE) AS end_date,
CAST(REPLACE(REPLACE(net_salary, '$', ''),',', '') AS INT) AS net_salary,
CAST(REPLACE(REPLACE(tax, '$', ''),',', '') AS INT) AS tax,
CAST(REPLACE(REPLACE(gross_salary_with_tax, '$', ''),',', '') AS INT) AS gross_salary
FROM salaries.csv
WHERE employee_name IS NOT NULL
Time to analyze
Now that the data has been loaded and cleaned we can start analyzing it and see what insights we might extract.
DataFrameas
ad_rev_mom_change
variable
-- First we will look to see how our monthly ad earnings have evolved in percentage terms.
WITH ad_rev_cte AS (
SELECT month,
ROUND(SUM(ad_revenue),0) AS total_ad_revenue,
CASE WHEN LAG(SUM(ad_revenue)) OVER (ORDER BY month) IS NULL
THEN ROUND(SUM(ad_revenue),0)
ELSE LAG(ROUND(SUM(ad_revenue),0)) OVER (ORDER BY month)
END AS previous_month_ad_rev
FROM clean_ad_rev
GROUP BY month
ORDER BY month
)
SELECT month,
total_ad_revenue,
previous_month_ad_rev,
ROUND(total_ad_revenue / previous_month_ad_rev - 1, 4) * 100 AS mom_pct_change
FROM ad_rev_cte
DataFrameas
aff_rev_mom_change
variable
-- Secondly we will look to see how our monthly affiliate earnings have evolved in percentage terms.
WITH aff_rev_cte AS (
SELECT month,
ROUND(SUM(commission),0) AS total_commission,
CASE WHEN LAG(SUM(commission)) OVER (ORDER BY month) IS NULL
THEN ROUND(SUM(commission),0)
ELSE LAG(ROUND(SUM(commission),0)) OVER (ORDER BY month)
END AS previous_month_commission
FROM clean_affiliate_rev
GROUP BY month
ORDER BY month
)
SELECT month,
total_commission,
previous_month_commission,
ROUND(total_commission / previous_month_commission - 1, 4) * 100 AS mom_pct_change
FROM aff_rev_cte
DataFrameas
total_revenue_mom_change
variable
-- Now we will join the previous 2 tables in order to see how our total revenue has evolved over the past 4 months.
SELECT adr.month,
adr.total_ad_revenue,
cmr.total_commission,
adr.total_ad_revenue + cmr.total_commission as total_revenue,
CASE WHEN LAG(total_revenue) OVER (ORDER BY adr.month) IS NULL
THEN ROUND(total_revenue,0)
ELSE LAG(ROUND(total_revenue,0)) OVER (ORDER BY adr.month)
END AS previous_month_revenue,
ROUND((total_revenue - previous_month_revenue) / previous_month_revenue * 100, 2) AS total_rev_mom_pct_change
FROM ad_rev_mom_change AS adr
INNER JOIN aff_rev_mom_change AS cmr
ON adr.month = cmr.month
DataFrameas
traffic_mom_pct_change
variable
-- Lastly we will do the same for our traffic data.
WITH monthly_traffic_cte AS (
SELECT month,
SUM(page_views) AS page_views,
CASE WHEN LAG(SUM(page_views)) OVER (ORDER BY month) IS NULL
THEN ROUND(SUM(page_views),0)
ELSE LAG(ROUND(SUM(page_views),0)) OVER (ORDER BY month)
END AS previous_month_pv
FROM clean_traffic
GROUP BY month
ORDER BY month
)
SELECT month,
page_views,
previous_month_pv,
ROUND((page_views - previous_month_pv) / previous_month_pv * 100,2) AS traffic_mom_pct_change
FROM monthly_traffic_cte
DataFrameas
traffic_and_rev_summary
variable
-- Time to bring it all together
SELECT tfc.month,
tfc.page_views,
tfc.traffic_mom_pct_change,
rev.total_revenue,
rev.total_rev_mom_pct_change
FROM traffic_mom_pct_change AS tfc
INNER JOIN total_revenue_mom_change AS rev
ON tfc.month = rev.month
Indeed, it seems that in November we have recorded a 10% decrease in total page views on our site, which then resulted in an 8.5% decrease in our revenue for that month.
Now that we know what the headline summary is in terms of revenue and traffic, it is time to drill down further.
- First we will try to see if the traffic drop from November affected the whole site or if there were some specific article types that saw a general decrease in total page views.
- Then we will roll all of that up to the author level in order to see which employees had performed best/worst.
- Lastly we will try to determine which department was the most profitable.
DataFrameas
mom_traffic_pct_change_by_pgtype
variable
-- Lastly we will want to see month over month traffic figures based on the article type.
WITH traffic_cte AS(
SELECT page_type,
month,
SUM(page_views) AS page_views,
CASE WHEN LAG(SUM(page_views)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
THEN SUM(page_views)
ELSE LAG(SUM(page_views)) OVER (PARTITION BY page_type ORDER BY month)
END AS previous_month_page_views,
SUM(sessions) AS sessions,
CASE WHEN LAG(SUM(sessions)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
THEN SUM(sessions)
ELSE LAG(SUM(sessions)) OVER (PARTITION BY page_type ORDER BY month)
END AS previous_month_sessions,
SUM(users) AS users,
CASE WHEN LAG(SUM(users)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
THEN SUM(users)
ELSE LAG(SUM(users)) OVER (PARTITION BY page_type ORDER BY month)
END AS previous_month_users,
FROM clean_traffic AS ct
INNER JOIN clean_master_page_list AS mpl
ON ct.url = mpl.url
GROUP BY page_type, month
)
SELECT page_type,
month,
page_views,
previous_month_page_views,
ROUND(page_views / previous_month_page_views - 1, 4) * 100 AS pv_mom_pct_change,
sessions,
previous_month_sessions,
ROUND(sessions / previous_month_sessions - 1, 4) * 100 AS sess_mom_pct_change,
users,
previous_month_users,
ROUND(users / previous_month_users - 1, 4) * 100 AS usr_mom_pct_change,
FROM traffic_cte
ORDER BY page_type, month