Skip to content

Example of a relational database model for a digital news website

The following blocks of code are intended to exemplify how perfomance reports can be generated through the use of a relational database. This example is of a company that operates a tech news and reviews site, creatively named "MySite". MySite generates revenue via ad placements on its pages and affiliate links.

Description of the data warehouse contents.

ad_rev.csv

  • Contains daily ad earning totals for each of the site's pages.
Field NameDescription
urlthe web page address
monththe date on which the revenue was generated
ad_revenuethe amount of revenue generated

affiliate_rev.csv

  • Contains daily affiliate earning totals for each of the site's pages.
Field NameDescription
urlthe web page address
monththe date on which the revenue was generated
ad_revenuethe amount of revenue generated

master_page_list.csv

  • Contains the records of the site's page adresses, types of articles, the authors, date of publishing and last update date.
Field NameDescription
permalinkthe web page address
authorname of the article writer
date_publishedthe publish date
last_modified the date of the last update
amazon_tag_id an unique id that records MySite as the origin of a purchase made through amazon

salaries.csv

  • Contains a list of employee names, their start and end dates and their monthly pay.
Field NameDescription
departmentthe department within which the employee works
employeethe employee's full name
start_datethe date at which the employee started work
end_date the date at which the worker left his job
net_salary the amount the employee is owed at the end of each work cycle
tax amount of applicable tax
gross_salary_with_tax total amount payable after tax

traffic.csv

  • Records of monthly aggregated data regarding the traffic that the pages of MySite has generated.
Field NameDescription
urladdress of the web page
monththe end day of each month in date format
page_viewsthe number of page views generated in that month
sessions the number of sessions generated in that month
users the number of users generated in that month
Spinner
DataFrameas
clean_master_page_list
variable
--Clean master page list table to make sure there are no leading or trailing whitespaces
--The purpose of this is to make sure that our primary/foreign key fields or any other field that might be used in joins/lookups can find their match in the other tables.

SELECT TRIM(Permalink) AS url,
	TRIM(author) AS author,
	TRIM(amazon_tag_id) AS amazon_tag_id,
	TRIM(page_type) AS page_type
FROM master_page_list.csv
Spinner
DataFrameas
clean_traffic
variable
--Repeat the previous step shown above and also replace any missing values with 0.

SELECT TRIM(url) AS url,
	EXTRACT(month from month) AS month,
	CASE WHEN page_views IS NULL THEN 0 ELSE page_views END AS page_views,
	CASE WHEN sessions IS NULL THEN 0 ELSE sessions END AS sessions,
	CASE WHEN users IS NULL THEN 0 ELSE users::BIGINT END AS users
FROM traffic.csv
WHERE url IS NOT NULL
Spinner
DataFrameas
clean_ad_rev
variable
--Clean string data columns and make sure the ad revenue column is of the numeric data type so that calculations can be performed.

SELECT TRIM(url) AS url,
	EXTRACT(month from month) AS month,
	CAST(ad_revenue AS NUMERIC(10, 2)) AS ad_revenue
FROM ad_rev.csv
Spinner
DataFrameas
clean_affiliate_rev
variable
--Same transformations as in the previous step

SELECT TRIM(category) AS category,
	TRIM(name) AS product_name,
	TRIM(asin) AS ASIN,
	TRIM(tracking_id) AS tracking_id,
	EXTRACT(month from date_shipped) AS month,
	CAST(price AS NUMERIC(10, 2)) AS product_price,
	CAST(items_shipped AS INT) AS items_shipped,
	CAST(returns AS INT) AS items_returned,
	CAST(items_shipped AS NUMERIC(10, 2)) AS revenue,
	CAST(commission AS NUMERIC(10, 2)) AS commission,
	TRIM(device_type_group) AS device_type_group
FROM affiliate_rev.csv
Spinner
DataFrameas
clean_employee_salaries
variable
--Clean the string data columns while making sure that there are no text elements in the numeric columns so that calculations can be performed on those.

SELECT TRIM(department) AS department,
	TRIM(employee) as employee_name,
	CASE WHEN start_date IS NULL THEN CAST('2023-08-01' AS DATE) ELSE CAST(start_date AS DATE) END AS start_date,
	CAST(end_date AS DATE) AS end_date,
	CAST(REPLACE(REPLACE(net_salary, '$', ''),',', '') AS INT) AS net_salary,
	CAST(REPLACE(REPLACE(tax, '$', ''),',', '') AS INT) AS tax,
	CAST(REPLACE(REPLACE(gross_salary_with_tax, '$', ''),',', '') AS INT) AS gross_salary
FROM salaries.csv
WHERE employee_name IS NOT NULL

Time to analyze

Now that the data has been loaded and cleaned we can start analyzing it and see what insights we might extract.

Spinner
DataFrameas
ad_rev_mom_change
variable
-- First we will look to see how our monthly ad earnings have evolved in percentage terms.

WITH ad_rev_cte AS (
SELECT month,
	ROUND(SUM(ad_revenue),0) AS total_ad_revenue,
	CASE WHEN LAG(SUM(ad_revenue)) OVER (ORDER BY month) IS NULL
		THEN ROUND(SUM(ad_revenue),0)
		ELSE LAG(ROUND(SUM(ad_revenue),0)) OVER (ORDER BY month)
		END AS previous_month_ad_rev
FROM clean_ad_rev
GROUP BY month
ORDER BY month
)

SELECT month,
	total_ad_revenue,
	previous_month_ad_rev,
	ROUND(total_ad_revenue / previous_month_ad_rev - 1, 4) * 100 AS mom_pct_change
FROM ad_rev_cte
Spinner
DataFrameas
aff_rev_mom_change
variable
-- Secondly we will look to see how our monthly affiliate earnings have evolved in percentage terms.

WITH aff_rev_cte AS (
SELECT month,
	ROUND(SUM(commission),0) AS total_commission,
	CASE WHEN LAG(SUM(commission)) OVER (ORDER BY month) IS NULL
		THEN ROUND(SUM(commission),0)
		ELSE LAG(ROUND(SUM(commission),0)) OVER (ORDER BY month)
		END AS previous_month_commission
FROM clean_affiliate_rev
GROUP BY month
ORDER BY month
)

SELECT month,
	total_commission,
	previous_month_commission,
	ROUND(total_commission / previous_month_commission - 1, 4) * 100 AS mom_pct_change
FROM aff_rev_cte
Spinner
DataFrameas
total_revenue_mom_change
variable
-- Now we will join the previous 2 tables in order to see how our total revenue has evolved over the past 4 months.

SELECT adr.month,
	adr.total_ad_revenue,
	cmr.total_commission,
	adr.total_ad_revenue + cmr.total_commission as total_revenue,
	CASE WHEN LAG(total_revenue) OVER (ORDER BY adr.month) IS NULL
		THEN ROUND(total_revenue,0)
		ELSE LAG(ROUND(total_revenue,0)) OVER (ORDER BY adr.month)
		END AS previous_month_revenue,
	ROUND((total_revenue - previous_month_revenue) / previous_month_revenue * 100, 2) AS total_rev_mom_pct_change
FROM ad_rev_mom_change AS adr
INNER JOIN aff_rev_mom_change AS cmr
ON adr.month = cmr.month
Spinner
DataFrameas
traffic_mom_pct_change
variable
-- Lastly we will do the same for our traffic data.

WITH monthly_traffic_cte AS (
SELECT month,
	SUM(page_views) AS page_views,
	CASE WHEN LAG(SUM(page_views)) OVER (ORDER BY month) IS NULL
		THEN ROUND(SUM(page_views),0)
		ELSE LAG(ROUND(SUM(page_views),0)) OVER (ORDER BY month)
		END AS previous_month_pv
FROM clean_traffic
GROUP BY month
ORDER BY month
)

SELECT month,
	page_views,
	previous_month_pv,
	ROUND((page_views - previous_month_pv) / previous_month_pv * 100,2) AS traffic_mom_pct_change
FROM monthly_traffic_cte
Spinner
DataFrameas
traffic_and_rev_summary
variable
-- Time to bring it all together

SELECT tfc.month,
	tfc.page_views,
	tfc.traffic_mom_pct_change,
	rev.total_revenue,
	rev.total_rev_mom_pct_change
FROM traffic_mom_pct_change AS tfc
INNER JOIN total_revenue_mom_change AS rev
ON tfc.month = rev.month

Indeed, it seems that in November we have recorded a 10% decrease in total page views on our site, which then resulted in an 8.5% decrease in our revenue for that month.

Now that we know what the headline summary is in terms of revenue and traffic, it is time to drill down further.

  1. First we will try to see if the traffic drop from November affected the whole site or if there were some specific article types that saw a general decrease in total page views.
  2. Then we will roll all of that up to the author level in order to see which employees had performed best/worst.
  3. Lastly we will try to determine which department was the most profitable.
Spinner
DataFrameas
mom_traffic_pct_change_by_pgtype
variable
-- Lastly we will want to see month over month traffic figures based on the article type.

WITH traffic_cte AS(
	SELECT page_type,
		month,
		SUM(page_views) AS page_views,
		CASE WHEN LAG(SUM(page_views)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
		THEN SUM(page_views)
		ELSE LAG(SUM(page_views)) OVER (PARTITION BY page_type ORDER BY month)
		END AS previous_month_page_views,
		SUM(sessions) AS sessions,
		CASE WHEN LAG(SUM(sessions)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
		THEN SUM(sessions)
		ELSE LAG(SUM(sessions)) OVER (PARTITION BY page_type ORDER BY month)
		END AS previous_month_sessions,
		SUM(users) AS users,
		CASE WHEN LAG(SUM(users)) OVER (PARTITION BY page_type ORDER BY month) IS NULL
		THEN SUM(users)
		ELSE LAG(SUM(users)) OVER (PARTITION BY page_type ORDER BY month)
		END AS previous_month_users,
	FROM clean_traffic AS ct
	INNER JOIN clean_master_page_list AS mpl
	ON ct.url = mpl.url
	GROUP BY page_type, month
)

SELECT page_type,
	month,
	page_views,
	previous_month_page_views,
	ROUND(page_views / previous_month_page_views - 1, 4) * 100 AS pv_mom_pct_change,
	sessions,
	previous_month_sessions,
	ROUND(sessions / previous_month_sessions - 1, 4) * 100 AS sess_mom_pct_change,
	users,
	previous_month_users,
	ROUND(users / previous_month_users - 1, 4) * 100 AS usr_mom_pct_change,
FROM traffic_cte
ORDER BY page_type, month