Skip to content
Intermediate SQL Server - Course Notes by Chinmay Garg

Intermediate SQL Server

1. Summarizing Data

Data Analysis with Aggregations

  1. Data Scientists spend a lot of time analyzing and aggregating data by generating summary statistics such as the mean, minimum and maximum values of columns. If your data exists in a DB, the fastest way to generate summary statistics is by using SQL.
  2. Common summary stats - calculate minimum, maximum, average by using MIN(), MAX(), AVG()
  3. Aggregating Data - 3-4 in https://app.datacamp.com/workspace/w/aab12b07-150e-4884-a28c-1fce47c8b235/edit
  4. Reminder: SQL Server is case insensitive
Spinner
DataFrameavailable as
df1
variable
-- Query 1.3. Calculate Summary stats of bicycle orders where average sales of a bicycle > 500
SELECT p.product_name,
       AVG(oi.list_price * quantity * Discount) AS Average, 
       MIN(oi.list_price * quantity * Discount) AS Minimum, 
       MAX(oi.list_price * quantity * Discount) AS Maximum
FROM sales.order_items oi
INNER JOIN production.products p ON oi.product_id = p.product_id
GROUP BY product_name
-- Return records where average of Sales is greater than 500
HAVING AVG(oi.list_price * quantity) > 500

  1. Missing data are a common occurrence. Missing data is represented as NULL. NULL is not a value, you cannot use the = sign or any other operators to find or compare NULLs, instead you must use IS NULL and IS NOT NULL.
  2. Blank value is not NULL: blanks are formatted as two single quotes with no space and are specific to text fields. An empty row appears as a blank in queries but is not NULL. To identify blank values, use the LEN function (LEN(col1) > 0)
  3. ISNULL(missing_data_col, Substituted value/ col) - Substitute missing data with a specific value/ column
  4. COALESCE(missing_data_col, value_2/ col_2, ..value_n/ col_n) -Ppass multiple values it will always return the first non-NULL value. Similar to using a series of if-else statements to return the first non-missing value. https://www.sqlshack.com/using-the-sql-coalesce-function-in-sql-server/
Spinner
DataFrameavailable as
df2
variable
-- Query 1.7 - ISNULL example. Contact customers with missing phone numbers
SELECT customer_id, ISNULL(phone, 'Contact ' + first_name + ' ' + last_name) phone
FROM sales.customers 
WHERE phone IS NULL

Binning data with CASE

  1. Evaluate records like if/ if-else statements.
  2. Keywords - CASE WHEN THEN ELSE (Optional) END
  3. In absence of ELSE for records where any condition is not met in WHEN, the column will output as NULL.
  4. Add multipe logical conditions to WHEN using AND/ BETWEEN.
  5. Can also be Nested LIKE if-else statements. (https://www.tsql.info/sql-server/nested-case-statements.php)
  6. The entire CASE WHEN block can be used in WHERE and GROUP BY (https://www.sqlshack.com/case-statement-in-sql/)
  7. Good way to arrange data into smaller groups - common technique for Data Science.
Spinner
DataFrameavailable as
df3
variable
-- Query 1.10 - Bucketing orders under "On time, Early, Delayed" using CASE
SELECT o.order_id, order_date, required_date, shipped_date, 
	DATEDIFF(DAY, order_date, required_date) expected_delivery_days,
	DATEDIFF(DAY, order_date, shipped_date) delivery_days,
	DATEDIFF(DAY, required_date, shipped_date) delay_days, 
CASE WHEN DATEDIFF(DAY, required_date, shipped_date) = 0 THEN 'On time'
	WHEN DATEDIFF(DAY, required_date, shipped_date) > 0 THEN 'Delay'
	ELSE 'Early' END AS delay, 
/* Query 1.12 v Adding multiple conditions to check whether required date is a weekend. (Irrelevant in terms of E-commerce. Just for reference) */
CASE WHEN DATEPART(Week, o.required_date) BETWEEN 6 AND 7 THEN 'Weekend' 
	ELSE 'Weekday' END required_date_weekend_check
FROM sales.orders o

2. Math Functions

  1. COUNT(DISTINCT column) - Count unique no. of values in a column. Same as COUNT() if all values in a column are unique
Spinner
DataFrameavailable as
df4
variable
WITH order_quantity AS
	SELECT order_id, SUM(quantity) quantity
	FROM sales.order_items
	GROUP BY order_id
)
SELECT o.order_id, quantity, (SELECT AVG(quantity) FROM order_quantity) avg_order_quantity,
	order_date, required_date, shipped_date, 
	DATEDIFF(DAY, order_date, required_date) expected_delivery_days,
	DATEDIFF(DAY, order_date, shipped_date) delivery_days,
	DATEDIFF(DAY, required_date, shipped_date) delay_days, 
CASE WHEN DATEDIFF(DAY, required_date, shipped_date) = 0 THEN 'On time'
	WHEN DATEDIFF(DAY, required_date, shipped_date) > 0 THEN 
		CASE WHEN quantity > (SELECT AVG(quantity) FROM order_quantity) THEN 'Delay, High Volume' 
			WHEN quantity < (SELECT AVG(quantity) FROM order_quantity) THEN 'Delay, Low Volume'
			ELSE 'Delay' END
	ELSE 'Early' END AS delay
FROM sales.orders o
LEFT JOIN order_quantity  ot ON ot.order_id = o.order_id

3. Processing Data in SQL Server

  1. WHILE loops
Spinner
DataFrameavailable as
df
variable
-- Define the target factorial number
DECLARE @target float = 5
-- Initialization of the factorial result
DECLARE @factorial float = 1

WHILE @target > 0 
BEGIN
	-- Calculate the factorial number
	SET @factorial = @factorial * @target
	-- Reduce the termination condition  
	SET @target = @target - 1
END

SELECT @factorial;

4.