Skip to content
Intermediate SQL Server - Course Notes by Chinmay Garg
Intermediate SQL Server
1. Summarizing Data
Data Analysis with Aggregations
- Data Scientists spend a lot of time analyzing and aggregating data by generating summary statistics such as the mean, minimum and maximum values of columns. If your data exists in a DB, the fastest way to generate summary statistics is by using SQL.
- Common summary stats - calculate minimum, maximum, average by using
MIN()
,MAX()
,AVG()
- Aggregating Data - 3-4 in https://app.datacamp.com/workspace/w/aab12b07-150e-4884-a28c-1fce47c8b235/edit
- Reminder: SQL Server is case insensitive
DataFrameavailable as
df1
variable
-- Query 1.3. Calculate Summary stats of bicycle orders where average sales of a bicycle > 500
SELECT p.product_name,
AVG(oi.list_price * quantity * Discount) AS Average,
MIN(oi.list_price * quantity * Discount) AS Minimum,
MAX(oi.list_price * quantity * Discount) AS Maximum
FROM sales.order_items oi
INNER JOIN production.products p ON oi.product_id = p.product_id
GROUP BY product_name
-- Return records where average of Sales is greater than 500
HAVING AVG(oi.list_price * quantity) > 500
Dealing with Missing Data - 2.9-2.11 in https://app.datacamp.com/workspace/w/aab12b07-150e-4884-a28c-1fce47c8b235/edit
- Missing data are a common occurrence. Missing data is represented as
NULL
.NULL
is not a value, you cannot use the = sign or any other operators to find or compare NULLs, instead you must useIS NULL
andIS NOT NULL
. - Blank value is not
NULL
: blanks are formatted as two single quotes with no space and are specific to text fields. An empty row appears as a blank in queries but is not NULL. To identify blank values, use theLEN
function (LEN(col1) > 0
) ISNULL(missing_data_col, Substituted value/ col)
- Substitute missing data with a specific value/ columnCOALESCE(missing_data_col, value_2/ col_2, ..value_n/ col_n)
-Ppass multiple values it will always return the first non-NULL value. Similar to using a series of if-else statements to return the first non-missing value. https://www.sqlshack.com/using-the-sql-coalesce-function-in-sql-server/
DataFrameavailable as
df2
variable
-- Query 1.7 - ISNULL example. Contact customers with missing phone numbers
SELECT customer_id, ISNULL(phone, 'Contact ' + first_name + ' ' + last_name) phone
FROM sales.customers
WHERE phone IS NULL
Binning data with CASE
CASE
- Evaluate records like if/ if-else statements.
- Keywords -
CASE
WHEN
THEN
ELSE
(Optional)END
- In absence of
ELSE
for records where any condition is not met inWHEN
, the column will output asNULL
. - Add multipe logical conditions to
WHEN
usingAND
/BETWEEN
. - Can also be Nested LIKE if-else statements. (https://www.tsql.info/sql-server/nested-case-statements.php)
- The entire
CASE WHEN
block can be used inWHERE
andGROUP BY
(https://www.sqlshack.com/case-statement-in-sql/) - Good way to arrange data into smaller groups - common technique for Data Science.
DataFrameavailable as
df3
variable
-- Query 1.10 - Bucketing orders under "On time, Early, Delayed" using CASE
SELECT o.order_id, order_date, required_date, shipped_date,
DATEDIFF(DAY, order_date, required_date) expected_delivery_days,
DATEDIFF(DAY, order_date, shipped_date) delivery_days,
DATEDIFF(DAY, required_date, shipped_date) delay_days,
CASE WHEN DATEDIFF(DAY, required_date, shipped_date) = 0 THEN 'On time'
WHEN DATEDIFF(DAY, required_date, shipped_date) > 0 THEN 'Delay'
ELSE 'Early' END AS delay,
/* Query 1.12 v Adding multiple conditions to check whether required date is a weekend. (Irrelevant in terms of E-commerce. Just for reference) */
CASE WHEN DATEPART(Week, o.required_date) BETWEEN 6 AND 7 THEN 'Weekend'
ELSE 'Weekday' END required_date_weekend_check
FROM sales.orders o
2. Math Functions
Counting & Totals - 1.1, 2.11, 3, 4 in https://app.datacamp.com/workspace/w/aab12b07-150e-4884-a28c-1fce47c8b235/edit/Intermediate%20SQL%20-%20Couse%20Notes%20by%20Chinmay%20Garg.ipynb
COUNT(DISTINCT column)
- Count unique no. of values in a column. Same asCOUNT()
if all values in a column are unique
DataFrameavailable as
df4
variable
WITH order_quantity AS
SELECT order_id, SUM(quantity) quantity
FROM sales.order_items
GROUP BY order_id
)
SELECT o.order_id, quantity, (SELECT AVG(quantity) FROM order_quantity) avg_order_quantity,
order_date, required_date, shipped_date,
DATEDIFF(DAY, order_date, required_date) expected_delivery_days,
DATEDIFF(DAY, order_date, shipped_date) delivery_days,
DATEDIFF(DAY, required_date, shipped_date) delay_days,
CASE WHEN DATEDIFF(DAY, required_date, shipped_date) = 0 THEN 'On time'
WHEN DATEDIFF(DAY, required_date, shipped_date) > 0 THEN
CASE WHEN quantity > (SELECT AVG(quantity) FROM order_quantity) THEN 'Delay, High Volume'
WHEN quantity < (SELECT AVG(quantity) FROM order_quantity) THEN 'Delay, Low Volume'
ELSE 'Delay' END
ELSE 'Early' END AS delay
FROM sales.orders o
LEFT JOIN order_quantity ot ON ot.order_id = o.order_id
3. Processing Data in SQL Server
WHILE
loops
DataFrameavailable as
df
variable
-- Define the target factorial number
DECLARE @target float = 5
-- Initialization of the factorial result
DECLARE @factorial float = 1
WHILE @target > 0
BEGIN
-- Calculate the factorial number
SET @factorial = @factorial * @target
-- Reduce the termination condition
SET @target = @target - 1
END
SELECT @factorial;