Data Manipulation in SQL
Here you can access every table used in the course. To access each table, you will need to specify the soccer schema in your queries (e.g., soccer.match for the match table, and soccer.league for the league table).
Note: When using sample databases such as those that contain course data, you have read-only access. You can run queries, but cannot make any changes such as adding, deleting, or modifying the data (e.g., creating tables, views, etc.).
Take Notes
Add notes about the concepts you've learned and SQL cells with queries you want to keep.
Add your notes here
-- Add your own queries here
SELECT
stage,
AVG(home_goal + away_goal) AS avg_goals
FROM soccer.match
WHERE season = '2012/2013'
GROUP BY stageExplore Datasets
Use the match, league, and country tables to explore the data and practice your skills!
- Use the
match,league, andcountrytables to return the number of matches played in Great Britain versus elsewhere in the world.- "England", "Scotland", and "Wales" should be categorized as "Great Britain"
- All other leagues will need to be categorized as "World".
- Use the
matchandcountrytables to return the countries in which the average number of goals (home and away goals) scored are greater than the average number of goals of all matches. - In a soccer league, points are assigned to teams based on the result of a game. Here, let's assume that 3 points are awarded for a win, 1 for a tie, and 0 for a defeat. Use the
matchtable to calculate the running total of points earned by the team "Chelsea" (team id 8455) in the season "2014/2015".- The final output should have the match date, the points earned by Chelsea, and the running total.
1 We'll take the CASE
- CASE -> If... Then ...
- CASE: 1 WHEN... 2 THEN ... 3 ELSE... END (AS ...)
- MANY Conditions
- Example
- CASE WHEN ... AND -> Multiple Conditions
- WHERE -> filter to specific thing that u need
- If: ELSE NULL END / ELSE END-> ผลลัพธ์ที่ไม่เข้า condition จะแสดงเป็น NULL
- ใช้ CASE 2 รอบได้ ถ้าต้องการ Include หรือ Exclude อะไรที่ WHERE e.g.
- Aggregating:
- COUNT: e.g.
- SUM: e.g.
- AVG:
e.g.
- ROUND +
- Percentage + : ให้สิ่งที่ต้องการหา%เป็น 1 (THEN 1) และที่ตรงข้ามเป็น 0 (THEN 0) แล้วหา AVG
2 Short and Simple Subqueries
- Subqueries SELECT, FROM, WHERE, GROUP BY
- ให้ค่า
- Scalar quantities (3.14159, -2 , 0.001)
- list (id = (12, 25, 392, 401, 939))
- table
- ใช้เพื่อ
- Comparing groups to summarized values e.g. How did Liverpool compare to the English Premier League's average performance for that year?
- Reshaping data e.g. What is the highest monthly average of goals scored in the Bundesliga?
- Combining data that cannot be joined e.g. How do you get both the home and away team names into a table of match results?
- Simple Subqueries
- sub จะ exe ก่อน main
-
- WHERE (filtering) e.g. d use with IN -> ให้ค่าออกมาเป็น List e.g.
-
- FROM (restructing + transforming)
- data long -> wide
- pre-filter -> calculating
- calculate aggregates of aggregates e.g. Which 3 teams has the highest avg score of home goals scored? 1) calculated AVG of each team -> subquery Then put in MAIN query 2) get 3 highest of AVG
- NOTES
- มีหลาย subqueries ใน 1 FROM ได้ (Alias + Join ด้วย)
- join subquery ลงใน table ที่มีอยู่แล้วได้ (ต้องมี column ที่joinด้วยได้)
-
- SELECT (summary values -> detailed data set)
- ให้ค่าเป็น single value เอา aggregated value COMPARE กับ individual values ได้ (ปกติไม่ได้) e.g. เทียบtotal mathcesแบบindividualแต่ละ season กับ overall > calculate total matches across all season SELECT COUNT(id) FROM match; > Add to MAIN for each season
- ใช้ใน mathematic calculations e.g. deviation from AVG MEDTHOD 1: simple > AVG of total goal in a season > calculate difference from result or MEDTHOD 2: subquery > put calculate AVG as subquery
- NOTES
- NEED to return "Single" value
- filter in MAIN and sub should be placed correctly
-
- MULTIPLE Subqueries
- Can has subqueries in SELECT, FROM, WHERE in the same time e.g.
- BEST PRACTICE
- Format: SELECT FROM WHERE GROUP BY
- Annotate: /* ... */ for tell description
- Indent (ย่อหน้าให้เข้าใจง่าย)
- Properly filter in each subquery e.g. ** Hollywell's SQL Style Guide **
3 Correlated Queries, Nested Queries, and Common Table Expressions
-
Correlated Queries:
-
ใช้ value จาก outer query เพื่อสร้าง result
-
re-run every row ที่สร้างขึ้นใน final data set
-
FOR advanced joining filtering evaluating
-
e.g. same as previos chapter can be produce same query result
to (correlated query) = "return stages where the values in the subquery are higher than the average."
-
KEY difference sim - correlated
-
-
Nested subqueries
-
subq inside another subq
-
multiple layers of transformation
-
Subq
inside subq
-
Inner subq result:
-
Outer subq result: 2944.75
-
Final query result:
-
-
Correlated Nested subq
- Nested subq can be correlated OR uncorrelated
- OR both corr & uncorr (e.g. inner corr, outer uncorr)
- can referenced info from Outer subq or main query
- e.g. What is the each country's average goals scored in the 2011/2012 season?
-
-
Common Table Expressions (CTEs)
-
tb declared before Main query
-
setting
- Named subq by WITH
- referenced later in FROM
-
BEFORE CTE
AFTER CTE
-
NAMED + Place at the beginning
-
FINAL (referenced)
-
-
Multiple subq
*after CTEs can refer to before e.g. 3rd CTEs can refer 1st & 2nd
-
CTEs can refer itself (Self Join) = recursive CTE
-
-
Differentiating Techniques
*limits of JOINs = "you can't join two separate columns in one table, to a single column in another at a time" e.g. can't ดึง name จาก home team กับ away team ในเวลาเดียวกันได้
-
Which one?
- depends on Database/Questions
- technique allows to
- use & reuse queries
- gen clear & accurate results
-
Different use cases
JOINs - 2+ tables e.g. (What is the totalsales per employee?)
Correlated subq - matching data from different columns in one or more tables e.g. (Who does each employee report to in a company?)
Multiple/Nested subq - require multiple steps to transform and prepare before generating the final query e.g. (What is the average deal size closed by each sales representative in thequarter?)
CTEs - comparing a large number of disparate pieces of information e.g. (How did the marketing, sales, growth, & engineering teams perform on key metrics?) -> create a summary table examining the marketing, sales, growth, and engineering teams' performance on their key metrics last quarter CTE -> extract data about each team's performance one after another => combine them into a single query
-
4 Window Functions
-
Limitations in SQL
- Requires you to use GROUP BY with all non-aggregate columns e.g.
- can't compare aggregate values to non-aggregate data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Tackle ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
Window Functions = Perform calculations on an already generated result set (a window)
--> use a window TO perform aggregate calculation WITHOUT grouping data (similar to subq in SELECT) FOR Running totals, rankings, moving averagesOVER() = ให้เป็นค่าคงที่แล้วนำไปคำนวณ
-
subq & Window func. 1) subq
Window Func.
OVER = "pass this aggregate value over this existing result set."
RANK() *default: smallest -> highest e.g. What is the rank of matches based on number of goals scored?
from
to
(default)
(DESC)
-
-
Key Considerations of using Window func.
- Processed after every part of query except ORDER BY (Uses information in result set rather than database)
- Available in PostgreSQL, Oracle, MySQL, SQL Server... but NOT SQLite
-
OVER with a PARTITION
-
OVER can add func. in ()
-
PARTITION BY = A partition allows you to calculate separate values for different categories established in a partition = Calculate different calculations in the same column
SYNTAX:
OVER(PARTITION BY ...)
=> return the overall average for, or PARTITIONed BY each ... (e.g. season)
-
e.g.
Different: (no PARTITION BY)
(with PARTITION BY)
--> row 1-2 (in season 2011/2012 show avg of 2011/2012) --> row 3-4 (in season 2012/2013 show avg of 2012/2013)
-
PARTITION BY multiple columns
--> row 1 (a match was played in Belgium in 2011/2012 season + 1 goal ===> compared to the 2.88 (= avg goals in Belgium in 2011/2012 season)- Considerations
-
Can partition data by 1 or more columns
-
Can partition aggregate calculations, ranks, etc (any kind of Window func.)
-
-
-
Sliding Windows
= used to calculate information that changes with each subsequent row in a data set
= perform calculations relative to the current row of a data set
--> calculate a wide variety of information that aggregates one row at a time down your data set (running totals, sums, counts, averages)
- can also be partitioned by one or more columns
Syntax
ROWS BETWEEN AND
[Keywords]
PRECEDING FOLLOWING => specify the number of rows before(PRECEDING) OR after(FOLLOWING) current row that want to include in a calculation UNBOUNDED PRECEDING UNBOUNDED FOLLOWING => tell SQL that you want to include every row since the beginning(UNB PREC) OR the end(UNB FOLLO) of the data set in your calculations CURRENT ROW => tells SQL that you want to stop your calculation at the current rowe.g. (accumulated Total?)
- More limited frame
*running: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (= accumulated)
*running (ORDER BY DESC): ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (reversed)
Case Study (all methods in this Course)
"Who defeated Manchester United in the 2013/2014 season?"
Steps
-
Get team names with CTEs
-
Get match outcome with CASE statements
As a home team
As a away team
-
Determine how badly they lost with a Window Function (RANKING)