Skip to content

Data Manipulation in SQL

Here you can access every table used in the course. To access each table, you will need to specify the soccer schema in your queries (e.g., soccer.match for the match table, and soccer.league for the league table).

Take Notes

Add notes about the concepts you've learned and SQL cells with queries you want to keep.

Add your notes here

Spinner
DataFrameas
df
variable
-- Add your own queries here
SELECT *
FROM soccer.match
LIMIT 5

Explore Datasets

Use the match, league, and country tables to explore the data and practice your skills!

  • Use the match, league, and country tables to return the number of matches played in Great Britain versus elsewhere in the world.
    • "England", "Scotland", and "Wales" should be categorized as "Great Britain"
    • All other leagues will need to be categorized as "World".
  • Use the match and country tables to return the countries in which the average number of goals (home and away goals) scored are greater than the average number of goals of all matches.
  • In a soccer league, points are assigned to teams based on the result of a game. Here, let's assume that 3 points are awarded for a win, 1 for a tie, and 0 for a defeat. Use the match table to calculate the running total of points earned by the team "Chelsea" (team id 8455) in the season "2014/2015".
    • The final output should have the match date, the points earned by Chelsea, and the running total.

Statments

FROM -

SELECT -

WHERE -

GROUP BY -

HAVING - Very similar to where clause but it is run after the agregation while the WHERE is applied before the HAVIN is after.

FROM -

ORDER BY -

CASE WHEN THEN -

WITH name AS() -

The use of subqueries in statments WHERE,FROM and SELECT

WHERE Where to use subqueries - first example we have the subquery in the WHERE statment, that is used to filter the data with a single value or a list of values calculated from a different table. It returns a single column.

Its used when we want a more complex pre filtered table to work with. So instead just plain put it on the FROM we can open parentesis and put a entire query therefore filtering or pre working the data for the further outer query. Remember that the subquery in the FROM can use multiples subqueries just make sure to alias them and after that JOIN them as u would with to plain table from queries.

SELECT the subquery here is used for mathematics operation. when u want make a summary or a single value that will operate in a new calculated column. Bear in mind that the subqueries in SELECT must retrieve a single summary value, if return a colum the shell will return an error.

Best practice when running multiple subqueries in a main query

Its best practice in SQL as the queries grow in complexity use some best practice tools so the code gets more readable.

Such as indentation and coments with -- or /* in the code so u can explain why some line of code or subquery is there.

Correlated subqueries and simple subqueries

Instead of using a subquerie that for example filters the aditional table via a WHERE statement we can use a WHERE that is correlated with the main querie.

This type of subquerie has some disadvantages such as : slow down considarably the querie, it runs and reruns the subqueries row by row.

but it can be a simpler and easier way to write a query avoiding the need to make multple subqueries and joins.

Nested Subqueries

Nesting subqueries as the name implies u nest a subquery inside a subquery and this can get very complex and difficult to read and understand the code. Thats why we can use common table expressions or CTE. We use the function WITH and them we alias with AS and them we put the query way before the main query, in that way we can organizes all the subqueries in a single place and make the main query more readable and organized.

The OVER function and RANK window functions

The function OVER can easly substitute the subquery on the SELECT statment when we require a avg column to compare with.

The RANK function can be used with the over statment in order to make a ranking column of a determined data. use RANK() them OVER(ORDER BY ....). By default the ranking will order the smallest to largest, to reverse this use DESC inside the OVER just as u would on a ORDER BY. The OVER and RANK statments are ran lastly just before the ORDER BY statment.

Spinner
DataFrameas
df
variable
SELECT 
	-- Select the id, country name, season, home, and away goals
	m.id, 
    c.name AS country, 
    m.season,
	m.home_goal,
	m.away_goal,
    -- Use a window to include the aggregate average in each row
	AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;

Sliding windows functioins

you can add statments inside the OVER functions in order to calculate running totals, the arguments can be ( PRECEDING, FOLLOWING, UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW). In the preceding and following u can put the number that will indicate the number of rows before and after the row you want to run the calculation. Unbounded means that you want to add all rows before or after, we would use this one in a running total for example.

LAG function

it takes 2 subclauses inside the parentesis ( column_name, number integer). The first determine witch column we are taking in and the second argument is how many rows before we want to copy.

Spinner
DataFrameas
df
variable
WITH Weightlifting_Gold AS (
  SELECT
    -- Return each year's champions' countries
    Year,
    Country AS champion
  FROM Summer_Medals
  WHERE
    Discipline = 'Weightlifting' AND
    Event = '69KG' AND
    Gender = 'Men' AND
    Medal = 'Gold')

SELECT
  Year, Champion,
  -- Fetch the previous year's champion
  LAG(Champion,1) OVER
    (ORDER BY Year ASC) AS Last_Champion
FROM Weightlifting_Gold
ORDER BY Year ASC;

Partition by subclause

Lets imagine that we want to filter the polejump andthe weightlifting events and to use the LAG function in order to see if the champion of a given year is a reining champion(won 2 events in a row). The LAG won't give our correct result as the LAG will iterate between the two events as just one. So in this case we can put the partition by in the OVER(Partition BY event). Now the LAG will reset every time the event shows a different value.

Spinner
DataFrameas
df
variable
WITH Tennis_Gold AS (
  SELECT DISTINCT
    Gender, Year, Country
  FROM Summer_Medals
  WHERE
    Year >= 2000 AND
    Event = 'Javelin Throw' AND
    Medal = 'Gold')

SELECT
  Gender, Year,
  Country AS Champion,
  -- Fetch the previous year's champion by gender
  LAG(Country,1) OVER (PARTITION BY Gender
            ORDER BY Year ASC) AS Last_Champion
FROM Tennis_Gold
ORDER BY Gender ASC, Year ASC;