Skip to content

Playing with dataset from the course Intermediate SQL

This course contained data on UEFA soccer matches. I show how we can simplify a query from the course, and build a query to extract information on longest winning streaks by a team during a season.

Simplifying the query from final exercise.

For reference, here is the query we had to complete:

Spinner
DataFrameas
df
variable
-- Add your own queries here
-- Set up the home team CTE
WITH home AS (
  SELECT m.id, t.team_long_name,
	  CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
		   WHEN m.home_goal < m.away_goal THEN 'MU Loss' 
  		   ELSE 'Tie' END AS outcome
  FROM soccer.match AS m
  LEFT JOIN soccer.team AS t ON m.hometeam_id = t.team_api_id),
-- Set up the away team CTE
away AS (
  SELECT m.id, t.team_long_name,
	  CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
		   WHEN m.home_goal < m.away_goal THEN 'MU Win' 
  		   ELSE 'Tie' END AS outcome
  FROM soccer.match AS m
  LEFT JOIN soccer.team AS t ON m.awayteam_id = t.team_api_id)
-- Select columns and and rank the matches by goal difference
SELECT DISTINCT
    m.date,
    home.team_long_name AS home_team,
    away.team_long_name AS away_team,
    m.home_goal, m.away_goal,
    RANK() OVER(ORDER BY ABS(home_goal - away_goal) DESC) as match_rank
-- Join the CTEs onto the match table
FROM soccer.match AS m
INNER JOIN home ON m.id = home.id
INNER JOIN away ON m.id = away.id
WHERE m.season = '2014/2015'
      AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
      OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));

We can do something equivalent using a simple join on a CASE WHEN. We could of course add info on whether MU was the home or away team by comparing home and away goals.

Spinner
DataFrameas
df
variable
SELECT
	m.date,
	t.team_long_name AS winning_team,
	m.home_goal,
    m.away_goal,
    RANK() OVER(ORDER BY ABS(m.home_goal - m.away_goal) DESC) as match_rank
FROM soccer.match AS m
INNER JOIN soccer.team AS t
ON (CASE WHEN m.home_goal < m.away_goal THEN m.awayteam_id
    	WHEN m.home_goal > m.away_goal THEN m.hometeam_id END) = t.team_api_id
WHERE m.season = '2014/2015' AND (
        (m.hometeam_id = 10260 AND m.home_goal < m.away_goal)
        OR (m.awayteam_id = 10260 and m.home_goal > m.away_goal))
ORDER BY date;

Winning streaks

Now for something more interesting: identify longest winning streaks by a team in a given season.

First, given a team id and season (e.g., Manchester United in the 2011/2012 season), we identify wins, and record when the last defeat happened in order to identify winning streaks. This will later appear as a correlated subquery.

Spinner
DataFrameas
df
variable

SELECT *,
	(ROW_NUMBER() OVER(ORDER BY date)) - (MAX(last_loss_rank) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS win_streak
FROM (SELECT *,
      -- rank if defeat, 0 otherwise. Just apply running max to retrieve rank of last defeat:
      -- multiplying by (1-team_win) is slightly faster than the more legible CASE which I commented out:
      --CASE WHEN team_win=0 THEN (ROW_NUMBER() OVER(ORDER BY date)) ELSE 0 END AS last_loss_rank
      (1.0-team_win)*(ROW_NUMBER() OVER(ORDER BY date)) AS last_loss_rank
	FROM (SELECT 
		id,season,date,
    	hometeam_id,awayteam_id,
    	home_goal,away_goal,
    	CASE WHEN 
      	(m.home_goal > m.away_goal AND m.hometeam_id = 10260) OR
      	(m.home_goal < m.away_goal AND m.awayteam_id = 10260) THEN 1
    		ELSE 0 END AS team_win
	FROM soccer.match AS m
	WHERE m.season = '2011/2012' AND (m.hometeam_id = 10260 OR m.awayteam_id = 10260)) AS sub_a) AS sub_b;
    

Now, doing is for all teams and seasons:

Spinner
DataFrameas
df
variable

WITH team_seasons_list AS ( -- prepare a table with all teams and the seasons in which they played
    SELECT t.team_api_id, t.team_long_name, m.season
    FROM (SELECT DISTINCT hometeam_id AS team_api_id, season
          FROM soccer.match
          UNION
          SELECT DISTINCT awayteam_id AS team_api_id, season
          FROM soccer.match) AS m
    LEFT JOIN soccer.team AS t
    USING(team_api_id)
)
-- For a given team and season, calculate the longest streak
SELECT team_long_name, season, 
	(SELECT MAX(win_streak) as longest_streak --, team_seasons_list.team_long_name, team_seasons_list.season
FROM (SELECT (ROW_NUMBER() OVER(ORDER BY date)) - (MAX(last_loss_rank) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS win_streak
    FROM (SELECT *,
      -- rank if defeat, 0 otherwise. Just apply running max to retrieve rank of last defeat:
      -- maybe multiplying by (1-team_win) would be faster than CASE...
      CASE WHEN team_win=0 THEN (ROW_NUMBER() OVER(ORDER BY date)) ELSE 0 END AS last_loss_rank
	FROM (SELECT 
		id,season,date,
    	hometeam_id,awayteam_id,
    	home_goal,away_goal,
    	CASE WHEN 
      	(m.home_goal > m.away_goal AND m.hometeam_id = team_seasons_list.team_api_id) OR
      	(m.home_goal < m.away_goal AND m.awayteam_id = team_seasons_list.team_api_id) THEN 1
    		ELSE 0 END AS team_win
	FROM soccer.match AS m
	WHERE m.season=team_seasons_list.season AND (m.hometeam_id = team_seasons_list.team_api_id OR m.awayteam_id = team_seasons_list.team_api_id)) AS sub_a) AS sub_b
) AS sub_win_streaks)
FROM team_seasons_list
-- Sort the table by longest streaks, descending
ORDER BY longest_streak DESC;

Bayern Munich sure had an impressive 2013/2014 season!

https://www.bundesliga.com/en/news/Bundesliga/0000282346.jsp