Skip to content

Soccer - Who really has bragging rights? - SQL - Data Manipulation

Context:

I have three brothers. Let's call them number 1, number 2, and number 3.

They each support three different clubs in the Premier League.

  • Number 1's is Arsenal.
  • Number 2's is Manchester United.
  • Number 3's is Manchester City.

It is also important to note that in real life, numnber 1 is older than number 2, who is older than number 3.

Each of these guys would never miss a game where their club was playing. Very often, they would also never miss a game where the other brother's club was playing. For example Number 1 would intently watch a game where Arsenal was playing Leicester City, but he would also watch a game where Manchester City was playing Newcastle. There's ripe banter to be plucked there :)

You can then imagine the atmostphere when Arsenal is playing Manchester United, or when Manchester United is playing Manchester City💀 Each of them is ALWAYS 'making noise' about whose club is the superior (has won the most matches)...

...and today, we settle this once and for all!

To do this, I will be using the soccer dataset with information for 4 seasons (2011/2012 - 2014/2015) and about 13,000 matches.

I will analyse the results of the matches of each brother's club against another brother's club to measure whose club is the superior of the three. Afterwards, I will allow the analysis to include matches played by each brother's club against other clubs (not only those of the other brothers) to also measure whose 'noise-making' is justified and who has the ultimate bragging rights.

The brother's club that won the most games against another brother's club is the superior of the three. The brother whose club has won the most matches overall is the one with the bragging rights and is therefore allowed to trash-talk the other brothers and their respective clubs (kindly, of course).

My hypothesis from observing them during and after these games is that Number 3 is the one with undisputable bragging rights. After which comes number 2 before number 1. However, the data will be the one to help reject or validate this hypothesis.

DISCLAIMER: This analysis is just for fun. I just need an interesting topic to practice what I'm learning. Also, the results would be limited by the dataset I'm using.

View Datasets

League Table

Spinner
DataFrameas
df
variable

SELECT *
FROM soccer.league
LIMIT 3;

Club Table

Spinner
DataFrameas
df
variable

SELECT *
FROM soccer.team
LIMIT 3;

Match Table

Spinner
DataFrameas
df
variable

SELECT *
FROM soccer.match
LIMIT 3;

Analysis

Get Clubs respective API-IDs

Spinner
DataFrameas
df
variable

SELECT team_long_name, team_api_id, team_short_name
FROM soccer.team
WHERE team_short_name IN ('ARS', 'MUN', 'MCI')
ORDER BY team_long_name ASC

How many matches have each played in this dataset?

Spinner
DataFrameas
df
variable

SELECT 
	t.team_long_name As team,
    COUNT(m.id) AS total_matches
FROM soccer.match as m
LEFT JOIN soccer.team as t
ON (m.hometeam_id = t.team_api_id OR m.awayteam_id = t.team_api_id)
WHERE team_short_name IN ('ARS', 'MCI', 'MUN')
GROUP BY team