Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL
Today, we are going to explore some data of the EURO 2024 which is still ongoing at the time of writing and working through this.
We will work with a dataset that was sourced from the sportmonks
API (with some adjustments)
Problem Context:
MALCOM is invited to a sports party of Access Studios**. However, Malcom has no idea about soccer. He would like to have at least a few interesting facts about EURO 2024 that he can mention so that he doesn't look like a complete loser during the conversations.
We will cover the following
- 🕵️♀️ Exploratory data analysis --> How does the data look like and can we discover potential issues?
- 🔧 Data modeling --> Bring the data into a format that makes it easier for us to analyze
- 🤿 Deep dive analysis: Game events and goals
- 📊 Based on what we found - see how things could be monitored in a EURO 2024 dashboard in Metabase
Task 0: Setup 🛠️
Details
For this analysis we only need the CSV files with some data from the EURO 2024 tournament.
game_events.csv
-> Events that happened during a game
Make sure that you have the CSV files mentioned above in your environment.
We will use datalab's duckdb integration to analyze the files with SQL.
The data is sourced and adjusted from sportmonks
Task 1: Get an understanding of your AUDIENCE and what you want to achieve
Before diving into a dataset, we should set ourselves a clear goal on what we want to achive.
Otherwise, we might end in some rabbit holes easily 🐇🕳️.
In this session we want to understand more about goals since this is what everyone talks about at a garden party (or not?) 🪴
- When are they scored?
- Commentators often talk about "jokers" that substituted late in the game -> Can we see some effects in the data?
Task 2: Have a brief look at the data 👀
When we look at the data we try to already get a glimbse of it, e.g.:
- What data (columns) do we have available? Do we know what they are?
- Can we already see a row count?
- Missing values and where they occur
SELECT *
FROM 'game_events.csv'
LIMIT 50;
Field | Description | Type |
---|---|---|
id | Refers to the unique id of the type | integer |
game | The name of the game | string |
event_type | The type of event | string |
created_at | The time when the event happened | timestamp |
team_name | The name of the team related to the event | string |
player_name | The name of the player that performed the event or to whom it happened | string |
related_player_name | The name of the player who is related to the event that player_name performed | string |
result | The result after the event happend | string |
minute | The game minute in which the event happened | integer |
extra_minute | The additional minutes in which the event happened (e.g. for 90+2 this would be 2) | integer |
previous_player_event | The event_type of the previous event that happened to the player | string |
previous_player_event_at | The timestamp of the previous event that happened to the player | timestamp |
seconds_after_previous_event | The seconds after the previous event happened to the player | integer |
player_id | The id of the player that performed the event or to whom it happened | integer |
related_player_id | The id of the player who is related to the event that player_name performed | integer |
participant_id | The id of the team related to the event | integer |
fixture_id | The id of the game the event happened in | integer |
Task 3: 🕵️♀️ Exploratory data analysis
Now, we want to further understand the data we work with and potential issues it has.
For that, a good start is to look at
- Row count
- Duplicates
- Missing values
- Value distributions (i.e. can we see any weird outliers)
Row counts and uniqueness
Let's start by looking at the row counts
SELECT
count(1),
count(distinct id) as unique
FROM 'game_events.csv';
Our Data Validation shows there are no duplicates
Summary stats
Let's look at the separate tables and their columns using the SUMMARIZE
function.
SUMMARIZE SELECT * FROM 'game_events.csv';
Distributions
-- Distribution of "minute" values
SELECT minute,
count(1) as events
FROM 'game_events.csv'
GROUP BY 1;
- Seems that most events happen in the 2nd half of the game
- There are many events in the 90th minute - are those outliers or data issues?
Let's check what type of events happen in the 90th minute.
-- Events in minute 90
SELECT event_type, COUNT(1) AS event_count
FROM 'game_events.csv'
WHERE minute = 90
GROUP BY 1
ORDER BY 2 DESC;
mmmmm! We see lots of Yellow card events occur during 90TH MINUTE of the Euro 2024 tournament. These players are really going in for the tackle at the last minute as they fight for pride and glory, and the refs are not shying away from dishing out those yello cards wink