Analyzing Euro 2024 Soccer Data in SQL

Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL

Today, we are going to explore some data of the EURO 2024 which is still ongoing at the time of writing and working through this. We will work with a dataset that was sourced from the sportmonks API (with some adjustments)

Problem Context:

MALCOM is invited to a sports party of Access Studios**. However, Malcom has no idea about soccer. He would like to have at least a few interesting facts about EURO 2024 that he can mention so that he doesn't look like a complete loser during the conversations.

We will cover the following

🕵️‍♀️ Exploratory data analysis --> How does the data look like and can we discover potential issues?
🔧 Data modeling --> Bring the data into a format that makes it easier for us to analyze
🤿 Deep dive analysis: Game events and goals
📊 Based on what we found - see how things could be monitored in a EURO 2024 dashboard in Metabase

Task 0: Setup 🛠️

Details

For this analysis we only need the CSV files with some data from the EURO 2024 tournament.

game_events.csv -> Events that happened during a game

Make sure that you have the CSV files mentioned above in your environment.
We will use datalab's duckdb integration to analyze the files with SQL.

The data is sourced and adjusted from sportmonks

Task 1: Get an understanding of your AUDIENCE and what you want to achieve

Before diving into a dataset, we should set ourselves a clear goal on what we want to achive.
Otherwise, we might end in some rabbit holes easily 🐇🕳️.

In this session we want to understand more about goals since this is what everyone talks about at a garden party (or not?) 🪴

When are they scored?
Commentators often talk about "jokers" that substituted late in the game -> Can we see some effects in the data?

Task 2: Have a brief look at the data 👀

When we look at the data we try to already get a glimbse of it, e.g.:

What data (columns) do we have available? Do we know what they are?
Can we already see a row count?
Missing values and where they occur

DataFrameas

df

variable

SELECT *
FROM 'game_events.csv'
LIMIT 50;

Field	Description	Type
id	Refers to the unique id of the type	integer
game	The name of the game	string
event_type	The type of event	string
created_at	The time when the event happened	timestamp
team_name	The name of the team related to the event	string
player_name	The name of the player that performed the event or to whom it happened	string
related_player_name	The name of the player who is related to the event that `player_name` performed	string
result	The result after the event happend	string
minute	The game minute in which the event happened	integer
extra_minute	The additional minutes in which the event happened (e.g. for 90+2 this would be 2)	integer
previous_player_event	The `event_type` of the previous event that happened to the player	string
previous_player_event_at	The timestamp of the previous event that happened to the player	timestamp
seconds_after_previous_event	The seconds after the previous event happened to the player	integer
player_id	The id of the player that performed the event or to whom it happened	integer
related_player_id	The id of the player who is related to the event that `player_name` performed	integer
participant_id	The id of the team related to the event	integer
fixture_id	The id of the game the event happened in	integer

Task 3: 🕵️‍♀️ Exploratory data analysis

Now, we want to further understand the data we work with and potential issues it has.
For that, a good start is to look at

Row count
Duplicates
Missing values
Value distributions (i.e. can we see any weird outliers)

Row counts and uniqueness

Let's start by looking at the row counts

DataFrameas

df1

variable

SELECT 
	count(1), 
	count(distinct id) as unique
FROM 'game_events.csv';

Our Data Validation shows there are no duplicates

Summary stats

Let's look at the separate tables and their columns using the SUMMARIZE function.

DataFrameas

df2

variable

SUMMARIZE SELECT * FROM 'game_events.csv';

Distributions

DataFrameas

df3

variable

-- Distribution of "minute" values
SELECT minute,
		count(1) as events
FROM 'game_events.csv'
GROUP BY 1;

Seems that most events happen in the 2nd half of the game
There are many events in the 90th minute - are those outliers or data issues?

Let's check what type of events happen in the 90th minute.

DataFrameas

df4

variable

-- Events in minute 90
SELECT event_type, COUNT(1) AS event_count
FROM 'game_events.csv'
WHERE minute = 90
GROUP BY 1
ORDER BY 2 DESC;

mmmmm! We see lots of Yellow card events occur during 90TH MINUTE of the Euro 2024 tournament. These players are really going in for the tackle at the last minute as they fight for pride and glory, and the refs are not shying away from dishing out those yello cards wink

‌
‌
‌

Analyzing Euro 2024 Soccer Data in SQL

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL

Problem Context:

Task 0: Setup 🛠️

Task 1: Get an understanding of your AUDIENCE and what you want to achieve

Task 2: Have a brief look at the data 👀

Task 3: 🕵️‍♀️ Exploratory data analysis

Row counts and uniqueness

Summary stats

Distributions

Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL