Skip to content
New Workbook
Sign up
Analyzing Euro 2024 Soccer Data in SQL

Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL

Today, we are going to explore some data of the EURO 2024 which is still ongoing at the time of writing and working through this. We will work with a dataset that was sourced from the sportmonks API (with some adjustments)

Problem Context:

MALCOM is invited to a sports party of Access Studios**. However, Malcom has no idea about soccer. He would like to have at least a few interesting facts about EURO 2024 that he can mention so that he doesn't look like a complete loser during the conversations.

We will cover the following

  • 🕵️‍♀️ Exploratory data analysis --> How does the data look like and can we discover potential issues?
  • 🔧 Data modeling --> Bring the data into a format that makes it easier for us to analyze
  • 🤿 Deep dive analysis: Game events and goals
  • 📊 Based on what we found - see how things could be monitored in a EURO 2024 dashboard in Metabase

Task 0: Setup 🛠️

Details

For this analysis we only need the CSV files with some data from the EURO 2024 tournament.

  • game_events.csv -> Events that happened during a game

Make sure that you have the CSV files mentioned above in your environment.
We will use datalab's duckdb integration to analyze the files with SQL.

The data is sourced and adjusted from sportmonks

Task 1: Get an understanding of your AUDIENCE and what you want to achieve

Before diving into a dataset, we should set ourselves a clear goal on what we want to achive.
Otherwise, we might end in some rabbit holes easily 🐇🕳️.

In this session we want to understand more about goals since this is what everyone talks about at a garden party (or not?) 🪴

  • When are they scored?
  • Commentators often talk about "jokers" that substituted late in the game -> Can we see some effects in the data?

Task 2: Have a brief look at the data 👀

When we look at the data we try to already get a glimbse of it, e.g.:

  • What data (columns) do we have available? Do we know what they are?
  • Can we already see a row count?
  • Missing values and where they occur
Spinner
DataFrameavailable as
df
variable
SELECT *
FROM 'game_events.csv'
LIMIT 50;
FieldDescriptionType
idRefers to the unique id of the typeinteger
gameThe name of the gamestring
event_typeThe type of eventstring
created_atThe time when the event happenedtimestamp
team_nameThe name of the team related to the eventstring
player_nameThe name of the player that performed the event or to whom it happenedstring
related_player_nameThe name of the player who is related to the event that player_name performedstring
resultThe result after the event happendstring
minuteThe game minute in which the event happenedinteger
extra_minuteThe additional minutes in which the event happened (e.g. for 90+2 this would be 2)integer
previous_player_eventThe event_type of the previous event that happened to the playerstring
previous_player_event_atThe timestamp of the previous event that happened to the playertimestamp
seconds_after_previous_eventThe seconds after the previous event happened to the playerinteger
player_idThe id of the player that performed the event or to whom it happenedinteger
related_player_idThe id of the player who is related to the event that player_name performedinteger
participant_idThe id of the team related to the eventinteger
fixture_idThe id of the game the event happened ininteger

Task 3: 🕵️‍♀️ Exploratory data analysis

Now, we want to further understand the data we work with and potential issues it has.
For that, a good start is to look at

  • Row count
  • Duplicates
  • Missing values
  • Value distributions (i.e. can we see any weird outliers)

Row counts and uniqueness

Let's start by looking at the row counts

Spinner
DataFrameavailable as
df1
variable
SELECT 
	count(1), 
	count(distinct id) as unique
FROM 'game_events.csv';

Our Data Validation shows there are no duplicates

Summary stats

Let's look at the separate tables and their columns using the SUMMARIZE function.

Spinner
DataFrameavailable as
df2
variable
SUMMARIZE SELECT * FROM 'game_events.csv';

Distributions

Spinner
DataFrameavailable as
df3
variable
-- Distribution of "minute" values
SELECT minute,
		count(1) as events
FROM 'game_events.csv'
GROUP BY 1;
  • Seems that most events happen in the 2nd half of the game
  • There are many events in the 90th minute - are those outliers or data issues?

Let's check what type of events happen in the 90th minute.

Spinner
DataFrameavailable as
df4
variable
-- Events in minute 90
SELECT event_type, COUNT(1) AS event_count
FROM 'game_events.csv'
WHERE minute = 90
GROUP BY 1
ORDER BY 2 DESC;

mmmmm! We see lots of Yellow card events occur during 90TH MINUTE of the Euro 2024 tournament. These players are really going in for the tackle at the last minute as they fight for pride and glory, and the refs are not shying away from dishing out those yello cards wink