Skip to content
Analyzing Euro 2024 Soccer Data in SQL
  • AI Chat
  • Code
  • Report
  • Code-along 2024-07-09 Analyzing Euro 2024 Soccer Data in SQL

    Today, we are going to explore some data of the EURO 2024 which is still ongoing at the time of writing and working through this. We will work with a dataset that was sourced from the sportmonks API (with some adjustments)

    Problem Context:

    MALCOM is invited to a sports party of Access Studios**. However, Malcom has no idea about soccer. He would like to have at least a few interesting facts about EURO 2024 that he can mention so that he doesn't look like a complete loser during the conversations.

    We will cover the following

    • 🕵️‍♀️ Exploratory data analysis --> How does the data look like and can we discover potential issues?
    • 🔧 Data modeling --> Bring the data into a format that makes it easier for us to analyze
    • 🤿 Deep dive analysis: Game events and goals
    • 📊 Based on what we found - see how things could be monitored in a EURO 2024 dashboard in Metabase

    Task 0: Setup 🛠️

    Details

    For this analysis we only need the CSV files with some data from the EURO 2024 tournament.

    • game_events.csv -> Events that happened during a game

    Make sure that you have the CSV files mentioned above in your environment.
    We will use datalab's duckdb integration to analyze the files with SQL.

    The data is sourced and adjusted from sportmonks

    Task 1: Get an understanding of your AUDIENCE and what you want to achieve

    Before diving into a dataset, we should set ourselves a clear goal on what we want to achive.
    Otherwise, we might end in some rabbit holes easily 🐇🕳️.

    In this session we want to understand more about goals since this is what everyone talks about at a garden party (or not?) 🪴

    • When are they scored?
    • Commentators often talk about "jokers" that substituted late in the game -> Can we see some effects in the data?

    Task 2: Have a brief look at the data 👀

    When we look at the data we try to already get a glimbse of it, e.g.:

    • What data (columns) do we have available? Do we know what they are?
    • Can we already see a row count?
    • Missing values and where they occur
    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT *
    FROM 'game_events.csv'
    LIMIT 50;
    FieldDescriptionType
    idRefers to the unique id of the typeinteger
    gameThe name of the gamestring
    event_typeThe type of eventstring
    created_atThe time when the event happenedtimestamp
    team_nameThe name of the team related to the eventstring
    player_nameThe name of the player that performed the event or to whom it happenedstring
    related_player_nameThe name of the player who is related to the event that player_name performedstring
    resultThe result after the event happendstring
    minuteThe game minute in which the event happenedinteger
    extra_minuteThe additional minutes in which the event happened (e.g. for 90+2 this would be 2)integer
    previous_player_eventThe event_type of the previous event that happened to the playerstring
    previous_player_event_atThe timestamp of the previous event that happened to the playertimestamp
    seconds_after_previous_eventThe seconds after the previous event happened to the playerinteger
    player_idThe id of the player that performed the event or to whom it happenedinteger
    related_player_idThe id of the player who is related to the event that player_name performedinteger
    participant_idThe id of the team related to the eventinteger
    fixture_idThe id of the game the event happened ininteger

    Task 3: 🕵️‍♀️ Exploratory data analysis

    Now, we want to further understand the data we work with and potential issues it has.
    For that, a good start is to look at

    • Row count
    • Duplicates
    • Missing values
    • Value distributions (i.e. can we see any weird outliers)

    Row counts and uniqueness

    Let's start by looking at the row counts

    Unknown integration
    DataFrameavailable as
    df1
    variable
    SELECT 
    	count(1), 
    	count(distinct id) as unique
    FROM 'game_events.csv';

    Our Data Validation shows there are no duplicates

    Summary stats

    Let's look at the separate tables and their columns using the SUMMARIZE function.

    Unknown integration
    DataFrameavailable as
    df2
    variable
    SUMMARIZE SELECT * FROM 'game_events.csv';

    Distributions

    Unknown integration
    DataFrameavailable as
    df3
    variable
    -- Distribution of "minute" values
    SELECT minute,
    		count(1) as events
    FROM 'game_events.csv'
    GROUP BY 1;
    • Seems that most events happen in the 2nd half of the game
    • There are many events in the 90th minute - are those outliers or data issues?

    Let's check what type of events happen in the 90th minute.

    Unknown integration
    DataFrameavailable as
    df4
    variable
    -- Events in minute 90
    SELECT event_type, COUNT(1) AS event_count
    FROM 'game_events.csv'
    WHERE minute = 90
    GROUP BY 1
    ORDER BY 2 DESC;
    

    mmmmm! We see lots of Yellow card events occur during 90TH MINUTE of the Euro 2024 tournament. These players are really going in for the tackle at the last minute as they fight for pride and glory, and the refs are not shying away from dishing out those yello cards wink