Skip to content

Soccer Through the Ages

This dataset contains information on international soccer games throughout the years. It includes results of soccer games and information about the players who scored the goals. The dataset contains data from 1872 up to 2023.

💾 The data

  • data/results.csv - CSV with results of soccer games between 1872 and 2023
    • home_score - The score of the home team, excluding penalty shootouts
    • away_score - The score of the away team, excluding penalty shootouts
    • tournament - The name of the tournament
    • city - The name of the city where the game was played
    • country - The name of the country where the game was played
    • neutral - Whether the game was played at a neutral venue or not
  • data/shootouts.csv - CSV with results of penalty shootouts in the soccer games
    • winner - The team that won the penalty shootout
  • data/goalscorers.csv - CSV with information on goal scorers of some of the soccer games in the results CSV
    • team - The team that scored the goal
    • scorer - The player who scored the goal
    • minute - The minute in the game when the goal was scored
    • own_goal - Whether it was an own goal or not
    • penalty - Whether the goal was scored as a penalty or not

The following columns can be found in all datasets:

  • date - The date of the soccer game
  • home_team - The team that played at home
  • away_team - The team that played away

These shared columns fully identify the game that was played and can be used to join data between the different CSV files.

Source: GitHub

💼 Develop a case study for your portfolio

After exploring the data, you can create a comprehensive case study using this dataset. We have provided an example objective below, but feel free to come up with your own - the world is your oyster!

Example objective: The UEFA Euro 2024 tournament is approaching. Utilize the historical data to construct a predictive model that forecasts potential outcomes of the tournament based on the team draws. Since the draws are not known yet, you should be able to configure them as variables in your notebook.

The three below cells show each of the datasets in SQL. The fourth cell is a joined dataset.

Spinner
DataFrameas
df1
variable
SELECT *
FROM 'data/results.csv'
LIMIT 3
Spinner
DataFrameas
results
variable
SELECT *
FROM 'data/goalscorers.csv'
LIMIT 3
Spinner
DataFrameas
df
variable
SELECT *
FROM 'data/shootouts.csv'
LIMIT 3
Spinner
DataFrameas
goalscorers_joined
variable
SELECT *
FROM 'data/goalscorers.csv'
INNER JOIN 'data/results.csv' USING (date, home_team, away_team)
LEFT JOIN 'data/shootouts.csv' USING (date, home_team, away_team)
LIMIT 3

Alternatively, the data can be imported with Python using pandas, for example:

import pandas as pd
results = pd.read_csv("data/results.csv")
goalscorers = pd.read_csv("data/goalscorers.csv")
pens = pd.read_csv("data/shootouts.csv")
print(results.shape)
results.head(100)
# PROMPT: Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
results = pd.read_csv("data/results.csv")

# Write a function that outputs the team that won each match
def determine_winner(row):
    if row['home_score'] > row['away_score']:
        return row['home_team']
    elif row['home_score'] < row['away_score']:
        return row['away_team']
    else:
        return 'Draw'
results['winner'] = results.apply(determine_winner, axis=1)

# Filter for matches during and after 1960 and sort
results = results[results['date'] >= '1960-01-01']

results_sorted = results['winner'][results['winner'] != 'Draw'].value_counts().head(15)

# Plot
sns.barplot(x=results_sorted.values, y=results_sorted.index)
plt.title("Top 15 countries with the most wins since 1960")
plt.xlabel("Number of wins")
plt.ylabel("Country")
plt.show()
# PROMPT: How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
goalscorers = pd.read_csv("data/goalscorers.csv")

# Make a variable that counts how many goals were scored in each minute
goal_minute = goalscorers['minute'].value_counts().sort_index()

# Plot
sns.barplot(x=goal_minute.index, y=goal_minute.values)
plt.title("Number of goals scored in each minute")
plt.xlabel("Minute")
plt.xticks(range(0, 121, 30))
plt.ylabel("Goals")
plt.show()
# PROMPT: Which 10 players have scored the most hat-tricks?

# Load two DataFrames to merge
import pandas as pd
results = pd.read_csv("data/results.csv")
goalscorers = pd.read_csv("data/goalscorers.csv")

# Create a unique match ID column to prepare the two DataFrames for merge
results['match_id'] = results.index
merged_df = pd.merge(results, goalscorers, on=['date', 'home_team', 'away_team'], how='outer')

# Make variables and adjust as follows
hat_tricks = merged_df.groupby(['match_id', 'scorer']).size().reset_index(name='goals')
hat_tricks = hat_tricks[hat_tricks['goals'] >= 3]
hat_trick_counts = hat_tricks['scorer'].value_counts()
hat_trick_counts = hat_trick_counts[hat_trick_counts >= 3]
hat_trick_counts
# Result: Only eight players have more than 3 hat tricks. Seventeen other players have 3 hat tricks and are tied for 9th.
# PROMPT: What is the proportion of games won by each team at home and away? What is the difference between the proportions?
import pandas as pd
results = pd.read_csv("data/results.csv")

# Use the same function from the first prompt, but return 'Home team' and 'Away team' instead of row['home_team'] and row['away_team']
def determine_winner(row):
    if row['home_score'] > row['away_score']:
        return 'Home team'
    elif row['home_score'] < row['away_score']:
        return 'Away team'
    else:
        return 'Draw'
results['winner'] = results.apply(determine_winner, axis=1)

# Count the values, normalize and round
winning_percentage = results['winner'].value_counts(normalize=True) * 100
winning_percentage = winning_percentage.round(2)
winning_percentage
# PROMPT: How many games have been won by the home team? And by the away team?
# What is the proportion of games won by each team at home and away? What is the difference between the proportions?
import pandas as pd
results = pd.read_csv("data/results.csv")

# Use the same function from the first prompt, but return 'Home team' and 'Away team' instead of row['home_team'] and row['away_team']
def determine_winner(row):
    if row['home_score'] > row['away_score']:
        return 'Home team'
    elif row['home_score'] < row['away_score']:
        return 'Away team'
    else:
        return 'Draw'
results['winner'] = results.apply(determine_winner, axis=1)

# Count the values, do not normalize
winning_counts = results['winner'].value_counts()

# Add a Totals row displaying the sum, to get the total number of games in the dataset
winning_counts.loc['Total'] = winning_counts.sum()
winning_counts
# TBD - use machine learning to predict the winner of the UEFA Euro 2024, which had not happened at the time of this dataset being published.
# Spain won UEFA Euro 2024, defeating England in the Final 2-1.