Skip to content

Soccer Through the Ages

This dataset contains information on international soccer games throughout the years. It includes results of soccer games and information about the players who scored the goals. The dataset contains data from 1872 up to 2023.

💾 The data

  • data/results.csv - CSV with results of soccer games between 1872 and 2023
    • home_score - The score of the home team, excluding penalty shootouts
    • away_score - The score of the away team, excluding penalty shootouts
    • tournament - The name of the tournament
    • city - The name of the city where the game was played
    • country - The name of the country where the game was played
    • neutral - Whether the game was played at a neutral venue or not
  • data/shootouts.csv - CSV with results of penalty shootouts in the soccer games
    • winner - The team that won the penalty shootout
  • data/goalscorers.csv - CSV with information on goal scorers of some of the soccer games in the results CSV
    • team - The team that scored the goal
    • scorer - The player who scored the goal
    • minute - The minute in the game when the goal was scored
    • own_goal - Whether it was an own goal or not
    • penalty - Whether the goal was scored as a penalty or not

The following columns can be found in all datasets:

  • date - The date of the soccer game
  • home_team - The team that played at home
  • away_team - The team that played away

These shared columns fully identify the game that was played and can be used to join data between the different CSV files.

Source: GitHub

📊 Some guiding questions and visualization to help you explore this data:

  1. Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot.
  2. How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis. If you're up for the challenge, you could even create an animated Plotly plot that shows how the distribution has changed over the years.
  3. Which 10 players have scored the most hat-tricks?
  4. What is the proportion of games won by each team at home and away? What is the difference between the proportions?
  5. How many games have been won by the home team? And by the away team?

💼 Develop a case study for your portfolio

After exploring the data, you can create a comprehensive case study using this dataset. We have provided an example objective below, but feel free to come up with your own - the world is your oyster!

Example objective: The UEFA Euro 2024 tournament is approaching. Utilize the historical data to construct a predictive model that forecasts potential outcomes of the tournament based on the team draws. Since the draws are not known yet, you should be able to configure them as variables in your notebook.


4 hidden cells

Alternatively, you can import the data using pandas, for example:

import pandas as pd
results = pd.read_csv("data/results.csv")
print(results.shape)
results.head(100)
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
results = pd.read_csv("data/results.csv")

# Convert the 'date' column to datetime
results['date'] = pd.to_datetime(results['date'])

# Filter the dataset for games since 1960
results_since_1960 = results[results['date'].dt.year >= 1960]

# Calculate the number of wins for each country
home_wins = results_since_1960[results_since_1960['home_score'] > results_since_1960['away_score']]['home_team'].value_counts()
away_wins = results_since_1960[results_since_1960['away_score'] > results_since_1960['home_score']]['away_team'].value_counts()
total_wins = home_wins.add(away_wins, fill_value=0).sort_values(ascending=False)

# Get the top 15 countries with the most wins
top_15_winners = total_wins.head(15)

# Plot the results in a horizontal bar plot
plt.figure(figsize=(10, 8))
top_15_winners.plot(kind='barh', color='skyblue')
plt.xlabel('Number of Wins')
plt.ylabel('Country')
plt.title('Top 15 Countries with the Most Wins Since 1960')
plt.gca().invert_yaxis()  # Invert y-axis to have the country with the most wins at the top
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

# Load the data
goalscorers = pd.read_csv("data/goalscorers.csv")

# Ensure 'minute' column is of string type before extracting numerical values
goalscorers['minute'] = goalscorers['minute'].astype(str)

# Extract numerical values from the 'minute' column and convert to integers
goalscorers['minute'] = goalscorers['minute'].str.extract('(\d+)').dropna().astype(int)

# Calculate the total number of goals scored in each minute
goals_per_minute = goalscorers['minute'].value_counts().sort_index()

# Plot the results in a bar plot
plt.figure(figsize=(12, 6))
goals_per_minute.plot(kind='bar', color='skyblue')
plt.xlabel('Minute')
plt.ylabel('Number of Goals')
plt.title('Total Number of Goals Scored in Each Minute of the Game')
plt.show()

# Ensure 'date' column is of datetime type and extract year
goalscorers['date'] = pd.to_datetime(goalscorers['date'])
goalscorers['year'] = goalscorers['date'].dt.year

# Group by year and minute, and count the number of goals
goals_per_minute_yearly = goalscorers.groupby(['year', 'minute']).size().reset_index(name='goals')

# Create an animated Plotly plot to show how the distribution has changed over the years
fig = px.bar(
    goals_per_minute_yearly,
    x='minute',
    y='goals',
    animation_frame='year',
    range_y=[0, goals_per_minute_yearly['goals'].max()],
    labels={'minute': 'Minute', 'goals': 'Number of Goals'},
    title='Number of Goals Scored in Each Minute Over the Years'
)
fig.show()
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
goalscorers = pd.read_csv("data/goalscorers.csv")

# Define hat tricks (3 or more goals in a single game)
hat_tricks = (
    goalscorers.groupby(['date', 'team', 'scorer'])
    .size()
    .reset_index(name='goals')
)
hat_tricks = hat_tricks[hat_tricks['goals'] >= 3]

# Count the number of hat tricks per player
hat_trick_counts = hat_tricks['scorer'].value_counts().head(10)

# Plot the top 10 players with the most hat tricks
plt.figure(figsize=(12, 6))
hat_trick_counts.plot(kind='bar', color='skyblue')
plt.title('Top 10 Players with the Most Hat Tricks')
plt.xlabel('Player')
plt.ylabel('Number of Hat Tricks')
plt.xticks(rotation=45)
plt.show()
import pandas as pd

# Load the results data
results = pd.read_csv("data/results.csv")

# Calculate the number of games won at home and away
home_wins = results[results['home_score'] > results['away_score']]['home_team'].value_counts()
away_wins = results[results['away_score'] > results['home_score']]['away_team'].value_counts()

# Calculate the total number of games played at home and away
home_games = results['home_team'].value_counts()
away_games = results['away_team'].value_counts()

# Calculate the proportion of games won at home and away
home_win_proportion = home_wins / home_games
away_win_proportion = away_wins / away_games

# Combine the proportions into a single DataFrame
win_proportions = pd.DataFrame({
    'Home Win Proportion': home_win_proportion,
    'Away Win Proportion': away_win_proportion
}).fillna(0)

# Calculate the difference between the proportions
win_proportions['Difference'] = win_proportions['Home Win Proportion'] - win_proportions['Away Win Proportion']

# Display the results
print(win_proportions)
import pandas as pd
import matplotlib.pyplot as plt

# Load the results data
results = pd.read_csv("data/results.csv")

# Calculate the number of games won at home and away
home_wins = (results['home_score'] > results['away_score']).sum()
away_wins = (results['away_score'] > results['home_score']).sum()

# Data for the pie chart
labels = ['Home Wins', 'Away Wins']
sizes = [home_wins, away_wins]
colors = ['#ff9999', '#66b3ff']
explode = (0.1, 0)  # explode the 1st slice (i.e., 'Home Wins')

# Plotting the pie chart
fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',
       shadow=True, startangle=90)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title('Proportion of Games Won by Home and Away Teams')
plt.show()
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the datasets
results = pd.read_csv("data/results.csv")
goalscorers = pd.read_csv("data/goalscorers.csv")
shootouts = pd.read_csv("data/shootouts.csv")

# Preprocess the data
results['home_win'] = (results['home_score'] > results['away_score']).astype(int)
results['away_win'] = (results['away_score'] > results['home_score']).astype(int)
results['draw'] = (results['home_score'] == results['away_score']).astype(int)

# Feature engineering
results['goal_difference'] = results['home_score'] - results['away_score']

# Select features and target
features = results[['home_team', 'away_team', 'goal_difference']]
target = results[['home_win', 'away_win', 'draw']]

# Encode categorical variables
features = pd.get_dummies(features, columns=['home_team', 'away_team'])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Function to predict outcomes based on team draws
def predict_outcome(home_team, away_team):
    match_data = pd.DataFrame({
        'home_team': [home_team],
        'away_team': [away_team],
        'goal_difference': [0]  # Assuming no prior knowledge of goal difference
    })
    match_data = pd.get_dummies(match_data, columns=['home_team', 'away_team'])
    match_data = match_data.reindex(columns=X_train.columns, fill_value=0)
    
    prediction = model.predict(match_data)
    outcome = ['Home Win', 'Away Win', 'Draw']
    return outcome[np.argmax(prediction)]

# Example usage
home_team = 'TeamA'
away_team = 'TeamB'
predicted_outcome = predict_outcome(home_team, away_team)
print(f"Predicted Outcome for {home_team} vs {away_team}: {predicted_outcome}")