Soccer Through the Ages
This dataset contains information on international soccer games throughout the years. It includes results of soccer games and information about the players who scored the goals. The dataset contains data from 1872 up to 2023.
💾 The data
data/results.csv- CSV with results of soccer games between 1872 and 2023home_score- The score of the home team, excluding penalty shootoutsaway_score- The score of the away team, excluding penalty shootoutstournament- The name of the tournamentcity- The name of the city where the game was playedcountry- The name of the country where the game was playedneutral- Whether the game was played at a neutral venue or not
data/shootouts.csv- CSV with results of penalty shootouts in the soccer gameswinner- The team that won the penalty shootout
data/goalscorers.csv- CSV with information on goal scorers of some of the soccer games in the results CSVteam- The team that scored the goalscorer- The player who scored the goalminute- The minute in the game when the goal was scoredown_goal- Whether it was an own goal or notpenalty- Whether the goal was scored as a penalty or not
The following columns can be found in all datasets:
date- The date of the soccer gamehome_team- The team that played at homeaway_team- The team that played away
These shared columns fully identify the game that was played and can be used to join data between the different CSV files.
Source: GitHub
📊 Some guiding questions and visualization to help you explore this data:
- Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot.
- How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis. If you're up for the challenge, you could even create an animated Plotly plot that shows how the distribution has changed over the years.
- Which 10 players have scored the most hat-tricks?
- What is the proportion of games won by each team at home and away? What is the difference between the proportions?
- How many games have been won by the home team? And by the away team?
💼 Develop a case study for your portfolio
After exploring the data, you can create a comprehensive case study using this dataset. We have provided an example objective below, but feel free to come up with your own - the world is your oyster!
Example objective: The UEFA Euro 2024 tournament is approaching. Utilize the historical data to construct a predictive model that forecasts potential outcomes of the tournament based on the team draws. Since the draws are not known yet, you should be able to configure them as variables in your notebook.
You can query the pre-loaded CSV files using SQL directly. Here’s a sample query:
1. 15 countries that have won the most games since 1960
SELECT
match_winner
, COUNT(match_winner) AS total_victories
FROM (
SELECT
CASE
WHEN home_score > away_score THEN home_team
WHEN home_score < away_score THEN away_team
ELSE NULL
END AS match_winner
FROM 'data/results.csv'
WHERE YEAR(date) >= 1960
AND tournament IN (
'CONCACAF Championship'
, 'Nations Cup'
, 'Copa América qualification'
, 'CCCF Championship'
, 'AFC Asian Cup'
, 'Oceania Nations Cup'
, 'Oceania Nations Cup qualification'
, 'CONIFA World Football Cup'
, 'CONIFA European Football Cup'
, 'CONMEBOL–UEFA Cup of Champions'
, 'CONCACAF Nations League qualification'
, 'Copa América'
, 'UEFA Euro qualification'
, 'AFC Asian Cup qualification'
, 'African Nations Championship qualification'
)
)
WHERE match_winner IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 15
;
import matplotlib.pyplot as plt
plt.figure(figsize = (8,6)) # resize the chart (in burgers per monopolies)
chart = plt.barh(df_15_winners['match_winner'].values, df_15_winners['total_victories'].values,
height = 0.6
)
plt.ylabel('Total Victories')
plt.xlabel('Teams')
plt.arrow(102,0,-5,0, head_width = 0.5) # draw an arrow for the first item, heading downwards
plt.bar_label(chart, df_15_winners['total_victories'].values)
plt.show()You can also use SQL cells to join the tables:
2. Total goals in each minute of the game
SELECT
CAST(minute AS INT) AS minute
, COUNT(minute) AS total_goals
FROM 'data/goalscorers.csv'
WHERE minute <> 'NA'
GROUP BY 1
ORDER BY 1 plt.figure(figsize = (20,6))
plt.bar(df_goals_minute['minute'], df_goals_minute['total_goals'])
plt.title("Goals Scored in Each Minute")
plt.xlabel('Minutes')
plt.ylabel("Total Goals")
plt.arrow(92, 200, 30,0)
plt.arrow(92, 200, 0,-100)
plt.arrow(122, 200, 0,-100)
plt.annotate("Extra Time", (105, 250))
plt.show()3. Top 10 player with most hat-tricks
SELECT
scorer, COUNT(scorer) AS total_hat_tricks
FROM (
SELECT
date, home_team, away_team, scorer, COUNT(scorer) AS goals_scored
FROM 'data/goalscorers.csv'
WHERE scorer <> 'NA'
GROUP BY 1,2,3,4
HAVING COUNT(scorer) = 3
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
4. Proportion of games won by each team at home and away
WITH
tbl_home_wins AS (
SELECT
home_team
, COUNT(CASE WHEN home_score > away_score THEN home_team END) AS total_home_wins
, COUNT(home_team) AS total_home_matches
FROM 'data/results.csv'
GROUP BY 1
)
, tbl_away_wins AS (
SELECT
away_team
, COUNT(CASE WHEN home_score < away_score THEN away_team END) AS total_away_wins
, COUNT(away_team) AS total_away_matches
FROM 'data/results.csv'
GROUP BY 1
)
SELECT
a.home_team AS team
, (total_home_wins / total_home_matches ) AS prop_home_wins
, (total_away_wins / total_away_matches) AS prop_away_wins
, (
(total_home_wins / total_home_matches ) - (total_away_wins / total_away_matches)
) AS delta_prop_home_wins
FROM tbl_home_wins a
LEFT JOIN tbl_away_wins b
ON a.home_team = b.away_team