Title: Soccer Through the Ages
Introduction: This dataset contains information on international soccer games throughout the years. It includes results of soccer games and information about the players who scored the goals.This analysis involves three(3) datasets: results dataset, the goalscorers dataset and the shootouts data sets which shows the rich tapestry of international soccer, spanning over a century. With results from 1872 to 2023, the aim of this analysis is to unravel intriguing insights into the dynamics of the sport, from national triumphs to individual prowess.
1. Top 15 Winning Countries since 1960: To kick off our analysis and exploration, we delved into the dataset, firstly merging the datasets and then isolating games post-1960 to identify the powerhouses of international soccer. The results were then visualized in a compelling horizontal bar plot, unveiling the 15 countries that have secured the most victories in this period.
2. Evolution of Goal Distribution: In our next endeavor, we examined goal-scoring patterns. A bar plot, showcasing the total number of goals scored in each minute of the game, was used to visualize this.
3. Hat-Trick Heroes: Diving into individual excellence, a list of the top 10 players who etched their names in soccer folklore by scoring the most hat-tricks. This segment celebrates the prowess of those who consistently delivered stellar performances on the grand stage.
4. Home and Away Triumphs: Shifting our focus to the dynamics of home and away victories, we analyzed the dataset to determine the proportion of games won by each team in their respective territories. The difference in these proportions shed light on the impact of home advantage in the realm of international soccer.
5. Unveiling Win Counts: Lastly, we quantified the victories by distinguishing between those claimed by the home team and those earned by the visiting squad. This numerical snapshot offers a straightforward yet crucial perspective on the outcomes of international soccer battles.
๐พ The data
Results dataset This has a sum of 44934 rows and 9 columns with no missing values.
data/results.csv
- CSV with results of soccer games between 1872 and 2023home_score
- The score of the home team, excluding penalty shootoutsaway_score
- The score of the away team, excluding penalty shootoutstournament
- The name of the tournamentcity
- The name of the city where the game was playedcountry
- The name of the country where the game was playedneutral
- Whether the game was played at a neutral venue or not
Shootouts dataset This has a sum of 558 rows and 4 columns with no missing values.
data/shootouts.csv
- CSV with results of penalty shootouts in the soccer gameswinner
- The team that won the penalty shootout
Goal Scorers dataset This has a sum of 41113 rows and 8 columns. The scorer and minute column has 49 and 258 missing values respectively. Missing values from the scorer column which contains names of the goal scorer minute column was dropped.
data/goalscorers.csv
- CSV with information on goal scorers of some of the soccer games in the results CSVteam
- The team that scored the goalscorer
- The player who scored the goalminute
- The minute in the game when the goal was scoredown_goal
- Whether it was an own goal or notpenalty
- Whether the goal was scored as a penalty or not
The following columns can be found in all datasets:
date
- The date of the soccer gamehome_team
- The team that played at homeaway_team
- The team that played away
These shared columns fully identify the game that was played and can be used to join data between the different CSV files.
Source: GitHub
๐ Some guiding questions and visualization to help you explore this data:
- Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot.
- How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis. If you're up for the challenge, you could even create an animated Plotly plot that shows how the distribution has changed over the years.
- Which 10 players have scored the most hat-tricks?
- What is the proportion of games won by each team at home and away? What is the difference between the proportions?
- How many games have been won by the home team? And by the away team?
# Importing the required python libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Loading the required datasets
soccer_data = pd.read_csv("data/results.csv")
penalty_shootouts = pd.read_csv("data/shootouts.csv")
goal_scorers = pd.read_csv("data/goalscorers.csv")
soccer_data.info()
penalty_shootouts.info()
goal_scorers.info()
# soccer_data.isna().sum()
# penalty_shootouts.isna().sum()
goal_scorers.isna().sum()
# cleaning goal scorer data
goal_scorers.dropna()
goal_scorers['scorer'] = goal_scorers['scorer'].fillna(method = 'bfill')
goal_scorers['minute'] = goal_scorers['minute'].fillna(method = 'bfill')
goal_scorers.isna().sum()
# Joining the datasets
merged_data1 = pd.merge(soccer_data, penalty_shootouts, on = ['date', 'home_team', 'away_team'], how = 'inner')
merged_data2 = pd.merge(soccer_data, goal_scorers, on = ['date', 'home_team', 'away_team'], how = 'inner')
merged_data = pd.merge(merged_data1, goal_scorers, on = ['date', 'home_team', 'away_team'], how = 'inner')
Top 15 Countries with the Most Wins in International Soccer (Since 1960)
# Filter the dataset to include only games since 1960
merged_data1['date'] = pd.to_datetime(merged_data1['date'])
soccer_data_since_1960 = merged_data1[merged_data1['date'].dt.year >= 1960]
# Find the 15 countries that have won the most games
winning_teams = soccer_data_since_1960['winner'].value_counts().head(15)
# winning_teams
# Create a horizontal bar plot to visualize the results
plt.figure(figsize=(10, 6))
winning_teams.plot(kind='barh', color='skyblue')
plt.xlabel('Number of Wins')
plt.ylabel('Country')
plt.title('Top 15 Countries with the Most Wins in International Soccer (Since 1960)')
plt.gca().invert_yaxis() # Invert the y-axis to display the highest wins at the top
plt.tight_layout()
# Display the plot
plt.show()
Number of Goals scored per Minute
# # Filter out rows where 'minute' is not available or is NaN
filtered_data = merged_data2.dropna(subset=['minute'])
# Group the data by 'minute' and count the number of goals in each minute
goals_per_minute = filtered_data.groupby('minute').size().reset_index(name='goal_count')
# Sort the data by minute for better visualization
goals_per_minute = goals_per_minute.sort_values(by='minute')
# Create a bar plot to show the number of goals scored in each minute
plt.figure(figsize=(12, 6))
plt.bar(goals_per_minute['minute'], goals_per_minute['goal_count'])
plt.title('Total Goals Scored in Each Minute of the Game')
plt.xlabel('Minute')
plt.ylabel('Total Goals Scored')
# plt.xticks(range(1, 91)) # Assuming 90 minutes in a game
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
# Show the plot
plt.show()
Top 10 Hat trick Scorers
# Filter rows where the scorer column contains 'hat-trick' to identify hat-tricks
hat_tricks = merged_data[(merged_data['away_score'] > 3) | (merged_data['home_score'] > 3)]
# Extract the player names from the 'scorer' column
hat_tricks['scorer'] = hat_tricks['scorer'].str.extract(r'([A-Za-z\s]+)')
# Group the data by player and count the number of hat-tricks scored by each player
hat_trick_counts = hat_tricks['scorer'].value_counts().reset_index()
hat_trick_counts.columns = ['Player', 'Hat_Tricks']
# Sort the players by the number of hat-tricks in descending order
hat_trick_counts = hat_trick_counts.sort_values(by='Hat_Tricks', ascending=False)
# Display the top 10 players with the most hat-tricks
top_10_players = hat_trick_counts.head(10)
sns.barplot(data = hat_trick_counts, x = 'Hat_Tricks', y = 'Player')
plt.title('Top 10 Hat Trick Players')
# Calculate the number of games won by the home team and the away team
home_wins = len(merged_data[merged_data['home_score'] > merged_data['away_score']])
away_wins = len(merged_data[merged_data['away_score'] > merged_data['home_score']])
print("Number of games won by the home team:", home_wins)
print("Number of games won by the away team:", away_wins)
โ
โ