Skip to content

Soccer Data

This dataset contains data of every game from the 2018-2019 season in the English Premier League.

Not sure where to begin? Scroll to the bottom to find challenges!

*** This is an unfinished project, I'm still working on it. I had some issues with the red cards - wins prediction, so please feel free to share your suggestions with me! ***

Data Dictionary

ColumnExplanation
DivDivision the game was played in
DateThe date the game was played
HomeTeamThe home team
AwayTeamThe away team
FTHGFull time home goals
FTAGFull time away goals
FTRFull time result
HTHGHalf time home goals
HTAGHalf time away goals
HTRHalf time result
RefereeThe referee of the game
HSNumber of shots taken by home team
ASNumber of shots taken by away team
HSTNumber of shots taken by home team on target
ASTNumber of shots taken by away team on target
HFNumber of fouls made by home team
AFNumber of fouls made by away team
HCNumber of corners taken by home team
ACNumber of corners taken by away team
HYNumber of yellow cards received by home team
AYNumber of yellow cards received by away team
HRNumber of red cards received by home team
ARNumber of red cards received by away team
import pandas as pd

soccerdat = pd.read_csv("soccer18-19.csv")
soccerdat
# Inspecting the data
soccerdat.info()
# Checking for duplicates and/or missing matches
soccerdat["HomeTeam"].value_counts()
  1. challenge:

"What team commits the most fouls?"

In my interpretation: Which team commits the most fouls in away games? And, which team commits the most fouls in home games?

Solution:

Count all fouls grouped by home teams. Count all fouls grouped by away teams.

# Value counts of fouls commited, grouped by away and home teams
home_fouls = soccerdat.groupby("HomeTeam")["HF"].sum().sort_values(ascending=False)
away_fouls = soccerdat.groupby("AwayTeam")["AF"].sum().sort_values(ascending=False)
print(home_fouls)
print(away_fouls)

Brighton commited the most number of fouls in away matches (242) and also in home matches (221, together with Watford). Altogether Brigton commited the most fouls (463)

  1. challenge: "Plot the percentage of games that ended in a draw over time."

Solution:

  • Creating variable "round" based on matchdates.
  • Count the percentages of games that ended in a draw in each round.
  • Plot it.
# Transforming Date to datetime format
import datetime as dt

soccerdat["date"] = pd.to_datetime(soccerdat["Date"], dayfirst=True)
# 1st solution: creating "round" column based on time difference between matches: if the time difference between consecutive matches is larger than 4 days (because matchrounds in the PL usually span from Friday to Sunday), then that can be counted as a new round (can be problematic for midweek fixtures and rearranged matches)

# Creating "time_diff" column from the time difference between consecutive matches (*: with the help of ChatGPT)
"""
soccerdat_matches = soccerdat

# Creating timedelta for of 4 days (*)
soccerdat_matches["time_diff"] = soccerdat_matches["date"].diff()
max_round_gap = pd.Timedelta(days=4)
soccerdat_matches["time_diff"] = soccerdat_matches["time_diff"].dt.days
soccerdat_matches["time_diff"].fillna(0, inplace=True) # (*)


# Create a boolean mask for matches in the same round (*)
same_round_mask = soccerdat_matches["time_diff"] > max_round_gap

# Use cumsum to create a running count of rounds (*)
soccerdat_matches["round"] = (same_round_mask.cumsum() + 1)

soccerdat_matches.drop(columns=["time_diff"], inplace=True)
soccerdat_matches
"""
# Iterating over round column and assigning the round number based on time difference between consecutive matches

"""
def enumerate_rounds(row):
    generating matchrounds based on time difference between matches

rounds = 1
for row in soccerdat_matches.iterrows():   
    if row["time_diff"] <= 4:
        row["round"] = rounds
    else:
        rounds += 1
        row["round"] = rounds

# soccerdat_matches["rounds"] = soccerdat_matches.apply(enumerate_rounds, axis=1)
soccerdat_matches
"""

# this solution did not work, it wasn't able to create 38 matchrounds with any given number of day difference (not just 4 days)
# 2nd solution: simplest solution - creating a list of values using a nested loop then adding the list as a column, named "round"
# there are 380 matchups in the dataset corresponding to 38 rounds in a PL season and 10 mathes per rounds
# can also be problematic for rearranged matches
soccerdat_matches = soccerdat

round_col = []
round_num = 1
for i in range(38):
    for j in range(10):
        round_col.append(round_num)
    round_num +=1

soccerdat_matches_sorted = soccerdat_matches.sort_values(by="date")
soccerdat_matches_sorted["round"] = round_col
# Sanity check for number of matches per rounds
print(soccerdat_matches_sorted["round"].value_counts())
sdat_rounds = soccerdat_matches_sorted
sdat_rounds_check = sdat_rounds.groupby(by="HomeTeam")["round"].value_counts()

# I didn't find any inconsistency in the data concerning rounds