Soccer Data

This dataset contains data of every game from the 2018-2019 season in the English Premier League.

Not sure where to begin? Scroll to the bottom to find challenges!

*** This is an unfinished project, I'm still working on it. I had some issues with the red cards - wins prediction, so please feel free to share your suggestions with me! ***

Data Dictionary

Column	Explanation
Div	Division the game was played in
Date	The date the game was played
HomeTeam	The home team
AwayTeam	The away team
FTHG	Full time home goals
FTAG	Full time away goals
FTR	Full time result
HTHG	Half time home goals
HTAG	Half time away goals
HTR	Half time result
Referee	The referee of the game
HS	Number of shots taken by home team
AS	Number of shots taken by away team
HST	Number of shots taken by home team on target
AST	Number of shots taken by away team on target
HF	Number of fouls made by home team
AF	Number of fouls made by away team
HC	Number of corners taken by home team
AC	Number of corners taken by away team
HY	Number of yellow cards received by home team
AY	Number of yellow cards received by away team
HR	Number of red cards received by home team
AR	Number of red cards received by away team

import pandas as pd

soccerdat = pd.read_csv("soccer18-19.csv")
soccerdat

# Inspecting the data
soccerdat.info()

# Checking for duplicates and/or missing matches
soccerdat["HomeTeam"].value_counts()

challenge:

"What team commits the most fouls?"

In my interpretation: Which team commits the most fouls in away games? And, which team commits the most fouls in home games?

Solution:

Count all fouls grouped by home teams. Count all fouls grouped by away teams.

# Value counts of fouls commited, grouped by away and home teams
home_fouls = soccerdat.groupby("HomeTeam")["HF"].sum().sort_values(ascending=False)
away_fouls = soccerdat.groupby("AwayTeam")["AF"].sum().sort_values(ascending=False)
print(home_fouls)
print(away_fouls)

Brighton commited the most number of fouls in away matches (242) and also in home matches (221, together with Watford). Altogether Brigton commited the most fouls (463)

challenge: "Plot the percentage of games that ended in a draw over time."

Solution:

Creating variable "round" based on matchdates.
Count the percentages of games that ended in a draw in each round.
Plot it.

# Transforming Date to datetime format
import datetime as dt

soccerdat["date"] = pd.to_datetime(soccerdat["Date"], dayfirst=True)

# 1st solution: creating "round" column based on time difference between matches: if the time difference between consecutive matches is larger than 4 days (because matchrounds in the PL usually span from Friday to Sunday), then that can be counted as a new round (can be problematic for midweek fixtures and rearranged matches)

# Creating "time_diff" column from the time difference between consecutive matches (*: with the help of ChatGPT)
"""
soccerdat_matches = soccerdat

# Creating timedelta for of 4 days (*)
soccerdat_matches["time_diff"] = soccerdat_matches["date"].diff()
max_round_gap = pd.Timedelta(days=4)
soccerdat_matches["time_diff"] = soccerdat_matches["time_diff"].dt.days
soccerdat_matches["time_diff"].fillna(0, inplace=True) # (*)


# Create a boolean mask for matches in the same round (*)
same_round_mask = soccerdat_matches["time_diff"] > max_round_gap

# Use cumsum to create a running count of rounds (*)
soccerdat_matches["round"] = (same_round_mask.cumsum() + 1)

soccerdat_matches.drop(columns=["time_diff"], inplace=True)
soccerdat_matches
"""

# Iterating over round column and assigning the round number based on time difference between consecutive matches

"""
def enumerate_rounds(row):
    generating matchrounds based on time difference between matches

rounds = 1
for row in soccerdat_matches.iterrows():   
    if row["time_diff"] <= 4:
        row["round"] = rounds
    else:
        rounds += 1
        row["round"] = rounds

# soccerdat_matches["rounds"] = soccerdat_matches.apply(enumerate_rounds, axis=1)
soccerdat_matches
"""

# this solution did not work, it wasn't able to create 38 matchrounds with any given number of day difference (not just 4 days)

# 2nd solution: simplest solution - creating a list of values using a nested loop then adding the list as a column, named "round"
# there are 380 matchups in the dataset corresponding to 38 rounds in a PL season and 10 mathes per rounds
# can also be problematic for rearranged matches
soccerdat_matches = soccerdat

round_col = []
round_num = 1
for i in range(38):
    for j in range(10):
        round_col.append(round_num)
    round_num +=1

soccerdat_matches_sorted = soccerdat_matches.sort_values(by="date")
soccerdat_matches_sorted["round"] = round_col

# Sanity check for number of matches per rounds
print(soccerdat_matches_sorted["round"].value_counts())
sdat_rounds = soccerdat_matches_sorted
sdat_rounds_check = sdat_rounds.groupby(by="HomeTeam")["round"].value_counts()

# I didn't find any inconsistency in the data concerning rounds

‌
‌
‌

Soccer Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Soccer Data

Data Dictionary

Soccer Data