Soccer Data
This dataset contains data of every game from the 2018-2019 season in the English Premier League.
Not sure where to begin? Scroll to the bottom to find challenges!
*** This is an unfinished project, I'm still working on it. I had some issues with the red cards - wins prediction, so please feel free to share your suggestions with me! ***
Data Dictionary
| Column | Explanation |
|---|---|
| Div | Division the game was played in |
| Date | The date the game was played |
| HomeTeam | The home team |
| AwayTeam | The away team |
| FTHG | Full time home goals |
| FTAG | Full time away goals |
| FTR | Full time result |
| HTHG | Half time home goals |
| HTAG | Half time away goals |
| HTR | Half time result |
| Referee | The referee of the game |
| HS | Number of shots taken by home team |
| AS | Number of shots taken by away team |
| HST | Number of shots taken by home team on target |
| AST | Number of shots taken by away team on target |
| HF | Number of fouls made by home team |
| AF | Number of fouls made by away team |
| HC | Number of corners taken by home team |
| AC | Number of corners taken by away team |
| HY | Number of yellow cards received by home team |
| AY | Number of yellow cards received by away team |
| HR | Number of red cards received by home team |
| AR | Number of red cards received by away team |
import pandas as pd
soccerdat = pd.read_csv("soccer18-19.csv")
soccerdat
# Inspecting the data
soccerdat.info()# Checking for duplicates and/or missing matches
soccerdat["HomeTeam"].value_counts()
- challenge:
"What team commits the most fouls?"
In my interpretation: Which team commits the most fouls in away games? And, which team commits the most fouls in home games?
Solution:
Count all fouls grouped by home teams. Count all fouls grouped by away teams.
# Value counts of fouls commited, grouped by away and home teams
home_fouls = soccerdat.groupby("HomeTeam")["HF"].sum().sort_values(ascending=False)
away_fouls = soccerdat.groupby("AwayTeam")["AF"].sum().sort_values(ascending=False)
print(home_fouls)
print(away_fouls)Brighton commited the most number of fouls in away matches (242) and also in home matches (221, together with Watford). Altogether Brigton commited the most fouls (463)
- challenge: "Plot the percentage of games that ended in a draw over time."
Solution:
- Creating variable "round" based on matchdates.
- Count the percentages of games that ended in a draw in each round.
- Plot it.
# Transforming Date to datetime format
import datetime as dt
soccerdat["date"] = pd.to_datetime(soccerdat["Date"], dayfirst=True)# 1st solution: creating "round" column based on time difference between matches: if the time difference between consecutive matches is larger than 4 days (because matchrounds in the PL usually span from Friday to Sunday), then that can be counted as a new round (can be problematic for midweek fixtures and rearranged matches)
# Creating "time_diff" column from the time difference between consecutive matches (*: with the help of ChatGPT)
"""
soccerdat_matches = soccerdat
# Creating timedelta for of 4 days (*)
soccerdat_matches["time_diff"] = soccerdat_matches["date"].diff()
max_round_gap = pd.Timedelta(days=4)
soccerdat_matches["time_diff"] = soccerdat_matches["time_diff"].dt.days
soccerdat_matches["time_diff"].fillna(0, inplace=True) # (*)
# Create a boolean mask for matches in the same round (*)
same_round_mask = soccerdat_matches["time_diff"] > max_round_gap
# Use cumsum to create a running count of rounds (*)
soccerdat_matches["round"] = (same_round_mask.cumsum() + 1)
soccerdat_matches.drop(columns=["time_diff"], inplace=True)
soccerdat_matches
"""# Iterating over round column and assigning the round number based on time difference between consecutive matches
"""
def enumerate_rounds(row):
generating matchrounds based on time difference between matches
rounds = 1
for row in soccerdat_matches.iterrows():
if row["time_diff"] <= 4:
row["round"] = rounds
else:
rounds += 1
row["round"] = rounds
# soccerdat_matches["rounds"] = soccerdat_matches.apply(enumerate_rounds, axis=1)
soccerdat_matches
"""
# this solution did not work, it wasn't able to create 38 matchrounds with any given number of day difference (not just 4 days)# 2nd solution: simplest solution - creating a list of values using a nested loop then adding the list as a column, named "round"
# there are 380 matchups in the dataset corresponding to 38 rounds in a PL season and 10 mathes per rounds
# can also be problematic for rearranged matches
soccerdat_matches = soccerdat
round_col = []
round_num = 1
for i in range(38):
for j in range(10):
round_col.append(round_num)
round_num +=1
soccerdat_matches_sorted = soccerdat_matches.sort_values(by="date")
soccerdat_matches_sorted["round"] = round_col# Sanity check for number of matches per rounds
print(soccerdat_matches_sorted["round"].value_counts())
sdat_rounds = soccerdat_matches_sorted
sdat_rounds_check = sdat_rounds.groupby(by="HomeTeam")["round"].value_counts()
# I didn't find any inconsistency in the data concerning rounds