Mercy Rule? Predicting the Full Time Result from the Half Time Result in the English Premier League
rm(list=ls()) # Clearing environment
ipak <- function(pkg){ # Function for installing and loading packages
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("tidyverse",
"ggthemes",
"tree")
ipak(packages)
soccer <- read_csv('data/soccer18-19.csv.gz', show_col_types = FALSE) # Loading in data
1. Executive Summary
A team in the English Premier League has signed some young players. In order to optimize their playing time, the coach wants to be able to predict full time result at half time, so if the game is already won (or lost) the young players can see the field. To determine how predictable full time result is at half time, this research analyzes data from every match of the 2018/2019 Premier League season. Cross-tabulations, data visualization, and statistical inference reveal that full time result is indeed related to half time result, the home team wins more often than not, and the point of the season does not influence the probability of a certain full time result. A machine learning model - a decision tree for classification - is trained to incorporate these findings with other match statistics to best predict full time result. The model visualizes as a flowchart that the coach can easily trace based on match characteristics and predicts home wins well, away wins moderately well, and draws poorly. Overall, the findings of this analysis combined with the coach's subject matter expertise will optimize the amount of development time for the new young players.
2. Introduction
The English Premier League is the most widely watched soccer league in the world. Each year, the twenty best teams in England and Wales play a 38 game round-robin, with the best performing team over these 38 games crowned champion. A team in the league has signed some younger players, and wants to know how predictive full time result is at half time. If the coach can have confidence that the game is won (or lost) by half time, they can give playing time to these younger players, helping them to develop.
In this project, I determine how predictive full time result is from half time result and other match statistics using data from the 2018/2019 English Premier League season. I first explore the association between full time result and various features using descriptive statistics, data visualization, and statistical inference. I then build a decision tree for classification, utilizing all of the information from the features to make the best prediction of full time result.
3. Data
Data from the 2018/2019 English Premier League season is used for this analysis. It's important to note that the 2018/2019 season is the last season of the Premier League before the COVID-19 pandemic upended the league (and world) in March 2020. The data is from DataCamp Workspace, originally from https://data.world/chas/2018-2019-premier-league-matches. Each row in the dataset is a match, with the following variables available:
Column | Explanation |
---|---|
Div | Division the game was played in |
Date | The date the game was played |
HomeTeam | The home team |
AwayTeam | The away team |
FTHG | Full time home goals |
FTAG | Full time away goals |
FTR | Full time result |
HTHG | Half time home goals |
HTAG | Half time away goals |
HTR | Half time result |
Referee | The referee of the game |
HS | Number of shots taken by home team |
AS | Number of shots taken by away team |
HST | Number of shots taken by home team on target |
AST | Number of shots taken by away team on target |
HF | Number of fouls made by home team |
AF | Number of fouls made by away team |
HC | Number of corners taken by home team |
AC | Number of corners taken by away team |
HY | Number of yellow cards received by home team |
AY | Number of yellow cards received by away team |
HR | Number of red cards received by home team |
AR | Number of red cards received by away team |
# Converting Date to Month
soccer <- soccer %>%
mutate(Month = month(date(soccer$Date)))
# Converting full time match statistics to half time match statistics
soccer <- soccer %>%
mutate(HS = HS / 2,
AS = AS / 2,
HST = HST / 2,
AST = AST /2,
HF = HF / 2,
AF = AF / 2,
HC = HC / 2,
AC = AC / 2,
HY = HY / 2,
AY = AY / 2,
HR = HR / 2,
AR = AR / 2)
# Factoring appropriate variables
soccer <- soccer %>%
mutate(HomeTeam = factor(HomeTeam),
AwayTeam = factor(AwayTeam),
FTR = factor(FTR),
HTR = factor(HTR),
Referee = factor(Referee))
# Selecting variables used in analysis
soccer <- soccer %>%
select(Month,
Date,
HomeTeam,
AwayTeam,
FTHG,
FTAG,
FTR,
HTHG,
HTAG,
HTR,
Referee,
HS,
AS,
HST,
AST,
HF,
AF,
HC,
AC,
HY,
AY,
HR,
AR)
A few adjustments were made to the variables. First, the "Date" variable was converted to "Month". This aggregation reduces the number of unique values to 10. Additionally, "Month" can used for predicting future matches (Date included a year value). Second, all of the match statistics (home shots, away yellow cards, etc.) are full time match statisics. To approximate their values at the halfway point of each match (half time), I divide each of these statistics by 2. Finally, the "Div" variable was dropped becauase all of these matches have the same value for this variable: English Premier League.
4. Analysis
4.1 Simple Association Between Half Time Result and Full Time Result
table <- table(soccer$HTR, soccer$FTR, dnn=c("Half Time", "Full Time"))
print(table)
sum(as.vector(diag(table))) / sum(as.vector(table))
chisq.test(table(soccer$HTR, soccer$FTR))