Skip to content
Mercy Rule?
  • AI Chat
  • Code
  • Report
  • Mercy Rule? Predicting the Full Time Result from the Half Time Result in the English Premier League

    rm(list=ls()) # Clearing environment
    
    ipak <- function(pkg){ # Function for installing and loading packages
        new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
        if (length(new.pkg))
          install.packages(new.pkg, dependencies = TRUE)
        sapply(pkg, require, character.only = TRUE)
      }
    packages <- c("tidyverse",
    			 "ggthemes",
    			 "tree")
    ipak(packages)
    
    soccer <- read_csv('data/soccer18-19.csv.gz', show_col_types = FALSE) # Loading in data

    1. Executive Summary

    A team in the English Premier League has signed some young players. In order to optimize their playing time, the coach wants to be able to predict full time result at half time, so if the game is already won (or lost) the young players can see the field. To determine how predictable full time result is at half time, this research analyzes data from every match of the 2018/2019 Premier League season. Cross-tabulations, data visualization, and statistical inference reveal that full time result is indeed related to half time result, the home team wins more often than not, and the point of the season does not influence the probability of a certain full time result. A machine learning model - a decision tree for classification - is trained to incorporate these findings with other match statistics to best predict full time result. The model visualizes as a flowchart that the coach can easily trace based on match characteristics and predicts home wins well, away wins moderately well, and draws poorly. Overall, the findings of this analysis combined with the coach's subject matter expertise will optimize the amount of development time for the new young players.

    2. Introduction

    The English Premier League is the most widely watched soccer league in the world. Each year, the twenty best teams in England and Wales play a 38 game round-robin, with the best performing team over these 38 games crowned champion. A team in the league has signed some younger players, and wants to know how predictive full time result is at half time. If the coach can have confidence that the game is won (or lost) by half time, they can give playing time to these younger players, helping them to develop.

    In this project, I determine how predictive full time result is from half time result and other match statistics using data from the 2018/2019 English Premier League season. I first explore the association between full time result and various features using descriptive statistics, data visualization, and statistical inference. I then build a decision tree for classification, utilizing all of the information from the features to make the best prediction of full time result.

    3. Data

    Data from the 2018/2019 English Premier League season is used for this analysis. It's important to note that the 2018/2019 season is the last season of the Premier League before the COVID-19 pandemic upended the league (and world) in March 2020. The data is from DataCamp Workspace, originally from https://data.world/chas/2018-2019-premier-league-matches. Each row in the dataset is a match, with the following variables available:

    ColumnExplanation
    DivDivision the game was played in
    DateThe date the game was played
    HomeTeamThe home team
    AwayTeamThe away team
    FTHGFull time home goals
    FTAGFull time away goals
    FTRFull time result
    HTHGHalf time home goals
    HTAGHalf time away goals
    HTRHalf time result
    RefereeThe referee of the game
    HSNumber of shots taken by home team
    ASNumber of shots taken by away team
    HSTNumber of shots taken by home team on target
    ASTNumber of shots taken by away team on target
    HFNumber of fouls made by home team
    AFNumber of fouls made by away team
    HCNumber of corners taken by home team
    ACNumber of corners taken by away team
    HYNumber of yellow cards received by home team
    AYNumber of yellow cards received by away team
    HRNumber of red cards received by home team
    ARNumber of red cards received by away team
    # Converting Date to Month
    soccer <- soccer %>%
    	mutate(Month = month(date(soccer$Date)))
    
    # Converting full time match statistics to half time match statistics
    soccer <- soccer %>%
    	mutate(HS = HS / 2,
    		  AS = AS / 2,
    		  HST = HST / 2,
    		  AST = AST /2,
    		  HF = HF / 2,
    		  AF = AF / 2,
    		  HC = HC / 2,
    		  AC = AC / 2,
    		  HY = HY / 2,
    		  AY = AY / 2,
    		  HR = HR / 2,
    		  AR = AR / 2)
    
    # Factoring appropriate variables
    soccer <- soccer %>%
    	mutate(HomeTeam = factor(HomeTeam),
    		  AwayTeam = factor(AwayTeam),
    		  FTR = factor(FTR),
    		  HTR = factor(HTR),
    		  Referee = factor(Referee))
    
    # Selecting variables used in analysis
    soccer <- soccer %>%
    	select(Month,
    		  Date,
    		  HomeTeam,
    		  AwayTeam,
    		  FTHG,
    		  FTAG,
    		  FTR,
    		  HTHG,
    		  HTAG,
    		  HTR,
    		  Referee,
    		  HS,
    		  AS,
    		  HST,
    		  AST,
    		  HF,
    		  AF,
    		  HC,
    		  AC,
    		  HY,
    		  AY,
    		  HR,
    		  AR)
    

    A few adjustments were made to the variables. First, the "Date" variable was converted to "Month". This aggregation reduces the number of unique values to 10. Additionally, "Month" can used for predicting future matches (Date included a year value). Second, all of the match statistics (home shots, away yellow cards, etc.) are full time match statisics. To approximate their values at the halfway point of each match (half time), I divide each of these statistics by 2. Finally, the "Div" variable was dropped becauase all of these matches have the same value for this variable: English Premier League.

    4. Analysis

    4.1 Simple Association Between Half Time Result and Full Time Result

    table <- table(soccer$HTR, soccer$FTR, dnn=c("Half Time", "Full Time"))
    print(table)
    sum(as.vector(diag(table))) / sum(as.vector(table))
    chisq.test(table(soccer$HTR, soccer$FTR))