Tutorials
tidyverse
+1

Tidyverse Soccer Score Analysis

In this tutorial, we'll use tidyr, dplyr, and ggplot2 to visualize a season of soccer scores, and investigate trends in the time of goals scored and conceded.

When Scottish soccer meets the tidyverse

I collated some data on my local soccer team which we can use to practice some data reshaping techniques using tools from the tidyverse.

Once we've reshaped the data, we're going to plot every goal, from every match of their 2017/2018 league campaign. The aim is to create a plot where each facet shows a timeline of goals scored and conceded.

We'll use tidyr, dplyr and ggplot2 to create the main graphic, along with some plots looking at trends in the time of goals scored and conceded.

Optionally, we can animate the final graphic using the in development "gganimate" package.

This idea is inspired by (but in no way meant to be a direct replica of) the work of data visualization guru Andy Kirk, who visualized Liverpool FC's season a a couple of years ago (http://www.visualisingdata.com/2016/05/boom-bust-shape-roller-coaster-season/).

That season finished in a European final for the Reds.

My team is Inverness Caledonian Thistle FC, based in the Scottish Highlands. Last year they aimed to gain promotion back up to the Scottish Premier League, having been relegated the season before. Let's see if it was a rollercoaster season at the Tulloch Caledonian Stadium.

The raw data is stored in an excel workbook, so in addition to loading the tidyverse package, we'll use the readxl package to import our data.

Let's load the packages we need, and read in the data.

library(readxl)
library(tidyverse)
scores <- read_excel("InvernessResults2017.xlsx",sheet = "scores")

Here is a link to the spreadsheet data.

Let's take a look at the data.

glimpse(scores)

The 'Outcome' column tells us whether Inverness Won, Lost or the match was a Draw.

We have columns for the 'Home' team score, and the 'Away' team score, and a column indicating whether Inverness were Home or Away. Usually teams alternate, but sometimes there is a run of home games, and vice versa. When we make the plot, we'll always want the Inverness score to appear first, so we'll need to figure out a way to make the order of the Result column consistent.

We can see the name of the opposing teams:

unique(scores$Opponent)

There are 10 teams in the league, and each team plays 4 games against the others ( 2 at home, 2 away), so 36 games in all.

We have a column for each team's score, and additional columns are showing the time of each individual goal.

For example, if Inverness score 3 goals, then columns Inv_Goal_1, Inv_Goal_2, and Inv_Goal_3 will have entries, while columns Inv_Goal4 and Inv_Goal5 will remain as NA.

We'll definitely want to condense those columns.

Before we do that, we'll create a helper dataframe for plotting purposes. We can use this to join to our scores data frame later on.

team <- c('Brechin','Dumbarton','Dundee_United','Dunfermline','Falkirk','Greenock_Morton','Inverness','Livingston','Queen_Of_The_South','St_Mirren')
colors <- c('#E3001B','#F8BE02','#C6631D','#161616','midnightblue','#316891',
             '#0355AF','#C19B24','#093C71','#000000')


team_df <- tibble(team,colors)

rm(list = c("colors","team"))

The colors have been chosen as the best match (without being too obsessive over it) to the primary color in each team's playing kit.

Time to tidy

OK, enough preamble, let's see how my team did...

First of all, the column names in our data frame all start with capitals. That can be problematic so let's change them to lower case:

colnames(scores) <- colnames(scores) %>% str_to_lower()

That's better. Now let's tackle all those columns that show the time of each goal, using tidyr's gather function.

First, we select the date column, all the columns that start with "inv" plus those starting with "opp". We'll also select the 'gameid' column for use in calculating future variables.

Take a look at the 'gather' call. We're taking all the goal columns, and we're transposing them into one long column, named "goal" ( defined by the 'key' argument). Alongside this, we're creating a new long column called 'time', where all the values for the times of goal are scored. All the other columns that we selected are also stored in the 'long' format.

data <- scores %>%
  select(date, starts_with("inv"),starts_with("opp"), gameid) %>%
  gather(inv_goal_1,inv_goal_2,inv_goal_3,
         inv_goal_4,inv_goal_5,opp_goal_1,opp_goal_2,opp_goal_3,
         opp_goal_4,opp_goal_5, key = "goal",value = "time")

Let's take a look at this new dataframe:

str(data)

Our 'scores' dataframe was 36 rows long and 20 columns wide. Now our 'gathered' dataframe is 360 rows in length but only 8 columns wide.

We'll add in one more column, for use when faceting the plot later on.

data <- data %>%
mutate( result = paste(inverness_score,"-",opponent_score))

We need to add a team column in there, plus we also need to get a cumulative count of each goal by team, by match.

plot_data <- data %>%
   select(date,opponent,gameid,goal,time,result,inverness_status) %>%
  mutate(team = if_else(str_sub(goal,1,1) == "i","inverness",tolower(opponent))) %>%
  group_by(gameid,team) %>%
  arrange(time) %>%
  mutate(count = 1) %>%
  ungroup() %>%
  group_by(gameid,team) %>%
  mutate(goalcount = cumsum(count)) %>%
  ungroup() %>%
  select(-count)

We're going to use geom_rect() for each goal. This requires minimum and maximum values along both the x and y-axis. We'll use the lag and lead functions from dplyr to create these values. For example, the second goal scored will have a minimum value equal to the time of the first goal, and the max will be the actual time of the second goal. Meanwhile, the minimum y value will be 1, and the maximum will be 2.

It'll become clearer when we produce the plot!

plot_data <- plot_data %>%
  group_by(gameid) %>%
  arrange(gameid,time) %>%
  mutate(lag_time = lag(time),
         lead_time = lead(time)) %>%
  ungroup()

plot_data$lag_time[is.na(plot_data$lag_time)] <- 0

plot_data$lead_time[is.na(plot_data$lead_time)] <- 90

Those last 2 commands replace NA's in the lag time and lead time with 0 and 90 respectively. Now we select the columns we want in the desired order.

plot_data <-  plot_data %>%
  select(gameid, date,team, result,opponent,goalcount,lag_time, time, lead_time, inverness_status)

We want to ensure we keep the rows for the games that ended in a scoreless 0-0 draw. Otherwise, those results will be dropped from the final plot.

goalless <- filter(plot_data, result == "0 - 0")

However, we don't need the extraneous rows from the other games where there is no 'time' value. So we'll filter them out and then combine the two dataframes, keeping the plot_data name.

scored <- filter(plot_data, result != "0 - 0") %>%
  filter(!is.na(time))

plot_data <- bind_rows(goalless,scored)

rm(list = c("goalless", "scored"))

Now finally, we're ready to plot. You might want to maximize your plot window for this.

p <- ggplot(plot_data,aes(time,goalcount, group = opponent, fill = team)) +
  geom_rect(aes(xmin = lag_time,xmax = time, ymin = (goalcount - 1), ymax = goalcount)) +
  geom_text(aes(x = 25,y = 4.5,label = result , size = 0.5)) +
theme_void() +
  scale_fill_manual(values = team_df$colors) +
  facet_wrap(date + inverness_status ~ opponent, ncol = 9) +
  ggtitle(label = "Inverness Caledonian Thistle Goals Scored / Conceded - Scottish Championship 2017/2018") +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none")
print(p)

And there you go, a timeline for goals scored in each match.

While we have this data, we may as well look to see if there are any patterns in when the team scores or concedes. Because each match lasts 90 minutes (excluding time added on for stoppages), we can look at goals scored in 15-minute intervals.

plot_data$cut <- cut(plot_data$time, seq(0,90,by = 15))

inv_scored <- plot_data %>%
  filter(team == "inverness", !is.na(time))

  ggplot(inv_scored,aes(goalcount,time, group = goalcount)) +
  geom_boxplot(width = 0.2) +
  geom_point() +
  ggtitle(label = " When Inverness goals 1 to 5 are scored - by minute of match") +
  scale_y_continuous(breaks = seq(0, 90, by = 15)) +
  theme_bw()

It looks like the team often scores 1 or 2 goals within the first 30 minutes (This does not correlate with my memory of the matches though).

Were any of the opposing team particularly susceptible to letting goals in early on in matches?

opposing_colors <- team_df %>% filter(team != "Inverness")

ggplot(inv_scored,aes(goalcount,time , group = goalcount,fill = opponent)) +
  geom_boxplot(width = 0.2) +
  ggtitle(label = "Inverness goals scored by opponent and minute of match") +
  scale_y_continuous(breaks = seq(0, 90, by = 15)) +
  facet_wrap(~ opponent) +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_fill_manual(values = opposing_colors$colors)

So it looks like the team scored early against Brechin, Dumbarton, and Livingston.

And of course we can do the same for goals conceded:

ggplot(opposition_score,aes(goalcount,time, group = goalcount)) +
  geom_boxplot(width = 0.2) +
  geom_point() +
    ggtitle(label = "Inverness goals conceded by minute of match") +
  scale_y_continuous(breaks = seq(0, 90, by = 15)) +
  theme_bw()

Well, this is interesting because it appears that if Inverness concede 2 or more goals, it generally tends to happen in the second half of the match.

Is there a statistical difference between the times of goals scored and conceded? We can use ggplot2 to help.

comparison <- plot_data %>%
  select(team, goalcount, time) %>%
  filter(!is.na(time)) %>%
  mutate(team_type = if_else(str_sub(team,1,1) == "i","Inverness","Opponent"))

ggplot(comparison,aes(team_type,time, group = team_type,fill = team_type)) +
geom_boxplot(width = 0.2, notch = TRUE) +
    ggtitle(label = "Distribution of time of nth goal scored / conceded by minute of match") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, 90, by = 15)) +
  theme(legend.position = "bottom") +
  scale_fill_manual(values = c('#0355AF', '#E3001B')) +
  facet_wrap(vars(goalcount), ncol = 3)

Although it might be hard to see, depending on your plot window, the notches on the boxplots for the second and third goals do not overlap. This suggests a statistical difference - Inverness is likely to score a second goal earlier than they concede one, and similarly for a third goal. There is no overlap for goal 5 either, but because scoring or conceding 5 in a match is a rarity anyway, we can ignore that.

Animating the plot

Finally - although the goal timeline plot we created earlier doesn't quite have the same sophistication that Andy Kirk's original graphic did, we can still have some fun.

N.B - this is optional and will require installing the current version of gganimate from github.

If that is not possible, don't worry, as the final output will be shown below so you can see what happens.

Note also - it can take a few minutes to create this animation.

# gganimate is currently in development on github
# You will need to install devtools if you haven't already done so
# You will require to be using the latest version of R and may need to reinstall your existing packages.


# install.packages('devtools')
#devtools::install_github('thomasp85/gganimate')

library(gganimate)

#we already defined p as our original season plot
# but we will rebuild it here  just in case you have removed it from your workspace
p <- ggplot(plot_data,aes(time,goalcount, group = opponent, fill = team)) +
  geom_rect(aes(xmin = lag_time,xmax = time, ymin = (goalcount - 1), ymax = goalcount)) +
  geom_text(aes(x = 25,y = 4.5,label = result , size = 0.5)) +
theme_void() +
  scale_fill_manual(values = team_df$colors) +
  facet_wrap(date + inverness_status ~ opponent, ncol = 9) +
  ggtitle(label = "Inverness Caledonian Thistle Goals Scored / Conceded - Scottish Championship 2017/2018") +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none")
print(p)
# now we take the previous plot, and animate it using the game id to render each faceted plot, in date order
# q <- p + transition_states(gameid, transition_length = 1, state_length = 1) + shadow_mark(past =  TRUE) # this ensures all previous plots are shown
# animate(q,width = 900, height = 750)
#anim_save("game_by_game_season_plot.gif")

After a terrible start to the season, which saw the team languishing in 9th place, they slowly began to gain momentum. There were several cup games in amongst the league games, during which they managed to rack up a club record for the number of games/ minutes played without conceding.

If the league had begun in January, they would have been champions.

As it was, a long unbeaten run towards the end of the season saw them winning the Irn Bru Cup (thanks to an extra-time winner), and they came agonizingly close to finishing in the promotion play-off places.

Ultimately, they were undone by Dunfermline, who matched their unbeaten run in the league, and crucially, in the last, "must-win" home game of the season, scored a 96th-minute equalizer to all but seal 4th place.

However, with the team now acclimatized to the Championship, some new signings, and at least 4 matches against their newly relegated local rivals, there are lots of things to look forward to in the new season.

If you would like to learn more about the Tidyverse, take the following DataCamp Courses:

Want to leave a comment?