Skip to content

Studying the Two Running Games in Super Bowl LVII

Note: You can consult the solution for this live training in the file browser (notebook_solution.ipynb)

Every February millions of people, both inside and outside of North America, tune into the National Football League's Super Bowl, which crowns the world champion in American football. In football there are two ways to advance the football while on offense; running and passing. During this training, you will learn how to use data to see which players to watch out for during Super Bowl LVII in the running game, the more unheralded aspect of a football game. We will show you how to obtain NFL play-by-play data in Python and then use exploratory data analysis and linear models to show which players will be the ones to watch during Super Bowl LVII.

Obtaining data and loading packages

While we use Python in this tutorial, the nflfastR play-by-play datatset was developed initially in R, and the package's help page that gives the best description of the metadata. You can obtain data for any year using import_pbp_data() from the nfl_data_py package and we included this code commented out for your future reference. However, we pre-stage the data for this live training as a csv file to optimize your learning experience.

First, load the required python packages. Use pandas (alias pd) and numpy (alias np) for data. Use seaborn (alias sns) and matplotlib.pyplot (alias plt) for plotting. Use statsmodels.formula.api alias smf for linear models.

# import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

Note: If you wanted to import the data for other seasons, you could use the code Eric and Richard used to obtain the original data:

# code we used to obtain data import nfl_data_py as nfl pbp = nfl.import_pbp_data([2022]) # we only selected the columns we needed for today pbp[['play_type', 'posteam', 'rushing_yards', 'rusher_id', 'rusher_player_id', 'rusher_player_name', 'ydstogo', 'down', 'yardline_100', 'run_location', 'score_differential', 'game_seconds_remaining']].to_csv("pbp_2022.csv", index=False)

Now, you can load the play-by-play (pbp, Python object pbp) data you'll use today using pd.read_csv() with pbp_2022.csv:

# load play-by-play data 
pbp=pd.read_csv('pbp_2022.csv')

Next, peak at the top of pbp using the .head()

# peak at head of pbp 
pbp.head()

Filtering Data to Rushing Plays

First, the pbp data needs to be filtered cleaned. Use query() to preform the following:

  • Save the new data frame as pbp_run because this is only the run play-by-play (pbp) data
  • Filter (or, in pandas lingo, query()) to rushing plays with play_type == "run"
  • Remove missing values for rushing_yards using rushing_yards.notnull()
  • Remove missing values for rusher_id rusher_id.notnull()
  • Rest the data frame's index, and
  • Look at the header of the data frame

Remember, use & to merge multiple filter criteria. After filtering, look at the head() of the data:

# query pbp
pbp_run=pbp.query('play_type == "run" & rushing_yards.notnull() &         rusher_id.notnull()').reset_index()

# peak at head of data
pbp_run.head()

Who Are the Best Rushers in Sunday's Game?

Next, you will examine which players are the best at rushing the football for the players will be in the game on Sunday. First, select the data for the two teams in the Super Bowl, the Philadelphia Eagles (PHI) and the Kansas City Chiefs (KC) using the isin function from pandas with the Team of Possession (which team has the ball; posteam) column. To this,

  1. Create a list of Super Bowl teams, sb_teams
  2. Use the .loc command with pbp_run and the .isin() function on posteam column (Hint:, this looks like pbp_run['posteam'].isin(sb_teams)).
  3. reset_index() on the new data frame
  4. Save the outputs as pbp_run_sb
# create a list of teams in the Super Bowl
sb_teams=['KC', 'PHI']

# Filter out only the Super Bowl teams using `isin()`
pbp_run_sb=pbp_run.loc[pbp_run['posteam'].isin(sb_teams)]
pbp_run_sb

Next, check the data to make sure it is correct and you only have these two teams by using the posteam column and looking for the unique() values in the column:

# Look at the unique posteam values in `pbp_run_sb`
pbp_run_sb.posteam.unique()

Now, you can aggregate over the whole season for each player's rushing yards (sum of rushing_yards) and yards-per-carry (mean of rushing yards). For this, "group by" the posteam so we know which team the player belongs to, and both rusher_player_id and rusher_player_name because some players have the same name. Save this operation as pbp_run_sb_yards: