Skip to content

The Shots that Matter on the PGA Tour

I moved in with my grandmother when COVID hit and we developed a routine: Jeopardy at 7PM on weekdays (rest in peace, Alex Trebek), and PGA Tour events during the weekend.

After watching the pros play for long enough I developed the itch to hit the course myself. By the time I left the driving range to step on the first tee of my first round, I was hooked.

I would grind as often as I could to improve my game, break down how Tour Pros shot, and watch form/strategy videos to become the best golfer possible, but one question hovered in the back of my mind:

What shots actually make the difference in golf?

It wasn't until I made the career shift into data analysis that I developed the tools to make that more than just a theoretical question. This repository, along with the associated dashboard and walkthrough video, are my attempt at figuring out which shots matter the most on the PGA Tour:

I'm what my family calls a "researcher". I enjoy hunting down relevant information on all corners of the internet and every page of every book I own to find the kernel of truth I'm looking for, and this project was no different.

My initial goal was to get shot-by-shot information for all PGA Tour events since 2000, calculate Strokes Gained measurements myself, and then use those to determine which part of the game was most important for pros.

Unfortunately, I wasn't the beneficiary of a massive inheritance, so paying for all that data wasn't an option.

We'll just pull all of the Strokes Gained data from PGATour.com, instead:

(Data Here)

(Data Here)

Sadly, these tables are embedded on the webpage, so there's no change in the page's URL when searching by Season and Tournament. Because of this, scraping the page with Python was off the table (pun intended), so we'll have to go season-by-season highlighting all of the information, copying it over to a single Excel file with each year as a separate sheet, and cleaning it in there.

If you're following along, we need to grab info for each season with:

  • Time Period: Year to Date
  • Tournament: Wyndham Championship

We'll be using data through the Wyndham Championship because our measurement of players' overall success will be determined by their FedEx Cup Points - a scoring system that dramatically increases the point value for tournaments after the Wyndham Championship, which would overweight player performance during the final weeks of the season, and skew the relative importance of each Strokes Gained measurements based on late-season outcomes.

The following are screenshots showing the process of copying Strokes Gained: Tee-to-Green and then Strokes Gained: Putting information from the PGA Tour website into Excel, cleaning, and then formatting the data:

Next, we'll migrate back over to PGATour.com to grab FedEx Cup Points data for each year through the Wyndham Championship:

(Data Here)

We'll paste the information into the same Excel file, and perform a similar VLOOKUP as above to match each player's points total to their Strokes Gained data:

Then we sort by FedEx Cup Points desc into order to remove rows with #N/A under fedex_points, as these players were inelligible to receive points that year for an array of different reasons:

For data presentation purposes, I wanted to be able to show each player's FedEx Cup Rank as a string:

FedEx Cup scoring not only changes at the end of the season, but has also fluctuated dramatically over the last 18 years, so we'll need to standardize point values across years.

To do this, we'll determine each player's fedex_points as the number of standard deviations from the PGA Tour mean that year.

Here's a screenshot showing how to perform that function in Excel:

Next, we'll concatenate all of the individual years into a single sheet for use when creating our dashboard using the following Python script:

import pandas as pd

grouped = pd.DataFrame()

def combine_years(year):
    global grouped
    file = pd.read_excel('pga_yearly_combined.xlsx', sheet_name=year)
    grouped = pd.concat([grouped, file])


years = ['2007','2008','2009','2010','2011',
         '2012','2013','2014','2015','2016',
         '2017','2018','2019','2020','2021',
         '2022','2023','2024']

for row in years:
    combine_years(row)
    
#######################################################################################
# If running locally on your computer, this would save the DataFrame "grouped" to    #
# the same Excel with the yearly data used at the top of the script.                  #
#
#
# with pd.ExcelWriter('pga_data_combined.xlsx', engine='openpyxl', mode='a', if_sheet_exists='overlay') as writer:
#    grouped.to_excel(writer, sheet_name='all_years', index=False)
#
#
# Instead, I've uploaded the file with the data already concatenated.                 #
#######################################################################################

Now we're going to run a series of Regression Models from the SciKit Learn Python library, and rank them based on their Adjusted R-Squared scores to determine which model most accurately predicts the FedEx Cup Points standard deviations.

Adjusted R-Squared is a measurement of the regression model's ability to predict the variance of the dependent variable based on the independent variables that accounts for the number of variables in the model (i.e. how accurate the model is).

For a much more in-depth explanation of Adjusted R-Squared and it's use in determining model accuracy, check out this DataCamp article.

First, let's import all of the Python libraries we'll be using:

from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.linear_model import BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso

Then we'll create a DataFrame that can store the running rank of the R-Squared values to determine the best model:

models_scores = pd.DataFrame({'Model':['Linear Reg', 'Gradient Boost', 'SGDRegressor', 
                                        'SVR', 'Bayesian Ridge', 'Kernel Ridge', 
                                        'Logistic Regression', 'Lasso'], 
                              'Overall Score':[0,0,0,0,0,0,0,0]}).set_index('Model')

Now we're going to create one mighty function that will take the Strokes Gained and FedEx Cup standard deviation information for each year, run it through a series of Regression models, and create a rank for each of the models.

The models, from most to least accurate, will receive the following point values:

  • 10
  • 8
  • 6
  • 4
  • 3
  • 2
  • 1
  • 0

That scoring will occur for each year will be added to the associated model in the DataFrame we created above, allowing us to see which model was the most consistently accurate across all years.

def model_finder(year):
    
    global models_scores
    
    # importing data
    df = pd.pd.read_excel('pga_yearly_combined.xlsx', sheet_name=year) 
    df = df.drop(columns = ['year', 'player_name', 'fedex_points', 'fedex_rank'])
    
    scaler = MinMaxScaler()
    
    X = df.drop(['std_dev'], axis=1)
    y = df['std_dev']
     
    
    # creating train and test sets
    X_train, X_test, y_train, y_test = train_test_split( 
        X, y, test_size=0.2, random_state=42)
    
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)
    lr_df = pd.DataFrame({'Model':['Linear Reg'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_lr)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    gbr = GradientBoostingRegressor(random_state=42)
    gbr.fit(X_train, y_train)
    y_pred_gbr = gbr.predict(X_test)
    gbr_df = pd.DataFrame({'Model':['Gradient Boost'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_gbr)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    sgd = SGDRegressor(random_state=42)
    sgd.fit(X_train, y_train)
    y_pred_sgd = sgd.predict(X_test)
    sgd_df = pd.DataFrame({'Model':['SGDRegressor'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_sgd)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    svr = SVR()
    svr.fit(X_train, y_train)
    y_pred_svr = svr.predict(X_test)
    svr_df = pd.DataFrame({'Model':['SVR'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_svr)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    br = BayesianRidge()
    br.fit(X_train, y_train)
    y_pred_br = br.predict(X_test)
    br_df = pd.DataFrame({'Model':['Bayesian Ridge'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_br)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    kr = KernelRidge()
    kr.fit(X_train, y_train)
    y_pred_kr = kr.predict(X_test)
    kr_df = pd.DataFrame({'Model':['Kernel Ridge'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_kr)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    log = LogisticRegression()
    log.fit(X_train, y_train)
    y_pred_log = log.predict(X_test)
    log_df = pd.DataFrame({'Model':['Logistic Regression'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_log)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    las = Lasso()
    las.fit(X_train, y_train)
    y_pred_las = las.predict(X_test)
    las_df = pd.DataFrame({'Model':['Lasso'],'Score':[ #Adjusted r2
        (1 - (((1 - r2_score(y_test, y_pred_las)) * (len(df) - 1))
         / (len(df) - df.shape[1] - 1)))]})
    
    df_list = [lr_df, gbr_df, sgd_df, svr_df, br_df, kr_df, log_df, las_df]
    
    output = pd.concat(df_list)
    output = output.sort_values(by='Score', ascending=False)
    output = output.reset_index(drop=True)
    
    models_scores.loc[output.iloc[0]['Model'], 'Count'] += 10
    models_scores.loc[output.iloc[1]['Model'], 'Count'] += 8
    models_scores.loc[output.iloc[2]['Model'], 'Count'] += 6
    models_scores.loc[output.iloc[3]['Model'], 'Count'] += 4
    models_scores.loc[output.iloc[4]['Model'], 'Count'] += 3
    models_scores.loc[output.iloc[5]['Model'], 'Count'] += 2
    models_scores.loc[output.iloc[6]['Model'], 'Count'] += 1
    models_scores.loc[output.iloc[6]['Model'], 'Count'] += 0