Clustering WNBA Players by Shot Location and Usage
Abstract
Using the wehoop package from the open-source SportsDataVerse library, we examined player shooting data and estimated advanced metrics from the 2024 WNBA season. The Player Shot Location function pulled season-long shot data (field goals made, field goals attempted and field goal percentage) from different zones of the basketball court. The Player Estimated Metrics function pulled various estimated season-long advanced offensive and defensive metrics from the 2024 season.
We focused on the offensive side of the court, joining the two datasets and cluster players based on their shooting profiles and usage rate. We fit a hierarchical clustering model to identify a logical number of clusters and coducted further descriptive analysis to describe the primary characteristics of each cluster.
Loading Libraries and Packages
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
install.packages("naniar")
library(naniar)
library(corrplot)
library(reshape2)
library(dendextend)
library(data.table)
install.packages("wehoop")
library(wehoop)Load 2024 Player Shot Locations and Usage
In order to cluster, we called 2 data functions from the weehop package. Both these pulled from the most recent completed season, which was the 2024 campaign:
-
wnba_leaguedashplayershotlocations(): Calls the WNBA Stats API returning Player Shot Location data.
-
wnba_playerestimatedmetrics(): Calls the WNBA Stats API returning a litany of estimated advanced player metrics. We specifically isolated the Estimated Usage stat, which estimates the percentage of a team's possessions that a player used (took a shot, got fouled or turned the ball over).
In both instances, we extracted a tibble from the list dataframe of 157 players who qualified for the 2024 WNBA leaderboards based on total minutes and games played.
Note that in R Studio we loaded the list dataframe directly using these functions. That code is included below, but for Datalab's environment, we used the write_csv() function so that it could be uploaded as a data source.
# Code to call list df directly from wehoop package
shot_list <- wnba_leaguedashplayershotlocations()
player_shot_locations_df <- shot_list[[1]]
metrics_list <- wnba_playerestimatedmetrics()
player_metrics_df <- metrics_list[[1]]
str(player_shot_locations_df)
str(player_metrics_df)
# Code to create a CSV from each dataframe instead
library(readr)
write_csv(player_metrics_df, "player_metrics.csv")
write_csv(player_shot_locations_df, "player_shot_locations.csv")# Load datasets
player_metrics <- read_csv("player_metrics.csv")
glimpse(player_metrics)
player_shot_locations <- read_csv("player_shot_locations.csv")
glimpse(player_shot_locations)2 hidden cells
Data Cleaning
The data in both function calls imported every field as a character variable. Therefore, our first data cleaning step was to convert the numeric or integer fields appropriately since hierarchical clustering is best conducted with numeric data types.
Note that we did not convert every numeric field in player_metrics_df containing the Estimated Metrics since we intended to focus solely on the Estimated Usage Percentage statistic. In addition, because Datalab's data source upload read "PLAYER_ID" and "TEAM_ID" as numeric data types, we converted those back to character types since they are not intended to measure any numeric value and shouldn't be included in the clustering analysis later on.
Convert Character Variables to Numeric and Integer in Both Dataframes
# Player Shot Locations Data Conversions
player_shot_locations <- player_shot_locations %>%
mutate(PLAYER_ID = as.character(PLAYER_ID),
TEAM_ID = as.character(TEAM_ID),
AGE = as.integer(AGE),
Restricted_Area_FGM = as.integer(Restricted_Area_FGM),
Restricted_Area_FGA = as.integer(Restricted_Area_FGA),
Restricted_Area_FG_PCT = as.numeric(Restricted_Area_FG_PCT),
In_The_Paint_Non_RA_FGM = as.integer(In_The_Paint_Non_RA_FGM),
In_The_Paint_Non_RA_FGA = as.integer(In_The_Paint_Non_RA_FGA),
In_The_Paint_Non_RA_FG_PCT = as.numeric(In_The_Paint_Non_RA_FG_PCT),
Mid_Range_FGM = as.integer(Mid_Range_FGM),
Mid_Range_FGA = as.integer(Mid_Range_FGA),
Mid_Range_FG_PCT = as.numeric(Mid_Range_FG_PCT),
Left_Corner_3_FGM = as.integer(Left_Corner_3_FGM),
Left_Corner_3_FGA = as.integer(Left_Corner_3_FGA),
Left_Corner_3_FG_PCT = as.numeric(Left_Corner_3_FG_PCT),
Right_Corner_3_FGM = as.integer(Right_Corner_3_FGM),
Right_Corner_3_FGA = as.integer(Right_Corner_3_FGA),
Right_Corner_3_FG_PCT = as.numeric(Right_Corner_3_FG_PCT),
Above_the_Break_3_FGM = as.integer(Above_the_Break_3_FGM),
Above_the_Break_3_FGA = as.integer(Above_the_Break_3_FGA),
Above_the_Break_3_FG_PCT = as.numeric(Above_the_Break_3_FG_PCT),
Backcourt_FGM = as.integer(Backcourt_FGM),
Backcourt_FGA = as.integer(Backcourt_FGA),
Backcourt_FG_PCT = as.numeric(Backcourt_FG_PCT),
Corner_3_FGM = as.integer(Corner_3_FGM),
Corner_3_FGA = as.integer(Corner_3_FGA),
Corner_3_FG_PCT = as.numeric(Corner_3_FG_PCT))
str(player_shot_locations)# Player Estimated Metrics Data Conversions
player_metrics <- player_metrics %>%
mutate(PLAYER_ID = as.character(PLAYER_ID),
GP = as.integer(GP),
W = as.integer(W),
L = as.integer(L),
W_PCT = as.numeric(W_PCT),
MIN = as.numeric(MIN),
E_OFF_RATING = as.numeric(E_OFF_RATING),
E_DEF_RATING = as.numeric(E_DEF_RATING),
E_NET_RATING = as.numeric(E_NET_RATING),
E_AST_RATIO = as.numeric(E_AST_RATIO),
E_OREB_PCT = as.numeric(E_OREB_PCT),
E_DREB_PCT = as.numeric(E_DREB_PCT),
E_REB_PCT = as.numeric(E_REB_PCT),
E_TOV_PCT = as.numeric(E_TOV_PCT),
E_USG_PCT = as.numeric(E_USG_PCT),
E_PACE = as.numeric(E_PACE))
str(player_metrics)Drop Extraneous Columns in Both Dataframes
Because the Shot Locations dataframe already contained general Corner 3 values, we eliminated the extraneous Left Corner 3 and Right Corner 3 fields.
# Drop Specific Corner 3 Locations from Shot Locations Dataframe
player_shot_locations <- player_shot_locations %>%
select(-c(Left_Corner_3_FGM:Right_Corner_3_FG_PCT))# Select only Player_ID, Player_Name and Usage Rate from Metrics Dataframe
player_metrics <- player_metrics %>%
select(c(PLAYER_ID, PLAYER_NAME, E_USG_PCT))