Skip to content

This is Aaron Judge. Judge is one of the physically largest players in Major League Baseball standing 6 feet 7 inches (2.01 m) tall and weighing 282 pounds (128 kg). He also hit one of the hardest home runs ever recorded. How do we know this? Statcast.

Statcast is a state-of-the-art tracking system that uses high-resolution cameras and radar equipment to measure the precise location and movement of baseballs and baseball players. Introduced in 2015 to all 30 major league ballparks, Statcast data is revolutionizing the game. Teams are engaging in an "arms race" of data analysis, hiring analysts left and right in an attempt to gain an edge over their competition.

In this project, you're going to wrangle, analyze, and visualize Statcast historical data to compare Mr. Judge and another (extremely large) teammate of his, Giancaro Stanton. They are similar in a lot of ways, one being that they hit a lot of home runs. Stanton and Judge led baseball in home runs in 2017, with 59 and 52, respectively. These are exceptional totals - the player in third "only" had 45 home runs.

Stanton and Judge are also different in many ways. Let's find out how they compare!

The Data

There are two CSV files, judge.csv and stanton.csv, both of which contain Statcast data for 2015-2017. Each row represents one pitch thrown to a batter.

Custom Functions

Two functions have also been provided for you to visualize home rome zones

  • assign_x_coord: Assigns an x-coordinate to Statcast's strike zone numbers.
  • assign_y_coord: Assigns a y-coordinate to Statcast's strike zone numbers.

# Run this cell to begin
# Import the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load Aaron Judge's Statcast data
judge = pd.read_csv('judge.csv')

# Load Giancarlo Stanton's Statcast data
stanton = pd.read_csv('stanton.csv')

# Display all columns (pandas will collapse some columns if we don't set this option)
pd.set_option('display.max_columns', None)

# Custom Functions
def assign_x_coord(row):
    """
    Assigns an x-coordinate to Statcast's strike zone numbers. Zones 11, 12, 13,
    and 14 are ignored for plotting simplicity.
    """
    # Left third of strike zone
    if row.zone in [1, 4, 7]:
        return 1
    # Middle third of strike zone
    if row.zone in [2, 5, 8]:
        return 2
    # Right third of strike zone
    if row.zone in [3, 6, 9]:
        return 3
    
def assign_y_coord(row):
    """
    Assigns a y-coordinate to Statcast's strike zone numbers. Zones 11, 12, 13,
    and 14 are ignored for plotting simplicity.
    """
    # Upper third of strike zone
    if row.zone in [1, 2, 3]:
        return 3
    # Middle third of strike zone
    if row.zone in [4, 5, 6]:
        return 2
    # Lower third of strike zone
    if row.zone in [7, 8, 9]:
        return 1
    
# Display the last five rows of the Aaron Judge file
judge.tail()
# Start coding here. Use as many cells as you like!
#Familiarize with the data
judge.head(5)
stanton.head(5)
judge.columns
#print the unique vqlues of events
print(judge["events"].unique())

print(stanton["events"].unique())
#group events for the year 2017 and get the count for each player; Judge
judge_events_2017 = judge.loc[judge['game_year'] == 2017]
judge_events_2017= judge_events_2017["events"].value_counts()
#.events.value_counts()
print("Aaron Judge batted ball event totals, 2017:")
print(judge_events_2017)

#Stanton
stanton_events_2017 = stanton.loc[stanton['game_year'] == 2017]
#events.value_counts()
stanton_events_2017= stanton_events_2017["events"].value_counts()
print("Stanton batted ball event totals, 2017:")
print(stanton_events_2017)
#Which player hit homerun slightly lower and harder?
#Filter data for homerun for the players

j_homerun= judge.loc[judge["events"]== 'home_run']
j_homerun.head()
#Stanton's 
s_homerun= stanton.loc[stanton["events"]== 'home_run']
#create plots to visualize launch_speed versus launch_angle for thr players using KDE plt
fig1, ax1= plt.subplots(ncols=2 , sharex= True, sharey= True)
sns.kdeplot(x= j_homerun.launch_angle, y= j_homerun.launch_speed , cmap= "mako", Shade= True ,ax=ax1[0])
ax1[0].set_title("Aaron Judge\nHome Runs, 2015-2017")

sns.kdeplot( x= s_homerun.launch_angle,y= s_homerun.launch_speed,  cmap= "mako", Shade= True,ax=ax1[1])
ax1[1].set_title("Giancarlo Stanton\nHome Runs, 2015-2017")
plt.show()

player_hr= "Stanton"


#compare the pitch velocity
#concatenate both judge's and stanton's dataframes
combined_df= pd.concat([j_homerun, s_homerun], axis= 1, join= "inner")
combined_df.head()
import pandas as pd
import matplotlib.pyplot as plt

# Ensure combined_df is a DataFrame
combined_df = pd.DataFrame(combined_df)

# Combine the two DataFrames j_homerun and s_homerun
combined_homerun = pd.concat([j_homerun, s_homerun])

# Create boxplots for player name vs release speed
combined_homerun.boxplot(column='release_speed', by='player_name', grid=False)

plt.title('Boxplot of Release Speed by Player Name')
plt.suptitle('')  # Suppress the default title to avoid overlap
plt.xlabel('Player Name')
plt.ylabel('Release Speed')
plt.show()

player_fast= 'Judge'
#visualize the homerun strike zones
#remove the zones to ignore
#create a copy of the combined_df
combined_homerun_copy_j= j_homerun.copy()

judge_strike_hr= combined_homerun_copy_j.loc[combined_homerun_copy_j["zone"] <= 9]
#stanton's
combined_homerun_copy_s= s_homerun.copy()

stanton_strike_hr= combined_homerun_copy_s.loc[combined_homerun_copy_s["zone"] <= 9]
#assign cartesian coordinates
judge_strike_hr["zone_x"]= judge_strike_hr.apply(assign_x_coord, axis= 1)
judge_strike_hr["zone_y"]= judge_strike_hr.apply(assign_y_coord, axis= 1)

stanton_strike_hr["zone_x"]= stanton_strike_hr.apply(assign_x_coord, axis= 1)
stanton_strike_hr["zone_y"]= stanton_strike_hr.apply(assign_y_coord, axis= 1)
#Create 2D histogram
fig, ax= plt.subplots()
plt.hist2d( judge_strike_hr["zone_x"], judge_strike_hr["zone_y"], bins= 3)
plt.hist2d( stanton_strike_hr["zone_x"], stanton_strike_hr["zone_y"], bins= 3)
plt.show()