MLB Data — DataLab

Did Rickey Henderson have a significant impact on the Stolen Base Percentage across the MLB or would it have remained the same without him?

It sure looks like it, until we look at the specific difference he made in the steal percentage.

His steal percentage did skew the MLB percentage by nearly half a point once. In 1983, he stole 103 bases and that moved the percentage up 0.47% which is still a lot. Rickey was 0.16% (1/624) of the roster slots in the MLB that year.

I thought Rickey would make a 2-3% difference, so maybe it's just that my perception was overblown. I can't imagine anyone else having that effect on stolen base percentage.

# get pandas for DataFrame
import pandas as pd

# read in the data
file_path = "MLB_YearByYearHitting_Totals.csv"
hitting_totals = pd.read_csv(file_path)

hitting_totals.head(5)

# steal percentage each year
steals = pd.DataFrame(hitting_totals, columns=['Year','SB','CS'])
steals["Pct"] = ((steals["SB"]/(steals["SB"]+steals["CS"])) *100).round(2)

modern = steals[steals["Year"]>1930]
recent = steals[steals["Year"]>2004]
# since 10 years before the last expansion
last_expansion = steals[steals["Year"]>1987]
since_68 = steals[steals["Year"]>1967]
rickey_era = steals[steals["Year"]>1978]

import matplotlib.pyplot as plt

ax = modern.plot(x="Year", y="Pct", kind="line", title="Steal Percentage in MLB")
ax.get_legend().remove()

# Add annotation for Rickey Henderson's career start
ax.annotate("Rickey Henderson's\ncareer starts", 
            xy=(1968, modern[modern["Year"] == 1968]["Pct"].values[0]), 
            xytext=(1940, modern["Pct"].max() - 10), 
            arrowprops=dict(facecolor='red', shrink=0.05))
# Add annotation for Rickey Henderson's career start
ax.annotate("Rickey Henderson's\ncareer ends", 
            xy=(2003, modern[modern["Year"] == 2003]["Pct"].values[0] - 1), 
            xytext=(1990, modern["Pct"].max() - 20), 
            arrowprops=dict(facecolor='red', shrink=0.05))

plt.show()

last_expansion.plot (x="Year", y="Pct", kind="line", title="Steal Percentage in MLB since 1988").get_legend().remove()

import numpy as np

recent.plot (x="Year", y="Pct", kind="line", title="Steal Percentage in MLB since 2005", xticks=np.arange(2005,2026,2), rot=45).get_legend().remove()

since_68.plot (x="Year", y="Pct", kind="line", title="Steal Percentage in MLB since 1968").get_legend().remove()

since_68

file_path = "Rickey_YearByYearHitting.csv"
rickey = pd.read_csv(file_path)

rickey.head(5)

# steal percentage each year
rickey_steals = pd.DataFrame(rickey, columns=['Year','SB','CS'])
rickey_steals["Pct"] = ((rickey_steals["SB"]/(rickey_steals["SB"]+rickey_steals["CS"])) *100).round(2)

rickey_since_68 = rickey_steals[rickey_steals["Year"]>1967]

rickey_since_68.plot (x="Year", y="Pct", kind="line", title="Rickey's Steal Percentage in MLB since 1968").get_legend().remove()

rickey_steals

# since merge is doing an inner join, it only joins the rows that overlap 
rickey_since_68.set_index("Year")

no_rickey = steals.merge(rickey_since_68, on="Year", suffixes=("_MLB","_Rickey"))
no_rickey["SB_No_Rickey"] = no_rickey["SB_MLB"] - no_rickey["SB_Rickey"]
no_rickey["CS_No_Rickey"] = no_rickey["CS_MLB"] - no_rickey["CS_Rickey"]
no_rickey["Pct_No_Rickey"] = (no_rickey["SB_No_Rickey"]/(no_rickey["SB_No_Rickey"] + no_rickey["CS_No_Rickey"]) * 100).round(2)
no_rickey["Pct_Difference"] = no_rickey['Pct_MLB'] - no_rickey['Pct_No_Rickey']

no_rickey.plot(x="Year", y=["Pct_MLB","Pct_Rickey","Pct_No_Rickey"])
no_rickey.plot(x="Year", y=["Pct_MLB","Pct_No_Rickey"], title="No Rickey")
no_rickey.plot(x="Year", y=["Pct_Difference"], title="The Difference")
no_rickey