Skip to content

My Cy Young Predictor

Here is my midseason(ish) predictions for who will win the National League and American League Cy Young Award. In order to determine what makes a CY Young winner, I have compiled a list of every winner, vote getter, and additional pitchers from every year since 2015. I excluded 2020 since it was a shortened season. I then investigated what they all shared in common when it came to stats relative to their competition. Below are my selected criteria.

My criteria:

  1. ERA and ERA-
  2. Wins
  3. Team Wins
  4. Playoffs
  5. Strikeouts
  6. Innings Pitched

All data as of 06/17/2025

Below is the product of my code which consolidates all award winners, vote getters, and additional pitchers. Accompanied with each player's stats is their, vote placement, teams, and their team success. One thing to note is that some players played for multiple teams. When estimating team success value for each award recipient, players who played for multiple teams in a given year will not used.

library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(readxl)  # Added the readxl library

cy_young.df <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 11) %>%
  mutate("Name and Year" = paste(Name, Year, sep = " ")) %>%
  distinct(`Name and Year`, .keep_all = TRUE)  # Keep all columns when removing duplicates

mlb_teams.df <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 5) %>%
  select(Team, "Team Name", League, Division)

mlb_standings_2015on <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 10)

mlb_standings_2015on <- mlb_standings_2015on %>%
  select(Season, Team, "Team Wins", "Team Losses") %>%
  mutate("Team and Season" = paste(Team, Season, sep = " "))  # Changed Year to Season

cy_young.df <- cy_young.df %>%
  left_join(mlb_teams.df, by = "Team") %>%
  mutate("Team and Season" = paste(Team, Season, sep = " ")) %>%
  left_join(mlb_standings_2015on, by = "Team and Season") %>%
  select(-ends_with(".y")) %>%
  rename(Team = Team.x, Season = Season.x)

cy_young.df

cy_young_winners <- cy_young.df %>%
  filter(`Place:` == 1)
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)

cy_young.df <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 11) 

mlb_teams.df <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 5) %>%
  select(Team, "Team Name", League)

mlb_standings_2015on <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 10)

mlb_standings_2015on <- mlb_standings_2015on %>%
  select(Season, Team, "Team Wins", "Team Losses") %>%
  mutate("Team and Season" = paste(Team, Season, sep = " "))  # Changed Year to Season

cy_young.df <- cy_young.df %>%
  left_join(mlb_teams.df, by = "Team") %>%
  mutate("Team and Season" = paste(Team, Season, sep = " ")) %>%
  left_join(mlb_standings_2015on, by = "Team and Season") %>%
  select(-ends_with(".y")) %>%
  rename(Team = Team.x, Season = Season.x)

cy_young_winners <- cy_young.df %>%
  filter(`Place:` == 1)

mlb_playoffs_2015on <- read_xlsx("MLB Database(AutoRecovered).xlsx", sheet = 12) %>%
  separate(col = Series, into = c("Year", "Series")) %>%
  separate(col = Matchup, into = c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th"), sep = " ") %>%
  mutate(
    `1st` = ifelse(grepl("Los|New|San|Kansas|Boston|St.|Tampa", `1st`), paste(`1st`, `2nd`), `1st`),
    `2nd` = ifelse(grepl("Los|New|San|Kansas|St.|Tampa", `1st`), NA, `2nd`),
    `2nd` = ifelse(is.na(`2nd`), `3rd`, `2nd`),
    `3rd` = ifelse(!is.na(`4th`), paste(`3rd`, `4th`), `3rd`),
    `2nd` = ifelse(`2nd` == "Red", "Sox", `2nd`),
    `2nd` = ifelse(`2nd` == "Blue", "Blue Jays", `2nd`)
  ) %>%
  mutate(Team = paste(`1st`, `2nd`, sep = " ")) %>%
  select(-c(`1st`, `2nd`)) %>%
  mutate(Team = gsub("\\*", "", Team)) %>%
  mutate(Team_Two = paste(`4th`, `5th`, `6th`, `7th`, sep = " ")) %>%
  mutate(Team_Two = sub(".*vs\\.", "", Team_Two)) %>%  
  mutate(Team_Two = gsub("[^a-zA-Z. ]", "", Team_Two)) %>%  
  mutate(Team_Two = gsub("\\b(AL|NL)\\b", "", Team_Two)) %>%
  mutate(Team = sub(".*vs\\.", "", Team)) %>%  
  mutate(Team = gsub("[^a-zA-Z. ]", "", Team)) %>%  
  mutate(Team = gsub("\\b(AL|NL)\\b", "", Team)) %>%
  mutate(Team = str_replace_all(Team, "([a-z])([A-Z])", "\\1 \\2")) %>%  
  mutate(Team = str_trim(Team, side = "right")) %>%  # Remove trailing white space
  mutate(Team_Two = str_trim(Team_Two, side = "right")) %>%  # Remove trailing white space
  select(Year, Team, Team_Two, Series) 

mlb_playoffs_2015on1 <- mlb_playoffs_2015on %>%
  select(Team, Year)
mlb_playoffs_2015on2 <- mlb_playoffs_2015on %>%
  select(Team_Two, Year) %>%
  rename(Team = Team_Two)

mlb_teams.df <- mlb_teams.df %>%
  rename(Team_abvr = Team, Team = "Team Name") %>%
  mutate(Team = ifelse(Team == "Philidelphia Phillies", "Philadelphia Phillies", Team))

mlb_playoffs_total <- mlb_playoffs_2015on1 %>%
  union(mlb_playoffs_2015on2) %>%
  left_join(mlb_teams.df, by = "Team") %>%
  mutate(Team_abvr = ifelse(is.na(Team_abvr), "CLE", Team_abvr)) %>%
  mutate("Team and Season" = paste(Team_abvr, Year, sep = " "))

players_on_playoff_teams <- cy_young.df %>%
  semi_join(mlb_playoffs_total, by = "Team and Season") %>%
  mutate(Playoffs = TRUE)

players_not_on_playoff_teams <- cy_young.df %>%
  anti_join(mlb_playoffs_total, by = "Team and Season") %>%
  mutate(Playoffs = FALSE)

cy_young.df <- players_on_playoff_teams %>%
  union(players_not_on_playoff_teams)

team_standings.df <- read_xlsx("datalab_export_2025-06-18 23_27_25.xlsx") 

team_standings <- team_standings.df %>%
  select("Team Name", G, W, L, win_percent, League) %>%
  rename(team_games = G, team_wins = W, team_loses = L, team_win_percent = win_percent) %>%
  mutate(`Team Name` = ifelse(`Team Name` == "Cinncinnati Reds", "Cincinnati Reds", `Team Name`)) %>%
  left_join(mlb_teams.df, by = c("Team Name" = "Team")) %>%
  mutate(Team_abvr = ifelse(is.na(Team_abvr), "OAK", Team_abvr)) %>%
  select(-League.y) %>%
  rename(League = League.x) 
  

ERA and ERA-

ERA:

Earned run average is a common statistic used to value a pitcher's ability to limit an oposing team from scoring runs. To calculate ERA, you divide the pitchers total earned runs and divide by the amount of innings pitched and multiply the resulting calculation by nine. This is a determinination of, if pitched for a complete game, how many runs can you expect a pitcher to give up. Side note: if a run is scored while a pitcher is pitching, and the baserunner who crosses homeplate was assisted by a an action not charged to the pitcher, such as a passed ball by the catcher, or an error, and even an inherited runner by the previous pitcher, the run scored is not charged to the pitcher.

ERA_leaders <- cy_young.df %>%
  group_by(Year) %>%
  slice_min(order_by = ERA, n = 5)
ERA_leaders

cy_young_winners %>%
  anti_join(ERA_leaders, by = "Name")

ERA-

Era minus is a fangraphs measure of earned run average relative to league average and adjusted to park factors. It's scaled so that 100 is league average, and each point above or below 100 represents a percentage point above or below league average.

The calculations: ERA Minus = 100*((ERA + (ERA – ERA*(PF/100)) )/ AL or NL ERA) Note: Park factor is commonly found by dividing the number of runs scored by a team at home by the number of runs scored by that same team in away games. Stadiums with higher park factors are tougher to pitch in because the park naturally sees more runs scored. Thus, ERA minus places all pitchers on a level playing field to compare as their ERA is adjusted for the park they play in. Below is the list of ERA- leaders over the past ten years. Five of the top 6 all won the award (once again Zack Greinke losing in 2015 needs to be studied.) 17 out of 18 winners of the award placed in the top 5 of ERA minus in their respective year. This stat is very important towards determining the winner.

ERAminus_leaders <- cy_young.df %>%
  group_by(Year) %>%
  slice_min(order_by = `ERA-`, n = 5) %>%
  filter(`Place:` == 1)
ERAminus_leaders

cy_young_winners %>%
  anti_join(ERAminus_leaders, by = "Name")

Rick Porcello was 15th in ERA and 8th in ERA-

Wins

Wins are awarded to a pitcher if their team takes the lead while they are pitching and maintains that lead, provided the starting pitcher has pitched at least five innings. If the starter does not pitch at least five innings, the official scorer of the game awards the designation. This stat is valuable because it notes how often the pitcher leaves the game with an unrelinquishable lead but is flawed because it favors teams that have powerful offenses behind the pitcher. Still, this stat is used widely, diminishing over time, to value a pitcher's ability to give the team wins in the standings.

Below is a a list of the top seven pitchers every year with regards to wins. Zero Cy Young Winners were not top 12 in Wins and only four winners were on the outside of the top 7. No pitchers won the award and recorded 8 loses. Most importantly, only four pitchers have won the award while being on the outside of the top 7 in wins. This is mostly due to ERA while playing for a non-playoff team.

wins_leaders <- cy_young.df %>%
  group_by(Year) %>%
  slice_min(order_by = desc(W), n =7)
wins_leaders

cy_young_winners %>%
  anti_join(wins_leaders, by = "Name")

Team Wins