Case Study: Exploring Baseball Pitching Data in R

Use a rich baseball dataset from the MLB's Statcast system to practice your data exploration skills.

Start Course for Free
4 Hours14 Videos69 Exercises8,775 Learners
5750 XP

Create Your Free Account



By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

<p>This course is a case study in baseball analytics, exploratory data analysis, and the R language. It introduces a rich baseball dataset from Major League Baseball's (MLB) Statcast system to develop skills in baseball analytics using the R language.</p><p>Throughout the course, you will use data on every pitch thrown by Zack Greinke during the 2015 MLB season. These data include information about pitch velocity, pitch type, pitch location, exit speed when the batter makes contact, the game situation (e.g. outs or ball-strike count), and the outcome of each pitch (e.g. strike, foul, home run, or walk).<p><p>By the end of the course, you will have a thorough understanding of the data and be able to create publication quality visuals to communicate what you have found.</p>

  1. 1

    Exploring pitch velocities


    Velocity is a key component in the arsenal of many pitchers. In this chapter, you will examine whether there was an uptick in Zack Greinke's velocity during his impressive July in 2015. The chapter will introduce how to deal with dates, plotting distributions with histograms, and using the very handy tapply() function.

    Play Chapter Now
    Did Zack Greinke pitch differently in July?
    50 xp
    Clean the data
    100 xp
    Check dates
    100 xp
    Delimit dates
    100 xp
    Subsets and histograms
    50 xp
    Velocity distribution
    100 xp
    Fastball velocity distribution
    100 xp
    Distribution comparisons with color
    100 xp
    Describe the histogram
    50 xp
    Using tapply() for comparisons
    50 xp
    tapply() for velocity changes
    100 xp
    Game-by-game velocity changes
    100 xp
    Tidying the data frame
    100 xp
    A game-by-game line plot
    100 xp
    Adding jittered points
    100 xp
    50 xp
  2. 3

    Exploring pitch locations

    As with velocity and pitch type, pitch location can play a key role in pitching success. This chapter leverages the rich information about location provided in the MLB Statcast data to visualize changes in Greinke's pitch location choice in July and in different ball-strike counts. You will also make use of the very important for loop in the context of plotting data.

    Play Chapter Now




Nick CarchediTom Jeon


Intermediate R
Brian M. Mills Headshot

Brian M. Mills

Assistant Professor at the University of Florida

Brian Mills is an Assistant Professor at the University of Florida, with research interests encompassing quantitative and economic analysis in sport. He earned a PhD and MA in Sport Management, an MA in Statistics, and an MA in Applied Economics from the University of Michigan. Brian has been an active contributor to the Sabermetric community through blogging about analytics and teaching how to use R to analyze baseball data.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA