Start Learning for Free

Join over 1,000,000 other Data Science learners and start one of our interactive tutorials today!

Topic machine learning small

Scikit-Learn Tutorial: Baseball Analytics in Python Pt 1

May 4th, 2017 in Machine Learning

The Python programming language is a great option for data science and predictive analytics, as it comes equipped with multiple packages which cover most of your data analysis needs. For machine learning in Python, Scikit-learn (sklearn) is a great option and is built on NumPy, SciPy, and Matplotlib (N-dimensional arrays, scientific computing, and data visualization respectively).

In this tutorial, you’ll see how you can easily load in data from a database with sqlite3, how you can explore your data and improve its data quality with pandas and matplotlib, and how you can then use the Scikit-Learn package to extract some valid insights out of your data.

If you would like to take a machine learning course, check out DataCamp’s Supervised Learning with scikit-learn course.

Part 1: Predicting MLB Team Wins per Season

In this project, you’ll test out several machine learning models from sklearn to predict the number of games that a Major-League Baseball team won that season, based on the teams statistics and other variables from that season. If I were a gambling man (and I most certainly am a gambling man), I could build a model using historic data from previous seasons to forecast the upcoming one. Given the time series nature of the data, you could generate indicators such as average wins per year over the previous five years, and other such factors, to make a highly accurate model. That is outside the scope of this tutorial, however and you will treat each row as independent. Each row of our data will consist of a single team for a specific year.

Sean Lahman compiled this data on his website and it was transformed to a sqlite database here.

Importing Data

You will read in the data by querying a sqlite database using the sqlite3 package and converting to a DataFrame with pandas. Your data will be filtered to only include currently active modern teams and only years where the team played 150 or more games.

First, download the file “lahman2016.sqlite” (here). Then, load Pandas and rename to pd for efficiency sake. You might remember that pd is the common alias for Pandas. Lastly, load sqlite3 and connect to the database, like this:

# import `pandas` and `sqlite3`
import pandas as pd
import sqlite3

# Connecting to SQLite Database
conn = sqlite3.connect('lahman2016.sqlite')

Next, you write a query, execute it and fetch the results.

# Querying Database for all seasons where a team played 150 or more games and is still active today. 
query = '''select * from Teams 
inner join TeamsFranchises
on Teams.franchID == TeamsFranchises.franchID
where Teams.G >= 150 and TeamsFranchises.active == 'Y';
'''

# Creating dataframe from query.
Teams = conn.execute(query).fetchall()

Tip: If you want to learn more about using SQL with Python, consider taking DataCamp’s Introduction to Databases in Python

Using pandas, you then convert the results to a DataFrame and print the first 5 rows using the head() method:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuVGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RlYW1zX2RmLmNzdlwiKSIsInNhbXBsZSI6IiMgQ29udmVydCBgVGVhbXNgIHRvIERhdGFGcmFtZVxudGVhbXNfZGYgPSBwZC5EYXRhRnJhbWUoX19fX18pXG5cbiMgUHJpbnQgb3V0IGZpcnN0IDUgcm93c1xucHJpbnQoX19fX19fX19fX19fX19fKSIsInNvbHV0aW9uIjoiIyBDb252ZXJ0IHJlc3VsdHMgdG8gRGF0YUZyYW1lXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZShUZWFtcylcblxuIyBQcmludCBvdXQgZmlyc3QgNSByb3dzXG5wcmludCh0ZWFtc19kZi5oZWFkKCkpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcInRlYW1zX2RmXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

Each of the columns contain data related to a specific team and year. Some of the more important variables are listed below. A full list of the variables can be found here.

  • yearID - Year
  • teamID - Team
  • franchID - Franchise (links to TeamsFranchise table)
  • G - Games played
  • W - Wins
  • LgWin - League Champion(Y or N)
  • WSWin - World Series Winner (Y or N)
  • R - Runs scored
  • AB - At bats
  • H - Hits by batters
  • HR - Homeruns by batters
  • BB - Walks by batters
  • SO - Strikeouts by batters
  • SB - Stolen bases
  • CS - Caught stealing
  • HBP - Batters hit by pitch
  • SF - Sacrifice flies
  • RA - Opponents runs scored
  • ER - Earned runs allowed
  • ERA - Earned run average
  • CG - Complete games
  • SHO - Shutouts
  • SV - Saves
  • IPOuts - Outs Pitched (innings pitched x 3)
  • HA - Hits allowed
  • HRA - Homeruns allowed
  • BBA - Walks allowed
  • SOA - Strikeouts by pitchers
  • E - Errors
  • DP - Double Plays
  • FP - Fielding percentage
  • name - Team’s full name

For those of you who maybe aren’t that familiar to baseball, here’s a brief explanation of how the game works, with some of the variables included.

Baseball is played between two teams (which you’ll find back in the data by name or teamID) of nine players each. These two teams take turns batting and fielding. The batting team attempts to score runs by taking turns batting a ball that is thrown by the pitcher of the fielding team, then running counter-clockwise around a series of four bases: first, second, third, and home plate. The fielding team tries to prevent runs by getting hitters or base runners out in any of several ways and a run (R) is scored when a player advances around the bases and returns to home plate. A player on the batting team who reaches base safely will attempt to advance to subsequent bases during teammates’ turns batting, such as on a hit (H), stolen base (SB), or by other means.

The teams switch between batting and fielding whenever the fielding team records three outs. One turn batting for both teams, beginning with the visiting team, constitutes an inning. A game is composed of nine innings, and the team with the greater number of runs at the end of the game wins. Baseball has no game clock and although most games end in the ninth inning, if a game is tied after nine innings it goes into extra innings and will continue indefinitely until one team has a lead at the end of an extra inning.

For a detailed explanation of the game of baseball, checkout the official rules of Major League Baseball.

Cleaning and Preparing The Data

As you can see above, the DataFrame doesn’t have column headers. You can add headers by passing a list of your headers to the columns attribute from pandas.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RlYW1zX2RmLmNzdlwiKVxudGVhbXNfZGYgPSBwZC5EYXRhRnJhbWUodGVhbXMpIiwic2FtcGxlIjoiIyBBZGRpbmcgY29sdW1uIG5hbWVzIHRvIGRhdGFmcmFtZVxuY29scyA9IFsneWVhcklEJywnbGdJRCcsJ3RlYW1JRCcsJ2ZyYW5jaElEJywnZGl2SUQnLCdSYW5rJywnRycsJ0dob21lJywnVycsJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnQ1MnLCdIQlAnLCdTRicsJ1JBJywnRVInLCdFUkEnLCdDRycsJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ25hbWUnLCdwYXJrJywnYXR0ZW5kYW5jZScsJ0JQRicsJ1BQRicsJ3RlYW1JREJSJywndGVhbUlEbGFobWFuNDUnLCd0ZWFtSURyZXRybycsJ2ZyYW5jaElEJywnZnJhbmNoTmFtZScsJ2FjdGl2ZScsJ05BYXNzb2MnXVxudGVhbXNfZGYuY29sdW1ucyA9IF9fX19cblxuIyBQcmludCB0aGUgZmlyc3Qgcm93cyBvZiBgdGVhbXNfZGZgXG5wcmludChfX19fX19fX19fX19fXylcblxuIyBQcmludCB0aGUgbGVuZ3RoIG9mIGB0ZWFtc19kZmBcbnByaW50KGxlbihfX19fX19fXykpIiwic29sdXRpb24iOiIjIEFkZGluZyBjb2x1bW4gbmFtZXMgdG8gZGF0YWZyYW1lXG5jb2xzID0gWyd5ZWFySUQnLCdsZ0lEJywndGVhbUlEJywnZnJhbmNoSUQnLCdkaXZJRCcsJ1JhbmsnLCdHJywnR2hvbWUnLCdXJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG50ZWFtc19kZi5jb2x1bW5zID0gY29sc1xuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGB0ZWFtc19kZmBcbnByaW50KHRlYW1zX2RmLmhlYWQoKSlcblxuIyBQcmludCB0aGUgbGVuZ3RoIG9mIGB0ZWFtc19kZmBcbnByaW50KGxlbih0ZWFtc19kZikpIiwic2N0IjoiRXgoKS5jaGVja19vYmplY3QoXCJjb2xzXCIpLmlzX2luc3RhbmNlKGxpc3QpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcInRlYW1zX2RmXCIsIGNvbHVtbnM9Wyd5ZWFySUQnLCdsZ0lEJywndGVhbUlEJywnZnJhbmNoSUQnLCdkaXZJRCcsJ1JhbmsnLCdHJywnR2hvbWUnLCdXJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddLCB1bmRlZmluZWRfY29sc19tc2c9XCJEaWQgeW91IGFzc2lnbiBgY29sc2AgdG8gYHRlYW1zX2RmLmNvbHVtbnNgP1wiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwidGVhbXNfZGYuaGVhZFwiLCBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IGB0ZWFtc19kZi5oZWFkKClgP1wiKVxuRXgoKS5jaGVja19mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTApXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJsZW5cIilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTEpIn0=

The len() function will let you know how many rows you’re dealing with: 2,287 is not a huge number of data points to work with, so hopefully there aren’t too many null values.

Prior to assessing the data quality, let’s first eliminate the columns that aren’t necessary or are derived from the target column (Wins). This is where knowledge of the data you are working with starts to become very valuable. It doesn’t matter how much you know about coding or statistics if you don’t have any knowledge of the data that you’re working with. Being a lifelong baseball fan certainly helped me with this project.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzIiwic2FtcGxlIjoiIyBEcm9wcGluZyB5b3VyIHVubmVjZXNhcnkgY29sdW1uIHZhcmlhYmxlcy5cbmRyb3BfY29scyA9IFsnbGdJRCcsJ2ZyYW5jaElEJywnZGl2SUQnLCdSYW5rJywnR2hvbWUnLCdMJywnRGl2V2luJywnV0NXaW4nLCdMZ1dpbicsJ1dTV2luJywnU0YnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cblxuZGYgPSB0ZWFtc19kZi5kcm9wKGRyb3BfY29scywgYXhpcz0xKVxuXG4jIFByaW50IG91dCBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KF9fX19fX19fXykiLCJzb2x1dGlvbiI6IiMgRHJvcHBpbmcgeW91ciB1bm5lY2VzYXJ5IGNvbHVtbiB2YXJpYWJsZXMuXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5cbmRmID0gdGVhbXNfZGYuZHJvcChkcm9wX2NvbHMsIGF4aXM9MSlcblxuIyBQcmludCBvdXQgZmlyc3Qgcm93cyBvZiBgZGZgXG5wcmludChkZi5oZWFkKCkpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcImRyb3BfY29sc1wiKVxuRXgoKS50ZXN0X29iamVjdChcImRmXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

As you read above, null values influence the data quality, which in turn can cause issues with machine learning algorithms.

That’s why you’ll remove those next. There are a few ways to eliminate null values, but it might be a better idea to first display the count of null values for each column so you can decide how to best handle them.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5kZiA9IHRlYW1zX2RmLmRyb3AoZHJvcF9jb2xzLCBheGlzPTEpIiwic2FtcGxlIjoiIyBQcmludCBvdXQgbnVsbCB2YWx1ZXMgb2YgYWxsIGNvbHVtbnMgb2YgYGRmYFxucHJpbnQoX19fX19fX19fLnN1bShheGlzPTApLnRvbGlzdCgpKSIsInNvbHV0aW9uIjoiIyBQcmludCBvdXQgbnVsbCB2YWx1ZXMgb2YgYWxsIGNvbHVtbnMgb2YgYGRmYFxucHJpbnQoZGYuaXNudWxsKCkuc3VtKGF4aXM9MCkudG9saXN0KCkpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

And here’s where you’ll see a trade-off: you need clean data, but you also don’t have a large amount of data to spare. Two of the columns have a relatively small amount of null values. There are 110 null values in the SO (Strike Outs) column and 22 in the DP (Double Play) column. Two of the columns have a relatively large amount of them. There are 419 null values in the CS (Caught Stealing) column and 1777 in the HBP (Hit by Pitch) column.

If you eliminate the rows where the columns have a small number of null values, you’re losing a little over five percent of our data. Since you’re trying to predict wins, runs scored and runs allowed are highly correlated with the target. You want data in those columns to be very accurate.

Strike outs (SO) and double plays (DP) aren’t as important.

I think you’re better off keeping the rows and filling the null values with the median value from each of the columns by using the fillna() method. Caught stealing (CS) and hit by pitch (HBP) aren’t very important variables either. With so many null values in these columns, it’s best to eliminate the columns all together.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5kZiA9IHRlYW1zX2RmLmRyb3AoZHJvcF9jb2xzLCBheGlzPTEpXG5wcmludChkZi5pc251bGwoKS5zdW0oYXhpcz0wKS50b2xpc3QoKSkiLCJzYW1wbGUiOiIjIEVsaW1pbmF0aW5nIGNvbHVtbnMgd2l0aCBudWxsIHZhbHVlc1xuZGYgPSBkZi5kcm9wKFsnQ1MnLCdIQlAnXSwgYXhpcz0xKVxuXG4jIEZpbGxpbmcgbnVsbCB2YWx1ZXNcbmRmWydTTyddID0gZGZbJ1NPJ10uZmlsbG5hKGRmWydTTyddLm1lZGlhbigpKVxuZGZbJ0RQJ10gPSBkZlsnRFAnXS5maWxsbmEoZGZbJ0RQJ10ubWVkaWFuKCkpXG5cbiMgUHJpbnQgb3V0IG51bGwgdmFsdWVzIG9mIGFsbCBjb2x1bW5zIG9mIGBkZmBcbnByaW50KGRmLmlzbnVsbCgpLnN1bShheGlzPTApLnRvbGlzdCgpKSIsInNvbHV0aW9uIjoiIyBFbGltaW5hdGluZyBjb2x1bW5zIHdpdGggbnVsbCB2YWx1ZXNcbmRmID0gZGYuZHJvcChbJ0NTJywnSEJQJ10sIGF4aXM9MSlcblxuIyBGaWxsaW5nIG51bGwgdmFsdWVzXG5kZlsnU08nXSA9IGRmWydTTyddLmZpbGxuYShkZlsnU08nXS5tZWRpYW4oKSlcbmRmWydEUCddID0gZGZbJ0RQJ10uZmlsbG5hKGRmWydEUCddLm1lZGlhbigpKVxuXG4jIFByaW50IG91dCBudWxsIHZhbHVlcyBvZiBhbGwgY29sdW1ucyBvZiBgZGZgXG5wcmludChkZi5pc251bGwoKS5zdW0oYXhpcz0wKS50b2xpc3QoKSkiLCJzY3QiOiJFeCgpLnRlc3RfZGF0YV9mcmFtZShcImRmXCIsIGNvbHVtbnM9Wyd5ZWFySUQnLCAndGVhbUlEJywgJ0cnLCAnVycsICdSJywgJ0FCJywgJ0gnLCAnMkInLCAnM0InLCAnSFInLFxuICAgICAgICdCQicsICdTTycsICdTQicsICdSQScsICdFUicsICdFUkEnLCAnQ0cnLCAnU0hPJywgJ1NWJywgJ0lQb3V0cycsXG4gICAgICAgJ0hBJywgJ0hSQScsICdCQkEnLCAnU09BJywgJ0UnLCAnRFAnLCAnRlAnXSlcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

Exploring and Visualizing The Data

Now that you’ve cleaned up your data, you can do some exploration. You can get a better feel for the data set with a few simple visualizations. matplotlib is an excellent library for data visualization.

You import matplotlib.pyplot and rename it as plt for efficiency. If you’re working with a Jupyter notebook, you need to use the %matplotlib inline magic.

You’ll start by plotting a histogram of the target column so you can see the distribution of wins.

# import the pyplot module from matplotlib
import matplotlib.pyplot as plt

# matplotlib plots inline  
%matplotlib inline

# Plotting distribution of wins
plt.hist(df['W'])
plt.xlabel('Wins')
plt.title('Distribution of Wins')

plt.show()

Note that if you're not using a Jupyter notebook, you have to make use of plt.show() to show the plots.

Print out the average wins (W) per year. You can use the mean() method for this.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6InByaW50KGRmWydfJ10uX19fX19fKSIsInNvbHV0aW9uIjoicHJpbnQoZGZbJ1cnXS5tZWFuKCkpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

It can be useful to create bins for your target column while exploring your data, but you need to make sure not to include any feature that you generate from your target column when you train the model. Including a column of labels generated from the target column in your training set would be like giving your model the answers to the test.

To create your win labels, you’ll create a function called assign_win_bins which will take in an integer value (wins) and return an integer of 1-5 depending on the input value.

Next, you’ll create a new column win_bins by using the apply() method on the wins column and passing in the assign_win_bins() function.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6IiMgQ3JlYXRpbmcgYmlucyBmb3IgdGhlIHdpbiBjb2x1bW5cbmRlZiBhc3NpZ25fd2luX2JpbnMoXyk6XG4gICAgaWYgVyA8IDUwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGlmIFcgPj0gNTAgYW5kIFcgPD0gNjk6XG4gICAgICAgIHJldHVybiAyXG4gICAgaWYgVyA+PSA3MCBhbmQgVyA8PSA4OTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBpZiBXID49IDkwIGFuZCBXIDw9IDEwOTpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBpZiBXID49IDExMDpcbiAgICAgICAgcmV0dXJuIDVcbiAgICAgICAgXG4jIEFwcGx5IGBhc3NpZ25fd2luX2JpbnNgIHRvIGBkZlsnVyddYCAgICBcbmRmWyd3aW5fYmlucyddID0gZGZbJ18nXS5hcHBseShfX19fX19fX19fX19fXykiLCJzb2x1dGlvbiI6IiMgQ3JlYXRpbmcgYmlucyBmb3IgdGhlIHdpbiBjb2x1bW5cbmRlZiBhc3NpZ25fd2luX2JpbnMoVyk6XG4gICAgaWYgVyA8IDUwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGlmIFcgPj0gNTAgYW5kIFcgPD0gNjk6XG4gICAgICAgIHJldHVybiAyXG4gICAgaWYgVyA+PSA3MCBhbmQgVyA8PSA4OTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBpZiBXID49IDkwIGFuZCBXIDw9IDEwOTpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBpZiBXID49IDExMDpcbiAgICAgICAgcmV0dXJuIDVcbiAgICBcbmRmWyd3aW5fYmlucyddID0gZGZbJ1cnXS5hcHBseShhc3NpZ25fd2luX2JpbnMpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uX2RlZmluaXRpb24oXCJhc3NpZ25fd2luX2JpbnNcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKSJ9

Now let’s make a scatter graph with the year on the x-axis and wins on the y-axis and highlight the win_bins column with colors.

# Plotting scatter graph of Year vs. Wins
plt.scatter(df['yearID'], df['W'], c=df['win_bins'])
plt.title('Wins Scatter Plot')
plt.xlabel('Year')
plt.ylabel('Wins')

plt.show()

As you can see in the above scatter plot, there are very few seasons from before 1900 and the game was much different back then. Because of that, it makes sense to eliminate those rows from the data set.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucykiLCJzYW1wbGUiOiIjIEZpbHRlciBmb3Igcm93cyB3aGVyZSAneWVhcklEJyBpcyBncmVhdGVyIHRoYW4gMTkwMFxuZGYgPSBkZltfX19fX19fX19fX19fX19fX19fXSIsInNvbHV0aW9uIjoiIyBGaWx0ZXIgZm9yIHJvd3Mgd2hlcmUgJ3llYXJJRCcgaXMgZ3JlYXRlciB0aGFuIDE5MDBcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF0iLCJzY3QiOiJFeCgpLnRlc3Rfb2JqZWN0KFwiZGZcIikifQ==

When dealing with continuous data and creating linear models, integer values such as a year can cause issues. It is unlikely that the number 1950 will have the same relationship to the rest of the data that the model will infer.

You can avoid these issues by creating new variables that label the data based on the yearID value.

Anyone who follows the game of baseball knows that, as Major League Baseball (MLB) progressed, different eras emerged where the amount of runs per game increased or decreased significantly. The dead ball era of the early 1900s is an example of a low scoring era and the steroid era at the turn of the 21st century is an example of a high scoring era.

Let’s make a graph below that indicates how much scoring there was for each year.

You’ll start by creating dictionaries runs_per_year and games_per_year. Loop through the dataframe using the iterrows() method. Populate the runs_per_year dictionary with years as keys and how many runs were scored that year as the value. Populate the games_per_year dictionary with years as keys and how many games were played that year as the value.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF0iLCJzYW1wbGUiOiIjIENyZWF0ZSBydW5zIHBlciB5ZWFyIGFuZCBnYW1lcyBwZXIgeWVhciBkaWN0aW9uYXJpZXNcbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuXG5mb3IgaSwgcm93IGluIGRmLml0ZXJyb3dzKCk6XG4gICAgeWVhciA9IHJvd1sneWVhcklEJ11cbiAgICBydW5zID0gcm93WydSJ11cbiAgICBnYW1lcyA9IHJvd1snRyddXG4gICAgaWYgeWVhciBpbiBydW5zX3Blcl95ZWFyOlxuICAgICAgICBydW5zX3Blcl95ZWFyW3llYXJdID0gcnVuc19wZXJfeWVhclt5ZWFyXSArIHJ1bnNcbiAgICAgICAgZ2FtZXNfcGVyX3llYXJbeWVhcl0gPSBnYW1lc19wZXJfeWVhclt5ZWFyXSArIGdhbWVzXG4gICAgZWxzZTpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNcbiAgICAgICAgZ2FtZXNfcGVyX3llYXJbeWVhcl0gPSBnYW1lc1xuICAgICAgICBcbnByaW50KHJ1bnNfcGVyX3llYXIpXG5wcmludChnYW1lc19wZXJfeWVhcikifQ==

Next, create a dictionary called mlb_runs_per_game. Iterate through the games_per_year dictionary with the items() method. Populate the mlb_runs_per_game dictionary with years as keys and the number of runs scored per game, league wide, as the value.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXMiLCJzYW1wbGUiOiIjIENyZWF0ZSBNTEIgcnVucyBwZXIgZ2FtZSAocGVyIHllYXIpIGRpY3Rpb25hcnlcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG4gICAgXG5wcmludChtbGJfcnVuc19wZXJfZ2FtZSkifQ==

Finally, create your plot from the mlb_runs_per_game dictionary by putting the years on the x-axis and runs per game on the y-axis.

# Create lists from mlb_runs_per_game dictionary
lists = sorted(mlb_runs_per_game.items())
x, y = zip(*lists)

# Create line plot of Year vs. MLB runs per Game
plt.plot(x, y)
plt.title('MLB Yearly Runs per Game')
plt.xlabel('Year')
plt.ylabel('MLB Runs per Game')

plt.show()

Adding New Features

Now that you have a better idea of scoring trends, you can create new variables that indicate a specific era that each row of data falls in based on the yearID. You’ll follow the same process as you did above when you created the win_bins column.

This time however, you will create dummy columns; a new column for each era. You can use the get_dummies() method for this.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzIiwic2FtcGxlIjoiIyBDcmVhdGluZyBcInllYXJfbGFiZWxcIiBjb2x1bW4sIHdoaWNoIHdpbGwgZ2l2ZSB5b3VyIGFsZ29yaXRobSBpbmZvcm1hdGlvbiBhYm91dCBob3cgY2VydGFpbiB5ZWFycyBhcmUgcmVsYXRlZCBcbiMgKERlYWQgYmFsbCBlcmFzLCBMaXZlIGJhbGwvU3Rlcm9pZCBFcmFzKVxuXG5kZWYgYXNzaWduX2xhYmVsKF9fX18pOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuICAgICAgICBcbiMgQWRkIGB5ZWFyX2xhYmVsYCBjb2x1bW4gdG8gYGRmYCAgICBcbmRmWyd5ZWFyX2xhYmVsJ10gPSBkZlsneWVhcklEJ10uYXBwbHkoX19fX19fX19fX18pXG5cbmR1bW15X2RmID0gcGQuZ2V0X2R1bW1pZXMoZGZbJ3llYXJfbGFiZWwnXSwgcHJlZml4PSdlcmEnKVxuXG4jIENvbmNhdGVuYXRlIGBkZmAgYW5kIGBkdW1teV9kZmBcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpXG5cbnByaW50KGRmLmhlYWQoKSkiLCJzb2x1dGlvbiI6IiMgQ3JlYXRpbmcgXCJ5ZWFyX2xhYmVsXCIgY29sdW1uLCB3aGljaCB3aWxsIGdpdmUgeW91ciBhbGdvcml0aG0gaW5mb3JtYXRpb24gYWJvdXQgaG93IGNlcnRhaW4geWVhcnMgYXJlIHJlbGF0ZWQgXG4jIChEZWFkIGJhbGwgZXJhcywgTGl2ZSBiYWxsL1N0ZXJvaWQgRXJhcylcblxuZGVmIGFzc2lnbl9sYWJlbCh5ZWFyKTpcbiAgICBpZiB5ZWFyIDwgMTkyMDpcbiAgICAgICAgcmV0dXJuIDFcbiAgICBlbGlmIHllYXIgPj0gMTkyMCBhbmQgeWVhciA8PSAxOTQxOlxuICAgICAgICByZXR1cm4gMlxuICAgIGVsaWYgeWVhciA+PSAxOTQyIGFuZCB5ZWFyIDw9IDE5NDU6XG4gICAgICAgIHJldHVybiAzXG4gICAgZWxpZiB5ZWFyID49IDE5NDYgYW5kIHllYXIgPD0gMTk2MjpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBlbGlmIHllYXIgPj0gMTk2MyBhbmQgeWVhciA8PSAxOTc2OlxuICAgICAgICByZXR1cm4gNVxuICAgIGVsaWYgeWVhciA+PSAxOTc3IGFuZCB5ZWFyIDw9IDE5OTI6XG4gICAgICAgIHJldHVybiA2XG4gICAgZWxpZiB5ZWFyID49IDE5OTMgYW5kIHllYXIgPD0gMjAwOTpcbiAgICAgICAgcmV0dXJuIDdcbiAgICBlbGlmIHllYXIgPj0gMjAxMDpcbiAgICAgICAgcmV0dXJuIDhcbiAgICAgICAgXG4jIEFkZCBgeWVhcl9sYWJlbGAgY29sdW1uIHRvIGBkZmAgICAgXG5kZlsneWVhcl9sYWJlbCddID0gZGZbJ3llYXJJRCddLmFwcGx5KGFzc2lnbl9sYWJlbClcblxuZHVtbXlfZGYgPSBwZC5nZXRfZHVtbWllcyhkZlsneWVhcl9sYWJlbCddLCBwcmVmaXg9J2VyYScpXG5cbiMgQ29uY2F0ZW5hdGUgYGRmYCBhbmQgYGR1bW15X2RmYFxuZGYgPSBwZC5jb25jYXQoW2RmLCBkdW1teV9kZl0sIGF4aXM9MSlcblxucHJpbnQoZGYuaGVhZCgpKSIsInNjdCI6IkV4KCkudGVzdF9mdW5jdGlvbl9kZWZpbml0aW9uKFwiYXNzaWduX2xhYmVsXCIpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcImRmXCIpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcImR1bW15X2RmXCIpXG5FeCgpLmNoZWNrX29iamVjdChcImRmXCIpLmhhc19lcXVhbF9rZXkoXCJ5ZWFyX2xhYmVsXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

Since you already did the work to determine MLB runs per game for each year, add that data to the data set.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuICAgIFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcblxuZGYgPSBwZC5jb25jYXQoW2RmLCBkdW1teV9kZl0sIGF4aXM9MSkiLCJzYW1wbGUiOiIjIENyZWF0ZSBjb2x1bW4gZm9yIE1MQiBydW5zIHBlciBnYW1lIGZyb20gdGhlIG1sYl9ydW5zX3Blcl9nYW1lIGRpY3Rpb25hcnlcbmRlZiBhc3NpZ25fbWxiX3JwZyh5ZWFyKTpcbiAgICByZXR1cm4gbWxiX3J1bnNfcGVyX2dhbWVbeWVhcl1cblxuZGZbJ21sYl9ycGcnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbWxiX3JwZykiLCJzb2x1dGlvbiI6IiMgQ3JlYXRlIGNvbHVtbiBmb3IgTUxCIHJ1bnMgcGVyIGdhbWUgZnJvbSB0aGUgbWxiX3J1bnNfcGVyX2dhbWUgZGljdGlvbmFyeVxuZGVmIGFzc2lnbl9tbGJfcnBnKHllYXIpOlxuICAgIHJldHVybiBtbGJfcnVuc19wZXJfZ2FtZVt5ZWFyXVxuXG5kZlsnbWxiX3JwZyddID0gZGZbJ3llYXJJRCddLmFwcGx5KGFzc2lnbl9tbGJfcnBnKSIsInNjdCI6IkV4KCkudGVzdF9mdW5jdGlvbl9kZWZpbml0aW9uKFwiYXNzaWduX21sYl9ycGdcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKSJ9

Now you’ll convert the years into decades by creating dummy columns for each decade. Then you can drop the columns that you don’t need anymore.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpIiwic2FtcGxlIjoiIyBDb252ZXJ0IHllYXJzIGludG8gZGVjYWRlIGJpbnMgYW5kIGNyZWF0aW5nIGR1bW15IHZhcmlhYmxlc1xuZGVmIGFzc2lnbl9kZWNhZGUoeWVhcik6XG4gICAgaWYgeWVhciA8IDE5MjA6XG4gICAgICAgIHJldHVybiAxOTEwXG4gICAgZWxpZiB5ZWFyID49IDE5MjAgYW5kIHllYXIgPD0gMTkyOTpcbiAgICAgICAgcmV0dXJuIDE5MjBcbiAgICBlbGlmIHllYXIgPj0gMTkzMCBhbmQgeWVhciA8PSAxOTM5OlxuICAgICAgICByZXR1cm4gMTkzMFxuICAgIGVsaWYgeWVhciA+PSAxOTQwIGFuZCB5ZWFyIDw9IDE5NDk6XG4gICAgICAgIHJldHVybiAxOTQwXG4gICAgZWxpZiB5ZWFyID49IDE5NTAgYW5kIHllYXIgPD0gMTk1OTpcbiAgICAgICAgcmV0dXJuIDE5NTBcbiAgICBlbGlmIHllYXIgPj0gMTk2MCBhbmQgeWVhciA8PSAxOTY5OlxuICAgICAgICByZXR1cm4gMTk2MFxuICAgIGVsaWYgeWVhciA+PSAxOTcwIGFuZCB5ZWFyIDw9IDE5Nzk6XG4gICAgICAgIHJldHVybiAxOTcwXG4gICAgZWxpZiB5ZWFyID49IDE5ODAgYW5kIHllYXIgPD0gMTk4OTpcbiAgICAgICAgcmV0dXJuIDE5ODBcbiAgICBlbGlmIHllYXIgPj0gMTk5MCBhbmQgeWVhciA8PSAxOTk5OlxuICAgICAgICByZXR1cm4gMTk5MFxuICAgIGVsaWYgeWVhciA+PSAyMDAwIGFuZCB5ZWFyIDw9IDIwMDk6XG4gICAgICAgIHJldHVybiAyMDAwXG4gICAgZWxpZiB5ZWFyID49IDIwMTA6XG4gICAgICAgIHJldHVybiAyMDEwXG4gICAgXG5kZlsnZGVjYWRlX2xhYmVsJ10gPSBkZlsneWVhcklEJ10uYXBwbHkoYXNzaWduX2RlY2FkZSlcbmRlY2FkZV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWydkZWNhZGVfbGFiZWwnXSwgcHJlZml4PSdkZWNhZGUnKVxuZGYgPSBwZC5jb25jYXQoW2RmLCBkZWNhZGVfZGZdLCBheGlzPTEpXG5cbiMgRHJvcCB1bm5lY2Vzc2FyeSBjb2x1bW5zXG5kZiA9IGRmLmRyb3AoWyd5ZWFySUQnLCd5ZWFyX2xhYmVsJywnZGVjYWRlX2xhYmVsJ10sIGF4aXM9MSkifQ==

The bottom line in the game of baseball is how many runs you score and how many runs you allow. You can significantly increase the accuracy of your model by creating columns which are ratios of other columns of data. Runs per game and runs allowed per game will be great features to add to our data set.

Pandas makes this very simple as you create a new column by dividing the R column by the G column to create the R_per_game column.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpXG5kZWYgYXNzaWduX2RlY2FkZSh5ZWFyKTpcbiAgICBpZiB5ZWFyIDwgMTkyMDpcbiAgICAgICAgcmV0dXJuIDE5MTBcbiAgICBlbGlmIHllYXIgPj0gMTkyMCBhbmQgeWVhciA8PSAxOTI5OlxuICAgICAgICByZXR1cm4gMTkyMFxuICAgIGVsaWYgeWVhciA+PSAxOTMwIGFuZCB5ZWFyIDw9IDE5Mzk6XG4gICAgICAgIHJldHVybiAxOTMwXG4gICAgZWxpZiB5ZWFyID49IDE5NDAgYW5kIHllYXIgPD0gMTk0OTpcbiAgICAgICAgcmV0dXJuIDE5NDBcbiAgICBlbGlmIHllYXIgPj0gMTk1MCBhbmQgeWVhciA8PSAxOTU5OlxuICAgICAgICByZXR1cm4gMTk1MFxuICAgIGVsaWYgeWVhciA+PSAxOTYwIGFuZCB5ZWFyIDw9IDE5Njk6XG4gICAgICAgIHJldHVybiAxOTYwXG4gICAgZWxpZiB5ZWFyID49IDE5NzAgYW5kIHllYXIgPD0gMTk3OTpcbiAgICAgICAgcmV0dXJuIDE5NzBcbiAgICBlbGlmIHllYXIgPj0gMTk4MCBhbmQgeWVhciA8PSAxOTg5OlxuICAgICAgICByZXR1cm4gMTk4MFxuICAgIGVsaWYgeWVhciA+PSAxOTkwIGFuZCB5ZWFyIDw9IDE5OTk6XG4gICAgICAgIHJldHVybiAxOTkwXG4gICAgZWxpZiB5ZWFyID49IDIwMDAgYW5kIHllYXIgPD0gMjAwOTpcbiAgICAgICAgcmV0dXJuIDIwMDBcbiAgICBlbGlmIHllYXIgPj0gMjAxMDpcbiAgICAgICAgcmV0dXJuIDIwMTBcbmRmWydkZWNhZGVfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fZGVjYWRlKVxuZGVjYWRlX2RmID0gcGQuZ2V0X2R1bW1pZXMoZGZbJ2RlY2FkZV9sYWJlbCddLCBwcmVmaXg9J2RlY2FkZScpXG5kZiA9IHBkLmNvbmNhdChbZGYsIGRlY2FkZV9kZl0sIGF4aXM9MSlcbmRmID0gZGYuZHJvcChbJ3llYXJJRCcsJ3llYXJfbGFiZWwnLCdkZWNhZGVfbGFiZWwnXSwgYXhpcz0xKSIsInNhbXBsZSI6IiMgQ3JlYXRlIG5ldyBmZWF0dXJlcyBmb3IgUnVucyBwZXIgR2FtZSBhbmQgUnVucyBBbGxvd2VkIHBlciBHYW1lXG5kZlsnUl9wZXJfZ2FtZSddID0gZGZbJ1InXSAvIGRmWydHJ11cbmRmWydSQV9wZXJfZ2FtZSddID0gZGZbJ1JBJ10gLyBkZlsnRyddIn0=

Now look at how each of the two new variables relate to the target wins column by making a couple scatter graphs. Plot the runs per game on the x-axis of one graph and runs allowed per game on the x-axis of the other. Plot the W column on each y-axis.

# Create scatter plots for runs per game vs. wins and runs allowed per game vs. wins
fig = plt.figure(figsize=(12, 6))

ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.scatter(df['R_per_game'], df['W'], c='blue')
ax1.set_title('Runs per Game vs. Wins')
ax1.set_ylabel('Wins')
ax1.set_xlabel('Runs per Game')

ax2.scatter(df['RA_per_game'], df['W'], c='red')
ax2.set_title('Runs Allowed per Game vs. Wins')
ax2.set_xlabel('Runs Allowed per Game')

plt.show()

Before getting into any machine learning models, it can be useful to see how each of the variables is correlated with the target variable. Pandas makes this easy with the corr() method.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6ImRmLmNvcnIoKVsnVyddIn0=

Another feature you can add to the dataset are labels derived from a K-means cluster algorithm provided by sklearn. K-means is a simple clustering algorithm that partitions the data based on the number of k centroids you indicate. Each data point is assigned to a cluster based on which centroid has the lowest Euclidian distance from the data point.

You can learn more about K-means clustering here.

First, create a DataFrame that leaves out the target variable:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6ImF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KGRmLmhlYWQoKSkiLCJzb2x1dGlvbiI6ImF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KGRmLmhlYWQoKSkiLCJzY3QiOiJFeCgpLnRlc3Rfb2JqZWN0KFwiYXR0cmlidXRlc1wiKVxuRXgoKS50ZXN0X29iamVjdChcImRhdGFfYXR0cmlidXRlc1wiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

One aspect of K-means clustering that you must determine before using the model is how many clusters you want. You can get a better idea of your ideal number of clusters by using sklearn’s silhouette_score() function. This function returns the mean silhouette coefficient over all samples. You want a higher silhouette score, and the score decreases as more clusters are added.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXSIsInNhbXBsZSI6IiMgSW1wb3J0IG5lY2Vzc2FyeSBtb2R1bGVzIGZyb20gYHNrbGVhcm5gIFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZnJvbSBza2xlYXJuIGltcG9ydCBtZXRyaWNzXG5cbiMgQ3JlYXRlIHNpbGhvdWV0dGUgc2NvcmUgZGljdGlvbmFyeVxuc19zY29yZV9kaWN0ID0ge31cbmZvciBpIGluIHJhbmdlKDIsMTEpOlxuICAgIGttID0gS01lYW5zKG5fY2x1c3RlcnM9aSwgcmFuZG9tX3N0YXRlPTEpXG4gICAgbCA9IGttLmZpdF9wcmVkaWN0KGRhdGFfYXR0cmlidXRlcylcbiAgICBzX3MgPSBtZXRyaWNzLnNpbGhvdWV0dGVfc2NvcmUoZGF0YV9hdHRyaWJ1dGVzLCBsKVxuICAgIHNfc2NvcmVfZGljdFtpXSA9IFtzX3NdXG5cbiMgUHJpbnQgb3V0IGBzX3Njb3JlX2RpY3RgXG5wcmludChzX3Njb3JlX2RpY3QpIiwic29sdXRpb24iOiIjIEltcG9ydCBuZWNlc3NhcnkgbW9kdWxlcyBmcm9tIGBza2xlYXJuYCBcbmZyb20gc2tsZWFybi5jbHVzdGVyIGltcG9ydCBLTWVhbnNcbmZyb20gc2tsZWFybiBpbXBvcnQgbWV0cmljc1xuXG4jIENyZWF0ZSBzaWxob3VldHRlIHNjb3JlIGRpY3Rpb25hcnlcbnNfc2NvcmVfZGljdCA9IHt9XG5mb3IgaSBpbiByYW5nZSgyLDExKTpcbiAgICBrbSA9IEtNZWFucyhuX2NsdXN0ZXJzPWksIHJhbmRvbV9zdGF0ZT0xKVxuICAgIGwgPSBrbS5maXRfcHJlZGljdChkYXRhX2F0dHJpYnV0ZXMpXG4gICAgc19zID0gbWV0cmljcy5zaWxob3VldHRlX3Njb3JlKGRhdGFfYXR0cmlidXRlcywgbClcbiAgICBzX3Njb3JlX2RpY3RbaV0gPSBbc19zXVxuXG4jIFByaW50IG91dCBgc19zY29yZV9kaWN0YFxucHJpbnQoc19zY29yZV9kaWN0KSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmNsdXN0ZXIuS01lYW5zXCIpXG5FeCgpLnRlc3RfaW1wb3J0KFwic2tsZWFybi5tZXRyaWNzXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwic19zY29yZV9kaWN0XCIpXG4jIGZvciBsb29wIHRlc3RcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

Now you can initialize the model. Set your number of clusters to 6 and the random state to 1. Determine the Euclidian distances for each data point by using the fit_transform() method and then visualize the clusters with a scatter plot.

# Create K-means model and determine euclidian distances for each data point
kmeans_model = KMeans(n_clusters=6, random_state=1)
distances = kmeans_model.fit_transform(data_attributes)

# Create scatter plot using labels from K-means model as color
labels = kmeans_model.labels_

plt.scatter(distances[:,0], distances[:,1], c=labels)
plt.title('Kmeans Clusters')

plt.show()

Now add the labels from your clusters into the data set as a new column. Also add the string “labels” to the attributes list, for use later.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzXyIsInNhbXBsZSI6IiMgQWRkIGxhYmVscyBmcm9tIEstbWVhbnMgbW9kZWwgdG8gYGRmYCBEYXRhRnJhbWUgYW5kIGF0dHJpYnV0ZXMgbGlzdFxuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJylcblxuIyBQcmludCB0aGUgZmlyc3Qgcm93cyBvZiBgZGZgXG5wcmludChfX19fX19fX18pIiwic29sdXRpb24iOiIjIEFkZCBsYWJlbHMgZnJvbSBLLW1lYW5zIG1vZGVsIHRvIGBkZmAgRGF0YUZyYW1lIGFuZCBhdHRyaWJ1dGVzIGxpc3RcbmRmWydsYWJlbHMnXSA9IGxhYmVsc1xuYXR0cmlidXRlcy5hcHBlbmQoJ2xhYmVscycpXG5cbiMgUHJpbnQgdGhlIGZpcnN0IHJvd3Mgb2YgYGRmYFxucHJpbnQoZGYuaGVhZCgpKSIsInNjdCI6IkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKVxuRXgoKS50ZXN0X29iamVjdChcImF0dHJpYnV0ZXNcIilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

Before you can build your model, you need to split your data into train and test sets. You do this because if you do decide to train your model on the same data that you test the model, your model can easily overfit the data: the model will more memorize the data instead of learning from it, which results in excessively complex models for your data. That also explains why an overfitted model will perform very poorly when you would try to make predictions with new data.

But don’t worry just yet, there are a number of ways to cross-validate your model.

This time, you will simply take a random sample of 75 percent of our data for the train data set and use the other 25 percent for your test data set. Create a list numeric_cols with all of the columns you will use in your model. Next, create a new DataFrame data from the df DataFrame with the columns in the numeric_cols list. Then, also create your train and test data sets by sampling the DataFrame data.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzX1xuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJykiLCJzYW1wbGUiOiIjIENyZWF0ZSBuZXcgRGF0YUZyYW1lIHVzaW5nIG9ubHkgdmFyaWFibGVzIHRvIGJlIGluY2x1ZGVkIGluIG1vZGVsc1xubnVtZXJpY19jb2xzID0gWydHJywnUicsJ0FCJywnSCcsJzJCJywnM0InLCdIUicsJ0JCJywnU08nLCdTQicsJ1JBJywnRVInLCdFUkEnLCdDRycsJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZycsJ2xhYmVscycsJ1cnXVxuZGF0YSA9IGRmW251bWVyaWNfY29sc11cbnByaW50KGRhdGEuaGVhZCgpKVxuXG4jIFNwbGl0IGRhdGEgRGF0YUZyYW1lIGludG8gdHJhaW4gYW5kIHRlc3Qgc2V0c1xudHJhaW4gPSBkYXRhLnNhbXBsZShmcmFjPTAuNzUsIHJhbmRvbV9zdGF0ZT0xKVxudGVzdCA9IGRhdGEubG9jW35kYXRhLmluZGV4LmlzaW4odHJhaW4uaW5kZXgpXVxuXG54X3RyYWluID0gdHJhaW5bYXR0cmlidXRlc11cbnlfdHJhaW4gPSB0cmFpblsnVyddXG54X3Rlc3QgPSB0ZXN0W2F0dHJpYnV0ZXNdXG55X3Rlc3QgPSB0ZXN0WydXJ10iLCJzb2x1dGlvbiI6IiMgQ3JlYXRlIG5ldyBEYXRhRnJhbWUgdXNpbmcgb25seSB2YXJpYWJsZXMgdG8gYmUgaW5jbHVkZWQgaW4gbW9kZWxzXG5udW1lcmljX2NvbHMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnZXJhXzEnLCdlcmFfMicsJ2VyYV8zJywnZXJhXzQnLCdlcmFfNScsJ2VyYV82JywnZXJhXzcnLCdlcmFfOCcsJ2RlY2FkZV8xOTEwJywnZGVjYWRlXzE5MjAnLCdkZWNhZGVfMTkzMCcsJ2RlY2FkZV8xOTQwJywnZGVjYWRlXzE5NTAnLCdkZWNhZGVfMTk2MCcsJ2RlY2FkZV8xOTcwJywnZGVjYWRlXzE5ODAnLCdkZWNhZGVfMTk5MCcsJ2RlY2FkZV8yMDAwJywnZGVjYWRlXzIwMTAnLCdSX3Blcl9nYW1lJywnUkFfcGVyX2dhbWUnLCdtbGJfcnBnJywnbGFiZWxzJywnVyddXG5kYXRhID0gZGZbbnVtZXJpY19jb2xzXVxucHJpbnQoZGF0YS5oZWFkKCkpXG5cbiMgU3BsaXQgZGF0YSBEYXRhRnJhbWUgaW50byB0cmFpbiBhbmQgdGVzdCBzZXRzXG50cmFpbiA9IGRhdGEuc2FtcGxlKGZyYWM9MC43NSwgcmFuZG9tX3N0YXRlPTEpXG50ZXN0ID0gZGF0YS5sb2NbfmRhdGEuaW5kZXguaXNpbih0cmFpbi5pbmRleCldXG5cbnhfdHJhaW4gPSB0cmFpblthdHRyaWJ1dGVzXVxueV90cmFpbiA9IHRyYWluWydXJ11cbnhfdGVzdCA9IHRlc3RbYXR0cmlidXRlc11cbnlfdGVzdCA9IHRlc3RbJ1cnXSIsInNjdCI6IkV4KCkudGVzdF9vYmplY3QoXCJudW1lcmljX2NvbHNcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkYXRhXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKVxuRXgoKS50ZXN0X29iamVjdChcInRyYWluXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwidGVzdFwiKVxuRXgoKS50ZXN0X29iamVjdChcInhfdHJhaW5cIilcbkV4KCkudGVzdF9vYmplY3QoXCJ5X3RyYWluXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwieF90ZXN0XCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwieV90ZXN0XCIpIn0=

Selecting Error Metric and Model

Mean Absolute Error (MAE) is the metric you’ll use to determine how accurate your model is. It measures how close the predictions are to the eventual outcomes. Specifically, for this data, that means that this error metric will provide you with the average absolute value that your prediction missed its mark.

This means that if, on average, your predictions miss the target amount by 5 wins, your error metric will be 5.

The first model you will train will be a linear regression model. You can import LinearRegression and mean_absolute_error from sklearn.linear_model and sklearn.metrics respectively, and then create a model lr. Next, you’ll fit the model, make predictions and determine mean absolute error of the model.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzX1xuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJylcbm51bWVyaWNfY29scyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLCdTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnLCdsYWJlbHMnLCdXJ11cbmRhdGEgPSBkZltudW1lcmljX2NvbHNdXG50cmFpbiA9IGRhdGEuc2FtcGxlKGZyYWM9MC43NSwgcmFuZG9tX3N0YXRlPTEpXG50ZXN0ID0gZGF0YS5sb2NbfmRhdGEuaW5kZXguaXNpbih0cmFpbi5pbmRleCldXG54X3RyYWluID0gdHJhaW5bYXR0cmlidXRlc11cbnlfdHJhaW4gPSB0cmFpblsnVyddXG54X3Rlc3QgPSB0ZXN0W2F0dHJpYnV0ZXNdXG55X3Rlc3QgPSB0ZXN0WydXJ10iLCJzYW1wbGUiOiIjIEltcG9ydCBgTGluZWFyUmVncmVzc2lvbmAgZnJvbSBgc2tsZWFybi5saW5lYXJfbW9kZWxgXG5mcm9tIHNrbGVhcm4ubGluZWFyX21vZGVsIGltcG9ydCBfX19fX19fX19fX19fX19fX19fXG5cbiMgSW1wb3J0IGBtZWFuX2Fic29sdXRlX2Vycm9yYCBmcm9tIGBza2xlYXJuLm1ldHJpY3NgXG5mcm9tIHNrbGVhcm4ubWV0cmljcyBpbXBvcnQgbWVhbl9hYnNvbHV0ZV9lcnJvclxuXG4jIENyZWF0ZSBMaW5lYXIgUmVncmVzc2lvbiBtb2RlbCwgZml0IG1vZGVsLCBhbmQgbWFrZSBwcmVkaWN0aW9uc1xubHIgPSBMaW5lYXJSZWdyZXNzaW9uKG5vcm1hbGl6ZT1UcnVlKVxubHIuZml0KHhfdHJhaW4sIHlfdHJhaW4pXG5wcmVkaWN0aW9ucyA9IGxyLnByZWRpY3QoeF90ZXN0KVxuXG4jIERldGVybWluZSBtZWFuIGFic29sdXRlIGVycm9yXG5tYWUgPSBtZWFuX2Fic29sdXRlX2Vycm9yKHlfdGVzdCwgcHJlZGljdGlvbnMpXG5cbiMgUHJpbnQgYG1hZWBcbnByaW50KF9fXykiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IGBMaW5lYXJSZWdyZXNzaW9uYCBmcm9tIGBza2xlYXJuLmxpbmVhcl9tb2RlbGBcbmZyb20gc2tsZWFybi5saW5lYXJfbW9kZWwgaW1wb3J0IExpbmVhclJlZ3Jlc3Npb25cblxuIyBJbXBvcnQgYG1lYW5fYWJzb2x1dGVfZXJyb3JgIGZyb20gYHNrbGVhcm4ubWV0cmljc2BcbmZyb20gc2tsZWFybi5tZXRyaWNzIGltcG9ydCBtZWFuX2Fic29sdXRlX2Vycm9yXG5cbiMgQ3JlYXRlIExpbmVhciBSZWdyZXNzaW9uIG1vZGVsLCBmaXQgbW9kZWwsIGFuZCBtYWtlIHByZWRpY3Rpb25zXG5sciA9IExpbmVhclJlZ3Jlc3Npb24obm9ybWFsaXplPVRydWUpXG5sci5maXQoeF90cmFpbiwgeV90cmFpbilcbnByZWRpY3Rpb25zID0gbHIucHJlZGljdCh4X3Rlc3QpXG5cbiMgRGV0ZXJtaW5lIG1lYW4gYWJzb2x1dGUgZXJyb3Jcbm1hZSA9IG1lYW5fYWJzb2x1dGVfZXJyb3IoeV90ZXN0LCBwcmVkaWN0aW9ucylcblxuIyBQcmludCBgbWFlYFxucHJpbnQobWFlKSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmxpbmVhcl9tb2RlbC5MaW5lYXJSZWdyZXNzaW9uXCIpXG5FeCgpLnRlc3RfaW1wb3J0KFwic2tsZWFybi5tZXRyaWNzLm1lYW5fYWJzb2x1dGVfZXJyb3JcIilcbkV4KCkudGVzdF9mdW5jdGlvbihcImxyLmZpdFwiKVxuRXgoKS50ZXN0X29iamVjdChcInByZWRpY3Rpb25zXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwibWFlXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

If you recall from above, the average number of wins was about 79 wins. On average, the model is off by only 2.687 wins.

Now try a Ridge regression model. Import RidgeCV from sklearn.linear_model and create model rrm. The RidgeCV model allows you to set the alpha parameter, which is a complexity parameter that controls the amount of shrinkage (read more here). The model will use cross-validation to deterime which of the alpha parameters you provide is ideal.

Again, fit your model, make predictions and determine the mean absolute error.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IG1lYW5fYWJzb2x1dGVfZXJyb3JcbmRmX3RlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9kZl90ZWFtc193aG9sZS5jc3ZcIiwgaGVhZGVyPTApXG5kZiA9IHBkLkRhdGFGcmFtZShkZl90ZWFtcylcbmF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuZGF0YV9hdHRyaWJ1dGVzID0gZGZbYXR0cmlidXRlc11cbmttZWFuc19tb2RlbCA9IEtNZWFucyhuX2NsdXN0ZXJzPTYsIHJhbmRvbV9zdGF0ZT0xKVxuZGlzdGFuY2VzID0ga21lYW5zX21vZGVsLmZpdF90cmFuc2Zvcm0oZGF0YV9hdHRyaWJ1dGVzKVxubGFiZWxzID0ga21lYW5zX21vZGVsLmxhYmVsc19cbmRmWydsYWJlbHMnXSA9IGxhYmVsc1xuYXR0cmlidXRlcy5hcHBlbmQoJ2xhYmVscycpXG5udW1lcmljX2NvbHMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnZXJhXzEnLCdlcmFfMicsJ2VyYV8zJywnZXJhXzQnLCdlcmFfNScsJ2VyYV82JywnZXJhXzcnLCdlcmFfOCcsJ2RlY2FkZV8xOTEwJywnZGVjYWRlXzE5MjAnLCdkZWNhZGVfMTkzMCcsJ2RlY2FkZV8xOTQwJywnZGVjYWRlXzE5NTAnLCdkZWNhZGVfMTk2MCcsJ2RlY2FkZV8xOTcwJywnZGVjYWRlXzE5ODAnLCdkZWNhZGVfMTk5MCcsJ2RlY2FkZV8yMDAwJywnZGVjYWRlXzIwMTAnLCdSX3Blcl9nYW1lJywnUkFfcGVyX2dhbWUnLCdtbGJfcnBnJywnbGFiZWxzJywnVyddXG5kYXRhID0gZGZbbnVtZXJpY19jb2xzXVxudHJhaW4gPSBkYXRhLnNhbXBsZShmcmFjPTAuNzUsIHJhbmRvbV9zdGF0ZT0xKVxudGVzdCA9IGRhdGEubG9jW35kYXRhLmluZGV4LmlzaW4odHJhaW4uaW5kZXgpXVxueF90cmFpbiA9IHRyYWluW2F0dHJpYnV0ZXNdXG55X3RyYWluID0gdHJhaW5bJ1cnXVxueF90ZXN0ID0gdGVzdFthdHRyaWJ1dGVzXVxueV90ZXN0ID0gdGVzdFsnVyddIiwic2FtcGxlIjoiIyBJbXBvcnQgYFJpZGdlQ1ZgIGZyb20gYHNrbGVhcm4ubGluZWFyX21vZGVsYFxuZnJvbSBza2xlYXJuLmxpbmVhcl9tb2RlbCBpbXBvcnQgX19fX19fX1xuXG4jIENyZWF0ZSBSaWRnZSBMaW5lYXIgUmVncmVzc2lvbiBtb2RlbCwgZml0IG1vZGVsLCBhbmQgbWFrZSBwcmVkaWN0aW9uc1xucnJtID0gUmlkZ2VDVihhbHBoYXM9KDAuMDEsIDAuMSwgMS4wLCAxMC4wKSwgbm9ybWFsaXplPVRydWUpXG5ycm0uZml0KHhfdHJhaW4sIHlfdHJhaW4pXG5wcmVkaWN0aW9uc19ycm0gPSBycm0ucHJlZGljdCh4X3Rlc3QpXG5cbiMgRGV0ZXJtaW5lIG1lYW4gYWJzb2x1dGUgZXJyb3Jcbm1hZV9ycm0gPSBtZWFuX2Fic29sdXRlX2Vycm9yKHlfdGVzdCwgcHJlZGljdGlvbnNfcnJtKVxucHJpbnQoX19fX19fX18pIiwic29sdXRpb24iOiIjIEltcG9ydCBgUmlkZ2VDVmAgZnJvbSBgc2tsZWFybi5saW5lYXJfbW9kZWxgXG5mcm9tIHNrbGVhcm4ubGluZWFyX21vZGVsIGltcG9ydCBSaWRnZUNWXG5cbiMgQ3JlYXRlIFJpZGdlIExpbmVhciBSZWdyZXNzaW9uIG1vZGVsLCBmaXQgbW9kZWwsIGFuZCBtYWtlIHByZWRpY3Rpb25zXG5ycm0gPSBSaWRnZUNWKGFscGhhcz0oMC4wMSwgMC4xLCAxLjAsIDEwLjApLCBub3JtYWxpemU9VHJ1ZSlcbnJybS5maXQoeF90cmFpbiwgeV90cmFpbilcbnByZWRpY3Rpb25zX3JybSA9IHJybS5wcmVkaWN0KHhfdGVzdClcblxuIyBEZXRlcm1pbmUgbWVhbiBhYnNvbHV0ZSBlcnJvclxubWFlX3JybSA9IG1lYW5fYWJzb2x1dGVfZXJyb3IoeV90ZXN0LCBwcmVkaWN0aW9uc19ycm0pXG5wcmludChtYWVfcnJtKSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmxpbmVhcl9tb2RlbC5SaWRnZUNWXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJycm0uZml0XCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwicHJlZGljdGlvbnNfcnJtXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

This model performed slightly better, and is off by 2.673 wins, on average.

Sports Analytics & Scikit-Learn

This concludes the first part of this tutorial series in which you have seen how you can use scikit-Learn to analyze sports data. You imported the data from an SQLite database, cleaned it up, explored aspects of it visually, and engineered several new features. You learned how to create a K-means clustering model, a couple different Linear Regression models, and how to test your predictions with the mean absolute error metric.

In the second part, you’ll see how to use classification models to predict which players make it into the MLB Hall of Fame.

Comments

No comments yet. Be the first to respond!