HomeTutorialsPython

# Scikit-Learn Tutorial: Baseball Analytics Pt 1

A scikit-learn tutorial to predicting MLB wins per season by modeling data to KMeans clustering model and linear regression models.
May 2017  · 17 min read

The Python programming language is a great option for data science and predictive analytics, as it comes equipped with multiple packages which cover most of your data analysis needs. For machine learning in Python, Scikit-learn (`sklearn`) is a great option and is built on NumPy, SciPy, and Matplotlib (N-dimensional arrays, scientific computing, and data visualization respectively).

In this tutorial, you’ll see how you can easily load in data from a database with `sqlite3`, how you can explore your data and improve its data quality with `pandas` and `matplotlib`, and how you can then use the Scikit-Learn package to extract some valid insights out of your data.

If you would like to take a machine learning course, check out DataCamp’s Supervised Learning with scikit-learn course.

## Part 1: Predicting MLB Team Wins per Season

In this project, you’ll test out several machine learning models from `sklearn` to predict the number of games that a Major-League Baseball team won that season, based on the teams statistics and other variables from that season. If I were a gambling man (and I most certainly am a gambling man), I could build a model using historic data from previous seasons to forecast the upcoming one. Given the time series nature of the data, you could generate indicators such as average wins per year over the previous five years, and other such factors, to make a highly accurate model. That is outside the scope of this tutorial, however and you will treat each row as independent. Each row of our data will consist of a single team for a specific year.

Sean Lahman compiled this data on his website and it was transformed to a sqlite database here.

### Importing Data

You will read in the data by querying a sqlite database using the `sqlite3` package and converting to a DataFrame with `pandas`. Your data will be filtered to only include currently active modern teams and only years where the team played 150 or more games.

First, download the file “lahman2016.sqlite” (here). Then, load `Pandas` and rename to `pd` for efficiency sake. You might remember that `pd` is the common alias for Pandas. Lastly, load `sqlite3` and connect to the database, like this:

``# import `pandas` and `sqlite3`import pandas as pdimport sqlite3# Connecting to SQLite Databaseconn = sqlite3.connect('lahman2016.sqlite')``

Next, you write a query, execute it and fetch the results.

``# Querying Database for all seasons where a team played 150 or more games and is still active today. query = '''select * from Teams inner join TeamsFranchiseson Teams.franchID == TeamsFranchises.franchIDwhere Teams.G >= 150 and TeamsFranchises.active == 'Y';'''# Creating dataframe from query.Teams = conn.execute(query).fetchall()``

Tip: If you want to learn more about using SQL with Python, consider taking DataCamp’s Introduction to Databases in Python course.

Using `pandas`, you then convert the results to a DataFrame and print the first 5 rows using the `head()` method:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuVGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RlYW1zX2RmLmNzdlwiKSIsInNhbXBsZSI6IiMgQ29udmVydCBgVGVhbXNgIHRvIERhdGFGcmFtZVxudGVhbXNfZGYgPSBwZC5EYXRhRnJhbWUoX19fX18pXG5cbiMgUHJpbnQgb3V0IGZpcnN0IDUgcm93c1xucHJpbnQoX19fX19fX19fX19fX19fKSIsInNvbHV0aW9uIjoiIyBDb252ZXJ0IHJlc3VsdHMgdG8gRGF0YUZyYW1lXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZShUZWFtcylcblxuIyBQcmludCBvdXQgZmlyc3QgNSByb3dzXG5wcmludCh0ZWFtc19kZi5oZWFkKCkpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcInRlYW1zX2RmXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

Each of the columns contain data related to a specific team and year. Some of the more important variables are listed below. A full list of the variables can be found here.

• `yearID` - Year
• `teamID` - Team
• `franchID` - Franchise (links to TeamsFranchise table)
• `G` - Games played
• `W` - Wins
• `LgWin` - League Champion(Y or N)
• `WSWin` - World Series Winner (Y or N)
• `R` - Runs scored
• `AB` - At bats
• `H` - Hits by batters
• `HR` - Homeruns by batters
• `BB` - Walks by batters
• `SO` - Strikeouts by batters
• `SB` - Stolen bases
• `CS` - Caught stealing
• `HBP` - Batters hit by pitch
• `SF` - Sacrifice flies
• `RA` - Opponents runs scored
• `ER` - Earned runs allowed
• `ERA` - Earned run average
• `CG` - Complete games
• `SHO` - Shutouts
• `SV` - Saves
• `IPOuts` - Outs Pitched (innings pitched x 3)
• `HA` - Hits allowed
• `HRA` - Homeruns allowed
• `BBA` - Walks allowed
• `SOA` - Strikeouts by pitchers
• `E` - Errors
• `DP` - Double Plays
• `FP` - Fielding percentage
• `name` - Team’s full name

For those of you who maybe aren’t that familiar to baseball, here’s a brief explanation of how the game works, with some of the variables included.

Baseball is played between two teams (which you’ll find back in the data by `name` or `teamID`) of nine players each. These two teams take turns batting and fielding. The batting team attempts to score runs by taking turns batting a ball that is thrown by the pitcher of the fielding team, then running counter-clockwise around a series of four bases: first, second, third, and home plate. The fielding team tries to prevent runs by getting hitters or base runners out in any of several ways and a run (`R`) is scored when a player advances around the bases and returns to home plate. A player on the batting team who reaches base safely will attempt to advance to subsequent bases during teammates’ turns batting, such as on a hit (`H`), stolen base (`SB`), or by other means.

The teams switch between batting and fielding whenever the fielding team records three outs. One turn batting for both teams, beginning with the visiting team, constitutes an inning. A game is composed of nine innings, and the team with the greater number of runs at the end of the game wins. Baseball has no game clock and although most games end in the ninth inning, if a game is tied after nine innings it goes into extra innings and will continue indefinitely until one team has a lead at the end of an extra inning.

For a detailed explanation of the game of baseball, checkout the official rules of Major League Baseball.

### Cleaning and Preparing the Data

As you can see above, the DataFrame doesn’t have column headers. You can add headers by passing a list of your headers to the `columns` attribute from `pandas`.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxudGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL3RlYW1zX2RmLmNzdlwiKVxudGVhbXNfZGYgPSBwZC5EYXRhRnJhbWUodGVhbXMpIiwic2FtcGxlIjoiIyBBZGRpbmcgY29sdW1uIG5hbWVzIHRvIGRhdGFmcmFtZVxuY29scyA9IFsneWVhcklEJywnbGdJRCcsJ3RlYW1JRCcsJ2ZyYW5jaElEJywnZGl2SUQnLCdSYW5rJywnRycsJ0dob21lJywnVycsJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnQ1MnLCdIQlAnLCdTRicsJ1JBJywnRVInLCdFUkEnLCdDRycsJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ25hbWUnLCdwYXJrJywnYXR0ZW5kYW5jZScsJ0JQRicsJ1BQRicsJ3RlYW1JREJSJywndGVhbUlEbGFobWFuNDUnLCd0ZWFtSURyZXRybycsJ2ZyYW5jaElEJywnZnJhbmNoTmFtZScsJ2FjdGl2ZScsJ05BYXNzb2MnXVxudGVhbXNfZGYuY29sdW1ucyA9IF9fX19cblxuIyBQcmludCB0aGUgZmlyc3Qgcm93cyBvZiBgdGVhbXNfZGZgXG5wcmludChfX19fX19fX19fX19fXylcblxuIyBQcmludCB0aGUgbGVuZ3RoIG9mIGB0ZWFtc19kZmBcbnByaW50KGxlbihfX19fX19fXykpIiwic29sdXRpb24iOiIjIEFkZGluZyBjb2x1bW4gbmFtZXMgdG8gZGF0YWZyYW1lXG5jb2xzID0gWyd5ZWFySUQnLCdsZ0lEJywndGVhbUlEJywnZnJhbmNoSUQnLCdkaXZJRCcsJ1JhbmsnLCdHJywnR2hvbWUnLCdXJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG50ZWFtc19kZi5jb2x1bW5zID0gY29sc1xuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGB0ZWFtc19kZmBcbnByaW50KHRlYW1zX2RmLmhlYWQoKSlcblxuIyBQcmludCB0aGUgbGVuZ3RoIG9mIGB0ZWFtc19kZmBcbnByaW50KGxlbih0ZWFtc19kZikpIiwic2N0IjoiRXgoKS5jaGVja19vYmplY3QoXCJjb2xzXCIpLmlzX2luc3RhbmNlKGxpc3QpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcInRlYW1zX2RmXCIsIGNvbHVtbnM9Wyd5ZWFySUQnLCdsZ0lEJywndGVhbUlEJywnZnJhbmNoSUQnLCdkaXZJRCcsJ1JhbmsnLCdHJywnR2hvbWUnLCdXJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddLCB1bmRlZmluZWRfY29sc19tc2c9XCJEaWQgeW91IGFzc2lnbiBgY29sc2AgdG8gYHRlYW1zX2RmLmNvbHVtbnNgP1wiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwidGVhbXNfZGYuaGVhZFwiLCBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IGB0ZWFtc19kZi5oZWFkKClgP1wiKVxuRXgoKS5jaGVja19mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTApXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJsZW5cIilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIsIGluZGV4PTEpIn0=

The `len()` function will let you know how many rows you’re dealing with: 2,287 is not a huge number of data points to work with, so hopefully there aren’t too many null values.

Prior to assessing the data quality, let’s first eliminate the columns that aren’t necessary or are derived from the target column (`Wins`). This is where knowledge of the data you are working with starts to become very valuable. It doesn’t matter how much you know about coding or statistics if you don’t have any knowledge of the data that you’re working with. Being a lifelong baseball fan certainly helped me with this project.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzIiwic2FtcGxlIjoiIyBEcm9wcGluZyB5b3VyIHVubmVjZXNhcnkgY29sdW1uIHZhcmlhYmxlcy5cbmRyb3BfY29scyA9IFsnbGdJRCcsJ2ZyYW5jaElEJywnZGl2SUQnLCdSYW5rJywnR2hvbWUnLCdMJywnRGl2V2luJywnV0NXaW4nLCdMZ1dpbicsJ1dTV2luJywnU0YnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cblxuZGYgPSB0ZWFtc19kZi5kcm9wKGRyb3BfY29scywgYXhpcz0xKVxuXG4jIFByaW50IG91dCBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KF9fX19fX19fXykiLCJzb2x1dGlvbiI6IiMgRHJvcHBpbmcgeW91ciB1bm5lY2VzYXJ5IGNvbHVtbiB2YXJpYWJsZXMuXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5cbmRmID0gdGVhbXNfZGYuZHJvcChkcm9wX2NvbHMsIGF4aXM9MSlcblxuIyBQcmludCBvdXQgZmlyc3Qgcm93cyBvZiBgZGZgXG5wcmludChkZi5oZWFkKCkpIiwic2N0IjoiRXgoKS50ZXN0X29iamVjdChcImRyb3BfY29sc1wiKVxuRXgoKS50ZXN0X29iamVjdChcImRmXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

As you read above, null values influence the data quality, which in turn can cause issues with machine learning algorithms.

That’s why you’ll remove those next. There are a few ways to eliminate null values, but it might be a better idea to first display the count of null values for each column so you can decide how to best handle them.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5kZiA9IHRlYW1zX2RmLmRyb3AoZHJvcF9jb2xzLCBheGlzPTEpIiwic2FtcGxlIjoiIyBQcmludCBvdXQgbnVsbCB2YWx1ZXMgb2YgYWxsIGNvbHVtbnMgb2YgYGRmYFxucHJpbnQoX19fX19fX19fLnN1bShheGlzPTApLnRvbGlzdCgpKSIsInNvbHV0aW9uIjoiIyBQcmludCBvdXQgbnVsbCB2YWx1ZXMgb2YgYWxsIGNvbHVtbnMgb2YgYGRmYFxucHJpbnQoZGYuaXNudWxsKCkuc3VtKGF4aXM9MCkudG9saXN0KCkpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

And here’s where you’ll see a trade-off: you need clean data, but you also don’t have a large amount of data to spare. Two of the columns have a relatively small amount of null values. There are 110 null values in the `SO` (Strike Outs) column and 22 in the `DP` (Double Play) column. Two of the columns have a relatively large amount of them. There are 419 null values in the `CS` (Caught Stealing) column and 1777 in the `HBP` (Hit by Pitch) column.

If you eliminate the rows where the columns have a small number of null values, you’re losing a little over five percent of our data. Since you’re trying to predict wins, runs scored and runs allowed are highly correlated with the target. You want data in those columns to be very accurate.

Strike outs (`SO`) and double plays (`DP`) aren’t as important.

I think you’re better off keeping the rows and filling the null values with the median value from each of the columns by using the `fillna()` method. Caught stealing (`CS`) and hit by pitch (`HBP`) aren’t very important variables either. With so many null values in these columns, it’s best to eliminate the columns all together.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCBcbnRlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy90ZWFtc19kZi5jc3ZcIikgXG50ZWFtc19kZiA9IHBkLkRhdGFGcmFtZSh0ZWFtcylcbmNvbHMgPSBbJ3llYXJJRCcsJ2xnSUQnLCd0ZWFtSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0cnLCdHaG9tZScsJ1cnLFxuJ0wnLCdEaXZXaW4nLCdXQ1dpbicsJ0xnV2luJywnV1NXaW4nLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJyxcbidDUycsJ0hCUCcsJ1NGJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsXG4nU09BJywnRScsJ0RQJywnRlAnLCduYW1lJywncGFyaycsJ2F0dGVuZGFuY2UnLCdCUEYnLCdQUEYnLCd0ZWFtSURCUicsJ3RlYW1JRGxhaG1hbjQ1JywndGVhbUlEcmV0cm8nLCdmcmFuY2hJRCcsJ2ZyYW5jaE5hbWUnLCdhY3RpdmUnLCdOQWFzc29jJ11cbnRlYW1zX2RmLmNvbHVtbnMgPSBjb2xzXG5kcm9wX2NvbHMgPSBbJ2xnSUQnLCdmcmFuY2hJRCcsJ2RpdklEJywnUmFuaycsJ0dob21lJywnTCcsJ0RpdldpbicsJ1dDV2luJywnTGdXaW4nLCdXU1dpbicsJ1NGJywnbmFtZScsJ3BhcmsnLCdhdHRlbmRhbmNlJywnQlBGJywnUFBGJywndGVhbUlEQlInLCd0ZWFtSURsYWhtYW40NScsJ3RlYW1JRHJldHJvJywnZnJhbmNoSUQnLCdmcmFuY2hOYW1lJywnYWN0aXZlJywnTkFhc3NvYyddXG5kZiA9IHRlYW1zX2RmLmRyb3AoZHJvcF9jb2xzLCBheGlzPTEpXG5wcmludChkZi5pc251bGwoKS5zdW0oYXhpcz0wKS50b2xpc3QoKSkiLCJzYW1wbGUiOiIjIEVsaW1pbmF0aW5nIGNvbHVtbnMgd2l0aCBudWxsIHZhbHVlc1xuZGYgPSBkZi5kcm9wKFsnQ1MnLCdIQlAnXSwgYXhpcz0xKVxuXG4jIEZpbGxpbmcgbnVsbCB2YWx1ZXNcbmRmWydTTyddID0gZGZbJ1NPJ10uZmlsbG5hKGRmWydTTyddLm1lZGlhbigpKVxuZGZbJ0RQJ10gPSBkZlsnRFAnXS5maWxsbmEoZGZbJ0RQJ10ubWVkaWFuKCkpXG5cbiMgUHJpbnQgb3V0IG51bGwgdmFsdWVzIG9mIGFsbCBjb2x1bW5zIG9mIGBkZmBcbnByaW50KGRmLmlzbnVsbCgpLnN1bShheGlzPTApLnRvbGlzdCgpKSIsInNvbHV0aW9uIjoiIyBFbGltaW5hdGluZyBjb2x1bW5zIHdpdGggbnVsbCB2YWx1ZXNcbmRmID0gZGYuZHJvcChbJ0NTJywnSEJQJ10sIGF4aXM9MSlcblxuIyBGaWxsaW5nIG51bGwgdmFsdWVzXG5kZlsnU08nXSA9IGRmWydTTyddLmZpbGxuYShkZlsnU08nXS5tZWRpYW4oKSlcbmRmWydEUCddID0gZGZbJ0RQJ10uZmlsbG5hKGRmWydEUCddLm1lZGlhbigpKVxuXG4jIFByaW50IG91dCBudWxsIHZhbHVlcyBvZiBhbGwgY29sdW1ucyBvZiBgZGZgXG5wcmludChkZi5pc251bGwoKS5zdW0oYXhpcz0wKS50b2xpc3QoKSkiLCJzY3QiOiJFeCgpLnRlc3RfZGF0YV9mcmFtZShcImRmXCIsIGNvbHVtbnM9Wyd5ZWFySUQnLCAndGVhbUlEJywgJ0cnLCAnVycsICdSJywgJ0FCJywgJ0gnLCAnMkInLCAnM0InLCAnSFInLFxuICAgICAgICdCQicsICdTTycsICdTQicsICdSQScsICdFUicsICdFUkEnLCAnQ0cnLCAnU0hPJywgJ1NWJywgJ0lQb3V0cycsXG4gICAgICAgJ0hBJywgJ0hSQScsICdCQkEnLCAnU09BJywgJ0UnLCAnRFAnLCAnRlAnXSlcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

### Exploring and Visualizing The Data

Now that you’ve cleaned up your data, you can do some exploration. You can get a better feel for the data set with a few simple visualizations. `matplotlib` is an excellent library for data visualization.

You import `matplotlib.pyplot` and rename it as `plt` for efficiency. If you’re working with a Jupyter notebook, you need to use the `%matplotlib inline` magic.

You’ll start by plotting a histogram of the target column so you can see the distribution of wins.

``# import the pyplot module from matplotlibimport matplotlib.pyplot as plt# matplotlib plots inline  %matplotlib inline# Plotting distribution of winsplt.hist(df['W'])plt.xlabel('Wins')plt.title('Distribution of Wins')plt.show()``

Note that if you're not using a Jupyter notebook, you have to make use of `plt.show()` to show the plots.

Print out the average wins (`W`) per year. You can use the `mean()` method for this.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6InByaW50KGRmWydfJ10uX19fX19fKSIsInNvbHV0aW9uIjoicHJpbnQoZGZbJ1cnXS5tZWFuKCkpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

It can be useful to create bins for your target column while exploring your data, but you need to make sure not to include any feature that you generate from your target column when you train the model. Including a column of labels generated from the target column in your training set would be like giving your model the answers to the test.

To create your win labels, you’ll create a function called `assign_win_bins` which will take in an integer value (wins) and return an integer of 1-5 depending on the input value.

Next, you’ll create a new column `win_bins` by using the `apply()` method on the wins column and passing in the `assign_win_bins()` function.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6IiMgQ3JlYXRpbmcgYmlucyBmb3IgdGhlIHdpbiBjb2x1bW5cbmRlZiBhc3NpZ25fd2luX2JpbnMoXyk6XG4gICAgaWYgVyA8IDUwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGlmIFcgPj0gNTAgYW5kIFcgPD0gNjk6XG4gICAgICAgIHJldHVybiAyXG4gICAgaWYgVyA+PSA3MCBhbmQgVyA8PSA4OTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBpZiBXID49IDkwIGFuZCBXIDw9IDEwOTpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBpZiBXID49IDExMDpcbiAgICAgICAgcmV0dXJuIDVcbiAgICAgICAgXG4jIEFwcGx5IGBhc3NpZ25fd2luX2JpbnNgIHRvIGBkZlsnVyddYCAgICBcbmRmWyd3aW5fYmlucyddID0gZGZbJ18nXS5hcHBseShfX19fX19fX19fX19fXykiLCJzb2x1dGlvbiI6IiMgQ3JlYXRpbmcgYmlucyBmb3IgdGhlIHdpbiBjb2x1bW5cbmRlZiBhc3NpZ25fd2luX2JpbnMoVyk6XG4gICAgaWYgVyA8IDUwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGlmIFcgPj0gNTAgYW5kIFcgPD0gNjk6XG4gICAgICAgIHJldHVybiAyXG4gICAgaWYgVyA+PSA3MCBhbmQgVyA8PSA4OTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBpZiBXID49IDkwIGFuZCBXIDw9IDEwOTpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBpZiBXID49IDExMDpcbiAgICAgICAgcmV0dXJuIDVcbiAgICBcbmRmWyd3aW5fYmlucyddID0gZGZbJ1cnXS5hcHBseShhc3NpZ25fd2luX2JpbnMpIiwic2N0IjoiRXgoKS50ZXN0X2Z1bmN0aW9uX2RlZmluaXRpb24oXCJhc3NpZ25fd2luX2JpbnNcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKSJ9

Now let’s make a scatter graph with the year on the x-axis and wins on the y-axis and highlight the `win_bins` column with colors.

``# Plotting scatter graph of Year vs. Winsplt.scatter(df['yearID'], df['W'], c=df['win_bins'])plt.title('Wins Scatter Plot')plt.xlabel('Year')plt.ylabel('Wins')plt.show()``

As you can see in the above scatter plot, there are very few seasons from before 1900 and the game was much different back then. Because of that, it makes sense to eliminate those rows from the data set.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucykiLCJzYW1wbGUiOiIjIEZpbHRlciBmb3Igcm93cyB3aGVyZSAneWVhcklEJyBpcyBncmVhdGVyIHRoYW4gMTkwMFxuZGYgPSBkZltfX19fX19fX19fX19fX19fX19fXSIsInNvbHV0aW9uIjoiIyBGaWx0ZXIgZm9yIHJvd3Mgd2hlcmUgJ3llYXJJRCcgaXMgZ3JlYXRlciB0aGFuIDE5MDBcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF0iLCJzY3QiOiJFeCgpLnRlc3Rfb2JqZWN0KFwiZGZcIikifQ==

When dealing with continuous data and creating linear models, integer values such as a year can cause issues. It is unlikely that the number 1950 will have the same relationship to the rest of the data that the model will infer.

You can avoid these issues by creating new variables that label the data based on the `yearID` value.

Anyone who follows the game of baseball knows that, as Major League Baseball (MLB) progressed, different eras emerged where the amount of runs per game increased or decreased significantly. The dead ball era of the early 1900s is an example of a low scoring era and the steroid era at the turn of the 21st century is an example of a high scoring era.

Let’s make a graph below that indicates how much scoring there was for each year.

You’ll start by creating dictionaries `runs_per_year` and `games_per_year`. Loop through the dataframe using the `iterrows()` method. Populate the `runs_per_year` dictionary with years as keys and how many runs were scored that year as the value. Populate the `games_per_year` dictionary with years as keys and how many games were played that year as the value.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF0iLCJzYW1wbGUiOiIjIENyZWF0ZSBydW5zIHBlciB5ZWFyIGFuZCBnYW1lcyBwZXIgeWVhciBkaWN0aW9uYXJpZXNcbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuXG5mb3IgaSwgcm93IGluIGRmLml0ZXJyb3dzKCk6XG4gICAgeWVhciA9IHJvd1sneWVhcklEJ11cbiAgICBydW5zID0gcm93WydSJ11cbiAgICBnYW1lcyA9IHJvd1snRyddXG4gICAgaWYgeWVhciBpbiBydW5zX3Blcl95ZWFyOlxuICAgICAgICBydW5zX3Blcl95ZWFyW3llYXJdID0gcnVuc19wZXJfeWVhclt5ZWFyXSArIHJ1bnNcbiAgICAgICAgZ2FtZXNfcGVyX3llYXJbeWVhcl0gPSBnYW1lc19wZXJfeWVhclt5ZWFyXSArIGdhbWVzXG4gICAgZWxzZTpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNcbiAgICAgICAgZ2FtZXNfcGVyX3llYXJbeWVhcl0gPSBnYW1lc1xuICAgICAgICBcbnByaW50KHJ1bnNfcGVyX3llYXIpXG5wcmludChnYW1lc19wZXJfeWVhcikifQ==

Next, create a dictionary called `mlb_runs_per_game`. Iterate through the `games_per_year` dictionary with the `items()` method. Populate the `mlb_runs_per_game` dictionary with years as keys and the number of runs scored per game, league wide, as the value.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXMiLCJzYW1wbGUiOiIjIENyZWF0ZSBNTEIgcnVucyBwZXIgZ2FtZSAocGVyIHllYXIpIGRpY3Rpb25hcnlcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG4gICAgXG5wcmludChtbGJfcnVuc19wZXJfZ2FtZSkifQ==

Finally, create your plot from the `mlb_runs_per_game` dictionary by putting the years on the x-axis and runs per game on the y-axis.

``# Create lists from mlb_runs_per_game dictionarylists = sorted(mlb_runs_per_game.items())x, y = zip(*lists)# Create line plot of Year vs. MLB runs per Gameplt.plot(x, y)plt.title('MLB Yearly Runs per Game')plt.xlabel('Year')plt.ylabel('MLB Runs per Game')plt.show()``

Now that you have a better idea of scoring trends, you can create new variables that indicate a specific era that each row of data falls in based on the `yearID`. You’ll follow the same process as you did above when you created the `win_bins` column.

This time however, you will create dummy columns; a new column for each era. You can use the `get_dummies()` method for this.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzIiwic2FtcGxlIjoiIyBDcmVhdGluZyBcInllYXJfbGFiZWxcIiBjb2x1bW4sIHdoaWNoIHdpbGwgZ2l2ZSB5b3VyIGFsZ29yaXRobSBpbmZvcm1hdGlvbiBhYm91dCBob3cgY2VydGFpbiB5ZWFycyBhcmUgcmVsYXRlZCBcbiMgKERlYWQgYmFsbCBlcmFzLCBMaXZlIGJhbGwvU3Rlcm9pZCBFcmFzKVxuXG5kZWYgYXNzaWduX2xhYmVsKF9fX18pOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuICAgICAgICBcbiMgQWRkIGB5ZWFyX2xhYmVsYCBjb2x1bW4gdG8gYGRmYCAgICBcbmRmWyd5ZWFyX2xhYmVsJ10gPSBkZlsneWVhcklEJ10uYXBwbHkoX19fX19fX19fX18pXG5cbmR1bW15X2RmID0gcGQuZ2V0X2R1bW1pZXMoZGZbJ3llYXJfbGFiZWwnXSwgcHJlZml4PSdlcmEnKVxuXG4jIENvbmNhdGVuYXRlIGBkZmAgYW5kIGBkdW1teV9kZmBcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpXG5cbnByaW50KGRmLmhlYWQoKSkiLCJzb2x1dGlvbiI6IiMgQ3JlYXRpbmcgXCJ5ZWFyX2xhYmVsXCIgY29sdW1uLCB3aGljaCB3aWxsIGdpdmUgeW91ciBhbGdvcml0aG0gaW5mb3JtYXRpb24gYWJvdXQgaG93IGNlcnRhaW4geWVhcnMgYXJlIHJlbGF0ZWQgXG4jIChEZWFkIGJhbGwgZXJhcywgTGl2ZSBiYWxsL1N0ZXJvaWQgRXJhcylcblxuZGVmIGFzc2lnbl9sYWJlbCh5ZWFyKTpcbiAgICBpZiB5ZWFyIDwgMTkyMDpcbiAgICAgICAgcmV0dXJuIDFcbiAgICBlbGlmIHllYXIgPj0gMTkyMCBhbmQgeWVhciA8PSAxOTQxOlxuICAgICAgICByZXR1cm4gMlxuICAgIGVsaWYgeWVhciA+PSAxOTQyIGFuZCB5ZWFyIDw9IDE5NDU6XG4gICAgICAgIHJldHVybiAzXG4gICAgZWxpZiB5ZWFyID49IDE5NDYgYW5kIHllYXIgPD0gMTk2MjpcbiAgICAgICAgcmV0dXJuIDRcbiAgICBlbGlmIHllYXIgPj0gMTk2MyBhbmQgeWVhciA8PSAxOTc2OlxuICAgICAgICByZXR1cm4gNVxuICAgIGVsaWYgeWVhciA+PSAxOTc3IGFuZCB5ZWFyIDw9IDE5OTI6XG4gICAgICAgIHJldHVybiA2XG4gICAgZWxpZiB5ZWFyID49IDE5OTMgYW5kIHllYXIgPD0gMjAwOTpcbiAgICAgICAgcmV0dXJuIDdcbiAgICBlbGlmIHllYXIgPj0gMjAxMDpcbiAgICAgICAgcmV0dXJuIDhcbiAgICAgICAgXG4jIEFkZCBgeWVhcl9sYWJlbGAgY29sdW1uIHRvIGBkZmAgICAgXG5kZlsneWVhcl9sYWJlbCddID0gZGZbJ3llYXJJRCddLmFwcGx5KGFzc2lnbl9sYWJlbClcblxuZHVtbXlfZGYgPSBwZC5nZXRfZHVtbWllcyhkZlsneWVhcl9sYWJlbCddLCBwcmVmaXg9J2VyYScpXG5cbiMgQ29uY2F0ZW5hdGUgYGRmYCBhbmQgYGR1bW15X2RmYFxuZGYgPSBwZC5jb25jYXQoW2RmLCBkdW1teV9kZl0sIGF4aXM9MSlcblxucHJpbnQoZGYuaGVhZCgpKSIsInNjdCI6IkV4KCkudGVzdF9mdW5jdGlvbl9kZWZpbml0aW9uKFwiYXNzaWduX2xhYmVsXCIpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcImRmXCIpXG5FeCgpLnRlc3RfZGF0YV9mcmFtZShcImR1bW15X2RmXCIpXG5FeCgpLmNoZWNrX29iamVjdChcImRmXCIpLmhhc19lcXVhbF9rZXkoXCJ5ZWFyX2xhYmVsXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

Since you already did the work to determine MLB runs per game for each year, add that data to the data set.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuICAgIFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcblxuZGYgPSBwZC5jb25jYXQoW2RmLCBkdW1teV9kZl0sIGF4aXM9MSkiLCJzYW1wbGUiOiIjIENyZWF0ZSBjb2x1bW4gZm9yIE1MQiBydW5zIHBlciBnYW1lIGZyb20gdGhlIG1sYl9ydW5zX3Blcl9nYW1lIGRpY3Rpb25hcnlcbmRlZiBhc3NpZ25fbWxiX3JwZyh5ZWFyKTpcbiAgICByZXR1cm4gbWxiX3J1bnNfcGVyX2dhbWVbeWVhcl1cblxuZGZbJ21sYl9ycGcnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbWxiX3JwZykiLCJzb2x1dGlvbiI6IiMgQ3JlYXRlIGNvbHVtbiBmb3IgTUxCIHJ1bnMgcGVyIGdhbWUgZnJvbSB0aGUgbWxiX3J1bnNfcGVyX2dhbWUgZGljdGlvbmFyeVxuZGVmIGFzc2lnbl9tbGJfcnBnKHllYXIpOlxuICAgIHJldHVybiBtbGJfcnVuc19wZXJfZ2FtZVt5ZWFyXVxuXG5kZlsnbWxiX3JwZyddID0gZGZbJ3llYXJJRCddLmFwcGx5KGFzc2lnbl9tbGJfcnBnKSIsInNjdCI6IkV4KCkudGVzdF9mdW5jdGlvbl9kZWZpbml0aW9uKFwiYXNzaWduX21sYl9ycGdcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKSJ9

Now you’ll convert the years into decades by creating dummy columns for each decade. Then you can drop the columns that you don’t need anymore.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpIiwic2FtcGxlIjoiIyBDb252ZXJ0IHllYXJzIGludG8gZGVjYWRlIGJpbnMgYW5kIGNyZWF0aW5nIGR1bW15IHZhcmlhYmxlc1xuZGVmIGFzc2lnbl9kZWNhZGUoeWVhcik6XG4gICAgaWYgeWVhciA8IDE5MjA6XG4gICAgICAgIHJldHVybiAxOTEwXG4gICAgZWxpZiB5ZWFyID49IDE5MjAgYW5kIHllYXIgPD0gMTkyOTpcbiAgICAgICAgcmV0dXJuIDE5MjBcbiAgICBlbGlmIHllYXIgPj0gMTkzMCBhbmQgeWVhciA8PSAxOTM5OlxuICAgICAgICByZXR1cm4gMTkzMFxuICAgIGVsaWYgeWVhciA+PSAxOTQwIGFuZCB5ZWFyIDw9IDE5NDk6XG4gICAgICAgIHJldHVybiAxOTQwXG4gICAgZWxpZiB5ZWFyID49IDE5NTAgYW5kIHllYXIgPD0gMTk1OTpcbiAgICAgICAgcmV0dXJuIDE5NTBcbiAgICBlbGlmIHllYXIgPj0gMTk2MCBhbmQgeWVhciA8PSAxOTY5OlxuICAgICAgICByZXR1cm4gMTk2MFxuICAgIGVsaWYgeWVhciA+PSAxOTcwIGFuZCB5ZWFyIDw9IDE5Nzk6XG4gICAgICAgIHJldHVybiAxOTcwXG4gICAgZWxpZiB5ZWFyID49IDE5ODAgYW5kIHllYXIgPD0gMTk4OTpcbiAgICAgICAgcmV0dXJuIDE5ODBcbiAgICBlbGlmIHllYXIgPj0gMTk5MCBhbmQgeWVhciA8PSAxOTk5OlxuICAgICAgICByZXR1cm4gMTk5MFxuICAgIGVsaWYgeWVhciA+PSAyMDAwIGFuZCB5ZWFyIDw9IDIwMDk6XG4gICAgICAgIHJldHVybiAyMDAwXG4gICAgZWxpZiB5ZWFyID49IDIwMTA6XG4gICAgICAgIHJldHVybiAyMDEwXG4gICAgXG5kZlsnZGVjYWRlX2xhYmVsJ10gPSBkZlsneWVhcklEJ10uYXBwbHkoYXNzaWduX2RlY2FkZSlcbmRlY2FkZV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWydkZWNhZGVfbGFiZWwnXSwgcHJlZml4PSdkZWNhZGUnKVxuZGYgPSBwZC5jb25jYXQoW2RmLCBkZWNhZGVfZGZdLCBheGlzPTEpXG5cbiMgRHJvcCB1bm5lY2Vzc2FyeSBjb2x1bW5zXG5kZiA9IGRmLmRyb3AoWyd5ZWFySUQnLCd5ZWFyX2xhYmVsJywnZGVjYWRlX2xhYmVsJ10sIGF4aXM9MSkifQ==

The bottom line in the game of baseball is how many runs you score and how many runs you allow. You can significantly increase the accuracy of your model by creating columns which are ratios of other columns of data. Runs per game and runs allowed per game will be great features to add to our data set.

`Pandas` makes this very simple as you create a new column by dividing the `R` column by the `G` column to create the `R_per_game` column.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuZGVmIGFzc2lnbl93aW5fYmlucyhXKTpcbiAgICBpZiBXIDwgNTA6XG4gICAgICAgIHJldHVybiAxXG4gICAgaWYgVyA+PSA1MCBhbmQgVyA8PSA2OTpcbiAgICAgICAgcmV0dXJuIDJcbiAgICBpZiBXID49IDcwIGFuZCBXIDw9IDg5OlxuICAgICAgICByZXR1cm4gM1xuICAgIGlmIFcgPj0gOTAgYW5kIFcgPD0gMTA5OlxuICAgICAgICByZXR1cm4gNFxuICAgIGlmIFcgPj0gMTEwOlxuICAgICAgICByZXR1cm4gNVxuZGZbJ3dpbl9iaW5zJ10gPSBkZlsnVyddLmFwcGx5KGFzc2lnbl93aW5fYmlucylcbmRmID0gZGZbZGZbJ3llYXJJRCddID4gMTkwMF1cbnJ1bnNfcGVyX3llYXIgPSB7fVxuZ2FtZXNfcGVyX3llYXIgPSB7fVxuZm9yIGksIHJvdyBpbiBkZi5pdGVycm93cygpOlxuICAgIHllYXIgPSByb3dbJ3llYXJJRCddXG4gICAgcnVucyA9IHJvd1snUiddXG4gICAgZ2FtZXMgPSByb3dbJ0cnXVxuICAgIGlmIHllYXIgaW4gcnVuc19wZXJfeWVhcjpcbiAgICAgICAgcnVuc19wZXJfeWVhclt5ZWFyXSA9IHJ1bnNfcGVyX3llYXJbeWVhcl0gKyBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNfcGVyX3llYXJbeWVhcl0gKyBnYW1lc1xuICAgIGVsc2U6XG4gICAgICAgIHJ1bnNfcGVyX3llYXJbeWVhcl0gPSBydW5zXG4gICAgICAgIGdhbWVzX3Blcl95ZWFyW3llYXJdID0gZ2FtZXNcbm1sYl9ydW5zX3Blcl9nYW1lID0ge31cbmZvciBrLCB2IGluIGdhbWVzX3Blcl95ZWFyLml0ZW1zKCk6XG4gICAgeWVhciA9IGtcbiAgICBnYW1lcyA9IHZcbiAgICBydW5zID0gcnVuc19wZXJfeWVhclt5ZWFyXVxuICAgIG1sYl9ydW5zX3Blcl9nYW1lW3llYXJdID0gcnVucyAvIGdhbWVzXG5kZWYgYXNzaWduX2xhYmVsKHllYXIpOlxuICAgIGlmIHllYXIgPCAxOTIwOlxuICAgICAgICByZXR1cm4gMVxuICAgIGVsaWYgeWVhciA+PSAxOTIwIGFuZCB5ZWFyIDw9IDE5NDE6XG4gICAgICAgIHJldHVybiAyXG4gICAgZWxpZiB5ZWFyID49IDE5NDIgYW5kIHllYXIgPD0gMTk0NTpcbiAgICAgICAgcmV0dXJuIDNcbiAgICBlbGlmIHllYXIgPj0gMTk0NiBhbmQgeWVhciA8PSAxOTYyOlxuICAgICAgICByZXR1cm4gNFxuICAgIGVsaWYgeWVhciA+PSAxOTYzIGFuZCB5ZWFyIDw9IDE5NzY6XG4gICAgICAgIHJldHVybiA1XG4gICAgZWxpZiB5ZWFyID49IDE5NzcgYW5kIHllYXIgPD0gMTk5MjpcbiAgICAgICAgcmV0dXJuIDZcbiAgICBlbGlmIHllYXIgPj0gMTk5MyBhbmQgeWVhciA8PSAyMDA5OlxuICAgICAgICByZXR1cm4gN1xuICAgIGVsaWYgeWVhciA+PSAyMDEwOlxuICAgICAgICByZXR1cm4gOFxuZGZbJ3llYXJfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fbGFiZWwpXG5kdW1teV9kZiA9IHBkLmdldF9kdW1taWVzKGRmWyd5ZWFyX2xhYmVsJ10sIHByZWZpeD0nZXJhJylcbmRmID0gcGQuY29uY2F0KFtkZiwgZHVtbXlfZGZdLCBheGlzPTEpXG5kZWYgYXNzaWduX2RlY2FkZSh5ZWFyKTpcbiAgICBpZiB5ZWFyIDwgMTkyMDpcbiAgICAgICAgcmV0dXJuIDE5MTBcbiAgICBlbGlmIHllYXIgPj0gMTkyMCBhbmQgeWVhciA8PSAxOTI5OlxuICAgICAgICByZXR1cm4gMTkyMFxuICAgIGVsaWYgeWVhciA+PSAxOTMwIGFuZCB5ZWFyIDw9IDE5Mzk6XG4gICAgICAgIHJldHVybiAxOTMwXG4gICAgZWxpZiB5ZWFyID49IDE5NDAgYW5kIHllYXIgPD0gMTk0OTpcbiAgICAgICAgcmV0dXJuIDE5NDBcbiAgICBlbGlmIHllYXIgPj0gMTk1MCBhbmQgeWVhciA8PSAxOTU5OlxuICAgICAgICByZXR1cm4gMTk1MFxuICAgIGVsaWYgeWVhciA+PSAxOTYwIGFuZCB5ZWFyIDw9IDE5Njk6XG4gICAgICAgIHJldHVybiAxOTYwXG4gICAgZWxpZiB5ZWFyID49IDE5NzAgYW5kIHllYXIgPD0gMTk3OTpcbiAgICAgICAgcmV0dXJuIDE5NzBcbiAgICBlbGlmIHllYXIgPj0gMTk4MCBhbmQgeWVhciA8PSAxOTg5OlxuICAgICAgICByZXR1cm4gMTk4MFxuICAgIGVsaWYgeWVhciA+PSAxOTkwIGFuZCB5ZWFyIDw9IDE5OTk6XG4gICAgICAgIHJldHVybiAxOTkwXG4gICAgZWxpZiB5ZWFyID49IDIwMDAgYW5kIHllYXIgPD0gMjAwOTpcbiAgICAgICAgcmV0dXJuIDIwMDBcbiAgICBlbGlmIHllYXIgPj0gMjAxMDpcbiAgICAgICAgcmV0dXJuIDIwMTBcbmRmWydkZWNhZGVfbGFiZWwnXSA9IGRmWyd5ZWFySUQnXS5hcHBseShhc3NpZ25fZGVjYWRlKVxuZGVjYWRlX2RmID0gcGQuZ2V0X2R1bW1pZXMoZGZbJ2RlY2FkZV9sYWJlbCddLCBwcmVmaXg9J2RlY2FkZScpXG5kZiA9IHBkLmNvbmNhdChbZGYsIGRlY2FkZV9kZl0sIGF4aXM9MSlcbmRmID0gZGYuZHJvcChbJ3llYXJJRCcsJ3llYXJfbGFiZWwnLCdkZWNhZGVfbGFiZWwnXSwgYXhpcz0xKSIsInNhbXBsZSI6IiMgQ3JlYXRlIG5ldyBmZWF0dXJlcyBmb3IgUnVucyBwZXIgR2FtZSBhbmQgUnVucyBBbGxvd2VkIHBlciBHYW1lXG5kZlsnUl9wZXJfZ2FtZSddID0gZGZbJ1InXSAvIGRmWydHJ11cbmRmWydSQV9wZXJfZ2FtZSddID0gZGZbJ1JBJ10gLyBkZlsnRyddIn0=

Now look at how each of the two new variables relate to the target wins column by making a couple scatter graphs. Plot the runs per game on the x-axis of one graph and runs allowed per game on the x-axis of the other. Plot the `W` column on each y-axis.

``# Create scatter plots for runs per game vs. wins and runs allowed per game vs. winsfig = plt.figure(figsize=(12, 6))ax1 = fig.add_subplot(1,2,1)ax2 = fig.add_subplot(1,2,2)ax1.scatter(df['R_per_game'], df['W'], c='blue')ax1.set_title('Runs per Game vs. Wins')ax1.set_ylabel('Wins')ax1.set_xlabel('Runs per Game')ax2.scatter(df['RA_per_game'], df['W'], c='red')ax2.set_title('Runs Allowed per Game vs. Wins')ax2.set_xlabel('Runs Allowed per Game')plt.show()``

Before getting into any machine learning models, it can be useful to see how each of the variables is correlated with the target variable. `Pandas` makes this easy with the `corr()` method.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6ImRmLmNvcnIoKVsnVyddIn0=

Another feature you can add to the dataset are labels derived from a K-means cluster algorithm provided by `sklearn`. K-means is a simple clustering algorithm that partitions the data based on the number of k centroids you indicate. Each data point is assigned to a cluster based on which centroid has the lowest Euclidian distance from the data point.

First, create a DataFrame that leaves out the target variable:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKSIsInNhbXBsZSI6ImF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KGRmLmhlYWQoKSkiLCJzb2x1dGlvbiI6ImF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxuXG4jIFByaW50IHRoZSBmaXJzdCByb3dzIG9mIGBkZmBcbnByaW50KGRmLmhlYWQoKSkiLCJzY3QiOiJFeCgpLnRlc3Rfb2JqZWN0KFwiYXR0cmlidXRlc1wiKVxuRXgoKS50ZXN0X29iamVjdChcImRhdGFfYXR0cmlidXRlc1wiKVxuRXgoKS50ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIikifQ==

One aspect of K-means clustering that you must determine before using the model is how many clusters you want. You can get a better idea of your ideal number of clusters by using sklearn’s `silhouette_score()` function. This function returns the mean silhouette coefficient over all samples. You want a higher silhouette score, and the score decreases as more clusters are added.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXSIsInNhbXBsZSI6IiMgSW1wb3J0IG5lY2Vzc2FyeSBtb2R1bGVzIGZyb20gYHNrbGVhcm5gIFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZnJvbSBza2xlYXJuIGltcG9ydCBtZXRyaWNzXG5cbiMgQ3JlYXRlIHNpbGhvdWV0dGUgc2NvcmUgZGljdGlvbmFyeVxuc19zY29yZV9kaWN0ID0ge31cbmZvciBpIGluIHJhbmdlKDIsMTEpOlxuICAgIGttID0gS01lYW5zKG5fY2x1c3RlcnM9aSwgcmFuZG9tX3N0YXRlPTEpXG4gICAgbCA9IGttLmZpdF9wcmVkaWN0KGRhdGFfYXR0cmlidXRlcylcbiAgICBzX3MgPSBtZXRyaWNzLnNpbGhvdWV0dGVfc2NvcmUoZGF0YV9hdHRyaWJ1dGVzLCBsKVxuICAgIHNfc2NvcmVfZGljdFtpXSA9IFtzX3NdXG5cbiMgUHJpbnQgb3V0IGBzX3Njb3JlX2RpY3RgXG5wcmludChzX3Njb3JlX2RpY3QpIiwic29sdXRpb24iOiIjIEltcG9ydCBuZWNlc3NhcnkgbW9kdWxlcyBmcm9tIGBza2xlYXJuYCBcbmZyb20gc2tsZWFybi5jbHVzdGVyIGltcG9ydCBLTWVhbnNcbmZyb20gc2tsZWFybiBpbXBvcnQgbWV0cmljc1xuXG4jIENyZWF0ZSBzaWxob3VldHRlIHNjb3JlIGRpY3Rpb25hcnlcbnNfc2NvcmVfZGljdCA9IHt9XG5mb3IgaSBpbiByYW5nZSgyLDExKTpcbiAgICBrbSA9IEtNZWFucyhuX2NsdXN0ZXJzPWksIHJhbmRvbV9zdGF0ZT0xKVxuICAgIGwgPSBrbS5maXRfcHJlZGljdChkYXRhX2F0dHJpYnV0ZXMpXG4gICAgc19zID0gbWV0cmljcy5zaWxob3VldHRlX3Njb3JlKGRhdGFfYXR0cmlidXRlcywgbClcbiAgICBzX3Njb3JlX2RpY3RbaV0gPSBbc19zXVxuXG4jIFByaW50IG91dCBgc19zY29yZV9kaWN0YFxucHJpbnQoc19zY29yZV9kaWN0KSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmNsdXN0ZXIuS01lYW5zXCIpXG5FeCgpLnRlc3RfaW1wb3J0KFwic2tsZWFybi5tZXRyaWNzXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwic19zY29yZV9kaWN0XCIpXG4jIGZvciBsb29wIHRlc3RcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

Now you can initialize the model. Set your number of clusters to 6 and the random state to `1`. Determine the Euclidian distances for each data point by using the `fit_transform()` method and then visualize the clusters with a scatter plot.

``# Create K-means model and determine euclidian distances for each data pointkmeans_model = KMeans(n_clusters=6, random_state=1)distances = kmeans_model.fit_transform(data_attributes)# Create scatter plot using labels from K-means model as colorlabels = kmeans_model.labels_plt.scatter(distances[:,0], distances[:,1], c=labels)plt.title('Kmeans Clusters')plt.show()``

Now add the labels from your clusters into the data set as a new column. Also add the string “labels” to the `attributes` list, for use later.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzXyIsInNhbXBsZSI6IiMgQWRkIGxhYmVscyBmcm9tIEstbWVhbnMgbW9kZWwgdG8gYGRmYCBEYXRhRnJhbWUgYW5kIGF0dHJpYnV0ZXMgbGlzdFxuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJylcblxuIyBQcmludCB0aGUgZmlyc3Qgcm93cyBvZiBgZGZgXG5wcmludChfX19fX19fX18pIiwic29sdXRpb24iOiIjIEFkZCBsYWJlbHMgZnJvbSBLLW1lYW5zIG1vZGVsIHRvIGBkZmAgRGF0YUZyYW1lIGFuZCBhdHRyaWJ1dGVzIGxpc3RcbmRmWydsYWJlbHMnXSA9IGxhYmVsc1xuYXR0cmlidXRlcy5hcHBlbmQoJ2xhYmVscycpXG5cbiMgUHJpbnQgdGhlIGZpcnN0IHJvd3Mgb2YgYGRmYFxucHJpbnQoZGYuaGVhZCgpKSIsInNjdCI6IkV4KCkudGVzdF9vYmplY3QoXCJkZlwiKVxuRXgoKS50ZXN0X29iamVjdChcImF0dHJpYnV0ZXNcIilcbkV4KCkudGVzdF9mdW5jdGlvbihcInByaW50XCIpIn0=

Before you can build your model, you need to split your data into train and test sets. You do this because if you do decide to train your model on the same data that you test the model, your model can easily overfit the data: the model will more memorize the data instead of learning from it, which results in excessively complex models for your data. That also explains why an overfitted model will perform very poorly when you would try to make predictions with new data.

But don’t worry just yet, there are a number of ways to cross-validate your model.

This time, you will simply take a random sample of 75 percent of our data for the `train` data set and use the other 25 percent for your `test` data set. Create a list `numeric_cols` with all of the columns you will use in your model. Next, create a new DataFrame `data` from the `df` DataFrame with the columns in the `numeric_cols` list. Then, also create your `train` and `test` data sets by sampling the DataFrame `data`.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzX1xuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJykiLCJzYW1wbGUiOiIjIENyZWF0ZSBuZXcgRGF0YUZyYW1lIHVzaW5nIG9ubHkgdmFyaWFibGVzIHRvIGJlIGluY2x1ZGVkIGluIG1vZGVsc1xubnVtZXJpY19jb2xzID0gWydHJywnUicsJ0FCJywnSCcsJzJCJywnM0InLCdIUicsJ0JCJywnU08nLCdTQicsJ1JBJywnRVInLCdFUkEnLCdDRycsJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZycsJ2xhYmVscycsJ1cnXVxuZGF0YSA9IGRmW251bWVyaWNfY29sc11cbnByaW50KGRhdGEuaGVhZCgpKVxuXG4jIFNwbGl0IGRhdGEgRGF0YUZyYW1lIGludG8gdHJhaW4gYW5kIHRlc3Qgc2V0c1xudHJhaW4gPSBkYXRhLnNhbXBsZShmcmFjPTAuNzUsIHJhbmRvbV9zdGF0ZT0xKVxudGVzdCA9IGRhdGEubG9jW35kYXRhLmluZGV4LmlzaW4odHJhaW4uaW5kZXgpXVxuXG54X3RyYWluID0gdHJhaW5bYXR0cmlidXRlc11cbnlfdHJhaW4gPSB0cmFpblsnVyddXG54X3Rlc3QgPSB0ZXN0W2F0dHJpYnV0ZXNdXG55X3Rlc3QgPSB0ZXN0WydXJ10iLCJzb2x1dGlvbiI6IiMgQ3JlYXRlIG5ldyBEYXRhRnJhbWUgdXNpbmcgb25seSB2YXJpYWJsZXMgdG8gYmUgaW5jbHVkZWQgaW4gbW9kZWxzXG5udW1lcmljX2NvbHMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnZXJhXzEnLCdlcmFfMicsJ2VyYV8zJywnZXJhXzQnLCdlcmFfNScsJ2VyYV82JywnZXJhXzcnLCdlcmFfOCcsJ2RlY2FkZV8xOTEwJywnZGVjYWRlXzE5MjAnLCdkZWNhZGVfMTkzMCcsJ2RlY2FkZV8xOTQwJywnZGVjYWRlXzE5NTAnLCdkZWNhZGVfMTk2MCcsJ2RlY2FkZV8xOTcwJywnZGVjYWRlXzE5ODAnLCdkZWNhZGVfMTk5MCcsJ2RlY2FkZV8yMDAwJywnZGVjYWRlXzIwMTAnLCdSX3Blcl9nYW1lJywnUkFfcGVyX2dhbWUnLCdtbGJfcnBnJywnbGFiZWxzJywnVyddXG5kYXRhID0gZGZbbnVtZXJpY19jb2xzXVxucHJpbnQoZGF0YS5oZWFkKCkpXG5cbiMgU3BsaXQgZGF0YSBEYXRhRnJhbWUgaW50byB0cmFpbiBhbmQgdGVzdCBzZXRzXG50cmFpbiA9IGRhdGEuc2FtcGxlKGZyYWM9MC43NSwgcmFuZG9tX3N0YXRlPTEpXG50ZXN0ID0gZGF0YS5sb2NbfmRhdGEuaW5kZXguaXNpbih0cmFpbi5pbmRleCldXG5cbnhfdHJhaW4gPSB0cmFpblthdHRyaWJ1dGVzXVxueV90cmFpbiA9IHRyYWluWydXJ11cbnhfdGVzdCA9IHRlc3RbYXR0cmlidXRlc11cbnlfdGVzdCA9IHRlc3RbJ1cnXSIsInNjdCI6IkV4KCkudGVzdF9vYmplY3QoXCJudW1lcmljX2NvbHNcIilcbkV4KCkudGVzdF9vYmplY3QoXCJkYXRhXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKVxuRXgoKS50ZXN0X29iamVjdChcInRyYWluXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwidGVzdFwiKVxuRXgoKS50ZXN0X29iamVjdChcInhfdHJhaW5cIilcbkV4KCkudGVzdF9vYmplY3QoXCJ5X3RyYWluXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwieF90ZXN0XCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwieV90ZXN0XCIpIn0=

### Selecting Error Metric and Model

Mean Absolute Error (MAE) is the metric you’ll use to determine how accurate your model is. It measures how close the predictions are to the eventual outcomes. Specifically, for this data, that means that this error metric will provide you with the average absolute value that your prediction missed its mark.

This means that if, on average, your predictions miss the target amount by 5 wins, your error metric will be 5.

The first model you will train will be a linear regression model. You can import `LinearRegression` and `mean_absolute_error` from `sklearn.linear_model` and `sklearn.metrics` respectively, and then create a model `lr`. Next, you’ll fit the model, make predictions and determine mean absolute error of the model.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZGZfdGVhbXMgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2RmX3RlYW1zX3dob2xlLmNzdlwiLCBoZWFkZXI9MClcbmRmID0gcGQuRGF0YUZyYW1lKGRmX3RlYW1zKVxuYXR0cmlidXRlcyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLFxuJ1NITycsJ1NWJywnSVBvdXRzJywnSEEnLCdIUkEnLCdCQkEnLCdTT0EnLCdFJywnRFAnLCdGUCcsJ2VyYV8xJywnZXJhXzInLCdlcmFfMycsJ2VyYV80JywnZXJhXzUnLCdlcmFfNicsJ2VyYV83JywnZXJhXzgnLCdkZWNhZGVfMTkxMCcsJ2RlY2FkZV8xOTIwJywnZGVjYWRlXzE5MzAnLCdkZWNhZGVfMTk0MCcsJ2RlY2FkZV8xOTUwJywnZGVjYWRlXzE5NjAnLCdkZWNhZGVfMTk3MCcsJ2RlY2FkZV8xOTgwJywnZGVjYWRlXzE5OTAnLCdkZWNhZGVfMjAwMCcsJ2RlY2FkZV8yMDEwJywnUl9wZXJfZ2FtZScsJ1JBX3Blcl9nYW1lJywnbWxiX3JwZyddXG5kYXRhX2F0dHJpYnV0ZXMgPSBkZlthdHRyaWJ1dGVzXVxua21lYW5zX21vZGVsID0gS01lYW5zKG5fY2x1c3RlcnM9NiwgcmFuZG9tX3N0YXRlPTEpXG5kaXN0YW5jZXMgPSBrbWVhbnNfbW9kZWwuZml0X3RyYW5zZm9ybShkYXRhX2F0dHJpYnV0ZXMpXG5sYWJlbHMgPSBrbWVhbnNfbW9kZWwubGFiZWxzX1xuZGZbJ2xhYmVscyddID0gbGFiZWxzXG5hdHRyaWJ1dGVzLmFwcGVuZCgnbGFiZWxzJylcbm51bWVyaWNfY29scyA9IFsnRycsJ1InLCdBQicsJ0gnLCcyQicsJzNCJywnSFInLCdCQicsJ1NPJywnU0InLCdSQScsJ0VSJywnRVJBJywnQ0cnLCdTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnLCdsYWJlbHMnLCdXJ11cbmRhdGEgPSBkZltudW1lcmljX2NvbHNdXG50cmFpbiA9IGRhdGEuc2FtcGxlKGZyYWM9MC43NSwgcmFuZG9tX3N0YXRlPTEpXG50ZXN0ID0gZGF0YS5sb2NbfmRhdGEuaW5kZXguaXNpbih0cmFpbi5pbmRleCldXG54X3RyYWluID0gdHJhaW5bYXR0cmlidXRlc11cbnlfdHJhaW4gPSB0cmFpblsnVyddXG54X3Rlc3QgPSB0ZXN0W2F0dHJpYnV0ZXNdXG55X3Rlc3QgPSB0ZXN0WydXJ10iLCJzYW1wbGUiOiIjIEltcG9ydCBgTGluZWFyUmVncmVzc2lvbmAgZnJvbSBgc2tsZWFybi5saW5lYXJfbW9kZWxgXG5mcm9tIHNrbGVhcm4ubGluZWFyX21vZGVsIGltcG9ydCBfX19fX19fX19fX19fX19fX19fXG5cbiMgSW1wb3J0IGBtZWFuX2Fic29sdXRlX2Vycm9yYCBmcm9tIGBza2xlYXJuLm1ldHJpY3NgXG5mcm9tIHNrbGVhcm4ubWV0cmljcyBpbXBvcnQgbWVhbl9hYnNvbHV0ZV9lcnJvclxuXG4jIENyZWF0ZSBMaW5lYXIgUmVncmVzc2lvbiBtb2RlbCwgZml0IG1vZGVsLCBhbmQgbWFrZSBwcmVkaWN0aW9uc1xubHIgPSBMaW5lYXJSZWdyZXNzaW9uKG5vcm1hbGl6ZT1UcnVlKVxubHIuZml0KHhfdHJhaW4sIHlfdHJhaW4pXG5wcmVkaWN0aW9ucyA9IGxyLnByZWRpY3QoeF90ZXN0KVxuXG4jIERldGVybWluZSBtZWFuIGFic29sdXRlIGVycm9yXG5tYWUgPSBtZWFuX2Fic29sdXRlX2Vycm9yKHlfdGVzdCwgcHJlZGljdGlvbnMpXG5cbiMgUHJpbnQgYG1hZWBcbnByaW50KF9fXykiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IGBMaW5lYXJSZWdyZXNzaW9uYCBmcm9tIGBza2xlYXJuLmxpbmVhcl9tb2RlbGBcbmZyb20gc2tsZWFybi5saW5lYXJfbW9kZWwgaW1wb3J0IExpbmVhclJlZ3Jlc3Npb25cblxuIyBJbXBvcnQgYG1lYW5fYWJzb2x1dGVfZXJyb3JgIGZyb20gYHNrbGVhcm4ubWV0cmljc2BcbmZyb20gc2tsZWFybi5tZXRyaWNzIGltcG9ydCBtZWFuX2Fic29sdXRlX2Vycm9yXG5cbiMgQ3JlYXRlIExpbmVhciBSZWdyZXNzaW9uIG1vZGVsLCBmaXQgbW9kZWwsIGFuZCBtYWtlIHByZWRpY3Rpb25zXG5sciA9IExpbmVhclJlZ3Jlc3Npb24obm9ybWFsaXplPVRydWUpXG5sci5maXQoeF90cmFpbiwgeV90cmFpbilcbnByZWRpY3Rpb25zID0gbHIucHJlZGljdCh4X3Rlc3QpXG5cbiMgRGV0ZXJtaW5lIG1lYW4gYWJzb2x1dGUgZXJyb3Jcbm1hZSA9IG1lYW5fYWJzb2x1dGVfZXJyb3IoeV90ZXN0LCBwcmVkaWN0aW9ucylcblxuIyBQcmludCBgbWFlYFxucHJpbnQobWFlKSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmxpbmVhcl9tb2RlbC5MaW5lYXJSZWdyZXNzaW9uXCIpXG5FeCgpLnRlc3RfaW1wb3J0KFwic2tsZWFybi5tZXRyaWNzLm1lYW5fYWJzb2x1dGVfZXJyb3JcIilcbkV4KCkudGVzdF9mdW5jdGlvbihcImxyLmZpdFwiKVxuRXgoKS50ZXN0X29iamVjdChcInByZWRpY3Rpb25zXCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwibWFlXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

If you recall from above, the average number of wins was about 79 wins. On average, the model is off by only 2.687 wins.

Now try a Ridge regression model. Import `RidgeCV` from `sklearn.linear_model` and create model `rrm`. The `RidgeCV` model allows you to set the alpha parameter, which is a complexity parameter that controls the amount of shrinkage (read more here). The model will use cross-validation to deterime which of the alpha parameters you provide is ideal.

Again, fit your model, make predictions and determine the mean absolute error.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLmNsdXN0ZXIgaW1wb3J0IEtNZWFuc1xuZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IG1lYW5fYWJzb2x1dGVfZXJyb3JcbmRmX3RlYW1zID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9kZl90ZWFtc193aG9sZS5jc3ZcIiwgaGVhZGVyPTApXG5kZiA9IHBkLkRhdGFGcmFtZShkZl90ZWFtcylcbmF0dHJpYnV0ZXMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJyxcbidTSE8nLCdTVicsJ0lQb3V0cycsJ0hBJywnSFJBJywnQkJBJywnU09BJywnRScsJ0RQJywnRlAnLCdlcmFfMScsJ2VyYV8yJywnZXJhXzMnLCdlcmFfNCcsJ2VyYV81JywnZXJhXzYnLCdlcmFfNycsJ2VyYV84JywnZGVjYWRlXzE5MTAnLCdkZWNhZGVfMTkyMCcsJ2RlY2FkZV8xOTMwJywnZGVjYWRlXzE5NDAnLCdkZWNhZGVfMTk1MCcsJ2RlY2FkZV8xOTYwJywnZGVjYWRlXzE5NzAnLCdkZWNhZGVfMTk4MCcsJ2RlY2FkZV8xOTkwJywnZGVjYWRlXzIwMDAnLCdkZWNhZGVfMjAxMCcsJ1JfcGVyX2dhbWUnLCdSQV9wZXJfZ2FtZScsJ21sYl9ycGcnXVxuZGF0YV9hdHRyaWJ1dGVzID0gZGZbYXR0cmlidXRlc11cbmttZWFuc19tb2RlbCA9IEtNZWFucyhuX2NsdXN0ZXJzPTYsIHJhbmRvbV9zdGF0ZT0xKVxuZGlzdGFuY2VzID0ga21lYW5zX21vZGVsLmZpdF90cmFuc2Zvcm0oZGF0YV9hdHRyaWJ1dGVzKVxubGFiZWxzID0ga21lYW5zX21vZGVsLmxhYmVsc19cbmRmWydsYWJlbHMnXSA9IGxhYmVsc1xuYXR0cmlidXRlcy5hcHBlbmQoJ2xhYmVscycpXG5udW1lcmljX2NvbHMgPSBbJ0cnLCdSJywnQUInLCdIJywnMkInLCczQicsJ0hSJywnQkInLCdTTycsJ1NCJywnUkEnLCdFUicsJ0VSQScsJ0NHJywnU0hPJywnU1YnLCdJUG91dHMnLCdIQScsJ0hSQScsJ0JCQScsJ1NPQScsJ0UnLCdEUCcsJ0ZQJywnZXJhXzEnLCdlcmFfMicsJ2VyYV8zJywnZXJhXzQnLCdlcmFfNScsJ2VyYV82JywnZXJhXzcnLCdlcmFfOCcsJ2RlY2FkZV8xOTEwJywnZGVjYWRlXzE5MjAnLCdkZWNhZGVfMTkzMCcsJ2RlY2FkZV8xOTQwJywnZGVjYWRlXzE5NTAnLCdkZWNhZGVfMTk2MCcsJ2RlY2FkZV8xOTcwJywnZGVjYWRlXzE5ODAnLCdkZWNhZGVfMTk5MCcsJ2RlY2FkZV8yMDAwJywnZGVjYWRlXzIwMTAnLCdSX3Blcl9nYW1lJywnUkFfcGVyX2dhbWUnLCdtbGJfcnBnJywnbGFiZWxzJywnVyddXG5kYXRhID0gZGZbbnVtZXJpY19jb2xzXVxudHJhaW4gPSBkYXRhLnNhbXBsZShmcmFjPTAuNzUsIHJhbmRvbV9zdGF0ZT0xKVxudGVzdCA9IGRhdGEubG9jW35kYXRhLmluZGV4LmlzaW4odHJhaW4uaW5kZXgpXVxueF90cmFpbiA9IHRyYWluW2F0dHJpYnV0ZXNdXG55X3RyYWluID0gdHJhaW5bJ1cnXVxueF90ZXN0ID0gdGVzdFthdHRyaWJ1dGVzXVxueV90ZXN0ID0gdGVzdFsnVyddIiwic2FtcGxlIjoiIyBJbXBvcnQgYFJpZGdlQ1ZgIGZyb20gYHNrbGVhcm4ubGluZWFyX21vZGVsYFxuZnJvbSBza2xlYXJuLmxpbmVhcl9tb2RlbCBpbXBvcnQgX19fX19fX1xuXG4jIENyZWF0ZSBSaWRnZSBMaW5lYXIgUmVncmVzc2lvbiBtb2RlbCwgZml0IG1vZGVsLCBhbmQgbWFrZSBwcmVkaWN0aW9uc1xucnJtID0gUmlkZ2VDVihhbHBoYXM9KDAuMDEsIDAuMSwgMS4wLCAxMC4wKSwgbm9ybWFsaXplPVRydWUpXG5ycm0uZml0KHhfdHJhaW4sIHlfdHJhaW4pXG5wcmVkaWN0aW9uc19ycm0gPSBycm0ucHJlZGljdCh4X3Rlc3QpXG5cbiMgRGV0ZXJtaW5lIG1lYW4gYWJzb2x1dGUgZXJyb3Jcbm1hZV9ycm0gPSBtZWFuX2Fic29sdXRlX2Vycm9yKHlfdGVzdCwgcHJlZGljdGlvbnNfcnJtKVxucHJpbnQoX19fX19fX18pIiwic29sdXRpb24iOiIjIEltcG9ydCBgUmlkZ2VDVmAgZnJvbSBgc2tsZWFybi5saW5lYXJfbW9kZWxgXG5mcm9tIHNrbGVhcm4ubGluZWFyX21vZGVsIGltcG9ydCBSaWRnZUNWXG5cbiMgQ3JlYXRlIFJpZGdlIExpbmVhciBSZWdyZXNzaW9uIG1vZGVsLCBmaXQgbW9kZWwsIGFuZCBtYWtlIHByZWRpY3Rpb25zXG5ycm0gPSBSaWRnZUNWKGFscGhhcz0oMC4wMSwgMC4xLCAxLjAsIDEwLjApLCBub3JtYWxpemU9VHJ1ZSlcbnJybS5maXQoeF90cmFpbiwgeV90cmFpbilcbnByZWRpY3Rpb25zX3JybSA9IHJybS5wcmVkaWN0KHhfdGVzdClcblxuIyBEZXRlcm1pbmUgbWVhbiBhYnNvbHV0ZSBlcnJvclxubWFlX3JybSA9IG1lYW5fYWJzb2x1dGVfZXJyb3IoeV90ZXN0LCBwcmVkaWN0aW9uc19ycm0pXG5wcmludChtYWVfcnJtKSIsInNjdCI6IkV4KCkudGVzdF9pbXBvcnQoXCJza2xlYXJuLmxpbmVhcl9tb2RlbC5SaWRnZUNWXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJycm0uZml0XCIpXG5FeCgpLnRlc3Rfb2JqZWN0KFwicHJlZGljdGlvbnNfcnJtXCIpXG5FeCgpLnRlc3RfZnVuY3Rpb24oXCJwcmludFwiKSJ9

This model performed slightly better, and is off by 2.673 wins, on average.

## Sports Analytics & Scikit-Learn

This concludes the first part of this tutorial series in which you have seen how you can use scikit-Learn to analyze sports data. You imported the data from an SQLite database, cleaned it up, explored aspects of it visually, and engineered several new features. You learned how to create a K-means clustering model, a couple different Linear Regression models, and how to test your predictions with the mean absolute error metric.

In the second part, you’ll see how to use classification models to predict which players make it into the MLB Hall of Fame.

Topics

Python Courses

Course

### .css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to Python

4 hr
5.5M
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See Details
Start Course

Course

### Introduction to Data Science in Python

4 hr
456.7K
Dive into data science using Python and learn how to effectively analyze and visualize your data. No coding experience or skills needed.

Course

### Intermediate Python

4 hr
1.1M
Level up your data science skills by creating visualizations using Matplotlib and manipulating DataFrames with pandas.
See More
Related

tutorial

### Scikit-Learn Tutorial: Baseball Analytics Pt 2

A Scikit-Learn tutorial to using logistic regression and random forest models to predict which baseball players will be voted into the Hall of Fame

Daniel Poston

32 min

tutorial

### Introduction to k-Means Clustering with scikit-learn in Python

In this tutorial, learn how to apply k-Means Clustering with scikit-learn in Python

Kevin Babitz

21 min

tutorial

Kurtis Pykes

12 min

tutorial

### Kaggle Tutorial: Your First Machine Learning Model

Learn how to build your first machine learning model, a decision tree classifier, with the Python scikit-learn package, submit it to Kaggle and see how it performs!

Hugo Bowne-Anderson

11 min

tutorial

### Introduction to Machine Learning in Python

In this tutorial, you will be introduced to the world of Machine Learning (ML) with Python. To understand ML practically, you will be using a well-known machine learning algorithm called K-Nearest Neighbor (KNN) with Python.