Predicting Video Game Sales in Japan with Machine Learning

Video Games Sales Data

This dataset contains records of popular video games in North America, Japan, Europe and other parts of the world. Every video game in this dataset has at least 100k global sales.

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

vgsales = pd.read_csv("data/vgsales.csv", index_col=0)

vgsales.head()

Data Dictionary

Column	Explanation
Rank	Ranking of overall sales
Name	Name of the game
Platform	Platform of the games release (i.e. PC,PS4, etc.)
Year	Year the game was released in
Genre	Genre of the game
Publisher	Publisher of the game
NA_Sales	Number of sales in North America (in millions)
EU_Sales	Number of sales in Europe (in millions)
JP_Sales	Number of sales in Japan (in millions)
Other_Sales	Number of sales in other parts of the world (in millions)
Global_Sales	Number of total sales (in millions)

Source of dataset.

Information

The retailer typically orders games based on sales in North America and Europe, as the games are often released later in Japan. However, it has been observed that North American and European sales are not always a perfect predictor of how a game will sell in Japan.

To address this, a model will be developed to predict sales in Japan using sales data from North America and Europe, as well as other attributes such as the name of the game, the platform, the genre, and the publisher.

A report will be prepared that is accessible to a broad audience, outlining the motivation, steps, findings, and conclusions of the project.

Defining the Steps to Address My Manager's Requirement

Data Understanding: Begin with an exploratory analysis of the data types, as well as the structure of the information. Also, consider the basic statistics.
Data Cleaning and Preprocessing: Determine if there are missing values (NaN) and consequently decide how to handle them (remove, impute, or not process them). Also, identify if there are errors in the data, such as outliers or inconsistent values.
EDA: Visualize if there are relationships between the data using visual tools.
Feature Engineering: Based on the above, determine if feature engineering is necessary, such as applying log transformation, normalization, or creating new features.
Feature Selection: Select the most relevant features for the model.
Data Splitting: Divide the data into training and testing sets.
Modeling: A model will be used for sales prediction as required by the manager, and corresponding hyperparameter adjustments will be made.
Model Evaluation: Evaluate the performance of the model.
Conclusions: Draft conclusions for the purpose of the report (for the manager).

#To see 15 columns
pd.set_option('display.max_columns', 15)

Data Understanding

Basically, a review of data structure.

# Show shape dataframe
print(vgsales.shape)
# Show type of data using 'include' = all
print(vgsales.describe(include='all').transpose())
# Show statistics
print(vgsales.info())

# Change type of data
vgsales['Platform'] = vgsales['Platform'].astype('category')
vgsales['Genre'] =vgsales['Genre'].astype('category')
vgsales['Publisher'] =vgsales['Publisher'].astype('category')

Analysis of data:

The DataFrame contains 16,598 records across 10 fields.
Missing data is noted in the 'Year' and 'Publisher' fields.
There are 31 unique platforms and 12 unique genres. The earliest recorded year is 1980, and the most recent is 2020.
The minimum sales figures in all categories are 0.0, except for 'Global_Sales,' which has a minimum of 0.01. The maximum sales figures are highest in 'NA_Sales' compared to other regions.
The genre with the highest number of titles released is 'Action'. It appears that the most prolific publisher is Electronic Arts, and the most common platform for these releases is the Nintendo DS.

Data Cleaning and Preprocessing

As see above, there are missing data in 'Year' and 'Publisher'. We'll make a review on these fields and evaluate how we'll approach them.

# Calculate number of NaN fields.
print(vgsales.isna().sum())
# Calculate NaN percent related to total of records.
print(vgsales.isna().mean()*100)

Percent of missing data is less than 5 %

# Quick review on data missing
vgsales.query('Year.isna() & Publisher.isna()')