Video Games Sales Data
This dataset contains records of popular video games in North America, Japan, Europe and other parts of the world. Every video game in this dataset has at least 100k global sales.
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
sales = pd.read_csv("vgsales.csv", index_col=0)
print(sales.shape)
sales.head(100)Data Dictionary
| Column | Explanation |
|---|---|
| Rank | Ranking of overall sales |
| Name | Name of the game |
| Platform | Platform of the games release (i.e. PC,PS4, etc.) |
| Year | Year the game was released in |
| Genre | Genre of the game |
| Publisher | Publisher of the game |
| NA_Sales | Number of sales in North America (in millions) |
| EU_Sales | Number of sales in Europe (in millions) |
| JP_Sales | Number of sales in Japan (in millions) |
| Other_Sales | Number of sales in other parts of the world (in millions) |
| Global_Sales | Number of total sales (in millions) |
Source of dataset.
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- 🗺️ Explore: Which of the three seventh generation consoles (Xbox 360, Playstation 3, and Nintendo Wii) had the highest total sales globally?
- 📊 Visualize: Create a plot visualizing the average sales for games in the most popular three genres. Differentiate between NA, EU, and global sales.
- 🔎 Analyze: Are some genres significantly more likely to perform better or worse in Japan than others? If so, which ones?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
You are working as a data analyst for a video game retailer based in Japan. The retailer typically orders games based on sales in North America and Europe, as the games are often released later in Japan. However, they have found that North American and European sales are not always a perfect predictor of how a game will sell in Japan.
Your manager has asked you to develop a model that can predict the sales in Japan using sales in North America and Europe and other attributes such as the name of the game, the platform, the genre, and the publisher.
You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.
preliminary analysis
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'sales' DataFrame is already defined somewhere in the notebook
# print(sales.info())
columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
# Ensure the column names are in the DataFrame
# missing_columns = [col for col in columns if col not in sales.columns]
# if missing_columns:
# raise KeyError(f"Missing columns in the DataFrame: {missing_columns}")
aver_sales = sales[columns].mean().reset_index()
aver_sales.columns = ['Region', 'aver_sales']
print(aver_sales)
sns.barplot(data=aver_sales, x='Region', y='aver_sales')
plt.title('Regional average sales of video games')
plt.show()popular=sales.Platform.value_counts()
print(popular.head(10))
top_tree=sales[sales['Platform'].isin (['PS3','XB','Wii'])]
print(top_tree.Platform.value_counts())
print(top_tree.groupby('Platform')['Global_Sales'].sum())import numpy as np
platform_aver_earning = sales.groupby('Platform')['Global_Sales'].mean().sort_values(ascending=False)
print(platform_aver_earning.head(10))
correl = np.corrcoef(popular, platform_aver_earning)
print(correl[0,1])
plt.scatter(popular,platform_aver_earning)
plt.show()import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
top_genre = sales['Genre'].value_counts().head(3)
print(top_genre.index)
sales_top_genre = sales[sales['Genre'].isin(top_genre.index)]
print(sales_top_genre.head())
sales_top_genre_earnings = sales_top_genre.groupby('Genre')[['JP_Sales', 'NA_Sales', 'EU_Sales']].agg({'JP_Sales': ['mean', 'sum'], 'NA_Sales': ['mean', 'sum'], 'EU_Sales': ['mean', 'sum']})
print(sales_top_genre_earnings)
# Fixing the error by using pd.melt to reshape the DataFrame
melted_sales_top_genre_earnings = pd.melt(sales_top_genre_earnings.reset_index(), id_vars='Genre')
print(melted_sales_top_genre_earnings)
sumtop=melted_sales_top_genre_earnings[melted_sales_top_genre_earnings['variable_1']=='sum']
meantop=melted_sales_top_genre_earnings[melted_sales_top_genre_earnings['variable_1']=='mean']
sumtop.columns=['genre','region','total','sum']
sumtop=sumtop.drop('total',axis=1)
print(sumtop)
sns.barplot(data=sumtop,x='genre',y='sum',hue='region')
plt.show()
meantop.columns=['genre','region','total','mean']
meantop=meantop.drop('total',axis=1)
print(meantop)
sns.barplot(data=meantop,x='genre',y='mean',hue='region')
plt.show()def top_region(reg_sales,reg_frac):
df = sales.groupby('Genre', as_index=False)[reg_sales].sum()
df[reg_frac] = df[reg_sales] / df[reg_sales].sum()
df = df.sort_values(by=reg_frac, ascending=False).head(10)
return df
def cross_reg(top_reg,pref=1):
for k in range(0, 10):
flag = 0
frac_k = top_jap.iloc[k, 2]
gnr_k = top_jap.iloc[k, 0]
for j in range(0, 10):
frac_j = top_reg.iloc[j, 2]
gnr_j = top_reg.iloc[j, 0]
if gnr_k == gnr_j:
flag = 1
if pref==1:
if frac_k > frac_j:
print(gnr_k)
else:
if frac_k<=frac_j:
print(gnr_k)
if flag == 0 and pref==1:
print(gnr_k)
top_jap=top_region('JP_Sales',"JP_sales_frac")
print(top_jap)
top_NA=top_region('NA_Sales','NA_sales_frac')
print(top_NA)
top_EU=top_region('EU_Sales','EU_sales_frac')
print(top_EU)
print('\nGames more popular in Japan than in NA- among top 10')
cross_reg(top_NA,pref=1)
print('\nGames more popular in Jaman than in EU- among top 10')
cross_reg(top_EU,pref=1)
print('\nWorse performers than NA')
cross_reg(top_NA,pref=0)
print('\nWorse performers than EU')
cross_reg(top_EU,pref=0)import pandas as pd
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Assuming 'sales' DataFrame is already defined and loaded
salesx = sales.dropna()
cols = salesx.columns
print(cols)
cols = cols.drop(['Name','JP_Sales', 'Publisher','Platform','Year','Genre'])
print(cols)
X_pre = salesx[cols]
y = salesx['JP_Sales']
X = pd.get_dummies(X_pre)
feature_cols = X.columns
print(X.head())
print(X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
##########
z=salesx['JP_Sales']
x1=salesx['NA_Sales']
x2=salesx['EU_Sales']
x3=salesx['Global_Sales']
x4=salesx['Other_Sales']
model = ols("z ~ x1 + x2 + x3", salesx).fit()
# Print the summary
print(model.summary())
print("\nRetrieving manually the parameter estimates:")
print(model._results.params)
#########
reg = LinearRegression()
reg.fit(X, y)
coefs = reg.coef_
r_squared = reg.score(X, y)
print(r_squared)
reg_pred = reg.predict(X)
# Corrected the method to calculate the score
from sklearn.metrics import r2_score
score = r2_score(y, reg_pred)
print(f'SCORE {score}')
df = pd.DataFrame({'features': feature_cols, 'coefs': coefs})
print(df)
jpsales = pd.Series(sales['JP_Sales'])
nasales = pd.Series(sales['NA_Sales'])
print(np.corrcoef(jpsales, nasales))
# Corrected the column name from 'EU_sales' to 'EU_Sales'
eusales = pd.Series(sales['EU_Sales'])
print(np.corrcoef(jpsales, eusales))
gbsales= pd.Series(sales['Global_Sales'])
print(np.corrcoef(jpsales, gbsales))import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import f_oneway
# Assuming salesx is a DataFrame that has been defined earlier
# Example DataFrame definition (uncomment and modify as needed)
# salesx = pd.DataFrame({
# 'Genre': ['Action', 'Adventure', 'Puzzle', 'Strategy', 'Action'],
# 'JP_Sales': [1.0, 2.0, 3.0, 4.0, 5.0],
# 'Name': ['Game1', 'Game2', 'Game3', 'Game4', 'Grand Theft Auto V']
# })
genre_JP = salesx.groupby('Genre')['JP_Sales'].sum().reset_index()
print(genre_JP)
sns.barplot(data=genre_JP, x='Genre', y='JP_Sales')
plt.xticks(rotation=90)
plt.show()
# Ensure that 'sales' is defined as 'salesx' to avoid confusion
x1 = salesx.loc[salesx['Genre'] == 'Role-Playing', 'JP_Sales']
x2 = salesx.loc[salesx['Genre'] == 'Fighting', 'JP_Sales']
x3 = salesx.loc[salesx['Genre'] == 'Simulation', 'JP_Sales']
x4 = salesx.loc[salesx['Genre'] == 'Puzzle', 'JP_Sales']
x5 = salesx.loc[salesx['Genre']=='Adventure', 'JP_Sales']
result = f_oneway(x1, x2, x3, x4,x5)
print(result)from scipy.stats import ttest_ind
goodperform = salesx[salesx['Genre'].isin(['Role-Playing', 'Fighting', 'Simulation', 'Puzzle', 'Adventure'])]
poorperform = salesx[salesx['Genre'].isin(['Action', 'Sports', 'Platform', 'Misc', 'Racing'])]
print(goodperform.columns)
good_jp = goodperform['JP_Sales'].values
poor_jp = poorperform['JP_Sales'].values
stats, p = ttest_ind(good_jp, poor_jp)
print(stats,p)