DA ex Video Games Sales Data

Video Games Sales Data

This dataset contains records of popular video games in North America, Japan, Europe and other parts of the world. Every video game in this dataset has at least 100k global sales.

Not sure where to begin? Scroll to the bottom to find challenges!

Run cancelled

import pandas as pd
sales = pd.read_csv("vgsales.csv", index_col=0)
print(sales.shape)
sales.head(100)

Data Dictionary

Column	Explanation
Rank	Ranking of overall sales
Name	Name of the game
Platform	Platform of the games release (i.e. PC,PS4, etc.)
Year	Year the game was released in
Genre	Genre of the game
Publisher	Publisher of the game
NA_Sales	Number of sales in North America (in millions)
EU_Sales	Number of sales in Europe (in millions)
JP_Sales	Number of sales in Japan (in millions)
Other_Sales	Number of sales in other parts of the world (in millions)
Global_Sales	Number of total sales (in millions)

Source of dataset.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Which of the three seventh generation consoles (Xbox 360, Playstation 3, and Nintendo Wii) had the highest total sales globally?
📊 Visualize: Create a plot visualizing the average sales for games in the most popular three genres. Differentiate between NA, EU, and global sales.
🔎 Analyze: Are some genres significantly more likely to perform better or worse in Japan than others? If so, which ones?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working as a data analyst for a video game retailer based in Japan. The retailer typically orders games based on sales in North America and Europe, as the games are often released later in Japan. However, they have found that North American and European sales are not always a perfect predictor of how a game will sell in Japan.

Your manager has asked you to develop a model that can predict the sales in Japan using sales in North America and Europe and other attributes such as the name of the game, the platform, the genre, and the publisher.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

preliminary analysis

Run cancelled

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'sales' DataFrame is already defined somewhere in the notebook
# print(sales.info())
columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']

# Ensure the column names are in the DataFrame
# missing_columns = [col for col in columns if col not in sales.columns]
# if missing_columns:
#     raise KeyError(f"Missing columns in the DataFrame: {missing_columns}")

aver_sales = sales[columns].mean().reset_index()
aver_sales.columns = ['Region', 'aver_sales']
print(aver_sales)

sns.barplot(data=aver_sales, x='Region', y='aver_sales')
plt.title('Regional average sales of video games')
plt.show()

Run cancelled

popular=sales.Platform.value_counts()
print(popular.head(10))
top_tree=sales[sales['Platform'].isin (['PS3','XB','Wii'])]
print(top_tree.Platform.value_counts())
print(top_tree.groupby('Platform')['Global_Sales'].sum())

Run cancelled

import numpy as np
platform_aver_earning = sales.groupby('Platform')['Global_Sales'].mean().sort_values(ascending=False)

print(platform_aver_earning.head(10))

correl = np.corrcoef(popular, platform_aver_earning)
print(correl[0,1])
plt.scatter(popular,platform_aver_earning)
plt.show()

Run cancelled

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

top_genre = sales['Genre'].value_counts().head(3)
print(top_genre.index)
sales_top_genre = sales[sales['Genre'].isin(top_genre.index)]
print(sales_top_genre.head())
sales_top_genre_earnings = sales_top_genre.groupby('Genre')[['JP_Sales', 'NA_Sales', 'EU_Sales']].agg({'JP_Sales': ['mean', 'sum'], 'NA_Sales': ['mean', 'sum'], 'EU_Sales': ['mean', 'sum']})
print(sales_top_genre_earnings)

# Fixing the error by using pd.melt to reshape the DataFrame
melted_sales_top_genre_earnings = pd.melt(sales_top_genre_earnings.reset_index(), id_vars='Genre')
print(melted_sales_top_genre_earnings)
sumtop=melted_sales_top_genre_earnings[melted_sales_top_genre_earnings['variable_1']=='sum']
meantop=melted_sales_top_genre_earnings[melted_sales_top_genre_earnings['variable_1']=='mean']
sumtop.columns=['genre','region','total','sum']
sumtop=sumtop.drop('total',axis=1)
print(sumtop)
sns.barplot(data=sumtop,x='genre',y='sum',hue='region')
plt.show()

meantop.columns=['genre','region','total','mean']
meantop=meantop.drop('total',axis=1)
print(meantop)
sns.barplot(data=meantop,x='genre',y='mean',hue='region')
plt.show()

Run cancelled

def top_region(reg_sales,reg_frac):
    df = sales.groupby('Genre', as_index=False)[reg_sales].sum()
    df[reg_frac] = df[reg_sales] / df[reg_sales].sum()
    df = df.sort_values(by=reg_frac, ascending=False).head(10)
    return df

def cross_reg(top_reg,pref=1):
    for k in range(0, 10):
        flag = 0
        frac_k = top_jap.iloc[k, 2]
        gnr_k = top_jap.iloc[k, 0]
        for j in range(0, 10):
            frac_j = top_reg.iloc[j, 2]
            gnr_j = top_reg.iloc[j, 0]

            if gnr_k == gnr_j:
                flag = 1
                if pref==1:
                    if frac_k > frac_j:
                      print(gnr_k)
                else:
                    if frac_k<=frac_j:
                        print(gnr_k)                   
        if flag == 0 and pref==1:
            print(gnr_k)


top_jap=top_region('JP_Sales',"JP_sales_frac")
print(top_jap)    

top_NA=top_region('NA_Sales','NA_sales_frac')
print(top_NA)

top_EU=top_region('EU_Sales','EU_sales_frac')
print(top_EU)

print('\nGames more popular in Japan than in NA- among top 10')
cross_reg(top_NA,pref=1)

print('\nGames more popular in Jaman than in EU- among top 10')
cross_reg(top_EU,pref=1)

print('\nWorse performers than NA')
cross_reg(top_NA,pref=0)

print('\nWorse performers than EU')
cross_reg(top_EU,pref=0)

Run cancelled

import pandas as pd
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Assuming 'sales' DataFrame is already defined and loaded
salesx = sales.dropna()
cols = salesx.columns
print(cols)
cols = cols.drop(['Name','JP_Sales', 'Publisher','Platform','Year','Genre'])
print(cols)
X_pre = salesx[cols]
y = salesx['JP_Sales']

X = pd.get_dummies(X_pre)
feature_cols = X.columns
print(X.head())
print(X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
##########

z=salesx['JP_Sales']
x1=salesx['NA_Sales']
x2=salesx['EU_Sales']
x3=salesx['Global_Sales']
x4=salesx['Other_Sales']
model = ols("z ~ x1 + x2 + x3", salesx).fit()

# Print the summary
print(model.summary())

print("\nRetrieving manually the parameter estimates:")
print(model._results.params)

#########
reg = LinearRegression()
reg.fit(X, y)
coefs = reg.coef_
r_squared = reg.score(X, y)
print(r_squared)
reg_pred = reg.predict(X)

# Corrected the method to calculate the score
from sklearn.metrics import r2_score
score = r2_score(y, reg_pred)
print(f'SCORE  {score}')

df = pd.DataFrame({'features': feature_cols, 'coefs': coefs})
print(df)
jpsales = pd.Series(sales['JP_Sales'])
nasales = pd.Series(sales['NA_Sales'])
print(np.corrcoef(jpsales, nasales))

# Corrected the column name from 'EU_sales' to 'EU_Sales'
eusales = pd.Series(sales['EU_Sales'])
print(np.corrcoef(jpsales, eusales))

gbsales= pd.Series(sales['Global_Sales'])
print(np.corrcoef(jpsales, gbsales))

Run cancelled

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import f_oneway

# Assuming salesx is a DataFrame that has been defined earlier
# Example DataFrame definition (uncomment and modify as needed)
# salesx = pd.DataFrame({
#     'Genre': ['Action', 'Adventure', 'Puzzle', 'Strategy', 'Action'],
#     'JP_Sales': [1.0, 2.0, 3.0, 4.0, 5.0],
#     'Name': ['Game1', 'Game2', 'Game3', 'Game4', 'Grand Theft Auto V']
# })

genre_JP = salesx.groupby('Genre')['JP_Sales'].sum().reset_index()
print(genre_JP)
sns.barplot(data=genre_JP, x='Genre', y='JP_Sales')
plt.xticks(rotation=90)
plt.show()

# Ensure that 'sales' is defined as 'salesx' to avoid confusion
x1 = salesx.loc[salesx['Genre'] == 'Role-Playing', 'JP_Sales']
x2 = salesx.loc[salesx['Genre'] == 'Fighting', 'JP_Sales']
x3 = salesx.loc[salesx['Genre'] == 'Simulation', 'JP_Sales']
x4 = salesx.loc[salesx['Genre'] == 'Puzzle', 'JP_Sales']
x5 = salesx.loc[salesx['Genre']=='Adventure', 'JP_Sales']
result = f_oneway(x1, x2, x3, x4,x5)
print(result)

Run cancelled

from scipy.stats import ttest_ind

goodperform = salesx[salesx['Genre'].isin(['Role-Playing', 'Fighting', 'Simulation', 'Puzzle', 'Adventure'])]
poorperform = salesx[salesx['Genre'].isin(['Action', 'Sports', 'Platform', 'Misc', 'Racing'])]

print(goodperform.columns)
good_jp = goodperform['JP_Sales'].values
poor_jp = poorperform['JP_Sales'].values

stats, p = ttest_ind(good_jp, poor_jp)
print(stats,p)

DA ex Video Games Sales Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Video Games Sales Data

Data Dictionary

Don't know where to start?

Video Games Sales Data