Video Games Sales Data

This dataset contains records of popular video games in North America, Japan, Europe and other parts of the world. Every video game in this dataset has at least 100k global sales.

Not sure where to begin? Scroll to the bottom to find challenges!

Data Dictionary

Column	Explanation
Rank	Ranking of overall sales
Name	Name of the game
Platform	Platform of the games release (i.e. PC,PS4, etc.)
Year	Year the game was released in
Genre	Genre of the game
Publisher	Publisher of the game
NA_Sales	Number of sales in North America (in millions)
EU_Sales	Number of sales in Europe (in millions)
JP_Sales	Number of sales in Japan (in millions)
Other_Sales	Number of sales in other parts of the world (in millions)
Global_Sales	Number of total sales (in millions)

DataFrameas

df

variable

-- Can I do the same using SQL ? Can i get the same data but in table ?
WITH All_consoles AS (SELECT DISTINCT(Platform) AS Console, 
		SUM(NA_Sales) AS USA_Sales,
		SUM(EU_Sales) AS EU_Sales,
		SUM(JP_Sales) AS JP_Sales,
		SUM(Global_Sales) AS Total_Sales
		FROM vgsales.csv
		GROUP BY Console
		ORDER BY Total_Sales DESC)
SELECT *
FROM All_consoles
WHERE Total_sales > 300
-- I have decided to check all of the consoles and their sales across different parts of the world. As can be seen, PS2 had the highest global sales in comparison to other consoles. Thanks to a good sales across every part of the world presented in the data.
-- UPDATE// I added a WHERE keyword to determine total sales higher than 300 mil for plot in python. The amount of different consoles created a mess of a plot. I have decided to only include a meaningful consoles in the global market. Also I have place the whole querry in CTE to do it.

DataFrameas

xone

variable

SELECT DISTINCT(Platform) AS Console,
		Genre,
		SUM(NA_Sales) AS USA,
		SUM(EU_Sales) AS EU,
		SUM(JP_Sales) AS JP,
		SUM(Global_Sales) AS Global
FROM "vgsales.csv" AS sales
WHERE Platform LIKE ('XOne')
GROUP BY Console, Genre
ORDER BY Console DESC, Global DESC

DataFrameas

psv

variable

SELECT DISTINCT(Platform) AS Console,
		Genre,
		SUM(NA_Sales) AS USA,
		SUM(EU_Sales) AS EU,
		SUM(JP_Sales) AS JP,
		SUM(Global_Sales) AS Global
FROM "vgsales.csv" AS sales
WHERE Platform LIKE ('PSV')
GROUP BY Console, Genre
ORDER BY Global DESC

DataFrameas

pc

variable

SELECT DISTINCT(Platform) AS Console,
		Genre,
		SUM(NA_Sales) AS USA,
		SUM(EU_Sales) AS EU,
		SUM(JP_Sales) AS JP,
		SUM(Global_Sales) AS Global
FROM "vgsales.csv" AS sales
WHERE Platform LIKE ('PC')
GROUP BY Console, Genre
ORDER BY Global DESC

import pandas as pd
import matplotlib.pyplot as plt
sales = pd.read_csv("vgsales.csv", index_col=0) # 16598 rows and 10 columns
dif_platforms = sales['Platform'].unique()
#Xbox 360 - X360, Playstation 3 - PS3, Nintendo Wii - Wii
seven_gen = sales[sales['Platform'].isin(['PS3', 'X360', 'Wii'])]
# seven_gen shows 3919 rows and 10 columns
grouped_seven = seven_gen.groupby('Platform')['Global_Sales'].sum().reset_index()
sorted_seven = (grouped_seven.sort_values(['Global_Sales'], ascending=False))

plt.style.use('seaborn-v0_8-notebook')
bars = plt.bar(sorted_seven['Platform'], sorted_seven['Global_Sales'], color='skyblue', edgecolor='black')
#Now to make it more appealing, I used command print(plt.style.available) to determine which is the most suitable.
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2,
             height + 10,
             f'{height:.0f}',# Some experimentation with different aspects to create niced bar plot.
             ha='center',
             va='bottom',
             fontsize=9,
             fontweight='bold')
plt.grid(axis='y', linestyle='--',alpha=0.6)
plt.xlabel('Consoles')
plt.ylabel('Global sales in milions')
plt.title('Global sales of 7th gen consoles', weight='bold')
plt.show()
plt.clf()# Cleaning the plot for next one.

sql_bars = plt.bar(df['Console'],df['Total_Sales'], color='red', edgecolor='black')
for bar in sql_bars:
    height_sql = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2,
            height_sql + 10,
            f'{height_sql:.0f}',
             ha='center', 
             va='bottom',
            fontsize=12,
            fontstyle='italic')
plt.xlabel('Console')
plt.ylabel('Global sales')
plt.grid(axis='y', linestyle=':', alpha=0.8)
plt.title('Sales Plot', fontsize=15, fontstyle='italic')
plt.show()
plt.clf()

diff_genre = sales['Genre'].unique() # 12 different genres
genre_sales = sales.groupby('Genre')['Global_Sales','NA_Sales','EU_Sales','JP_Sales'].sum().reset_index()
high_genre_sales = genre_sales[genre_sales['Global_Sales'] > 750]
high_genre_sales_sorted = high_genre_sales.sort_values('Global_Sales', ascending=False)
genre_bar_global = plt.bar(high_genre_sales_sorted['Genre'],high_genre_sales_sorted['Global_Sales'], color='skyblue',  edgecolor='black')
plt.xlabel('Genre')
plt.ylabel('Global Sales')
plt.grid(axis='y',linestyle='-.',alpha=0.8)
plt.title('Genre global sales')
plt.show()
plt.clf()

# Experimentation with pie chart plots to determine the sales of ganres across different countries.
eu_sales = high_genre_sales_sorted['EU_Sales']
jp_sales = high_genre_sales_sorted['JP_Sales']
na_sales = high_genre_sales_sorted['NA_Sales']

fig, ax = plt.subplots(figsize=(6, 6))
wedges, texts, autotexts = ax.pie(eu_sales, labels=high_genre_sales_sorted['Genre'], autopct='%1.2f%%', startangle=90, textprops={'fontsize': 12})
for text in texts:
    text.set_fontweight('bold')
for autotext in autotexts:
    autotext.set_fontweight('bold')
plt.title('Games sales across EU')
plt.axis('equal')
plt.show()
plt.clf()

fig, ax = plt.subplots(figsize=(6, 6))
wdges, texts, autotexts = ax.pie(jp_sales, labels=high_genre_sales_sorted['Genre'], autopct='%1.1f%%', startangle=90, textprops={'fontsize': 10})
for text in texts:
    text.set_fontweight('bold')
for autotext in autotexts:
    autotext.set_fontweight('bold')
plt.title('Games sales across Japan', weight='bold')
plt.axis('equal')
plt.show()
plt.clf()


# Using the second data to experiment with scatter plots and seaborn.
import seaborn as sb
xone = pd.DataFrame(xone)
psv = pd.DataFrame(psv)
pc = pd.DataFrame(pc)

Source of dataset.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Which of the three seventh generation consoles (Xbox 360, Playstation 3, and Nintendo Wii) had the highest total sales globally?
📊 Visualize: Create a plot visualizing the average sales for games in the most popular three genres. Differentiate between NA, EU, and global sales.
🔎 Analyze: Are some genres significantly more likely to perform better or worse in Japan than others? If so, which ones?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working as a data analyst for a video game retailer based in Japan. The retailer typically orders games based on sales in North America and Europe, as the games are often released later in Japan. However, they have found that North American and European sales are not always a perfect predictor of how a game will sell in Japan.

Your manager has asked you to develop a model that can predict the sales in Japan using sales in North America and Europe and other attributes such as the name of the game, the platform, the genre, and the publisher.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

Video Games Sales Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Video Games Sales Data

Data Dictionary

Don't know where to start?

Video Games Sales Data