Skip to content

Video Games Sales Data (Obtained from Kaggle)

This dataset contains records of popular video games in North America, Japan, Europe and other parts of the world. Every video game in this dataset has at least 100k global sales. The release year ranges from 1980 to 2020 (with year 2018 and 2019 missing).

Data Dictionary

ColumnExplanation
RankRanking of overall sales
NameName of the game
PlatformPlatform of the games release (i.e. PC,PS4, etc.)
YearYear the game was released in
GenreGenre of the game
PublisherPublisher of the game
NA_SalesNumber of sales in North America (in millions)
EU_SalesNumber of sales in Europe (in millions)
JP_SalesNumber of sales in Japan (in millions)
Other_SalesNumber of sales in other parts of the world (in millions)
Global_SalesNumber of total sales (in millions)

Questions asked about the dataset.

  1. What are the top-selling games in each year (based on global sales)? Not including games without a year listed.
  2. How many games have the highest sales in the NA region compared to others? What about EU, JP and the other group?
  3. Part A: What are the top 10 platforms based on global sales? Part B: Does the result in Q3a change across regions?
  4. Show the running total of global sales in each year for each publisher.
Spinner
DataFrameas
df
variable
-- Quick view of the dataset.
SELECT * 
FROM 'vgsales_kaggle.csv'
LIMIT 50;
Spinner
DataFrameas
df1
variable
-- Q1. What are the top selling games in each year (based on global sales)? Not including games without a year listed.
SELECT t1.Year, t1.Name, t1.Genre, t1.Platform, t1.Publisher, t1.Global_Sales
FROM 'vgsales_kaggle.csv' AS t1
RIGHT JOIN (
	SELECT Year, MAX(Global_Sales) AS Top_Global_Sales
	FROM 'vgsales_kaggle.csv'
	WHERE Year != 'N/A'
	GROUP BY Year) AS t2
ON t1.Year = t2.Year AND t1.Global_Sales = t2.Top_Global_Sales
ORDER BY t1.Year ASC;
Spinner
DataFrameas
df2
variable
--Q2. How many games have the highest sales in the NA region compared to others? What about EU, JP and other country group?
SELECT 'NA' AS Region, COUNT(*) AS Num_Games
FROM 'vgsales_kaggle.csv'
WHERE NA_Sales > EU_Sales AND NA_Sales > JP_Sales AND NA_Sales > Other_Sales
UNION
(SELECT 'EU' AS Region, COUNT(*) AS Num_Games
FROM 'vgsales_kaggle.csv'
WHERE EU_Sales > NA_Sales AND EU_Sales > JP_Sales AND EU_Sales > Other_Sales)
UNION
(SELECT 'JP' AS Region, COUNT(*) AS Num_Games
FROM 'vgsales_kaggle.csv'
WHERE JP_Sales > EU_Sales AND JP_Sales > NA_Sales AND JP_Sales > Other_Sales)
UNION
(SELECT 'Other' AS Region, COUNT(*) AS Num_Games
FROM 'vgsales_kaggle.csv'
WHERE Other_Sales > EU_Sales AND Other_Sales > JP_Sales AND Other_Sales > NA_Sales)
ORDER BY Num_Games DESC;
Spinner
DataFrameas
df3a
variable
-- Q3a. What are the top 10 platforms based on global sales?
SELECT Platform, SUM(Global_Sales) as Total_Global_Sales
FROM 'vgsales_kaggle.csv'
GROUP BY Platform
ORDER BY Total_Global_Sales DESC
LIMIT 10;
Spinner
DataFrameas
df3b
variable
-- Q3b. Does the result in Q3a change across regions?
CREATE TEMP TABLE NA_Top AS
	SELECT Platform as NA_Platform, SUM(NA_Sales) as Total_NA_Sales, 
		ROW_NUMBER() OVER(ORDER BY Total_NA_Sales DESC) AS Index_Num
	FROM 'vgsales_kaggle.csv'
	GROUP BY Platform
	ORDER BY Total_NA_Sales DESC;

SELECT t1.Index_Num, t1.NA_Platform, t1.Total_NA_Sales, t2.EU_Platform, t2.Total_EU_Sales, t3.JP_Platform, t3.Total_JP_Sales, t4.Other_Platform, t4.Total_Other_Sales
FROM NA_Top AS t1
LEFT JOIN (
	SELECT Platform as EU_Platform, SUM(EU_Sales) as Total_EU_Sales, 
		ROW_NUMBER() OVER(ORDER BY Total_EU_Sales DESC) AS Index_Num
	FROM 'vgsales_kaggle.csv'
	GROUP BY Platform
	ORDER BY Total_EU_Sales DESC
) AS t2
ON t1.Index_Num = t2.Index_Num
LEFT JOIN (
	SELECT Platform as JP_Platform, SUM(JP_Sales) as Total_JP_Sales, 
		ROW_NUMBER() OVER(ORDER BY Total_JP_Sales DESC) AS Index_Num
	FROM 'vgsales_kaggle.csv'
	GROUP BY Platform
	ORDER BY Total_JP_Sales DESC
) AS t3
ON t1.Index_Num = t3.Index_Num
LEFT JOIN (
	SELECT Platform as Other_Platform, SUM(Other_Sales) as Total_Other_Sales, 
		ROW_NUMBER() OVER(ORDER BY Total_Other_Sales DESC) AS Index_Num
	FROM 'vgsales_kaggle.csv'
	GROUP BY Platform
	ORDER BY Total_Other_Sales DESC
) AS t4
ON t1.Index_Num = t4.Index_Num
ORDER BY t1.Index_Num ASC
LIMIT 10;
Spinner
DataFrameas
df4
variable
-- Q4. Show the running total of global sales in each year for each publisher.
SELECT Publisher, Year, Total_Global_Sales, SUM(Total_Global_Sales) OVER(PARTITION BY Publisher ORDER BY Year ASC) AS Running_Total_Global_Sales
FROM (
	SELECT Publisher, Year, SUM(Global_Sales) AS Total_Global_Sales
	FROM 'vgsales_kaggle.csv'
	GROUP BY Publisher, Year
)
ORDER BY Publisher ASC, Year ASC;