Skip to content

NLP-Enhanced Interactive Game Recommender Engine

Introduction

Embark on an exploration of game recommendation, where data aggregation meets the precision of feature engineering.

Game Recommender Engine

The project commences with the collection of game metadata and media, forming a comprehensive analytical foundation. Textual descriptions undergo TF-IDF vectorization, while community tags are encoded, crafting a nuanced feature set that captures the essence of each game.

Normalization techniques are applied to ensure equitable feature contribution, paving the way for the construction of a cosine similarity matrix. This matrix serves as a quantifiable measure of likeness between games, guiding the generation of tailored recommendations.

Visual elements are integrated, with images that encapsulate the visual appeal of each game, enhancing the user's engagement. The culmination of this project is an interactive interface, offering a seamless blend of analytics and user experience, designed to deliver personalized game discovery for every gamer!

Data Preparation and Feature Engineering

Initial Data Exploration

In this initial step, we import the essential pandas library and load our dataset from 'steam.csv' into a DataFrame named steam_data. By displaying the first few rows, we get a glimpse into the structure and content of our data, setting the stage for the exciting analyses and insights to come.

import pandas as pd

# Load the dataset
steam_data = pd.read_csv('datasets/steam.csv')

# Display the first few rows of the dataframe
steam_data.head()

Loading Steam Media Data

As we delve deeper, we now load the 'steam_media_data.csv' file, which contains rich media information related to the games on Steam. This DataFrame, steam_media_data, will provide us with visual context such as game images and videos, enhancing our understanding of each game's aesthetic and thematic elements. Let's peek at the first few entries to familiarize ourselves with the media aspects of our dataset.

# Load the steam media data
steam_media_data = pd.read_csv('datasets/steam_media_data.csv')

# Display the first few rows of the dataframe
steam_media_data.head()

Dataset Summary and Missing Values Check

Moving forward, we configure our environment to display all columns without truncation, ensuring a comprehensive view. We then generate a summary of our steam_data dataset, providing descriptive statistics and a complete overview of all features. Additionally, we perform a missing values check to identify any gaps in our data. The output presents both the summary statistics and the count of missing values, equipping us with valuable insights for data cleaning and preparation.

# Summary of the dataset with all columns shown
pd.set_option('display.max_columns', None)

# Summary of the dataset
summary = steam_data.describe(include='all').transpose()

# Check for missing values
missing_values = steam_data.isnull().sum()

# Display the summary and missing values
summary, missing_values

Exploring Game Release Years and Price Distribution

In our exploration phase, we enrich the steam_data with a release_year column extracted from the release dates. Our first Plotly histogram visualizes the distribution of games by release year, highlighting the evolution of game releases over time. Next, we create a histogram to examine the price distribution of games, providing insights into the range and commonality of game prices. These interactive visualizations are crucial for understanding both historical trends and the current state of the gaming market.

import plotly.express as px

# Extract the year from the release_date column as release_year
steam_data['release_year'] = pd.to_datetime(steam_data['release_date']).dt.year

# Distribution of games by release year using Plotly
fig1 = px.histogram(steam_data, x='release_year', nbins=len(steam_data['release_year'].unique()), title='Distribution of Games by Release Year')
fig1.update_layout(xaxis_title='Year', yaxis_title='Number of Games', bargap=0.2)
fig1.show()

# Price distribution using Plotly
fig2 = px.histogram(steam_data, x='price', nbins=50, title='Price Distribution of Games')
fig2.update_traces(xbins=dict( # bins used for histogram
    start=0,
    end=steam_data['price'].max(),
    size=steam_data['price'].max()/50
))
fig2.update_layout(xaxis_title='Price (£)', yaxis_title='Number of Games')
fig2.show()

Price Sensitivity in Game Recommendations

The price distribution plot indicates that a large majority of games on Steam are priced at the lower end of the spectrum, with a significant number of games available for free or at a very low cost. This suggests that price is a critical factor for users when browsing for games. A game recommender system could prioritize suggesting free-to-play or low-cost games, especially to users who are sensitive to price or are new to the platform and might be looking for an entry point without a significant financial commitment.

Additionally, the presence of a long tail in the price distribution implies that there are also premium games with higher price points. These could be targeted towards users who have shown a willingness to purchase more expensive games in the past or who are looking for high-quality or niche titles that justify a higher price.

In summary, incorporating price as a feature in the recommendation algorithm could help tailor suggestions to the financial preferences of different user segments, potentially increasing user satisfaction and engagement with the platform. Given the distribution and our focus on the majority of users, we will remove games priced above £50 to better align our recommendations with the spending habits of the average user.

import plotly.express as px

# Filtering out games priced above 50 as most are game builder apps
filtered_steam_data = steam_data[steam_data['price'] < 50]

# Price distribution of games priced below 50 using Plotly, with bins split by integers
fig2_filtered = px.histogram(filtered_steam_data, x='price', nbins=int(filtered_steam_data['price'].max()), title='Price Distribution of Games Below £50')
fig2_filtered.update_traces(xbins=dict( # bins used for histogram
    start=0,
    end=filtered_steam_data['price'].max(),
    size=1  # Setting bin size to 1 for integer splits
))
fig2_filtered.update_layout(xaxis_title='Price (£)', yaxis_title='Number of Games')
fig2_filtered.show()

Choice of Recommendation Features

The scatter plot below of price versus average number of owners for games priced below £50 illustrates that affordability may significantly influence a game's popularity, with lower-priced games generally attracting more owners. This trend underscores the potential of recommending lower-priced games to engage a broader audience within a game recommender system.

Despite this, the data also reveals that higher-priced games can achieve substantial ownership, suggesting a market for premium or niche gaming experiences. A balanced recommendation approach that accounts for both price and ownership can therefore offer personalized game suggestions that resonate with diverse user preferences and spending capacities.

To optimize the recommendations, this project will thus not consider price but will rather delve into more distinctive features such as genre and SteamSpy community tags. Additionally, we will leverage natural language processing (NLP) techniques to analyze game descriptions. This multifaceted strategy enables us to capture the essence of each game and the interests of our users more accurately, resulting in a recommendation system that aligns with individual tastes, interests, and willingness to invest in gaming experiences.

import matplotlib.pyplot as plt
import numpy as np

# Filter the dataframe for games priced below 50
filtered_prices = steam_data[steam_data['price'] < 50]

# Create bins for the 'owners' column based on the range of values
# Assuming 'owners' are in the format 'min_owners-max_owners' (e.g., '0-20000')
# First, we need to split these into separate columns to calculate the average number of owners
split_owners = filtered_prices['owners'].str.split('-', expand=True).astype(float)
avg_owners = (split_owners[0] + split_owners[1]) / 2

# Now, we plot price vs. average owners
plt.figure(figsize=(10, 6))
plt.scatter(avg_owners, filtered_prices['price'], alpha=0.5)
plt.title('Price vs. Average Number of Owners for Games Priced Below 50')
plt.xlabel('Average Number of Owners')
plt.ylabel('Price')
plt.xscale('log')  # Using a log scale for owners to better display the wide range of values
plt.grid(True, which="both", ls="--")
plt.show()

Preprocessing Genres and Tags for Machine Learning

To harness the predictive power of genres and community tags in our recommender system, we proceed to preprocess these categorical features. Using the MultiLabelBinarizer from scikit-learn, we transform the semicolon-delimited strings of genres and tags into a machine learning-friendly format. The one-hot encoding process results in two DataFrames, genres_encoded and tags_encoded, which represent the presence or absence of each genre and tag as binary features. We conclude this step by displaying the shapes of the encoded DataFrames, confirming the expansion of our feature space to accommodate these multi-label attributes.

from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

# Function to split the semicolon-delimited strings into lists
split_genres = steam_data['genres'].str.split(';')
split_tags = steam_data['steamspy_tags'].str.split(';')

# Initialize MultiLabelBinarizer for each column separately
mlb_genres = MultiLabelBinarizer()
mlb_tags = MultiLabelBinarizer()

# One-hot encode the split genres and tags for optimal ML format
one_hot_genres = mlb_genres.fit_transform(split_genres)
one_hot_tags = mlb_tags.fit_transform(split_tags)

# Create DataFrames from the one-hot encoded arrays using classes of the binarizer as column/feature names
genres_encoded = pd.DataFrame(one_hot_genres, columns=mlb_genres.classes_)
tags_encoded = pd.DataFrame(one_hot_tags, columns=mlb_tags.classes_)

# Display the shape of the resulting DataFrames with textual context
print("Shape of genres_encoded DataFrame:", genres_encoded.shape)
print("Shape of tags_encoded DataFrame:", tags_encoded.shape)