Steam NLP Analysis: Exploring Gamer Sentiments
Introduction
Welcome to this cutting-edge project that dives into the world of Steam game reviews!
The aim of this project is to decipher the hidden sentiments of gamers using Natural Language Processing (NLP) techniques. By analyzing the intricate patterns of language within user reviews, the aim is to extract the TF-IDF (Term Frequency-Inverse Document Frequency) values to construct a predictive model. This model's goal is to forecast whether a player will recommend a game based on the review's textual content.
The project doesn't stop there; it also harnesses the power of TextBlob for sentiment analysis to uncover any links between the sentiment polarity of reviews and the likelihood of game recommendation. With a robust training dataset at disposal, the model will be trained to discern the nuances of gamer feedback. Then, the model will be put to the test with a separate dataset, ensuring the predictions hold up against fresh, unseen reviews.
Enjoy this analytical journey to unlock the secrets behind what drives players to suggest — or not suggest — their favourite games.
import pandas as pd
# Load the dataset
game_overview = pd.read_csv('datasets/game_overview.csv')
# Display the first 5 rows
game_overview.head()A Peek at the data
The training dataset contains a game title column, a year column, a user review column that contains the textual gamer feedback and a user suggestion (recommendation) column. User suggestions are marked as "1" for recommended and "0" otherwise in the "user_suggestion" column.
# Load the reviews dataset
reviews = pd.read_csv('datasets/train.csv')
# Strip redundant "Early Access Review" and "Product received for free" from the user_review column
reviews['user_review'] = reviews['user_review'].str.replace("Early Access Review", "", regex=False)
reviews['user_review'] = reviews['user_review'].str.replace("Product received for free", "", regex=False)
# Display the first 5 rows of dataframe
reviews.head()# Review info on columns and dtype
reviews.info()
# Count number of positive and negative suggestions (recommend the game or not)
reviews['user_suggestion'].value_counts()# Count the number of unique titles (games) in the reviews dataframe and display
num_titles = reviews['title'].nunique()
print(f"Number of unique titles in the reviews dataframe: {num_titles}")A glance at the most and least popular games in the dataset
Here, we group the reviews by title and investigate the difference in recommendation numbers (recommended or not) to get an intuitive feel for the most and least popular games in the dataset. The resulting percentage is similar to a common gaming measure of how many gamers "like the game", as on Google, etc.
# Group by title and calculate the number of user_suggestion counts for each title
user_suggestion_counts = reviews.groupby('title')['user_suggestion'].value_counts().unstack(fill_value=0)
# Add difference column representing ratio of positive to negative suggestion
user_suggestion_counts['difference'] = user_suggestion_counts[1] - user_suggestion_counts[0]
# Add a percentage column reflecting the number of positive suggestions over the total number of reviews
user_suggestion_counts['percentage'] = ((user_suggestion_counts[1] / (user_suggestion_counts[0] + user_suggestion_counts[1])) * 100).round(2)
user_suggestion_countsimport plotly.express as px
# Limit the number of characters in the x axis labels to 10
user_suggestion_counts['title_short'] = user_suggestion_counts.index.str.slice(0, 10)
# Create a color column based on the difference being positive or negative
user_suggestion_counts['color'] = user_suggestion_counts['difference'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
# Create a bar chart using Plotly Express with shortened titles and full titles in hover text
# Color the bars based on the 'color' column
fig = px.bar(user_suggestion_counts.reset_index(), x='title_short', y='difference',
title='Difference in User Suggestions per Game Title',
labels={'difference': 'Difference in Suggestions', 'title_short': 'Game Title'},
color='color', # Color the bars based on the 'color' column
hover_data={'title': True, 'title_short': False, 'color': False}) # Display full title on hover, hide short title and color
# Show the figure
fig.show()# Sort the user_suggestion_counts dataframe by the 'difference' column to find the best and worst titles
sorted_user_suggestions = user_suggestion_counts.sort_values(by='difference', ascending=False)
# Display the 5 best titles
print("5 Best Titles Based on User Suggestions:")
display(sorted_user_suggestions.head(5))
# Display the 5 worst titles
print("\n5 Worst Titles Based on User Suggestions:")
display(sorted_user_suggestions.tail(5))# In this cell, we define a plotting function for the visualization of the confusion matrix that results from testing trained models
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
def plot_confusion_matrix(cf_matrix):
# Calculate the percentage of each value in the confusion matrix
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
# Labels combine counts and percentages
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
# Reshape the labels to the correct 2x2 shape
labels = np.asarray(labels).reshape(2,2)
# Create the heatmap
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues', cbar=False)
# Add labels to the x and y axis
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix with Counts and Percentages')
plt.show()Baseline Binary Classifier Model: TF-IDF and Logistic Regression
This code imports necessary libraries for text processing and machine learning, constructs a TF-IDF vectorizer with specific parameters to transform user reviews into a numerical format, creates a DataFrame from the transformed data, and adds the target variable 'user_suggestion' back in. It then defines features and the target for modeling, splits the data into training and test sets, trains a logistic regression model, makes predictions on the test set, and evaluates the model's performance by printing the accuracy score and displaying the confusion matrix. This is a good first attempt at fitting the data to a predictive model; a good baseline model for comparison.
# Import the TfidfVectorizer and default list of English stop words, processing and modeling modules
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Build the vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=1000, token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(reviews.user_review)
# Create sparse matrix from the vectorizer
X = vect.transform(reviews.user_review)
# Create a DataFrame from tf-idf frequencies and tokens
reviews_transformed = pd.DataFrame(X.toarray(), columns=vect.get_feature_names_out())
print(reviews_transformed.shape)
reviews_transformed['user_suggestion'] = reviews.user_suggestion
# Define X and y
y = reviews_transformed.user_suggestion
X = reviews_transformed.drop('user_suggestion', axis=1)
# Train/test split review dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Train a logistic regression model
log_reg = LogisticRegression().fit(X_train, y_train)
# Predict the labels for the test set
y_predicted = log_reg.predict(X_test)
# Calculate the confusion matrix to see distribution of predicted labels (correct along diagonal, incorrect off-diagonal)
cf_matrix = confusion_matrix(y_test, y_predicted)
# Print accuracy score and confusion matrix on test set
print('Accuracy on the test set: ', accuracy_score(y_test, y_predicted))
print(cf_matrix)
# Visualise confusion matrix using function defined above
plot_confusion_matrix(cf_matrix)Optmizing the Model using GridSearchCV
The baseline model produces an accuracy score of 82.7% in predicting recommendations based on the textual content of user reviews. This is already a well-performing model but higher accuracy can be achieved by fine-tuning the parameters of the TF-IDF model.
This code sets up a machine learning pipeline with a TF-IDF vectorizer and a logistic regression classifier, defines a grid of parameters to search for the best text feature extraction settings, and performs a grid search with 5-fold cross-validation on user reviews to find the optimal parameters, finally printing the best parameter set and the corresponding score.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Define a pipeline combining a text feature extractor with a simple logistic regression classifier. The token_pattern matches latin alphabet words containing 3 or more letters to filter out unwanted reviews
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', token_pattern=r'\b[a-zA-Z]{3,}\b')),
('clf', LogisticRegression())
])
# Define the parameter grid to search: unigrams or bigrams | max number of features (tokens/words or bigrams)
parameter_grid = {
'tfidf__ngram_range': [(1, 1), (1, 2)],
'tfidf__max_features': [2000, 5000, 10000, 20000]
}
# Setup the grid search using the pipeline and the parameter grid
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, n_jobs=-1, verbose=4)
# Perform the grid search on the user reviews
grid_search.fit(reviews.user_review, reviews.user_suggestion)
# Print the best parameters and the corresponding score
print("Best parameters set:")
print(grid_search.best_params_)
print("Best score: %0.3f" % grid_search.best_score_)