Skip to content

Sentiment analysis is a use case of Natural Language Processing (NLP). It is a basic classification problem, but we use language instead of geometric features (ex.age). Today,sentiment analysis tools are everwhere. They help analyzing movie ratings for future movie productions, keeing track on product or service reviews for improvements, monitoring social media posts such as a Twitter tweet or an Instagram post sentiment uncover valuable insights into customer perceptions.

For the purposes of this analysis, we will:

  • Explore a 'review' dataset
  • Visualize the frequency or the importance of each word
  • Transform sentiment carrying columns
  • Predict sentiment with two supervised machine learning models
  • Evaluate models and suggestions for further refinments.

Step 0: Import Libraries

%%capture
!pip install wordcloud 
# Basic functions
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from functools import reduce

# Model training, prediction, evaluation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

#Seed for reproducibility
SEED=123

# Word visualization
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Text analysis with the Natural Language Toolkit (NLTK)
import nltk
from nltk import word_tokenize

nltk.download("punkt")

Step 1: Collect and process the data

Upload data that has textual value and an indication of the sentiment (0 = negative, 1 = positive)

# Upload data as CSV and load as a data frame
df = pd.read_csv('reviews.csv',index_col=0)
print('Shape of the dataset:',df.shape)
df.head(10)
df['score'].value_counts()/len(df['score'])
df['score'].value_counts().rename(index={0: 'Negative', 1: 'Positive'})

This is a pretty balanced data set where negative and positive review amounts are almost equal.

Step 2: WordCloud and feature creation

Visualize words from positive reviews that carry meaning with a word cloud. This tool comes very handy for analyzing feedbacks and reviews.

positive_df = df[df["score"] == 1]["review"][:200]  # 1 = positive, 0 = negative
positive_df = reduce(lambda a, b: a + b, positive_df)

# Create and generate a word cloud image
cloud_positives = WordCloud(background_color="white").generate(positive_df)

# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation="bilinear")
# Choose title, position and size
plt.title("Top 200 positive words", y=1.02, size=14)  
# Turn off axis labels
plt.axis("off")  

# Show the final image
plt.show()

Rerun another word cloud on positive reviews with some stopwords:

# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["book", "block", "one", "will", "look"])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(positive_df)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

There are 2 steps for feature creation:

  1. Tokenization of Reviews: word_tokenize is a function from the NLTK library that splits a text into individual words (tokens). This operation is applied to every review in the review column, resulting in a list of lists, where each sublist contains the tokens of a respective review.
  2. Creating a New Feature- n_words: It is the length of each review.