Skip to content

Sentiment Analysis and Prediction in Python

Welcome to your webinar workspace! Here, you can follow along as we try to predict the sentiment of movie reviews!

The cells below install a package currently unavailable in Workspace and import the libraries we will use in this code. The final cell also imports the data in your directory ("movie_reviews.csv").

%%capture
!pip install wordcloud
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# Load data as a DataFrame
df = pd.read_csv("movie_reviews.csv")

# Preview the data
df.sample(3)

Inspect and explore our data

We can inspect the data types and the number of non-null rows per column using the .info() method.

# Inspect the data types and non-null rows
df.info()

As this is a classification problem, we will want to inspect the balance of our target variable label. We can use .value_counts() with normalize set to True to return the proportion of each class label.

# Check the value counts of the sentiment label
df["label"].value_counts(normalize=True)

One way to inspect our text data is to create a word cloud, which shows the most frequent words by size. To create one, we initialize a WordCloud(). Specifying the stopwords allows us to filter out generic words such as "the" and "and".

# Concatenate the text review data
reviews = " ".join(df["text"])

# Create the word cloud image
word_cloud = WordCloud(background_color='white',
                       stopwords=ENGLISH_STOP_WORDS,
                       width=800,
                       height=400)

# Generate the word cloud using the review data
word_cloud.generate(reviews)

# Display the word cloud
plt.rcParams["figure.figsize"] = (12, 8)
plt.imshow(word_cloud, interpolation="bilinear") 
plt.axis("off")
plt.show()

Pre-processing the review text

To pre-process the text, we will use the term frequency-inverse document frequency, or TfIdf. TfIdf is a way of calculating the importance of words in a collection of different sets of text (or documents). TfIdf has the advantages of:

  • Highlighting words that are common within a document but not across documents.
  • Returning low scores for words common across all reviews (e.g., movie in movie reviews).
  • Penalizing frequent words so we don't need to worry about stop words as much.

Fortunately, Scikit-Learn has a TfidfVectorizer class that can convert text data into a set of TfIdf features.

# Specify the word pattern
pattern = r"[a-zA-Z]+"

# Build the vectorizer and fit to the text data
vect = TfidfVectorizer(
    token_pattern=pattern, # Define the pattern to extract words
    stop_words=ENGLISH_STOP_WORDS, # Default list of English stop words
    ngram_range=(1, 2),  # Consider uni- and bi-grams
    max_features=500,  # Maximum number of features
)

vect.fit(df["text"])

# Create sparse matrix from the vectorizer
tokenized_features = vect.transform(df["text"])

# Create a DataFrame of the new features
features = pd.DataFrame(data=tokenized_features.toarray(), 
                        columns=vect.get_feature_names_out()
                       )
features

Let's add a few more features about the nature of the review, calculating different length metrics for the text (inspired by this great article).

# Generate a number of different length metrics based on the text
df["char_count"] = df["text"].str.count(r"\S")
df["word_count"] = df["text"].str.count(pattern)
df["avg_word_length"] = df["char_count"] / df["word_count"]

# Preview our new columns
df.sample(3)

Fit a model and evaluate its performance

Finally, we assign our features and target to X and y, respectively, split our data into train and test subsets, and fit a classification model to the data.

In this case, we use a simple RandomForestClassifier() and calculate the classification metrics using the test set and our predicted values.

# Define X and y
X = pd.concat([features, df.loc[:, "char_count":]], axis=1)
y = df["label"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,  
    random_state=42
)

# Train a random forest classifier
rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)

# Predict the labels
y_pred = rf.predict(X_test)

# Print classification metrics
print(classification_report(y_test, y_pred))