Sentiment Analysis and Prediction in Python
Welcome to your webinar workspace! Here, you can follow along as we try to predict the sentiment of movie reviews!
The cells below install a package currently unavailable in Workspace and import the libraries we will use in this code. The final cell also imports the data in your directory ("movie_reviews.csv").
%%capture
!pip install wordcloud# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
# Load data as a DataFrame
df = pd.read_csv("movie_reviews.csv")
# Preview the data
df.sample(3)Inspect and explore our data
We can inspect the data types and the number of non-null rows per column using the .info() method.
# Inspect the data types and non-null rows
df.info()As this is a classification problem, we will want to inspect the balance of our target variable label. We can use .value_counts() with normalize set to True to return the proportion of each class label.
# Check the value counts of the sentiment label
df["label"].value_counts(normalize=True)One way to inspect our text data is to create a word cloud, which shows the most frequent words by size. To create one, we initialize a WordCloud(). Specifying the stopwords allows us to filter out generic words such as "the" and "and".
# Concatenate the text review data
reviews = " ".join(df["text"])
# Create the word cloud image
word_cloud = WordCloud(background_color='white',
stopwords=ENGLISH_STOP_WORDS,
width=800,
height=400)
# Generate the word cloud using the review data
word_cloud.generate(reviews)
# Display the word cloud
plt.rcParams["figure.figsize"] = (12, 8)
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()Pre-processing the review text
To pre-process the text, we will use the term frequency-inverse document frequency, or TfIdf. TfIdf is a way of calculating the importance of words in a collection of different sets of text (or documents). TfIdf has the advantages of:
- Highlighting words that are common within a document but not across documents.
- Returning low scores for words common across all reviews (e.g., movie in movie reviews).
- Penalizing frequent words so we don't need to worry about stop words as much.
Fortunately, Scikit-Learn has a TfidfVectorizer class that can convert text data into a set of TfIdf features.
# Specify the word pattern
pattern = r"[a-zA-Z]+"
# Build the vectorizer and fit to the text data
vect = TfidfVectorizer(
token_pattern=pattern, # Define the pattern to extract words
stop_words=ENGLISH_STOP_WORDS, # Default list of English stop words
ngram_range=(1, 2), # Consider uni- and bi-grams
max_features=500, # Maximum number of features
)
vect.fit(df["text"])
# Create sparse matrix from the vectorizer
tokenized_features = vect.transform(df["text"])
# Create a DataFrame of the new features
features = pd.DataFrame(data=tokenized_features.toarray(),
columns=vect.get_feature_names_out()
)
featuresLet's add a few more features about the nature of the review, calculating different length metrics for the text (inspired by this great article).
# Generate a number of different length metrics based on the text
df["char_count"] = df["text"].str.count(r"\S")
df["word_count"] = df["text"].str.count(pattern)
df["avg_word_length"] = df["char_count"] / df["word_count"]
# Preview our new columns
df.sample(3)Fit a model and evaluate its performance
Finally, we assign our features and target to X and y, respectively, split our data into train and test subsets, and fit a classification model to the data.
In this case, we use a simple RandomForestClassifier() and calculate the classification metrics using the test set and our predicted values.
# Define X and y
X = pd.concat([features, df.loc[:, "char_count":]], axis=1)
y = df["label"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
# Train a random forest classifier
rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)
# Predict the labels
y_pred = rf.predict(X_test)
# Print classification metrics
print(classification_report(y_test, y_pred))