Skip to content
Sentiment Analysis and Prediction
Sentiment analysis is the process of understanding the opinion of an author about a subject. Examples include analyzing movie ratings, amazon product reviews or the analysis of Twitter tweet sentiment.
For the purposes of this analysis we will:
- Explore our data
- Transform sentiment carrying columns
- Predict sentiment with a supervised machine learning model
%%capture
!pip install wordcloud
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import (
TfidfVectorizer,
CountVectorizer,
ENGLISH_STOP_WORDS,
)
from wordcloud import WordCloud
from functools import reduce
from nltk import word_tokenize
nltk.download("punkt")
1. Load your data
Upload data that has textual value and an indication of the sentiment (0 = negative, 1 = positive)
# Upload your data as CSV and load as a data frame
df = pd.read_csv('reviews.csv',index_col=0)
df
2. Word cloud and feature creation
Visualize words that carry meaning with a word cloud
positive_df = df[df["score"] == 1]["review"][:100] # 1 = positive, 0 = negative
positive_df = reduce(lambda a, b: a + b, positive_df)
# Create and generate a word cloud image
cloud_positives = WordCloud(background_color="white").generate(positive_df)
# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation="bilinear")
plt.title("Top 100 positive words", y=1.02, size=14) # Choose title, position and size
plt.axis("off") # Turn off axis labels
# Don't forget to show the final image
plt.show()
# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in df["review"]]
# Create a new feature for the lengh of each review
df["n_words"] = [len(word_tokens[i]) for i in range(len(word_tokens))]
df
3. Building a vectorizer
Use the Tfidf Vectorizer to transform the data into numerical values that can be used to make predictions.
# Build the vectorizer
vect = TfidfVectorizer(
stop_words=ENGLISH_STOP_WORDS, # Default list of English stop words
ngram_range=(1, 2), # Consider Uni- and Bi-grams
max_features=200, # Max number of features
token_pattern=r"\b[^\d\W][^\d\W]+\b", # Capture only words using this pattern
)
vect.fit(df.review)
# Create sparse matrix from the vectorizer
X = vect.transform(df.review)
# Create a DataFrame
df_transformed = pd.DataFrame(data=X.toarray(), columns=vect.get_feature_names_out())
df_transformed
4. Building a classifier
Use a logistic regression to predict the sentiment of unseen data. Visualize the errors your classifier makes with a confusion matrix.
dropped = df.drop(["review", "n_words"], axis=1)
transformed = pd.concat([dropped, df_transformed], axis=1)
transformed
# Define X and y
y = transformed["score"]
X = transformed.drop("score", axis=1)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2, # Set size of test_set
random_state=456, # Random seed for reproducibility
)
# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)
# Predict the labels
y_predicted = log_reg.predict(X_test)
# Print accuracy score and confusion matrix on test set
print("Accuracy on the test set: ", accuracy_score(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted) / len(y_test))
ConfusionMatrixDisplay.from_estimator(log_reg, X_test, y_test, normalize="all")
plt.title("Confuson Matrix", y=1.02, size=14)
plt.show()