Skip to content
New Workbook
Sign up
Sentiment Analysis and Prediction

Sentiment Analysis and Prediction

Sentiment analysis is the process of understanding the opinion of an author about a subject. Examples include analyzing movie ratings, amazon product reviews or the analysis of Twitter tweet sentiment.

For the purposes of this analysis we will:

  • Explore our data
  • Transform sentiment carrying columns
  • Predict sentiment with a supervised machine learning model
%%capture
!pip install wordcloud
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import (
    TfidfVectorizer,
    CountVectorizer,
    ENGLISH_STOP_WORDS,
)
from wordcloud import WordCloud
from functools import reduce
from nltk import word_tokenize

nltk.download("punkt")

1. Load your data

Upload data that has textual value and an indication of the sentiment (0 = negative, 1 = positive)

# Upload your data as CSV and load as a data frame
df = pd.read_csv('reviews.csv',index_col=0)
df

2. Word cloud and feature creation

Visualize words that carry meaning with a word cloud

positive_df = df[df["score"] == 1]["review"][:100]  # 1 = positive, 0 = negative
positive_df = reduce(lambda a, b: a + b, positive_df)

# Create and generate a word cloud image
cloud_positives = WordCloud(background_color="white").generate(positive_df)

# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation="bilinear")
plt.title("Top 100 positive words", y=1.02, size=14)  # Choose title, position and size
plt.axis("off")  # Turn off axis labels

# Don't forget to show the final image
plt.show()
# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in df["review"]]

# Create a new feature for the lengh of each review
df["n_words"] = [len(word_tokens[i]) for i in range(len(word_tokens))]

df

3. Building a vectorizer

Use the Tfidf Vectorizer to transform the data into numerical values that can be used to make predictions.

# Build the vectorizer
vect = TfidfVectorizer(
    stop_words=ENGLISH_STOP_WORDS,  # Default list of English stop words
    ngram_range=(1, 2),  # Consider Uni- and Bi-grams
    max_features=200,  # Max number of features
    token_pattern=r"\b[^\d\W][^\d\W]+\b",  # Capture only words using this pattern
)  

vect.fit(df.review)

# Create sparse matrix from the vectorizer
X = vect.transform(df.review)

# Create a DataFrame
df_transformed = pd.DataFrame(data=X.toarray(), columns=vect.get_feature_names_out())
df_transformed

4. Building a classifier

Use a logistic regression to predict the sentiment of unseen data. Visualize the errors your classifier makes with a confusion matrix.

dropped = df.drop(["review", "n_words"], axis=1)
transformed = pd.concat([dropped, df_transformed], axis=1)
transformed
# Define X and y
y = transformed["score"]
X = transformed.drop("score", axis=1)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,  # Set size of test_set
    random_state=456,  # Random seed for reproducibility
)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the labels
y_predicted = log_reg.predict(X_test)

# Print accuracy score and confusion matrix on test set
print("Accuracy on the test set: ", accuracy_score(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted) / len(y_test))
ConfusionMatrixDisplay.from_estimator(log_reg, X_test, y_test, normalize="all")
plt.title("Confuson Matrix", y=1.02, size=14)
plt.show()