Sentiment Analysis and Prediction
    Sentiment analysis is the process of understanding the opinion of an author about a subject. Examples include analyzing movie ratings, amazon product reviews or the analysis of Twitter tweet sentiment.

    For the purposes of this analysis we will:

    • Explore our data
    • Transform sentiment carrying columns
    • Predict sentiment with a supervised machine learning model
    !pip install wordcloud
    # Imports
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np
    import nltk
    from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import (
    from wordcloud import WordCloud
    from functools import reduce
    from nltk import word_tokenize"punkt")

    1. Load your data

    Upload data that has textual value and an indication of the sentiment (0 = negative, 1 = positive)

    # Upload your data as CSV and load as a data frame
    df = pd.read_csv('reviews.csv',index_col=0)

    2. Word cloud and feature creation

    Visualize words that carry meaning with a word cloud

    positive_df = df[df["score"] == 1]["review"][:100]  # 1 = positive, 0 = negative
    positive_df = reduce(lambda a, b: a + b, positive_df)
    # Create and generate a word cloud image
    cloud_positives = WordCloud(background_color="white").generate(positive_df)
    # Display the generated wordcloud image
    plt.imshow(cloud_positives, interpolation="bilinear")
    plt.title("Top 100 positive words", y=1.02, size=14)  # Choose title, position and size
    plt.axis("off")  # Turn off axis labels
    # Don't forget to show the final image
    # Tokenize each item in the review column
    word_tokens = [word_tokenize(review) for review in df["review"]]
    # Create a new feature for the lengh of each review
    df["n_words"] = [len(word_tokens[i]) for i in range(len(word_tokens))]

    3. Building a vectorizer

    Use the Tfidf Vectorizer to transform the data into numerical values that can be used to make predictions.

    # Build the vectorizer
    vect = TfidfVectorizer(
        stop_words=ENGLISH_STOP_WORDS,  # Default list of English stop words
        ngram_range=(1, 2),  # Consider Uni- and Bi-grams
        max_features=200,  # Max number of features
        token_pattern=r"\b[^\d\W][^\d\W]+\b",  # Capture only words using this pattern
    # Create sparse matrix from the vectorizer
    X = vect.transform(
    # Create a DataFrame
    df_transformed = pd.DataFrame(data=X.toarray(), columns=vect.get_feature_names_out())

    4. Building a classifier

    Use a logistic regression to predict the sentiment of unseen data. Visualize the errors your classifier makes with a confusion matrix.

    dropped = df.drop(["review", "n_words"], axis=1)
    transformed = pd.concat([dropped, df_transformed], axis=1)
    # Define X and y
    y = transformed["score"]
    X = transformed.drop("score", axis=1)
    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        test_size=0.2,  # Set size of test_set
        random_state=456,  # Random seed for reproducibility
    # Train a logistic regression
    log_reg = LogisticRegression().fit(X_train, y_train)
    # Predict the labels
    y_predicted = log_reg.predict(X_test)
    # Print accuracy score and confusion matrix on test set
    print("Accuracy on the test set: ", accuracy_score(y_test, y_predicted))
    print(confusion_matrix(y_test, y_predicted) / len(y_test))
    ConfusionMatrixDisplay.from_estimator(log_reg, X_test, y_test, normalize="all")
    plt.title("Confuson Matrix", y=1.02, size=14)