Project: Topic Analysis of Clothing Reviews with Embeddings

Welcome to the world of e-commerce, where customer feedback is a goldmine of insights! In this project, you'll dive into the Women's Clothing E-Commerce Reviews dataset, focusing on the 'Review Text' column filled with direct customer opinions.

Your mission is to use text embeddings and Python to analyze these reviews, uncover underlying themes, and understand customer sentiments. This analysis will help improve customer service and product offerings.

The Data

You will be working with a dataset specifically focusing on customer reviews. Below is the data dictionary for the relevant field:

womens_clothing_e-commerce_reviews.csv

Column	Description
`'Review Text'`	Textual feedback provided by customers about their shopping experience and product quality.

Armed with access to powerful embedding API services, you will process the reviews, extract meaningful insights, and present your findings.

Let's get started!

Reminder:
To use the OpenAI API in Datalab, set your API key using the Environment menu:

Go to the top menu and select Environment > Environment variables.

Add a new variable with the name OPENAI_API_KEY and set its value to your actual OpenAI API key.

Click Create Evironment Variables to apply the changes.

Your code will now be able to access the API key from the environment variable.

Install useful libraries

Hidden code

Load the dataset

Load data and perform basic data checks to ensure you are using relevant data for the analysis

# Load the dataset
import pandas as pd
reviews = pd.read_csv("womens_clothing_e-commerce_reviews.csv")

# Display the first few entries
reviews.head()

# Start coding here
# Use as many cells as you need.

Create and store the embeddings

from openai import OpenAI
from tenacity import (
    retry,
    wait_random_exponential,
    stop_after_attempt,
    retry_if_exception_type
)
import tiktoken
from tqdm import tqdm

client = OpenAI()

MODEL = "text-embedding-3-small"
MAX_TOKENS_PER_REQUEST = 8000
MAX_ATTEMPTS = 5
MIN_WAIT, MAX_WAIT = 5, 40  # seconds
BATCH_PADDING_TOKENS = 200  # safety buffer for batch token total

enc = tiktoken.encoding_for_model(MODEL)

def count_tokens(text: str) -> int:
    """Count tokens for a given text."""
    return len(enc.encode(text))


def batch_by_token_limit(texts, max_tokens=2048):
    current_batch, token_sum = [], 0
    for text in texts:
        # Ensure text is a string
        text_str = str(text)
        tokens = count_tokens(text_str)
        # start a new batch if adding this text would exceed the limit
        if (token_sum + tokens + BATCH_PADDING_TOKENS) > max_tokens and current_batch:
            yield current_batch
            current_batch, token_sum = [], 0
        current_batch.append(text_str)
        token_sum += tokens
    if current_batch:
        yield current_batch

@retry(
    wait=wait_random_exponential(min=MIN_WAIT, max=MAX_WAIT),
    stop=stop_after_attempt(MAX_ATTEMPTS)
)
def get_embeddings_batch(batch):
    """
    Send one batch to the API with retry and exponential backoff.
    """
    response = client.embeddings.create(
        model=MODEL,
        input=batch
    )
    return [item.embedding for item in response.data]

product_reviews = reviews['Review Text'].values.tolist()
len(product_reviews)