Welcome to the world of e-commerce, where customer feedback is a goldmine of insights! In this project, you'll dive into the Women's Clothing E-Commerce Reviews dataset, focusing on the 'Review Text' column filled with direct customer opinions.
Your mission is to use text embeddings and Python to analyze these reviews, uncover underlying themes, and understand customer sentiments. This analysis will help improve customer service and product offerings.
The Data
You will be working with a dataset specifically focusing on customer reviews. Below is the data dictionary for the relevant field:
womens_clothing_e-commerce_reviews.csv
| Column | Description |
|---|---|
'Review Text' | Textual feedback provided by customers about their shopping experience and product quality. |
Armed with access to powerful embedding API services, you will process the reviews, extract meaningful insights, and present your findings.
Let's get started!
Reminder:
To use the OpenAI API in Datalab, set your API key using the Environment menu:
- Go to the top menu and select Environment > Environment variables.
- Add a new variable with the name
OPENAI_API_KEYand set its value to your actual OpenAI API key.- Click Create Evironment Variables to apply the changes.
Your code will now be able to access the API key from the environment variable.
Install useful libraries
Load the dataset
Load data and perform basic data checks to ensure you are using relevant data for the analysis
# Load the dataset
import pandas as pd
reviews = pd.read_csv("womens_clothing_e-commerce_reviews.csv")
# Display the first few entries
reviews.head()# Start coding here
# Use as many cells as you need.Create and store the embeddings
from openai import OpenAI
from tenacity import (
retry,
wait_random_exponential,
stop_after_attempt,
retry_if_exception_type
)
import tiktoken
from tqdm import tqdmclient = OpenAI()MODEL = "text-embedding-3-small"
MAX_TOKENS_PER_REQUEST = 8000
MAX_ATTEMPTS = 5
MIN_WAIT, MAX_WAIT = 5, 40 # seconds
BATCH_PADDING_TOKENS = 200 # safety buffer for batch token totalenc = tiktoken.encoding_for_model(MODEL)
def count_tokens(text: str) -> int:
"""Count tokens for a given text."""
return len(enc.encode(text))
def batch_by_token_limit(texts, max_tokens=2048):
current_batch, token_sum = [], 0
for text in texts:
# Ensure text is a string
text_str = str(text)
tokens = count_tokens(text_str)
# start a new batch if adding this text would exceed the limit
if (token_sum + tokens + BATCH_PADDING_TOKENS) > max_tokens and current_batch:
yield current_batch
current_batch, token_sum = [], 0
current_batch.append(text_str)
token_sum += tokens
if current_batch:
yield current_batch
@retry(
wait=wait_random_exponential(min=MIN_WAIT, max=MAX_WAIT),
stop=stop_after_attempt(MAX_ATTEMPTS)
)
def get_embeddings_batch(batch):
"""
Send one batch to the API with retry and exponential backoff.
"""
response = client.embeddings.create(
model=MODEL,
input=batch
)
return [item.embedding for item in response.data]product_reviews = reviews['Review Text'].values.tolist()
len(product_reviews)