Topic Analysis of Clothing Reviews with Embeddings
Analyze clothing reviews on an e-commerce platform to explore different topics and similarities among them.
Project Description
In the competitive world of e-commerce, understanding customer feedback is vital for business success. This project focuses on analyzing reviews from the Women's Clothing E-Commerce Reviews dataset. The challenge is to process and analyze text data, categorize feedback into themes, and perform semantic and similarity searches to gain deeper understanding of customer sentiments. The project will provide invaluable skills in data processing, natural language understanding, and leveraging AI for practical business applications.
Welcome to the world of e-commerce, where customer feedback is a goldmine of insights! In this project, you'll dive into the Women's Clothing E-Commerce Reviews dataset, focusing on the 'Review Text' column filled with direct customer opinions.
Your mission is to use text embeddings and Python to analyze these reviews, uncover underlying themes, and understand customer sentiments. This analysis will help improve customer service and product offerings.
The Data
You will be working with a dataset specifically focusing on customer reviews. Below is the data dictionary for the relevant field:
womens_clothing_e-commerce_reviews.csv
| Column | Description |
|---|---|
'Review Text' | Textual feedback provided by customers about their shopping experience and product quality. |
Armed with access to powerful embedding API services, you will process the reviews, extract meaningful insights, and present your findings.
Let's get started!
Before you start
In order to complete the project you will need to create a developer account with OpenAI and store your API key as a secure environment variable. Instructions for these steps are outlined below.
Create a developer account with OpenAI
-
Go to the API signup page.
-
Create your account (you'll need to provide your email address and your phone number).
-
Go to the API keys page.
-
Create a new secret key.
- Take a copy of it. (If you lose it, delete the key and create a new one.)
Add a payment method
OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details.
This project should cost much less than 1 US cents with gpt-4o-mini (but if you rerun tasks, you will be charged every time).
-
Go to the Payment Methods page.
-
Click Add payment method.
- Fill in your card details.
Add an environmental variable with your OpenAI key
-
In the workbook, click on "Environment," in the top toolbar and select "Environment variables".
-
Click "Add" to add environment variables.
-
In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key.
- Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.
Update to Python 3.10
Due to how frequently the libraries required for this project are updated, you'll need to update your environment to Python 3.10:
-
In the workbook, click on "Environment," in the top toolbar and select "Session details".
-
In the workbook language dropdown, select "Python 3.10".
-
Click "Confirm" and hit "Done" once the session is ready.
Load OpenAI API key from environment variables
These variables can be referenced globally throughout the project while keeping their values secret. Good for setting passwords in credentials.
- Create and store the embeddings
- Embed the reviews using a suitable text embedding algorithm and store them as list in the variable
embeddings.
- Embed the reviews using a suitable text embedding algorithm and store them as list in the variable
- Dimensionality reduction & visualization
- Apply an appropriate dimensionality reduction technique to reduce the
embeddingsto a 2-dimensional numpy array and store this array in the variableembeddings_2d. - Then, use this variable to plot a 2D visual representation of the reviews.
- Apply an appropriate dimensionality reduction technique to reduce the
- Feedback categorization
- Use your embeddings to identify some reviews that discuss topics such as 'quality', 'fit', 'style', 'comfort', etc.
- Similarity search function
- Write a function that outputs the closest 3 reviews to a given input review, enabling a more personalized customer service response.
- Apply this function to the first review "Absolutely wonderful - silky and sexy and comfortable", and store the output as a list in the variable
most_similar_reviews.
How to approach the project
-
Generating the embeddings
-
Reducing the dimensionality and visualizing in a plot
-
Categorizing the reviews into topics
-
Finding similar reviews
# Initialize your API key
import os
openai_api_key = os.environ["OPENAI_API_KEY"]Install useful libraries
import openai
from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import chromadb
from scipy.spatial import distance
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction# Update OpenAI to 1.3
from importlib.metadata import version
try:
assert version('openai') == '1.3.0'
except:
!pip install openai==1.3.0
import openai# Run this cell to install ChromaDB if desired
try:
assert version('chromadb') == '0.4.17'
except:
!pip install chromadb==0.4.17
try:
assert version('pysqlite3') == '0.5.2'
except:
!pip install pysqlite3-binary==0.5.2
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import chromadbLoad the dataset
Load data and perform basic data checks to ensure you are using relevant data for the analysis
# Load the dataset
import pandas as pd
reviews = pd.read_csv("womens_clothing_e-commerce_reviews.csv")
# Display the first few entries
reviews.head()# view data description
def display_data_info(df):
display(df.info())
display(df.isna().sum())
# Call the function with the reviews dataframe
display_data_info(reviews)1. Generating the embeddings
You can use the OpenAI API to generate an embedding for a given text input.