Skip to content
Analysis of Clothing Reviews with Embeddings using OpenAI API
Analysis of Clothing Reviews with Embeddings using OpenAI API
In this project, We will dive into a Women's Clothing Reviews dataset, focusing on the 'Review Text' column filled with direct customer opinions.
Our mission is to use text embeddings and Python to find similarities among these reviews.
Here is the data dictonary:
| Column | Description |
|---|---|
'Review Text' | Textual feedback provided by customers about their shopping experience and product quality. |
'Class Name' | Categorical variable, the class of clothing to which the review refers |
# Initialize the API key
import os
openai_api_key = os.environ["OPENAI_API_KEY"]Install useful libraries
# Update OpenAI to 1.3
from importlib.metadata import version
try:
assert version('openai') == '1.3.0'
except:
!pip install openai==1.3.0
import openai
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltHidden output
Load the dataset
Load data and perform basic data checks
# Load the dataset
reviews = pd.read_csv("womens_clothing_e-commerce_reviews.csv")
# Display the first few entries
reviews.head()# Check for duplicates and missing values
display(reviews.duplicated().sum())
display(reviews.isnull().sum())# Remove missing values in the "Review Text" columns
reviews.dropna(subset = ["Review Text"], inplace=True)# Extract the the "Review Text" and the "Class Name" columns and convert them to a list
reviews_text = reviews["Review Text"].tolist()
class_names = reviews["Class Name"].tolist()OpenAI connection and Embedding creation
Connect to OpenAI and create a function to create embeddings from a given text
# Connect to OpenAI API
client = openai.OpenAI(api_key = openai_api_key)def get_embeddings(texts):
# Create the embeddings using the model "text-embedding-ada-002"
response = client.embeddings.create(
model="text-embedding-ada-002",
input=texts
)
# Convert rsponse to a dictionary
response_dict = response.model_dump()
# Return the embeddings
return [data['embedding'] for data in response_dict['data']]# Get embeddings for the reviews
embeddings = get_embeddings(reviews_text)Decomposition using T-SNE
Use T-SNE to reduce dimensionality to 2 components