Skip to content

Analyzing Car Reviews with LLMs


Car-ing is sharing, an auto dealership company for car sales and rental, is taking their services to the next level thanks to Large Language Models (LLMs).

As their newly recruited AI and NLP developer, you've been asked to prototype a chatbot app with multiple functionalities that not only assist customers but also provide support to human agents in the company.

The solution should receive textual prompts and use a variety of pre-trained Hugging Face LLMs to respond to a series of tasks, e.g. classifying the sentiment in a car’s text review, answering a customer question, summarizing or translating text, etc.

# Import necessary packages
import pandas as pd

from transformers import logging, pipeline

import evaluate

logging.set_verbosity(logging.WARNING)

Car Reviews


In this section we will simply load five car reviews, previously collected.

# Load the car reviews dataset
file_path = "data/car_reviews.csv"
df = pd.read_csv(file_path, sep=";")

display(df)

Classify Car Reviews


In this section we will build a sentiment classifier, leveraging on the distilbert-base-uncased-finetuned-sst-2-english model, available on HugginFace.

The classifier will be trained to predict if the sentiment of the review is Positive or Negative. Then, we will compute the Accuracy and F1-score metrics.

Here is a quick summary of the math behind the most common metrics for a classificartion task:

TP = True Positives

TN = True Negatives

FP = False Positives

FN = False Negatives

# Put the car reviews and their associated sentiment labels in two lists
reviews = df['Review'].tolist()
target = df['Class'].tolist()


# Load a sentiment analysis LLM into a pipeline
classifier = pipeline(task='sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Perform inference on the car reviews and display prediction results
predicted_labels = classifier(reviews)
for review, prediction, label in zip(reviews, predicted_labels, target):
    print(f"Review: {review}\nActual Sentiment: {label}\nPredicted Sentiment: {prediction['label']} (Confidence: {prediction['score']:.4f})\n")

# Load accuracy and F1 score metrics    
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

# Map categorical sentiment labels into integer labels
references = [1 if label == "POSITIVE" else 0 for label in target]
predictions = [1 if label['label'] == "POSITIVE" else 0 for label in predicted_labels]

# Compute accuracy and F1 score
accuracy_result_dict = accuracy.compute(references=references, predictions=predictions)
accuracy_result = round(accuracy_result_dict['accuracy'],2)
f1_result_dict = f1.compute(references=references, predictions=predictions)
f1_result = round(f1_result_dict['f1'],2)
print(f"Accuracy: {accuracy_result}")
print(f"F1 result: {f1_result}")

Remembering that we only have 5 data points, both Accuracy and F1 are pretty high but F1 is higher than accuracy.

This suggests that the model is performing well overall, but it is particularly effective at correctly identifying positive instances, which is reflected in the higher F1 score.

Translate a Car Review


In this section we will build a translator from English to Spanish using the Helsinki-NLP/opus-mt-en-es model, available on HuggingFace.

We will then use a couple of references stored in a text file to compute the BLEU metric.

BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate the quality of machine-translated text by comparing it to one or more reference translations. Its mathematical formulation is pretty complex, but basically it looks like that:

where:

  • = Brevity Penalty

  • = weight assigned to the n-gram order (often uniform: )

  • = modified precision for n-grams of order

  • = maximum n-gram order considered

is defined as follows:

Where:

  • = total length of the candidate translations

  • = total length of the reference translations

is defined as follows:

Where:

= is the candidate sentence

= the set of reference sentences

= number of times the n-gram appears in the candidate

= number of times the n-gram appears in a reference

# Load translation LLM into a pipeline and translate car review
first_review = reviews[0]
translator = pipeline(task ="translation", model="Helsinki-NLP/opus-mt-en-es")
translated_review = translator(first_review, max_length=27)[0]['translation_text']
print(f"Model translation:\n{translated_review}")

# Load reference translations from text file
with open("data/reference_translations.txt", 'r') as file:
    lines = file.readlines()
    file.close()
references = [line.strip() for line in lines]
print(f"Spanish translation references:\n{references}")

# Load and calculate BLEU score metric
bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=[translated_review], references=[references])
print(f"\nBLEU: {round(bleu_score['bleu'],3)}")

The BLEU score ranges from 0 to 1. The obtained value indicates a high level of similarity between the model's translation and the reference translations.

This suggests that the translation model is performing well.

Ask a Question About a Car Review


In this section we will use the deepset/minilm-uncased-squad2 model to generate answers to questions. Since the model is extractive QA, the answers will be concise and extracted directly from the context.

The context will be the second review collected.

# Instantiate model and tokenizer
model_ckp = "deepset/minilm-uncased-squad2"
model = pipeline(task="question-answering", model=model_ckp)

# Define context and question
context = reviews[1]
question = "What did he like about the brand?"
QC_input = {"question" : question, "context" : context}

# Generate answer
output = model(QC_input)
answer = output['answer']

print(f"Context:\n{context}")
print(f"Question:\n{question}\nAnswer:\n{answer}")

As expected, the model has extracted the answer to our question from the context we provided, and it is right!

Summarize a Car Review


In this section, we will build a text summarizer using the facebook/bart-large-cnn model, available on HugginFace.

As a text to summarize, we will use the last review collected.