Speculative RAG Implementation With Transformers

Learn Speculative RAG, a technique that improves RAG through a two-step drafting and verification process, and apply your skills with a hands-on implementation using Hugging Face Transformers.

Oct 10, 2024 · 8 min read

Large language models (LLMs) often fall short when it comes to knowledge-intensive tasks that require up-to-date and precise information. This is where retrieval-augmented generation (RAG) steps in, combining the generative capabilities of LLMs with external knowledge bases to enhance accuracy and relevance.

Despite its effectiveness, traditional RAG systems face challenges in managing long and complex documents, leading to increased latency and sometimes less accurate results. To address these issues, the concept of speculative RAG has emerged as a promising solution. Let’s find out more about it.

What Is Speculative RAG?

Speculative retrieval-augmented generation (RAG) is an innovative framework designed to improve the efficiency and accuracy of RAG systems by dividing the generation process into two distinct steps:

Drafting: The first step involves a smaller, specialized language model known as the RAG Drafter. This model generates multiple answer drafts from different subsets of retrieved documents. Each subset is carefully selected to represent diverse perspectives, minimizing redundancy and enhancing the quality of drafts.
Verification: The drafts generated by the RAG Drafter are then passed to a larger, generalist LM, known as the RAG Verifier. This model evaluates the drafts, focusing on selecting the most accurate and reliable one based on the rationale provided. The verification process uses the language modeling capabilities of the generalist LM to ensure that the final output is both accurate and contextually relevant.

Let’s better understand speculative RAG by examining this diagram:

Source: Wang et al., 2024

Let's walk through this diagram step-by-step. It shows different ways to use external information to help an LLM answer questions, focusing on a new method called Speculative RAG. Let’s examine the methods in the first row:

(a) Standard RAG: The simplest approach. All the relevant documents (1, 2... in the diagram) are directly given to the LLM (the "Generalist LM") along with the question (Q) to produce an answer (A).
(b) Self-Reflective RAG: Here, the LLM is trained to analyze the documents and its own answer, generating special tags ("Need," "Relevant," "Supported") to improve its reasoning.
(c) Corrective RAG: This uses a separate model (an "NLI model") to check if the retrieved documents are actually relevant to the question before giving them to the LLM.

Now, let’s a close look at (d) Speculative RAG:

Specialist RAG Drafter: Instead of overwhelming the big LLM, a smaller "drafter" model takes the lead. It looks at the question (Q) and the documents (1, 2...) and creates multiple draft answers (α1, α2, α3...). Each draft is based on a different subset of the documents.
Rationales: The drafter also provides explanations (β1, β2, β3...) for why it chose those specific documents for each draft. Think of it like showing its work.
Evaluation: The main LLM (the "Generalist LM") now steps in. It looks at each draft answer and its explanation. It then picks the best answer (A) based on this evaluation.

Essentially, speculative RAG is like having a team of experts: the drafter creates options, and the generalist LLM makes the final decision. This makes the process faster and potentially more accurate.

Advantages of Speculative RAG

Speculative RAG offers several key advantages over traditional RAG systems:

1. Greater accuracy

By using a specialized RAG Drafter to generate multiple drafts from diverse document subsets, Speculative RAG improves the accuracy of the final output. This approach allows the system to consider multiple perspectives, reducing the chances of generating incorrect or biased answers. In experiments, Speculative RAG demonstrated up to a 12.97% improvement in accuracy on benchmarks like PubHealth compared to conventional RAG systems.

2. Reduced latency

One of the most significant challenges with traditional RAG systems is the increased latency caused by processing long and complex documents. Speculative RAG addresses this issue by offloading the drafting process to a smaller, more efficient LM, which can generate drafts in parallel. This approach reduces the overall processing time, with latency improvements of up to 51% in some cases.

Speculative RAG latency vs Standard RAG latency on different datasets. Source: Wang et al., 2024

3. Resource efficiency

Speculative RAG optimizes resource usage by delegating the drafting task to a smaller LM, which is less computationally intensive than the larger generalist LM used in traditional RAG systems. This division of labor not only speeds up the generation process but also reduces the overall computational load, making it a more resource-efficient solution.

4. Scalability and flexibility

The Speculative RAG framework is highly adaptable and can be applied to various knowledge-intensive tasks without the need for extensive tuning. This scalability makes it an ideal solution for a wide range of applications, from question answering to complex document analysis.

How to Implement Speculative RAG

To implement a simplified version of Speculative RAG for a basic question-answering task, we'll use a small dataset and a simple setup with two models—a smaller model for drafting and a larger one for verification.

For simplicity, we’ll use the Hugging Face Transformers library. We'll demonstrate how to use a smaller language model to generate multiple drafts and a larger model to verify and select the best draft.

Step 1: Install the required libraries

First, let's install the necessary libraries. You’ll need to restart the session after installing the transformers library from HuggingFace. Make sure that you have added HuggingFace credentials under a secret key in Collab or any other editor.

!pip install transformers torch datasets

Step 2: Load the dataset

For this example, we'll use the SQuAD (Stanford Question Answering Dataset) available from the Hugging Face datasets library.

from datasets import load_dataset
# Load the SQuAD dataset
dataset = load_dataset("squad", split="train[:100]")  # Using a small subset

Step 3: Initialize models

We'll use two pre-trained models from the Hugging Face library. A smaller model (distilbert-base-uncased-distilled-squad) will act as our RAG Drafter and a larger model (bert-large-uncased-whole-word-masking-finetuned-squad) will act as the RAG Verifier.

Then, we’ll build a Drafter and a verifier pipeline with the selected models.

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Initialize the smaller model (RAG Drafter)
drafter_model_name = "distilbert-base-uncased-distilled-squad"
drafter_model = AutoModelForQuestionAnswering.from_pretrained(drafter_model_name)
drafter_tokenizer = AutoTokenizer.from_pretrained(drafter_model_name)

# Initialize the larger model (RAG Verifier)
verifier_model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
verifier_model = AutoModelForQuestionAnswering.from_pretrained(verifier_model_name)
verifier_tokenizer = AutoTokenizer.from_pretrained(verifier_model_name)

# Set up pipelines
drafter_pipeline = pipeline("question-answering", model=drafter_model, tokenizer=drafter_tokenizer)
verifier_pipeline = pipeline("question-answering", model=verifier_model, tokenizer=verifier_tokenizer)

Step 4: Define the drafting function

Next, we'll define a function to generate multiple drafts using the smaller model with the predefined Drafter pipeline.

def generate_drafts(question, context, num_drafts=3):
   drafts = []
   for _ in range(num_drafts):
       draft = Drafter_pipeline(question=question, context=context)
       drafts.append(draft)
   return drafts

Step 5: Define the verification function

We'll now define a verify_drafts() function that uses the larger model to verify and select the best draft based on confidence scores.

It works by mapping the start and end character positions of each draft to the corresponding token positions in the tokenized context using an offset mapping. The verifier model then scores each draft based on its confidence in these positions being the correct answer. The draft with the highest confidence score is selected as the best answer, ensuring that the final output is both accurate and reliable.

This function's careful handling of token alignment and scoring is key to the enhanced performance of Speculative RAG.

def verify_drafts(question, context, drafts):
    best_draft = None
    highest_score = 0

    # Tokenize the context using the verifier's tokenizer, keeping track of offsets
    inputs = verifier_tokenizer(question, context, return_tensors="pt", return_offsets_mapping=True)
    offset_mapping = inputs['offset_mapping'][0]  # This will give us the character-to-token mapping
    input_ids = inputs['input_ids'][0]

    for draft in drafts:
        start_char = draft['start']
        end_char = draft['end']

        # Find the corresponding token positions using offset mapping
        start_index = None
        end_index = None

        for idx, (start, end) in enumerate(offset_mapping):
            if start_index is None and start_char >= start and start_char < end:
                start_index = idx
            if end_index is None and end_char > start and end_char <= end:
                end_index = idx
            if start_index is not None and end_index is not None:
                break

        # Ensure indices were found and are within bounds
        if (start_index is None or end_index is None or
                start_index >= len(input_ids) or end_index >= len(input_ids)):
            print(f"Draft skipped: Out of bounds or no matching tokens. "
                  f"Start Index: {start_index}, End Index: {end_index}")
            continue

        # Get the confidence score using the larger model
        outputs = verifier_model(input_ids=input_ids.unsqueeze(0))
        score = (outputs.start_logits[0, start_index].item() +
                 outputs.end_logits[0, end_index].item())

        if score > highest_score:
            highest_score = score
            best_draft = draft

    if best_draft is None:
        print("No valid draft found after verification.")

    return best_draft

Step 6: Run and evaluate the Speculative RAG process

Now, let's apply our Speculative RAG process to 10 sample questions from the dataset.

correct = 0
total = 10  # Evaluate on 10 samples for simplicity

for i in range(total):
    sample = dataset[i]
    question = sample['question']
    context = sample['context']
    drafts = generate_drafts(question, context)
    best_answer = verify_drafts(question, context, drafts)

    print(f"Q: {question}")

    if best_answer is not None:
        print(f"A: {best_answer['answer']}\n")
        # For simplicity, compare with the first answer (gold) provided in the dataset
        if best_answer['answer'].lower() in sample['answers']['text'][0].lower():
            correct += 1
    else:
        print("No valid draft found.\n")

accuracy = correct / total * 100
print(f"Accuracy: {accuracy}%")

This code snippet evaluates the accuracy of a simplified speculative RAG implementation on 10 samples from a dataset. For each sample, it generates multiple answer drafts based on a given question and context and then selects the best draft using a verification function. If a valid answer is found, it compares this answer to the gold (correct) answer provided in the dataset. If the answer matches, the correct count is incremented.

Finally, the code calculates and prints the accuracy as a percentage of correct answers out of the total samples. This code yields an accuracy of 90% on this simplified dataset.

Challenges of Speculative RAG

While speculative RAG gives good results, it presents some challenges:

Complexity in implementation:

Requires coordination of multiple models.
Adds complexity to integration into existing systems.

Training overhead:

Training the RAG Drafter introduces an extra step.
Requires resources and time.
Complicates deployment, especially in environments requiring rapid deployment.

Dependency on high-quality retrieval:

Effectiveness depends heavily on the quality of retrieved documents.
Poor retrieval results lead to suboptimal drafts.

Applications of Speculative RAG

Speculative RAG can be applied across a wide range of domains where accurate and efficient information retrieval is critical. Here are some potential applications:

Knowledge-intensive question answering:

Speculative RAG is effective in tasks where accuracy and relevance are important.
The framework can generate precise and well-supported answers to complex queries by leveraging diverse document subsets.

Real-time information retrieval:

Speculative RAG can enhance the speed and reliability of responses in scenarios where up-to-date information is crucial.
The framework's ability to reduce latency while maintaining accuracy makes it ideal for real-time applications.

Medical and legal text analysis:

Speculative RAG can be used in fields where accurate interpretation of complex documents is essential.
The framework ensures critical decisions are based on reliable information by generating multiple drafts and selecting the most accurate one based on a rationale.

Educational tools:

Speculative RAG can support the creation of accurate and context-rich content for e-learning platforms.
This ensures that learners receive reliable and well-rounded information, enhancing the overall learning experience.

Conclusion

Speculative RAG enhances retrieval-augmented generation with improved accuracy, reduced latency, resource efficiency, and scalability. We demonstrated its implementation using Hugging Face Transformers.

While challenges exist, speculative RAG's potential in various domains makes it promising for further research.

If you want to learn more about RAG implementation, I recommend the following tutorials:

Develop AI Applications

Learn to build AI applications using the OpenAI API.

Start Upskilling For Free

Author

Bhavishya Pandit

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

Developing AI Applications

0 min

Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.

See Details

Start Course

Course

Developing AI Systems with the OpenAI API

3 hr

14.5K

Leverage the OpenAI API to get your AI applications ready for production.

See Details

Start Course

Course

Retrieval Augmented Generation (RAG) with LangChain

3 hr

12K

Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.

See Details

Start Course

blog

Advanced RAG Techniques

Learn advanced RAG methods like dense retrieval, reranking, or multi-step reasoning to tackle issues like hallucination or ambiguity.

Stanislav Karzhev

12 min

blog

What is Retrieval Augmented Generation (RAG)?

Learn how Retrieval Augmented Generation (RAG) enhances large language models by integrating external data sources.

Natassha Selvaraj

6 min

blog

Agentic RAG: How It Works, Use Cases, Comparison With RAG

Learn about Agentic RAG, an AI paradigm combining agentic AI and RAG for autonomous information access and generation.

Bhavishya Pandit

6 min

Tutorial

How to Improve RAG Performance: 5 Key Techniques with Examples

Explore different approaches to enhance RAG systems: Chunking, Reranking, and Query Transformations.

Eugenia Anello

Tutorial

Corrective RAG (CRAG) Implementation With LangGraph

Corrective RAG (CRAG) is a RAG technique that incorporates self-assessment of retrieved documents to improve the accuracy and relevance of generated responses.

Ryan Ong

Tutorial

RAG vs Fine-Tuning: A Comprehensive Tutorial with Practical Examples

Learn the differences between RAG and Fine-Tuning techniques for customizing model performance and reducing hallucinations in LLMs.

Abid Ali Awan

See More See More

What Is Speculative RAG?

Advantages of Speculative RAG

1. Greater accuracy

2. Reduced latency

3. Resource efficiency

4. Scalability and flexibility

How to Implement Speculative RAG

Step 1: Install the required libraries

Step 2: Load the dataset

Step 3: Initialize models

Step 4: Define the drafting function

Step 5: Define the verification function

Step 6: Run and evaluate the Speculative RAG process

Challenges of Speculative RAG

Applications of Speculative RAG

Conclusion

Develop AI Applications

Advanced RAG Techniques

What is Retrieval Augmented Generation (RAG)?

Agentic RAG: How It Works, Use Cases, Comparison With RAG

How to Improve RAG Performance: 5 Key Techniques with Examples

Corrective RAG (CRAG) Implementation With LangGraph

RAG vs Fine-Tuning: A Comprehensive Tutorial with Practical Examples

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Developing AI Applications

Developing AI Systems with the OpenAI API

Retrieval Augmented Generation (RAG) with LangChain

Advanced RAG Techniques

What is Retrieval Augmented Generation (RAG)?

Agentic RAG: How It Works, Use Cases, Comparison With RAG

How to Improve RAG Performance: 5 Key Techniques with Examples

Corrective RAG (CRAG) Implementation With LangGraph

RAG vs Fine-Tuning: A Comprehensive Tutorial with Practical Examples

Developing AI Applications