Llama 4 With RAG: A Guide With Demo Project

Learn how to build a retrieval-augmented generation (RAG) pipeline using Llama 4 to create a simple web application.

Apr 14, 2025 · 12 min read

Llama 4 Scout is marketed as having a massive context window of 10 million tokens, but its training was limited to a maximum input size of 256k tokens. This means performance can degrade with larger inputs. To prevent this, we can use Llama 4 with a retrieval-augmented generation (RAG) pipeline.

In this tutorial, I’ll explain step-by-step how to build a RAG pipeline using the LangChain ecosystem and create a web application that allows users to upload documents and ask questions about them.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

Why Use Llama 4 With RAG?

Llama 4 comes with a 10 million token context window, meaning we can feed it 100 books and receive contextual answers about them. However, despite this impressive feature, there are compelling reasons to pair Llama 4 with RAG. Here is why:

1. Performance limitations with large inputs

While Llama 4 supports a massive context window, its training was limited to a maximum input size of 256k tokens. The model's performance can degrade when inputs approach or exceed this threshold. RAG mitigates this issue by retrieving only the most relevant information, ensuring the input remains concise and focused.

2. Efficiency and focus

RAG enables the retrieval of highly relevant information from a vector database, allowing Llama 4 to process smaller, more targeted inputs. This approach improves both efficiency and accuracy.

3. Cost savings

Processing massive inputs with Llama 4's full context window can be computationally expensive. By using RAG, only the most relevant data is retrieved and processed, significantly reducing the computational load and associated costs.

Building the Llama 4 RAG Pipeline

In this tutorial, I will explain step-by-step how to build a simple retrieval-augmented generation pipeline that allows you to load a .docx file and ask questions about its content.

1. Setting up

First, we need to create a GroqCloud API key and save it as an environment variable. Next, we will install all the necessary Python packages required to load, split, vectorize, and retrieve the context, as well as generate responses.

%%capture
%pip install langchain
%pip install langchain-community 
%pip install langchainhub 
%pip install langchain-chroma 
%pip install langchain-groq
%pip install langchain-huggingface
%pip install unstructured[docx]

2. Initiating the LLM and the embedding

We will set up the model object using the Groq API, providing it with the model name and API key. Similarly, we will download the embedding model from Hugging Face and load it as our embedding model.

from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

llm = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct", api_key=groq_api_key)

embed_model = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")

3. Loading and splitting the data

We will load the .docx file from the root folder and split it into chunks of 1000 characters.

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n"]
)

# Load the .docx files
loader = DirectoryLoader("./", glob="*.docx", use_multithreading=True)
documents = loader.load()

# Split the documents into chunks
chunks = text_splitter.split_documents(documents)

# Print the number of chunks
print(len(chunks))

4. Building and populating the vector store

Initialize the Chroma vector store and provide the text chunks to the vector database. The text will be converted to embeddings before being stored. After that, we will run a similarity search to test if our vector store is properly populated.

from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embed_model,
    persist_directory="./Vectordb",
)

query = "What is this tutorial about?"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)

Learn how to Fine-tune Stable Diffusion XL with DreamBooth and LoRA on your personal images. 

Let's try another prompt:

Prompt:

5. Creating the RAG pipeline

Next, we will convert our vector store into a retriever and create a prompt template for the RAG pipeline.

# Create retriever
retriever = vectorstore.as_retriever()

# Import PromptTemplate
from langchain_core.prompts import PromptTemplate

# Define a clearer, more professional prompt template
template = """You are an expert assistant tasked with answering questions based on the provided documents.
Use only the given context to generate your answer.
If the answer cannot be found in the context, clearly state that you do not know.
Be detailed and precise in your response, but avoid mentioning or referencing the context itself.

Context:
{context}

Question:
{question}

Answer:"""

# Create the PromptTemplate
rag_prompt = PromptTemplate.from_template(template)

Finally, we will create the RAG chain that provides both the context and the question in the RAG prompt, pass it through the Llama 4 model, and then generate a clean response.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

We will now ask the RAG chain about the document.

from IPython.display import display, Markdown

response = rag_chain.invoke("What this tutorial about?")
Markdown(response)

This tutorial is about setting up and using the Janus project, specifically Janus Pro, a multimodal model that can understand images and generate images from text prompts, and building a local solution to use the model privately on a laptop GPU. It covers learning about the Janus Series, setting up the Janus project, building a Docker container to run the model locally, and testing its capabilities with various image and text prompts.

As you can see, the response is highly accurate and context-aware. If you encounter any issues running the above code, please refer to this DataLab workbook: Llama 4 RAG pipeline.

Building the Llama 4 Docx Chat Application

In this section, we will transform the previously discussed RAG pipeline code into a functional chat application using the Gradio package. To get started, you need to install Gradio using the following pip command:

$ pip install gradio

The application is designed to be simple and user-friendly, allowing users to upload a single .docx file, multiple .docx files, or a .zip archive containing .docx files. Users can then ask questions about the content of all uploaded files, providing a fast and efficient way to retrieve information.

The application features:

Document loading: The app extracts and loads documents directly from uploaded files.
Text splitting: Documents are split into manageable chunks for efficient processing and retrieval.
Vector store integration: Instead of ChromaDB, it uses the in-memory vector store to manage document embeddings.
RAG pipeline: Combines document retrieval with a language model to provide accurate answers based on the uploaded content.
User interface: Features a chat interface where users can ask questions and receive detailed responses based on the provided documents.
Error handling: Provides feedback for unsupported file types and issues during file uploads, enhancing user experience.
Reset functionality: Users can reset the application state to start fresh with new document uploads.

Let’s put it all together:

main.py:

# ========== Standard Library ==========
import os
import tempfile
import zipfile
from typing import List, Optional, Tuple, Union
import collections


# ========== Third-Party Libraries ==========
import gradio as gr
from groq import Groq
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, UnstructuredFileLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

# ========== Configs ==========
TITLE = """<h1 align="center">🗨️🦙 Llama 4 Docx Chatter</h1>"""
AVATAR_IMAGES = (
    None,
    "./logo.png",
)

# Acceptable file extensions
TEXT_EXTENSIONS = [".docx", ".zip"]

# ========== Models & Clients ==========
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
client = Groq(api_key=GROQ_API_KEY)
llm = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct", api_key=GROQ_API_KEY)
embed_model = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")

# ========== Core Components ==========
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n"],
)

rag_template = """You are an expert assistant tasked with answering questions based on the provided documents.
Use only the given context to generate your answer.
If the answer cannot be found in the context, clearly state that you do not know.
Be detailed and precise in your response, but avoid mentioning or referencing the context itself.
Context:
{context}
Question:
{question}
Answer:"""
rag_prompt = PromptTemplate.from_template(rag_template)


# ========== App State ==========
class AppState:
    vectorstore: Optional[InMemoryVectorStore] = None
    rag_chain = None


state = AppState()

# ========== Utility Functions ==========


def load_documents_from_files(files: List[str]) -> List:
    """Load documents from uploaded files directly without moving."""
    all_documents = []

    # Temporary directory if ZIP needs extraction
    with tempfile.TemporaryDirectory() as temp_dir:
        for file_path in files:
            ext = os.path.splitext(file_path)[1].lower()

            if ext == ".zip":
                # Extract ZIP inside temp_dir
                with zipfile.ZipFile(file_path, "r") as zip_ref:
                    zip_ref.extractall(temp_dir)

                # Load all docx from extracted zip
                loader = DirectoryLoader(
                    path=temp_dir,
                    glob="**/*.docx",
                    use_multithreading=True,
                )
                docs = loader.load()
                all_documents.extend(docs)

            elif ext == ".docx":
                # Load single docx directly
                loader = UnstructuredFileLoader(file_path)
                docs = loader.load()
                all_documents.extend(docs)

    return all_documents


def get_last_user_message(chatbot: List[Union[gr.ChatMessage, dict]]) -> Optional[str]:
    """Get last user prompt."""
    for message in reversed(chatbot):
        content = (
            message.get("content") if isinstance(message, dict) else message.content
        )
        if (
            message.get("role") if isinstance(message, dict) else message.role
        ) == "user":
            return content
    return None


# ========== Main Logic ==========




def upload_files(
    files: Optional[List[str]], chatbot: List[Union[gr.ChatMessage, dict]]
):
    """Handle file upload - .docx or .zip containing docx."""
    if not files:
        return chatbot

    file_summaries = []  # <-- Collect formatted file/folder info
    documents = []

    with tempfile.TemporaryDirectory() as temp_dir:
        for file_path in files:
            filename = os.path.basename(file_path)
            ext = os.path.splitext(file_path)[1].lower()

            if ext == ".zip":
                file_summaries.append(f"📦 **{filename}** (ZIP file) contains:")
                try:
                    with zipfile.ZipFile(file_path, "r") as zip_ref:
                        zip_ref.extractall(temp_dir)
                        zip_contents = zip_ref.namelist()

                        # Group files by folder
                        folder_map = collections.defaultdict(list)
                        for item in zip_contents:
                            if item.endswith("/"):
                                continue  # skip folder entries themselves
                            folder = os.path.dirname(item)
                            file_name = os.path.basename(item)
                            folder_map[folder].append(file_name)

                        # Format nicely
                        for folder, files_in_folder in folder_map.items():
                            if folder:
                                file_summaries.append(f"📂 {folder}/")
                            else:
                                file_summaries.append(f"📄 (root)")
                            for f in files_in_folder:
                                file_summaries.append(f"   - {f}")

                    # Load docx files extracted from ZIP
                    loader = DirectoryLoader(
                        path=temp_dir,
                        glob="**/*.docx",
                        use_multithreading=True,
                    )
                    docs = loader.load()
                    documents.extend(docs)

                except zipfile.BadZipFile:
                    chatbot.append(
                        gr.ChatMessage(
                            role="assistant",
                            content=f"❌ Failed to open ZIP file: {filename}",
                        )
                    )

            elif ext == ".docx":
                file_summaries.append(f"📄 **{filename}**")
                loader = UnstructuredFileLoader(file_path)
                docs = loader.load()
                documents.extend(docs)

            else:
                file_summaries.append(f"❌ Unsupported file type: {filename}")

    if not documents:
        chatbot.append(
            gr.ChatMessage(
                role="assistant", content="No valid .docx files found in upload."
            )
        )
        return chatbot

    # Split documents
    chunks = text_splitter.split_documents(documents)
    if not chunks:
        chatbot.append(
            gr.ChatMessage(
                role="assistant", content="Failed to split documents into chunks."
            )
        )
        return chatbot

    # Create Vectorstore
    state.vectorstore = InMemoryVectorStore.from_documents(
        documents=chunks,
        embedding=embed_model,
    )
    retriever = state.vectorstore.as_retriever()

    # Build RAG Chain
    state.rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | rag_prompt
        | llm
        | StrOutputParser()
    )

    # Final display
    chatbot.append(
        gr.ChatMessage(
            role="assistant",
            content="**Uploaded Files:**\n"
            + "\n".join(file_summaries)
            + "\n\n✅ Ready to chat!",
        )
    )
    return chatbot


def user_message(
    text_prompt: str, chatbot: List[Union[gr.ChatMessage, dict]]
) -> Tuple[str, List[Union[gr.ChatMessage, dict]]]:
    """Add user's text input to conversation."""
    if text_prompt.strip():
        chatbot.append(gr.ChatMessage(role="user", content=text_prompt))
    return "", chatbot


def process_query(
    chatbot: List[Union[gr.ChatMessage, dict]],
) -> List[Union[gr.ChatMessage, dict]]:
    """Process user's query through RAG pipeline."""
    prompt = get_last_user_message(chatbot)
    if not prompt:
        chatbot.append(
            gr.ChatMessage(role="assistant", content="Please type a question first.")
        )
        return chatbot

    if state.rag_chain is None:
        chatbot.append(
            gr.ChatMessage(role="assistant", content="Please upload documents first.")
        )
        return chatbot

    chatbot.append(gr.ChatMessage(role="assistant", content="Thinking..."))

    try:
        response = state.rag_chain.invoke(prompt)
        chatbot[-1].content = response
    except Exception as e:
        chatbot[-1].content = f"Error: {str(e)}"

    return chatbot


def reset_app(
    chatbot: List[Union[gr.ChatMessage, dict]],
) -> List[Union[gr.ChatMessage, dict]]:
    """Reset application state."""
    state.vectorstore = None
    state.rag_chain = None
    return [
        gr.ChatMessage(
            role="assistant", content="App reset! Upload new documents to start."
        )
    ]


# ========== UI Layout ==========

with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.HTML(TITLE)
    chatbot = gr.Chatbot(
        label="Llama 4 RAG",
        type="messages",
        bubble_full_width=False,
        avatar_images=AVATAR_IMAGES,
        scale=2,
        height=350,
    )

    with gr.Row(equal_height=True):
        text_prompt = gr.Textbox(
            placeholder="Ask a question...", show_label=False, autofocus=True, scale=28
        )
        send_button = gr.Button(
            value="Send",
            variant="primary",
            scale=1,
            min_width=80,
        )
        upload_button = gr.UploadButton(
            label="Upload",
            file_count="multiple",
            file_types=TEXT_EXTENSIONS,
            scale=1,
            min_width=80,
        )
        reset_button = gr.Button(
            value="Reset",
            variant="stop",
            scale=1,
            min_width=80,
        )

    send_button.click(
        fn=user_message,
        inputs=[text_prompt, chatbot],
        outputs=[text_prompt, chatbot],
        queue=False,
    ).then(fn=process_query, inputs=[chatbot], outputs=[chatbot])

    text_prompt.submit(
        fn=user_message,
        inputs=[text_prompt, chatbot],
        outputs=[text_prompt, chatbot],
        queue=False,
    ).then(fn=process_query, inputs=[chatbot], outputs=[chatbot])

    upload_button.upload(
        fn=upload_files, inputs=[upload_button, chatbot], outputs=[chatbot], queue=False
    )
    reset_button.click(fn=reset_app, inputs=[chatbot], outputs=[chatbot], queue=False)

demo.queue().launch()

Testing the Llama 4 Docx Chat Application

Now, we will run the web application and test its functionalities. To start the application, run the following command in the terminal:

$ python main.py

Once the web server is up and running, navigate to the following URL or copy and paste it manually into your browser to access the web interface: http://127.0.0.1:7860

Here is how our chat interface looks:

We will upload a .zip file containing .docx files, and as we do this, you will see all the folders and files within the zip file.

Next, we can ask specific questions about the documents:

You can also press the “Reset” button to upload a new file and chat about it in just a few seconds.

After testing it locally, you can easily deploy the web application on Hugging Face Space for free. Here is the link to the deployed app: kingabzpro/Llama-4-RAG

Source: kingabzpro/Llama-4-RAG

The code source, notebook, and configuration files can be found in the kingabzpro/Llama-4-RAG repository.

Conclusion

Even with the introduction of long-context window models, RAG will never become obsolete. RAG remains an efficient and cost-effective method for retrieving information from external resources, providing additional context to models.

Moreover, RAG systems are inherently more economical to operate compared to feeding extensive context directly into a model, which requires significant computational resources. This makes RAG a practical and scalable solution for many use cases.

To learn more about Llama 4, check out these resources:

RAG with LangChain

Integrate external data with LLMs using Retrieval Augmented Generation (RAG) and LangChain.

Explore Course

Author

Abid Ali Awan

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

Llama Fundamentals

0 min

Experiment with Llama 3 to run inference on pre-trained models, fine-tune them on custom datasets, and optimize performance.

See Details

Start Course

Course

Working with Llama 3

2 hr

10.7K

Explore the latest techniques for running the Llama LLM locally and integrating it within your stack.

See Details

Start Course

Course

Fine-Tuning with Llama 3

2 hr

2.8K

Fine-tune Llama for custom tasks using TorchTune, and learn techniques for efficient fine-tuning such as quantization.

See Details

Start Course

Tutorial

RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial

Learn to build a RAG application with Llama 3.1 8B using Ollama and Langchain by setting up the environment, processing documents, creating embeddings, and integrating a retriever.

Ryan Ong

Tutorial

Llama 3.2 Vision With RAG: A Guide Using Ollama and ColPali

Learn the step-by-step process of setting up a RAG application using Llama 3.2 Vision, Ollama, and ColPali.

Ryan Ong

Tutorial

Recursive Retrieval for RAG: Implementation With LlamaIndex

Learn how to implement recursive retrieval in RAG systems using LlamaIndex to improve the accuracy and relevance of retrieved information, especially for large document collections.

Ryan Ong

Tutorial

Agentic RAG: Step-by-Step Tutorial With Demo Project

Learn how to build an Agentic RAG pipeline from scratch, integrating local data sources and web scraping to generate context-aware responses to user queries.

Bhavishya Pandit

Tutorial

Fine-Tuning Llama 4: A Guide With Demo Project

Learn how to fine-tune the Llama 4 Scout Instruct model on a medical reasoning dataset using RunPod GPUs.

Abid Ali Awan

code-along

Retrieval Augmented Generation with LlamaIndex

In this session you'll learn how to get started with Chroma and perform Q&A on some documents using Llama 2, the RAG technique, and LlamaIndex.

Dan Becker

See More See More

Why Use Llama 4 With RAG?

1. Performance limitations with large inputs

2. Efficiency and focus

3. Cost savings

Building the Llama 4 RAG Pipeline

1. Setting up

2. Initiating the LLM and the embedding

3. Loading and splitting the data

4. Building and populating the vector store

5. Creating the RAG pipeline

Building the Llama 4 Docx Chat Application

Testing the Llama 4 Docx Chat Application

Conclusion

RAG with LangChain

RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial

Llama 3.2 Vision With RAG: A Guide Using Ollama and ColPali

Recursive Retrieval for RAG: Implementation With LlamaIndex

Agentic RAG: Step-by-Step Tutorial With Demo Project

Fine-Tuning Llama 4: A Guide With Demo Project

Retrieval Augmented Generation with LlamaIndex

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Llama Fundamentals

Working with Llama 3

Fine-Tuning with Llama 3

RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial

Llama 3.2 Vision With RAG: A Guide Using Ollama and ColPali

Recursive Retrieval for RAG: Implementation With LlamaIndex

Agentic RAG: Step-by-Step Tutorial With Demo Project

Fine-Tuning Llama 4: A Guide With Demo Project

Retrieval Augmented Generation with LlamaIndex

Llama Fundamentals