Skip to main content

Llama Stack: A Guide With Practical Examples

Llama Stack is a set of standardized tools and APIs developed by Meta that simplifies the process of building and deploying large language model applications.
Oct 3, 2024  · 8 min read

Although generative AI applications have gained massive traction, the challenge of building efficient and consistent applications and deploying them remains to be solved.

Llama Stack, an open-source project by Meta, addresses this complexity.

Llama Stack introduces a standardized framework and modular APIs that offer a faster and smoother developing experience. It defines and standardizes the building blocks needed to bring generative AI applications to market.

In this article, I’ll guide you through getting started with Llama Stack using step-by-step instructions. If you’re a developer preparing to deploy an AI application to production, be sure to consult the Llama Stack repository, as it is continuously evolving.

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling For Free

What Is Llama Stack?

Llama Stack is a framework built to streamline the development and deployment of generative AI applications built on top of Meta’s Llama models. It achieves this by providing a collection of standardized APIs and components for tasks such as inference, safety, memory management, and agent capabilities.

Here are its goals and benefits:

  • Standardization: The set of APIs offers a consistent interface and work environment for developers to use, and quickly adapt their applications once new models become available.
  • Synergy: By abstracting complex functionalities into APIs, it allows collaboration between various tools and components, in favor of modularity and flexibility.
  • Smooth development: Llama Stack offers tolls for a simplified development lifecycle by predefining core functionalities and speeding up deployment.

Llama Stack Distribution

Source: Meta AI

Llama Stack Components and APIs

Llama Stack comes with several APIs, each targeting a specific set of tasks in building a generative AI application.

Inference API

The Inference API handles the generation of text or prompting multi-modal Llama variations. Its key features are:

  • It supports various decoding strategies such as top-k sampling.
  • It manages batch requests and stream responses, useful for high-scale applications.

The API defines various configurations enabling developers to control model behavior (e.g., FP8 or BF16 quantization) based on their application requirements.

Safety API

The Safety API is built for responsible deployment of AI models through moderating content and filtering harmful or potentially biased outputs. It is configurable to define violation levels (e.g., INFO, WARN, ERROR) and to return actionable messages to users.

Memory API

The Memory API grants the ability to retain and refer to past interactions and create more coherent conversations that are contextually aware. The variety of memory configurations gives developers the option to choose storage types based on application needs. Its key features are:

  • It allows flexible memory storage by providing multiple configurations such as vector, key-value, keyword, and graph, which are essentially methods of storing conversation memories.
  • Allows insertion, querying, updating, and deletion of documents within memory banks. Query results are returned as chunks with relevance scores.

Agentic API

The Agentic API enables LLMs to use external tools and functions, allowing them to perform tasks such as web search, code execution, or retrieving memory. The API allows developers to configure agents with specific tools and goals. It supports multi-turn interactions where each turn consists of multiple steps. Its key features are:

  • It comes with integrated tools such as brave_search, wolfram_alpha, photogen, and code_interpreter. We can use these tools to handle requests or execute code within the model’s context.
  • It works with the Memory API to retrieve relevant information to enhance long-term context.
  • Models can execute tasks through multiple steps such as inference, tool execution, memory retrieval, and safety checks.

Other APIs

Here are the other APIs that Llama Stack offers: 

How to Install and Setup Llama Stack

We will implement a sample project on Llama Stack to familiarize ourselves with the general idea and capabilities of this framework.

Before we begin, please be aware that:

  1. Llama Stack is evolving fast and it’s expected to see bugs while implementing your client, so it’s important to refer to the repository documentation and report any possible issue. At the date of writing this article, for example, I was not able to run the Llama Stack container on Windows due to some OS-specific issues, which the developer team will be working on.
  2. You cannot use Llama Stack on Google Colab, since the free version of the platform does not support building docker containers.

Let’s start with setting up the Llama command-line interface (CLI).

1. Llama CLI

The Llama Stack provides a Command-Line Interface (CLI) for managing distributions, installing models, and configuring environments. Here are the installation steps we need to take:

a. Create and activate a virtual environment:

conda create -n llama_stack python=3.10
conda activate llama_stack

b. Clone the Llama Stack repository:

git clone <https://github.com/meta-llama/llama-stack.git>
cd llama-stack

c. Install the required dependencies:

pip install llama-stack
pip install -r requirements.txt

2. Using Docker containers

Docker containers simplify the deployment of the Llama Stack server and agent API providers. Pre-built Docker images are available for easy setup:

docker pull llamastack/llamastack-local-gpu
llama stack build
llama stack configure llamastack-local-gpu

These commands pull the Docker image, build it, and configure the stack.

How to Build An App With Llama Stack APIs

Let's build a basic chatbot using the Llama Stack APIs. Here are the steps we need to take:

1. Start the Llama Stack server

We will run the server on port 5000. Ensure the server is running before working with the APIs:

llama stack run local-gpu --port 5000

2. Use the Inference API

After installing the Llama Stack, you can use client code to interact with its APIs. Use the Inference API to generate responses based on user input:

from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="<http://localhost:5000>")
user_input = "Hello! How are you?"
response = client.inference.chat_completion(
    model="Llama3.1-8B-Instruct",
    messages=[{"role": "user", "content": user_input}],
    stream=False
)
print("Bot:", response.text)

Note: Replace "Llama3.1-8B-Instruct" with the actual model name available in your setup.

3. Integrate the Safety API

Implement the Safety API to moderate responses and ensure they are appropriate:

safety_response = client.safety.run_shield(
    messages=[{"role": "assistant", "content": response.text}],
    shield_type="llama_guard",
    params={}
)
if safety_response.violation:
    print("Unsafe content detected.")
else:
    print("Bot:", response.text)

4. Add memory with the Memory API

Create the chatbot's context awareness by storing and retrieving conversation history:

### building a memory bank that can be used to store and retrieve context.
bank = client.memory.create_memory_bank(
    name="chat_memory",
    config=VectorMemoryBankConfig(
        type="vector",
        embedding_model="all-MiniLM-L6-v2",
        chunk_size_in_tokens=512,
        overlap_size_in_tokens=64,
    )
)

5. Chain components together

We can further combine the APIs to build a robust chatbot:

  • Inference API: Generates responses.
  • Safety API: Filters inappropriate content.
  • Memory API: Maintains conversation context.

Complete example

Here’s the complete code after covering all the steps:

import uuid
client = LlamaStackClient(base_url="<http://localhost:5000>")
### create a memory bank at the start
bank = client.memory.create_memory_bank(
    name="chat_memory",
    config=VectorMemoryBankConfig(
        type="vector",
        embedding_model="all-MiniLM-L6-v2",
        chunk_size_in_tokens=512,
        overlap_size_in_tokens=64,
    )
)
def get_bot_response(user_input):
    ### retrieving conversation history
    query_response = client.memory.query_documents(
        bank_id=bank.bank_id,
        query=[user_input],
        params={"max_chunks": 10}
    )
    history = [chunk.content for chunk in query_response.chunks]
    ### preparing messages with history
    messages = [{"role": "user", "content": user_input}]
    if history:
        messages.insert(0, {"role": "system", "content": "\\n".join(history)})
       
    ### generate response
    response = client.inference.chat_completion(
        model="llama-2-7b-chat",
        messages=messages,
        stream=False
    )
    bot_response = response.text
    ### safety check
    safety_response = client.safety.run_shield(
        messages=[{"role": "assistant", "content": bot_response}],
        shield_type="llama_guard",
        params={}
    )
    if safety_response.violation:
        return "I'm sorry, but I can't assist with that request."
    ### memory storing
    documents = [
        MemoryBankDocument(
            document_id=str(uuid.uuid4()),
            content=user_input,
            mime_type="text/plain"
        ),
        MemoryBankDocument(
            document_id=str(uuid.uuid4()),
            content=bot_response,
            mime_type="text/plain"
        )
    ]
    client.memory.insert_documents(
        bank_id=bank.bank_id,
        documents=documents
    )
    return bot_response
### putting all together
while True:
    user_input = input("You: ")
    if user_input.lower() == "bye":
        break
    bot_response = get_bot_response(user_input)
    print("Bot:", bot_response)

Llama Stack Examples and Contributions

To see some examples and jump start your implementation of applications using Llama Stack, Meta has provided the llama-stack-apps repository where you can look at some example applications. Make sure to check out and familiarize yourself with the framework.

As an open-source project, Llama Stack lives off of community contributions. The APIs are evolving rapidly, and the project is open to feedback and participation from developers, helping shape the future of the platform. If you test out Llama Stack, it can help other developers if you also shared your project as an example, or contributed to the documentation.

Conclusion

Throughout this article, we explored how to get started with Llama Stack through step-by-step instructions.

As you move forward in deploying your AI applications, remember to keep an eye on the Llama Stack repository for the latest updates and enhancements.

To learn more about the Llama ecosystem, check out the following resources:


Hesam Sheikh Hassani's photo
Author
Hesam Sheikh Hassani
LinkedIn
Twitter

Master's student of Artificial Intelligence and AI technical writer. I share insights on the latest AI technology, making ML research accessible, and simplifying complex AI topics necessary to keep you at the forefront.

Topics

Learn AI with these courses!

Course

Working with Llama 3

4 hr
1.6K
Explore the latest techniques for running the Llama LLM locally, fine-tuning it, and integrating it within your stack.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
Llama 3.2 is now multimodal

blog

Llama 3.2 Guide: How It Works, Use Cases & More

Meta releases Llama 3.2, which features small and medium-sized vision LLMs (11B and 90B) alongside lightweight text-only models (1B and 3B). It also introduces the Llama Stack Distribution.
Alex Olteanu's photo

Alex Olteanu

8 min

blog

What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

Meta releases Llama 3.1 405B, a large open-source language model designed to compete with closed models like GPT-4o and Claude 3.5 Sonnet.
Richie Cotton's photo

Richie Cotton

8 min

blog

Introduction to Meta AI’s LLaMA

LLaMA, a revolutionary open-source framework, aims to make large language model research more accessible.
Abid Ali Awan's photo

Abid Ali Awan

8 min

Tutorial

Llama 3.3: Step-by-Step Tutorial With Demo Project

Learn how to build a multilingual code explanation app using Llama 3.3, Hugging Face, and Streamlit.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

12 min

Tutorial

Llama 3.2 and Gradio Tutorial: Build a Multimodal Web App

Learn how to use the Llama 3.2 11B vision model with Gradio to create a multimodal web app that functions as a customer support assistant.
Aashi Dutt's photo

Aashi Dutt

10 min

Tutorial

RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial

Learn to build a RAG application with Llama 3.1 8B using Ollama and Langchain by setting up the environment, processing documents, creating embeddings, and integrating a retriever.
Ryan Ong's photo

Ryan Ong

12 min

See MoreSee More