Llama 4 With vLLM: A Guide With Demo Project

Learn how to deploy and use Meta's LLaMA 4 Scout with vLLM on RunPod for both text completion and multimodal inference.

Apr 10, 2025 · 12 min read

Meta’s latest LLaMA 4 models, Scout and Maverick, are built for long-context reasoning, multi-modal understanding, and efficient inference at scale. When paired with vLLM, a high-throughput inference engine with OpenAI-compatible APIs, these models become practical for real-world applications.

In this blog post, I’ll walk you through how to deploy LLaMA 4 Scout on a multi-GPU RunPod instance using vLLM and serve it via a local or remote OpenAI-compatible API endpoint. We’ll cover model setup, deployment, and API interaction using both text and image inputs.

By the end of this tutorial, you’ll also have two complete demos:

Text completion demo: A simple interface to chat with LLaMA 4 using OpenAI-compatible Python code and explore long-context capabilities.
Multimodal vision demo: An interface that accepts both text and image inputs using vLLM’s multi-modal support and returns grounded visual reasoning responses from LLaMA 4.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

Why Use LLaMA 4 on vLLM?

vLLM (virtual large language model) is an optimized LLM inference framework originally developed at UC Berkeley’s Sky Computing Lab, which introduces PagedAttention, a novel memory management system that minimizes GPU memory fragmentation and allows efficient key-value caching for extremely long sequences. To learn more, check out this tutorial on vLLM.

Here’s why vLLM is the ideal engine to serve LLaMA 4:

High throughput, low latency: vLLM’s paged KV cache enables efficient inference, even with long context.
Multi-modal and long-context ready: It supports both images and up to 10M token context for the Scout model.
OpenAI-compatible: You can use OpenAI SDKs and tools without modification.
Scalable and GPU-efficient: vLLM fully leverages multiple GPUs—including H100, A100, and MI300X—using tensor and memory parallelism for high-throughput inference.

Hosting LLaMA 4 Scout on RunPod With vLLM

In this section, we’ll walk through how to provision a powerful multi-GPU instance on RunPod, deploy Meta’s LLaMA 4 Scout model using vLLM, and expose an OpenAI-compatible API for both local and remote inference.

Step 1: Set up your RunPod environment

Before launching the model, make sure your RunPod account is set up:

Log in to RunPod.io and configure your billing.
Add at least $25 to your balance to ensure you can run large GPU instances for the duration of this project.

Step 2: Deploy a pod with H100 NVL GPUs

Now, let’s provision a pod capable of hosting the 17B LLaMA 4 Scout model:

Go to the Pods section and click “Deploy a Pod.”
Filter GPUs by VRAM, setting the minimum to 80 GB, and select the H100 NVL option. Note: LLaMA 4 Scout requires at least 24 GB of VRAM. For optimal performance, we’ll use 4x H100 NVL.

Scroll down and give your pod a name like “LLAMA4_DEMO”.
Under Template, click “Change Template” and select PyTorch 2.4.0.
Click “Edit Template”, set both Container Disk and Volume Disk to 1000 GB, and apply overrides.
(Optional) Add your Hugging Face access token to the environment variables for model downloads.

Set GPU Count to 4, then click “Deploy On-Demand.”

Wait a few minutes for the pod to be provisioned.

Step 3: Connect to your pod

Once the “Connect” button turns purple, click it. You'll see multiple connection options—you can either:

Open a JupyterLab terminal to run shell commands.
Or use SSH or HTTP ports if you want remote control.

Step 4: Install vLLM and required libraries

In the terminal inside your pod, install vLLM and its dependencies. Install vLLM and other libraries inside the pod environment.

pip install -U vllm
pip install transformers accelerate pillow #optional

Step 5: Run LLaMA 4 Scout on vLLM

Next, we launch the LLaMA 4 Scout model server using the following command:

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 4 \
--max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'

This command does the following:

It disables the vLLM compile cache for faster startup.
Run the model on 4 GPUs using tensor parallelism.
Then, it sets a maximum context length of 100,000 tokens due to memory constraints. Note -If you want to run the model with a 10M context length, then you need at least 8 x H100 GPUs.
It enables attention temperature tuning config for improving accuracy in long-context inference.

Now that your API is running on port 8000, create a new Jupyter Notebook in the same pod environment.

Text Completion With Llama 4 Scout

In this demo, we’ll interact with the locally hosted LLaMA 4 Scout model served via vLLM using a simple Python script. The model is exposed through API, allowing us to build a conversational assistant that supports multi-turn dialogue.

Step 1: Prerequisites

We will use the official OpenAI Python SDK to communicate with the vLLM API and the Colorama library for colored terminal output. You can install both using:

pip install openai colorama

Once installed, import the necessary modules:

from openai import OpenAI
from colorama import Fore, Style, init

Step 2: Initialize the client

We configure the OpenAI client to point to our locally running vLLM server. If your deployment doesn’t require an API key (the default behavior for local runs), pass "EMPTY" as the key:

# Initialize colorama
init(autoreset=True)
# Set up client
openai_api_key = "EMPTY"  
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

This code initializes the Colorama library for colored terminal output with an automatic reset after each print. It then sets up an OpenAI-compatible client using the OpenAI class, pointing to a locally hosted vLLM API.

Step 3: Start the chat loop

Next, we implement a simple loop to allow continuous interaction with the model. The conversation history is preserved across turns, enabling multi-turn dialogue behavior.

# Initialize message history
messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]
print("Start chatting with the assistant! Type 'exit' or 'quit' to end.\n")
while True:
    user_input = input(f"{Fore.BLUE}User: {Style.RESET_ALL}")
    if user_input.strip().lower() in ["exit", "quit"]:
        print("Exiting chat. Goodbye!")
        break
    messages.append({"role": "user", "content": user_input})
    chat_response = client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=messages,
    )
    assistant_message = chat_response.choices[0].message.content
    print(f"{Fore.GREEN}Assistant: {assistant_message}\n{Style.RESET_ALL}")
    messages.append({"role": "assistant", "content": assistant_message})

A simple terminal-based chat interface is initialized with a conversation history and a system prompt that continuously accepts user input in a loop. Each user message is appended to the message history and sent to the Llama-4-Scout-17B-16E-Instruct model via the client.chat.completions.create() API call.

The model’s response is printed in green using Colorama and added back to the message history, allowing for a multi-turn dialogue until the user types "exit" or "quit".

In the next section, we’ll explore how to run multimodal inference with image + text inputs using the same API.

Multimodal Image and Text Inference With LLaMA 4 Scout

In this demo, we’ll explore how to send both image and text inputs to the LLaMA 4 Scout model hosted via vLLM using its native multimodal capabilities. This allows you to perform visual reasoning, image captioning, or multimodal Q&A—all through a single OpenAI-compatible API.

Step 1: Prerequisites

This example uses the OpenAI SDK to interface with the vLLM server. If not already installed, you can run:

pip install openai

Then import the required module:

from openai import OpenAI

Step 2: Set up the API client

Connect to your local vLLM server using the OpenAI-compatible interface. If your server doesn't require authentication, use "EMPTY" as the API key.

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

The above code snippet initializes an OpenAI-compatible client for local inference using the vLLM server. The api_key is set to "EMPTY" (no auth required), and base_url points to the local vLLM API endpoint.

Step 3: Submit a multimodal prompt

We'll now send a chat prompt that includes both an image and a text instruction. LLaMA 4 Scout will process the image alongside the query and return a descriptive response.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": image_url1}},
            {"type": "text", "text": "Can you describe what's in this image?"}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",  
    messages=messages,
)
print("Response:", chat_response.choices[0].message.content)

Once the multimodal inference using the LLaMA 4 Scout model is served, we construct a chat-style input containing both an image URL and a text prompt, formatted according to the OpenAI chat schema. The client.chat.completions.create() method is used to send this multimodal request to the model, which processes the image and accompanying text to generate a contextual response.

Conclusion

In this tutorial, we hosted Meta’s LLaMA 4 Scout model using vLLM on RunPod, making it accessible via OpenAI-compatible endpoints for both text and multimodal inference. By combining vLLM’s high throughput and RunPod’s powerful infrastructure, we created a cost-effective setup for serving state-of-the-art LLMs with long context and vision capabilities.

To learn more about LLaMA 4 and vLLM, check out:

Author

Aashi Dutt

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

Llama Fundamentals

4 hr

Experiment with Llama 3 to run inference on pre-trained models, fine-tune them on custom datasets, and optimize performance.

See Details

Start Course

Course

Retrieval Augmented Generation (RAG) with LangChain

3 hr

13.1K

Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.

See Details

Start Course

Course

Fine-Tuning with Llama 3

2 hr

Fine-tune Llama for custom tasks using TorchTune, and learn techniques for efficient fine-tuning such as quantization.

See Details

Start Course

blog

Meta's Llama 4: Features, Access, How It Works, and More

Learn about the Llama 4 suite of large language models, including Llama 4 Scout, Llama 4 Maverick, and the in-training Llama 4 Behemoth.

Alex Olteanu

8 min

blog

Llama 3.2 Guide: How It Works, Use Cases & More

Meta releases Llama 3.2, which features small and medium-sized vision LLMs (11B and 90B) alongside lightweight text-only models (1B and 3B). It also introduces the Llama Stack Distribution.

Alex Olteanu

8 min

Tutorial

Llama Stack: A Guide With Practical Examples

Llama Stack is a set of standardized tools and APIs developed by Meta that simplifies the process of building and deploying large language model applications.