Course
Meta's Llama 4: Features, Access, How It Works, and More
Meta has just announced the Llama 4 suite of models, which includes two released models—Llama 4 Scout and Llama 4 Maverick—and a third one still in training: Llama 4 Behemoth.
The Scout and Maverick variants are available now, openly released under Meta’s typical open-weight license—with one notable caveat: if your services exceed 700 million monthly active users, you’re required to obtain a separate license from Meta, which it may or may not grant at its discretion.
Llama Scout supports a 10 million-token context window, the largest of any publicly released model. Llama Maverick is a generalist model and takes aim at GPT-4o, Gemini 2.0 Flash, and DeepSeek-V3. Llama Behemoth, still in training, serves as a high-capacity teacher model.
In this introductory blog, I’ll give you an overview of the Llama 4 suite. Our team has already tested the model, and I recommend these tutorials if you want to expand your learning:
We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:
What Is Llama 4?
Llama 4 is Meta’s new family of large language models. The release includes two models already available—Llama 4 Scout and Llama 4 Maverick—and a third, Llama 4 Behemoth, still in training.
Source: Meta AI
Llama 4 introduces substantial enhancements. Notably, it incorporates a mixture-of-experts (MoE) architecture, aiming to improve efficiency and performance by activating only necessary components for specific tasks (more on this in a bit). This design represents a shift towards more scalable and specialized AI models.
Llama 4 continues Meta’s strategy of releasing open-weight models—but with a caveat. If your company operates services with more than 700 million monthly active users, you’ll need a separate license from Meta, which may or may not be granted. That limitation aside, the release still feels like a major event in the open-weight landscape, though the landscape itself has changed rapidly in the past few months.
If Llama 2 and 3 once defined the category, Llama 4 now enters a field that’s far more competitive. DeepSeek has arrived with strong reasoning capabilities. Alibaba’s Qwen series has performed well across multilingual and coding benchmarks. Google’s Gemma models are pushing into the same space with smaller, efficient architectures. And just days ago, OpenAI announced plans to release an open-weight model, a shift that would have seemed unlikely a year ago.
Let’s find out more details about each model.
Llama Scout
Llama 4 Scout is the lighter-weight model in the new suite, but it’s arguably the most intriguing. It runs on a single H100 GPU and supports a 10 million-token context window. This makes Scout the most context-hungry open-weight model released to date and potentially the most useful for tasks like multi-document summarization, long-form code reasoning, and activity parsing.
Scout has 17 billion active parameters, organized through 16 experts, with a total parameter count of 109 billion. It was pre-trained and post-trained with a 256K context window, but Meta says it generalizes well far beyond that (this claim remains to be tested). In practice, that opens the door to workflows involving entire codebases, session histories, or legal documents—all processed in a single forward pass.
Architecturally, Scout is built using Meta’s mixture-of-experts (MoE) framework, where only a subset of parameters activates per token—as opposed to dense models like GPT-4o, where all parameters are activated. That means it’s both compute-efficient and highly scalable.
Beyond the core architecture, Meta emphasized Scout’s multimodal capabilities. It was pre-trained on text, image, and video data using early fusion, allowing it to handle combinations of text and visual prompts natively. In image-heavy tasks like visual grounding and VQA (visual question answering), Scout performs better than any previous Llama model—and holds its own against much larger systems.
In short, Scout is built for breadth and scale. It’s designed to run efficiently, handle more input than any previous open model, and operate well across both text and image tasks. We’ll test that 10M context limit soon—and we’ll report back.
Llama Maverick
Llama 4 Maverick is the generalist in the lineup—a full-scale, multimodal model built for performance across chat, reasoning, image understanding, and code. While Scout pushes the limits of context length, Maverick focuses on balanced, high-quality output across tasks. It’s Meta’s answer to GPT-4o, DeepSeek-V3, and Gemini 2.0 Flash.
Maverick has the same 17 billion active parameters as Scout, but with a larger MoE configuration: 128 experts and a total parameter count of 400 billion. Like Scout, it uses a mixture-of-experts architecture, which activates only part of the model per token—reducing inference cost while scaling capacity. The model runs on a single H100 DGX host, but can also be deployed with distributed inference for larger-scale applications.
Meta took a different approach to post-training here, using a mix of lightweight supervised fine-tuning, online reinforcement learning, and direct preference optimization. The goal was to sharpen performance on hard prompts without overconstraining the model. To that end, Meta filtered out over 50% of training examples marked “easy” by earlier Llama models and built a curriculum that emphasized harder reasoning, coding, and multimodal tasks.
Maverick was also co-distilled from Llama 4 Behemoth, Meta’s much larger internal model, which helped boost performance without adding training cost. According to Meta, this distillation pipeline produced a noticeable jump in reasoning and chat quality.
Llama Behemoth
Llama 4 Behemoth is Meta’s most powerful and largest model to date—but it’s not available yet. Still in training, Behemoth is not a reasoning model in the same sense as DeepSeek-R1 or OpenAI’s o3, which are built and optimized for multi-step chain-of-thought tasks.
Based on what we know so far, it also doesn’t seem designed as a product for direct use. Instead, it acts as a teacher model, used to distill and shape both Scout and Maverick. Once released, it could allow others to distill their own models as well.
Behemoth has 288 billion active parameters, organized through 16 experts, with a total parameter count nearing 2 trillion. Meta built an entirely new training infrastructure to support Behemoth at this scale. It introduced asynchronous reinforcement learning, curriculum sampling based on prompt difficulty, and a new distillation loss function that dynamically balances soft and hard targets.
Post-training Behemoth also required a different recipe. Meta discarded over 95% of SFT examples to narrow in on hard prompts and focused reinforcement learning on complex reasoning, coding, and multilingual scenarios. Sampling from varied system instructions helped the model generalize, while dynamic filtering removed low-value prompts during RL training.
Llama 4 Benchmarks
Meta released internal benchmark results for each of the Llama 4 models, comparing them to both their previous Llama variants and several competing open-weight and frontier models.
In this section, I’ll walk you through the benchmark highlights for Scout, Maverick, and Behemoth, using Meta’s own numbers. As always, I encourage caution with self-reported benchmarks—but these scores offer a helpful first look at how each model performs across different tasks and where they stand in the current landscape. Let’s start with Scout.
Llama Scout benchmarks
Llama 4 Scout performs well across a mix of reasoning, coding, and multimodal benchmarks—especially considering its smaller active parameter count and single-GPU footprint.
Source: MetaAI
On image understanding, Scout edges out competitors: it scores 88.8 on ChartQA and 94.4 on DocVQA (test), outperforming Gemini 2.0 Flash-Lite (73.0 and 91.2, respectively) and matching or slightly beating Mistral 3.1 and Gemma 3 27B.
In image reasoning benchmarks like MMMU (69.4) and MathVista (70.7), it also leads the open-weight pack, outperforming Gemma 3 (64.9, 67.6), Mistral 3.1 (62.8, 68.9), and Gemini Flash-Lite (68.0, 57.6).
In coding, Scout scores 32.8 on LiveCodeBench, putting it ahead of Gemini Flash-Lite (28.9) and Gemma 3 27B (29.7), though slightly behind Llama 3.3’s 33.3. It’s not a coding-first model, but it holds its own.
On knowledge and reasoning, Scout hits 74.3 on MMLU Pro and 57.2 on GPQA Diamond, outperforming every other open-weight model on both. These benchmarks favor long-form multi-step reasoning, so Scout’s strong performance here is notable, particularly at this scale.
Finally, Scout’s long-context capabilities show real-world potential. On MTOB (Massive Textual Overlap Benchmark), which tests the model’s ability to translate between English and KGV, a low-resource language, it scores 42.2/36.6 on the half-book test and 39.7/36.3 on the full-book test. On the half-book test, Gemini 2.0 Flash-Lite edges slightly ahead with 42.3, but Scout closes the gap on the full-book, outperforming Gemini’s 35.1/30.0.
Llama Maverick benchmarks
Maverick is the most well-rounded model in the Llama 4 lineup—and the benchmark results reflect that. While it doesn’t aim for the context length extremes of Scout or the raw scale of Behemoth, it performs consistently across every category that matters: multimodal reasoning, coding, language understanding, and long-context retention.
Source: MetaAI
In image reasoning, Maverick scores 73.4 on MMMU and 73.7 on MathVista, outperforming Gemini 2.0 Flash (71.7 and 73.1) and GPT-4o (69.1 and 63.8). On ChartQA (image understanding), it scores 90.0, slightly above Gemini’s 88.3 and well above GPT-4o’s 85.7. In DocVQA, Maverick hits 94.4, matching Scout and outperforming GPT-4o’s 92.8.
In coding, Maverick scores 43.4 on LiveCodeBench, placing it above GPT-4o (32.3), Gemini Flash (34.5), and close to DeepSeek v3.1’s 45.8.
On reasoning and knowledge, Maverick scores 80.5 on MMLU Pro and 69.8 on GPQA Diamond, again outperforming Gemini Flash (77.6 and 60.1) and GPT-4o (no reported MMLU Pro, 53.6 on GPQA). DeepSeek v3.1 leads by a 0.7 margin in MMLU Pro.
Maverick also performs well in multilingual understanding, scoring 84.6 on Multilingual MMLU, slightly above Gemini’s 81.5. That gives it an edge for developers working across multiple languages or geographies.
In long-context evaluations (MTOB), Maverick scores 54.0/46.4 on the half-book test and 50.8/46.7 on the full-book—significantly ahead of Gemini’s 48.4/39.8 and 45.5/39.6, respectively. These scores suggest that while Maverick doesn’t advertise its context length as loudly as Scout, it still benefits meaningfully from its extended window.
Llama Behemoth benchmarks
Behemoth isn’t released yet, but its benchmark numbers are worth paying attention to.
Source: MetaAI
On STEM-heavy benchmarks, Behemoth performs exceptionally well. It scores 95.0 on MATH-500—that’s higher than Gemini 2.0 Pro (91.8) and significantly above Claude Sonnet 3.7 (82.2). On MMLU Pro, Behemoth scores 82.2, while Gemini Pro comes in at 79.1 (Claude has no reported score). And on GPQA Diamond, another benchmark that rewards factual depth and precision, Behemoth reaches 73.7, ahead of Claude (68.0), Gemini (64.7), and GPT-4.5 (71.4).
In multilingual understanding, Behemoth scores 85.8 on Multilingual MMLU, slightly outperforming Claude Sonnet (83.2) and GPT-4.5 (85.1). These scores matter for global developers working outside English, and Behemoth currently leads this category.
On image reasoning, Behemoth hits 76.1 on MMMU, topping Gemini (71.8), Claude (72.7), and GPT-4.5 (74.4). While this isn’t its main focus, it still performs competitively with leading multimodal models.
In code generation, Behemoth scores 49.4 on LiveCodeBench. That’s well above Gemini 2.0 Pro (36.0).
How to Access Llama 4
Both Llama 4 Scout and Llama 4 Maverick are available now under Meta’s open-weight license. You can download them directly from the official Llama website or through Hugging Face.
To access the models through Meta’s own services, you can interact with Meta AI on several platforms: WhatsApp, Messenger, Instagram, and Facebook. Access currently requires logging in with a Meta account, and there’s no standalone API endpoint for Meta AI—at least not yet.
If you’re planning to integrate the models into your own applications or infrastructure, keep in mind the licensing clause: if your product or service has more than 700 million monthly active users, you’ll need to obtain separate permission from Meta. The models are otherwise usable for research, experimentation, and most commercial use cases.
Conclusion
Scout introduces unprecedented context length on a single GPU. Maverick holds its own against larger models across reasoning, code, and multimodal tasks. And Behemoth, still in training, offers a glimpse into how teacher models can shape more efficient and deployable variants.
The open-weight space is more competitive than ever. DeepSeek, Qwen, Gemma, and soon OpenAI are all pushing forward with strong releases. Llama 4 arrives as a continuation of Meta’s ongoing effort to offer scalable, openly available models for a range of use cases.
FAQs
Is there an API for Llama 4?
Meta hasn’t released an official API for Llama 4. However, third-party providers may offer API access to Llama 4.
Can I fine-tune Llama 4 Scout or Maverick on my own data?
Yes, both models are open-weight and can be fine-tuned.
Can I run Llama 4 models locally?
You can run Scout locally if you have access to a high-end GPU (like an A100 or H100). Maverick is significantly larger and typically requires multiple GPUs or distributed infrastructure. For lightweight testing, quantized versions of Scout may be viable on consumer hardware using tools like llama.cpp or vLLM.
What are the hardware requirements for Llama 4 Scout?
Scout is designed to fit on a single H100 GPU. That said, depending on context length and batch size, you might be able to run smaller versions or quantized models on lower-tier GPUs like the A100 or even RTX 4090 with reduced performance.
Is Llama 4 multilingual?
Yes—both Maverick and Behemoth perform strongly on multilingual benchmarks like Multilingual MMLU. While Meta hasn’t released detailed language breakdowns, early benchmarks suggest solid performance across major non-English languages.
Can I use Llama 4 in commercial products?
Yes, unless your company or product exceeds 700 million monthly active users, in which case you’ll need to obtain a special license from Meta. For most startups, researchers, and individual developers, the standard license applies.
Can I distill my own model from Llama Behemoth?
Not yet. Behemoth hasn’t been released, and there’s no indication when Meta will make it publicly available. That said, Meta used Behemoth internally to distill Scout and Maverick—so if released, it could serve as a foundation for further distillations.
What’s the difference between Llama 3.1, Llama 3.3, and Llama 4?
Llama 3.1 and 3.3 were dense models with limited or no multimodal support. Llama 4 moves to a mixture-of-experts architecture and adds native multimodal training. Scout and Maverick also include longer context windows and improved post-training techniques.
I’m an editor and writer covering AI blogs, tutorials, and news, ensuring everything fits a strong content strategy and SEO best practices. I’ve written data science courses on Python, statistics, probability, and data visualization. I’ve also published an award-winning novel and spend my free time on screenwriting and film directing.
Learn AI with these courses!
Course
Fine-Tuning with Llama 3
Track
Llama Fundamentals

blog
What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

blog
Llama 3.2 Guide: How It Works, Use Cases & More

Alex Olteanu
8 min

blog
What Is Meta's Llama 3.3 70B? How It Works, Use Cases & More

Alex Olteanu
8 min

blog
What is Llama 3? The Experts' View on The Next Generation of Open Source LLMs
Tutorial
Llama 4 With vLLM: A Guide With Demo Project
Tutorial