Muse Spark: Features, Benchmarks, and How to Use It

After a long quiet, Meta is back with a new model, a new lab, and a phrase it really wants you to remember. Learn about Muse Spark, its features, and more.

8. Apr. 2026 · 14 Min. lesen

We had been writing articles on Meta's Llama models at a steady clip (Llama 2, Llama 3, and so on). Then Llama 4 landed in April 2025 to widespread criticism, with multiple outlets and the company's own departing AI chief confirming that benchmark results had been manipulated using specialized sub-models that were never released to the public.

After that, the updates stopped coming. Around the same time, Meta announced it was moving Horizon Worlds to mobile-only, effectively ending the VR version on which it once staked the company's future. It looked like a company losing its footing on two fronts at once.

On April 8, 2026, Meta launched Muse Spark, the first model from Meta Superintelligence Labs. The press release uses the phrase "personal superintelligence" a few too many times. Strip that away, and there is a real model underneath that puts Meta back in the conversation at the frontier level.

What Is Muse Spark?

Muse Spark is a natively multimodal reasoning model that handles text, images, audio, and tool use in a single architecture. It supports visual chain-of-thought, meaning the model can work through image-based problems step-by-step rather than just producing a single answer. Multi-agent orchestration is also part of the setup, which we will get to.

Earlier Llama models returned answers based on pattern matching from training. Muse Spark works through problems before responding. That is the actual shift.

Who is behind it?

Meta Superintelligence Labs, or MSL, was formed on June 30, 2025, when Mark Zuckerberg reorganized the company's AI operations. Alexandr Wang, former CEO of Scale AI, came on as Chief AI Officer; Meta had invested roughly $14 billion in Scale AI as part of the deal.

Nat Friedman, former CEO of GitHub, leads the product and applied research side, and Shengjia Zhao, who co-created GPT-4 and o1 at OpenAI (the same o1 that Muse Spark is now benchmarked against), is Chief Scientist.

There is a third factor here worth naming: Yann LeCun, Meta's longtime Chief AI Scientist and the company's most visible open-source advocate, left in November 2025. His departure followed organizational changes that limited his role and the team's shift toward closed-source development.

What’s New with Muse Spark (And Why Should We Care)?

The headline features are reasoning modes, a rebuilt training pipeline, and a deliberate focus on health. Let's take them in order.

Three reasoning modes

Muse Spark offers three ways to interact with it, and the distinction between them is worth understanding before you try the model.

Instant is the default for casual queries. It responds quickly without extended reasoning, similar to what you would get from a standard chat model.
Thinking uses extended chain-of-thought reasoning. The model takes more time, works through intermediate steps, and generally performs better on harder problems. This is the mode behind most of the benchmark results in this article.
Contemplating is the most interesting one. More on that directly below.

One thing to know upfront: Contemplating mode is rolling out gradually and was not available to all users on launch day. If you do not see it yet, that is expected.

Contemplating mode

Contemplating mode spins up multiple reasoning agents that work in parallel, then combines their outputs into a single response. Where Gemini's Deep Think and OpenAI's GPT Pro mode scale reasoning by thinking longer, Muse Spark scales it by thinking wider. More agents are working simultaneously rather than one agent working for longer.

Meta's argument is that this approach produces comparable results with lower latency, since the agents run in parallel rather than sequentially. Independent confirmation of the latency claims is not yet available, but the benchmark numbers from Contemplating mode lead on several hard evaluations (more on those shortly).

This is an inference-time feature, not an architectural one. The model itself does not change.

Reinforcement learning scaling and thought compression

Meta rebuilt its training stack from scratch over nine months it took to develop Muse Spark. The reinforcement learning (RL) claims in particular come from Meta’s own technical blog and have not been independently verified.

The more interesting detail is a technique the research team calls thought compression. During RL training, the model is rewarded for getting correct answers but also penalized for thinking time, which translates to excessive output tokens. This creates a three-phase behavior during complex tasks like math problems.

First, the model improves by thinking longer. Then, the length penalty kicks in and forces the model to solve the same problems with far fewer tokens. At some point, it extends its reasoning again and pushes past previous performance ceilings while using fewer tokens.

The practical result: the model learned to do more with less. That claim rests on Meta's own training curves, which have not been independently validated.

10x less compute

Meta claims its new architecture matches Llama 4 Maverick's performance with ten times less training compute. That is about architectural efficiency, not Muse Spark's ceiling. Llama 4 Maverick scored 18 on the Artificial Analysis Intelligence Index. Muse Spark scored 52.

The token efficiency numbers from Artificial Analysis's independent run point in the same direction. Muse Spark used 58 million output tokens. GPT-5.4 used 120 million. Claude Opus 4.6 used 157 million.

Health: a deliberate focus

Health is Muse Spark's clearest benchmark advantage, and it is deliberate. Meta worked with more than 1,000 physicians to curate training data for health reasoning.

The model can generate interactive displays covering nutritional content, drug information, and exercise physiology. On HealthBench Hard, Muse Spark scored 42.8 versus GPT-5.4's 40.1 and Gemini 3.1 Pro's 20.6. That gap against Gemini holds under independent evaluation.

This is clearly Meta's answer to ChatGPT Health. Meta's argument for why it can compete is the 3 billion users' worth of social context, which should give it an edge in understanding how people actually ask health questions. Whether that holds for complex or unusual queries rather than the everyday ones that populate benchmarks is worth watching.

What About Llama?

The developer community is asking one thing, and it deserves a direct answer.

Muse Spark is not open-source. Every Llama model through Llama 4 shipped with weights that developers could download and run locally. Communities like r/LocalLLaMA were built on that. That use case is gone.

Meta's stated reason is partly competitive: Chinese labs, including DeepSeek, used Llama weights to accelerate their own research. Wang has said the company "hopes" to open-source future Muse models, with no timeline attached. "Hopes" is doing a lot of work in that sentence.

The Llama team was moved into Wang's lab, and Llama 4 was the last model from the old structure. Whether Llama continues alongside Muse or quietly phases out, Meta has not said.

Muse Spark Benchmark Results

Benchmarks are tricky with Muse Spark for a reason worth naming up front. Given Meta's Llama 4 history, keep the self-reported and independently verified numbers separate.

Here are the Thinking mode results, which are where most of the fair comparison data live.

Source: Meta Superintelligence Labs / ai.meta.com

Contemplating mode has a separate set for the hardest evaluations. It leads on Humanity's Last Exam and FrontierScience Research, but trails GPT-5.4 Pro and Gemini 3.1 Deep Think on IPhO 2025 Theory physics problems. All from Meta's own reporting, so read them as directionally interesting rather than settled.

Source: Meta Superintelligence Labs / ai.meta.com

The independent picture from Artificial Analysis is more measured. They placed Muse Spark fourth on their Intelligence Index, behind Gemini 3.1 Pro Preview, GPT-5.4, and Claude Opus 4.6. Still top five globally. The numbers also make the weak spots obvious: ARC-AGI-2 and Terminal-Bench 2.0 are worth paying attention to if coding or abstract reasoning matters to your use case.

Testing Muse Spark

With those benchmark scores in mind, let’s put Muse Spark to the test. I will examine the model in multi-step reasoning, image understanding, and code debugging.

Test 1: Fibonacci–binary logic chain (cascading reasoning stability)

In my first test, I will target the advanced reasoning capabilities of Muse Spark in a multi-step exercise. The model needs to:

Identify the correct Fibonacci term
Convert it accurately to binary
Count bits precisely
Generate primes in a computed range
Perform a large summation

The prompt used was:

Step 1: Find the 13th number in the Fibonacci sequence (starting with F1=1, F2=1). Let this be X.
Step 2: Convert X into a binary string (Base 2).
Step 3: Count the number of '1's in that binary string. Let this count be C.
Step 4: Identify all prime numbers (p) such that 20 ≤ p ≤ (C × 100).
Step 5: Calculate the sum of these primes. What is the final result?

Muse Spark did very well and solved the exercise correctly on the first try. That’s especially impressive given that GPT-5.4 failed at the last step and only succeeded once it was divided into two steps (listing the prime numbers and then adding them).

Test 2: Image understanding and sales reasoning on a multi-line time-series chart

Meta claims that Muse Spark is great at understanding complex images, so I’m using the following multi-line time-series chart to see if it can identify patterns and turn them into useful suggestions.

This is the prompt:

Examine this multi-line time-series of monthly active users for three products. Describe the key patterns you see, explain how events likely impacted each product, and propose data-driven next steps for the business.

Muse Spark identified all the patterns correctly, which implies that the image recognition works well.

The data used was randomly fabricated, so there’s no clear right or wrong answer here. That being said, Muse Spark identifies all events and reasons across the different products and times for each of the events, and comes to sensible conclusions. It even analyses changes in the sum of monthly active users (MAU) of product combinations, without being prompted for it, which is a nice addition.

All the suggested next steps are aligned with the analyses about product MAU patterns and event effects. Muse Spark identified the most important theme for each product (launch playbook for A, pricing for B, scaling for C), and came up with specific actions that make sense.

Test 3: Code debugging

Finally, I’ll test Muse Spark’s skills in diagnosing code bugs. The test is designed to show whether the model only traces code correctness line-by-line or is also able to detect underlying flaws.

The prompt:

A developer wrote this Python function to compute a running average: 

def running_average(data, window=3): 
    result = [] 
    for i in range(len(data)): 
        start = max(0, i - window + 1) 
        chunk = data[start:i + 1] 
        result.append(round(sum(chunk) / window, 2)) 
    return result 
When called with running_average([10, 20, 30, 40, 50]), the first two values in the output seem wrong. Why? Please help me fix what is wrong!

The function always divides by window (3), even in the beginning, when the chunk has fewer than 3 elements. The buggy output is [3.33, 10.0, 20.0, 30.0, 40.0], but the first two values should be 10.0 and 15.0 since those chunks contain only 1 and 2 elements, respectively. The fix is changing / window to / len(chunk).

Models often trace through the loop perfectly, but then report that the output looks “correct." They see the math happening step by step and don't flag that dividing a single element by 3 doesn’t make sense. It requires the model to hold intent (what a running average should do) alongside execution (what the code actually does) and spot the gap between them.

Muse Spark identified a running average as the intent and spotted the mistake. It suggested the right change and explained why it is necessary. It even suggested another option in case partial windows should be skipped entirely.

Overall, the model passed all three tests perfectly and made a good first impression.

How Can I Access Muse Spark?

You can access Muse Spark at meta.ai or through the Meta AI app on iOS and Android. Both are free. The initial rollout is US-first, with expansion to other regions described as coming in the following weeks.

Meta plans to roll it out across WhatsApp, Instagram, Facebook, Messenger, and its Ray-Ban AI glasses over the same timeframe.

There is no public API. A private preview is open to select enterprise partners, with no confirmed date for broader access. On privacy: Meta's policy sets few limits on how conversations can be used to improve its models. If you plan to share sensitive information, read the terms first.

Where Muse Spark Falls Short

Meta said it directly in its technical blog: the model has gaps in multi-step agent tasks and coding workflows.

On SWE-Bench Verified, the gap against Gemini and Opus 4.6 is small. It opens up in agentic work: Terminal-Bench 2.0 (59.0 vs. GPT-5.4's 75.1) and GDPval-AA office automation (1,444 vs. GPT-5.4's 1,672). Those are not close.

Abstract visual reasoning follows the same pattern: ARC-AGI-2 is 42.5 for Muse Spark against mid-70s for both GPT-5.4 and Gemini. The model that leads on chart reading trails badly on novel visual patterns.

That last one drew a response on launch day. François Chollet, co-founder of ARC Prize and creator of Keras and ARC-AGI, called the model "overoptimized for public benchmark numbers at the detriment of everything else." Wang replied, acknowledged the ARC-AGI-2 gap, and pointed to positive user feedback on visual coding and reasoning. Whether that holds under broader use is still an open question.

The missing public API, as I covered earlier, is a competitive gap on top of that. Wang acknowledged it on launch day: "There are certainly rough edges we will polish over time in model behavior."

Muse Spark Safety

Meta conducted evaluations under its Advanced AI Scaling Framework before launch. On BioTIER-refuse, Muse Spark leads the comparison set for bioweapons query refusal. These numbers are Meta's own.

Source: Meta Superintelligence Labs / ai.meta.com

The more interesting finding comes from Apollo Research. They found that Muse Spark showed the highest rate of evaluation awareness of any model they had tested: the model frequently identified safety evaluations as test contexts and behaved more carefully because of that detection.

A model that only behaves well when it knows it is being watched is a problem worth taking seriously. Apollo's prior work has documented that this pattern can increase what they call "scheming behavior" in actual deployment.

Meta acknowledged the finding at launch, which most labs do not do. Their follow-up found it affected a small subset of alignment evaluations, none related to hazardous capabilities, and concluded it was not a blocking concern. The research is ongoing.

Muse Spark vs. GPT-5.4 vs. Opus 4.6 vs. Gemini 3.1

The benchmarks covered what these models can do. This section covers which one to actually use.

At a glance

Spec	Muse Spark	GPT-5.4	Opus 4.6	Gemini 3.1 Pro
Released	Apr 8, 2026	Mar 5, 2026	Feb 5, 2026	Feb 19, 2026
Context window	262K*	1.05M	1M since Mar 13	1M
Input modalities	Text, image, speech	Text, image	Text, image	Text, image, audio, video
API pricing (per 1M tokens in / out)	No public API	$2.50 / $15.00	$5.00 / $25.00	$2.00 / $12.00
Consumer access	meta.ai (US-first)	ChatGPT	Claude.ai	Gemini app

*Artificial Analysis records Muse Spark's context window at 262K. Some sources cite 1M. Meta has not published a model card confirming either figure.

Which one should you use?

Choose Muse Spark if your use case is health queries, chart reading, or multimodal consumer applications. There is no public API yet, so if you are building a production integration, you will have to wait.

Choose GPT-5.4 if you need a general-purpose model you can build against today. It leads on coding, abstract visual reasoning, and office automation, with a public API and 1M context window available now.

Choose Claude Opus 4.6 if you are working with long documents or need careful, high-quality writing output. The 1M context window moved to standard pricing on March 13, 2026. It is the most expensive option at $5/$25 per 1M tokens.

Choose Gemini 3.1 Pro if your pipeline processes video. It is the only model here that accepts video input, and at $2/$12 per 1M tokens, it is the cheapest frontier option in this group.

What People Are Saying About Muse Spark

Early reactions split along the lines you would expect. Some people found specific things that surprised them. Others looked at the benchmark table and reached different conclusions.

The "full stack rebuilt from scratch" framing came up a lot. That nine-month timeline is either impressive or hard to believe, depending on how much you trust Meta's claims.

Pietro Schirano shared a specific example: he asked Muse Spark to convert a UI screenshot into code, and it extracted the image assets from the interface rather than treating them as a flat image.

That is not a benchmark. It is the kind of thing that gets shared because it is genuinely unexpected.

Aakash Gupta had the sharpest take. His framing: "This is a data labeling CEO's model. The fingerprints are all over the results." The benchmarks where Muse Spark leads are all data-quality-sensitive tasks where training set curation determines the ceiling.

The ones where it trails (ARC-AGI-2, Terminal-Bench, GDPval) are exactly where architecture and RL scaling matter more than data. His conclusion: "he built the best model at the things data pipelines solve, and a mediocre one at everything else."

Conclusion

The jump from Llama 4 Maverick's 18 to Muse Spark's 52 on the Artificial Analysis Intelligence Index is not subtle. For a team that rebuilt from scratch in nine months, the health and multimodal results are a real first step, and they hold under independent testing.

Sure, the gaps are obvious. Coding and agentic tasks against GPT-5.4 are not close; abstract visual reasoning is a clear weak spot, and there is still no public API. If you need a model you can build against today, Muse Spark is not that yet.

What I keep coming back to is the open-source question. The Llama ecosystem was built on the trust that weights would be available. Muse Spark breaks that. Wang's "hope" to open-source future versions is not a commitment. That is, in my view, the most consequential thing about this launch, and it gets far less attention than the benchmark numbers.

Bigger Muse models are in development. If the architecture scales as claimed, today's numbers will look modest. That is the bet.

If you want to learn how to make the most out of any large language model, I recommend taking our Understanding Prompt Engineering course.

If I were using Llama locally, does Muse Spark replace that?

When should I actually use Contemplating mode instead of Thinking?

What does the 10x compute claim actually mean for me as a user?

Is it worth switching to Muse Spark from what I use now?

Should I be concerned about the evaluation awareness finding?

Author

Khalid Abdelaty

Themen

AI News

Large Language Models

Artificial Intelligence

AI Courses

Lernpfad

Grundlagen der KI

10 Std.

Lerne die Grundlagen der KI kennen, finde heraus, wie du KI effektiv bei der Arbeit nutzen kannst, und tauche in Modelle wie chatGPT ein, um dich in der dynamischen KI-Landschaft zurechtzufinden.

Details anzeigen

Kurs starten

Lernpfad

Associate AI Engineer für Entwickler

26 Std.

Lerne, wie du KI mithilfe von APIs und Open-Source-Bibliotheken in Softwareanwendungen integrierst. Starte noch heute deine Reise zum AI Engineer!

Details anzeigen

Kurs starten

Kurs

KI-Lösungen im Unternehmen implementieren

2 Std.

48.8K

Erfahre, wie du mit KI echten Mehrwert schaffst – von der Identifikation von Einsatzmöglichkeiten über POCs bis hin zur Umsetzung und Strategie.

Details anzeigen

Kurs starten

Verwandt

Blog

Meta's Llama 4: Features, Access, How It Works, and More

Learn about the Llama 4 suite of large language models, including Llama 4 Scout, Llama 4 Maverick, and the in-training Llama 4 Behemoth.

Alex Olteanu

8 Min.

Blog

Introduction to Meta AI’s LLaMA

LLaMA, a revolutionary open-source framework, aims to make large language model research more accessible.

Abid Ali Awan

8 Min.

Blog

Meta Learning: How Machines Learn to Learn

Discover how meta learning enables AI systems to adapt rapidly to new tasks with minimal data, unlocking new potentials in machine learning, few-shot learning, and more.

Javier Canales Luna

10 Min.

Blog

What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

Meta releases Llama 3.1 405B, a large open-source language model designed to compete with closed models like GPT-4o and Claude 3.5 Sonnet.

Richie Cotton

8 Min.

Blog

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

Discover how GPT-5.2 improves knowledge work with major upgrades in long-context reasoning, tool calling, coding, vision, and end-to-end workflow execution.

Josef Waples

10 Min.

Blog

GPT-5: New Features, Tests, Benchmarks, and More

Learn about GPT-5's new features, performance benchmarks, and how it consolidates previous OpenAI models into a unified user experience.

Alex Olteanu

8 Min.

Mehr anzeigen Mehr anzeigen

What Is Muse Spark?

Who is behind it?

What’s New with Muse Spark (And Why Should We Care)?

Three reasoning modes

Contemplating mode

Reinforcement learning scaling and thought compression

10x less compute

Health: a deliberate focus

What About Llama?

Muse Spark Benchmark Results

Testing Muse Spark

Test 1: Fibonacci–binary logic chain (cascading reasoning stability)

Test 2: Image understanding and sales reasoning on a multi-line time-series chart

Test 3: Code debugging

How Can I Access Muse Spark?

Where Muse Spark Falls Short

Muse Spark Safety

Muse Spark vs. GPT-5.4 vs. Opus 4.6 vs. Gemini 3.1

At a glance

Which one should you use?

What People Are Saying About Muse Spark

Conclusion

Muse Spark FAQs

What does the 10x compute claim actually mean for me as a user?

Is it worth switching to Muse Spark from what I use now?

Should I be concerned about the evaluation awareness finding?

Meta's Llama 4: Features, Access, How It Works, and More

Introduction to Meta AI’s LLaMA

Meta Learning: How Machines Learn to Learn

What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT-5: New Features, Tests, Benchmarks, and More

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Grundlagen der KI

Associate AI Engineer für Entwickler

KI-Lösungen im Unternehmen implementieren

Meta's Llama 4: Features, Access, How It Works, and More

Introduction to Meta AI’s LLaMA

Meta Learning: How Machines Learn to Learn

What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT-5: New Features, Tests, Benchmarks, and More

Grundlagen der KI