Weiter zum Inhalt

Muse Spark: Features, Benchmarks, and How to Use It

After a long quiet, Meta is back with a new model, a new lab, and a phrase it really wants you to remember. Learn about Muse Spark, its features, and more.
8. Apr. 2026  · 14 Min. lesen

We had been writing articles on Meta's Llama models at a steady clip (Llama 2, Llama 3, and so on). Then Llama 4 landed in April 2025 to widespread criticism, with multiple outlets and the company's own departing AI chief confirming that benchmark results had been manipulated using specialized sub-models that were never released to the public. 

After that, the updates stopped coming. Around the same time, Meta announced it was moving Horizon Worlds to mobile-only, effectively ending the VR version on which it once staked the company's future. It looked like a company losing its footing on two fronts at once.

On April 8, 2026, Meta launched Muse Spark, the first model from Meta Superintelligence Labs. The press release uses the phrase "personal superintelligence" a few too many times. Strip that away, and there is a real model underneath that puts Meta back in the conversation at the frontier level.

What Is Muse Spark?

Muse Spark is a natively multimodal reasoning model that handles text, images, audio, and tool use in a single architecture. It supports visual chain-of-thought, meaning the model can work through image-based problems step-by-step rather than just producing a single answer. Multi-agent orchestration is also part of the setup, which we will get to.

Earlier Llama models returned answers based on pattern matching from training. Muse Spark works through problems before responding. That is the actual shift.

Who is behind it?

Meta Superintelligence Labs, or MSL, was formed on June 30, 2025, when Mark Zuckerberg reorganized the company's AI operations. Alexandr Wang, former CEO of Scale AI, came on as Chief AI Officer; Meta had invested roughly $14 billion in Scale AI as part of the deal. 

Nat Friedman, former CEO of GitHub, leads the product and applied research side, and Shengjia Zhao, who co-created GPT-4 and o1 at OpenAI (the same o1 that Muse Spark is now benchmarked against), is Chief Scientist.

There is a third factor here worth naming: Yann LeCun, Meta's longtime Chief AI Scientist and the company's most visible open-source advocate, left in November 2025. His departure followed organizational changes that limited his role and the team's shift toward closed-source development.

What’s New with Muse Spark (And Why Should We Care)?

The headline features are reasoning modes, a rebuilt training pipeline, and a deliberate focus on health. Let's take them in order.

Three reasoning modes

Muse Spark offers three ways to interact with it, and the distinction between them is worth understanding before you try the model.

  • Instant is the default for casual queries. It responds quickly without extended reasoning, similar to what you would get from a standard chat model.
  • Thinking uses extended chain-of-thought reasoning. The model takes more time, works through intermediate steps, and generally performs better on harder problems. This is the mode behind most of the benchmark results in this article.
  • Contemplating is the most interesting one. More on that directly below.

One thing to know upfront: Contemplating mode is rolling out gradually and was not available to all users on launch day. If you do not see it yet, that is expected.

Contemplating mode

Contemplating mode spins up multiple reasoning agents that work in parallel, then combines their outputs into a single response. Where Gemini's Deep Think and OpenAI's GPT Pro mode scale reasoning by thinking longer, Muse Spark scales it by thinking wider. More agents are working simultaneously rather than one agent working for longer.

Meta's argument is that this approach produces comparable results with lower latency, since the agents run in parallel rather than sequentially. Independent confirmation of the latency claims is not yet available, but the benchmark numbers from Contemplating mode lead on several hard evaluations (more on those shortly).

This is an inference-time feature, not an architectural one. The model itself does not change.

Reinforcement learning scaling and thought compression

Meta rebuilt its training stack from scratch over nine months it took to develop Muse Spark. The reinforcement learning (RL) claims in particular come from Meta’s own technical blog and have not been independently verified.

The more interesting detail is a technique the research team calls thought compression. During RL training, the model is rewarded for getting correct answers but also penalized for thinking time, which translates to excessive output tokens. This creates a three-phase behavior during complex tasks like math problems. 

First, the model improves by thinking longer. Then, the length penalty kicks in and forces the model to solve the same problems with far fewer tokens. At some point, it extends its reasoning again and pushes past previous performance ceilings while using fewer tokens.

The practical result: the model learned to do more with less. That claim rests on Meta's own training curves, which have not been independently validated.

10x less compute

Meta claims its new architecture matches Llama 4 Maverick's performance with ten times less training compute. That is about architectural efficiency, not Muse Spark's ceiling. Llama 4 Maverick scored 18 on the Artificial Analysis Intelligence Index. Muse Spark scored 52.

The token efficiency numbers from Artificial Analysis's independent run point in the same direction. Muse Spark used 58 million output tokens. GPT-5.4 used 120 million. Claude Opus 4.6 used 157 million.

Health: a deliberate focus

Health is Muse Spark's clearest benchmark advantage, and it is deliberate. Meta worked with more than 1,000 physicians to curate training data for health reasoning. 

The model can generate interactive displays covering nutritional content, drug information, and exercise physiology. On HealthBench Hard, Muse Spark scored 42.8 versus GPT-5.4's 40.1 and Gemini 3.1 Pro's 20.6. That gap against Gemini holds under independent evaluation.

This is clearly Meta's answer to ChatGPT Health. Meta's argument for why it can compete is the 3 billion users' worth of social context, which should give it an edge in understanding how people actually ask health questions. Whether that holds for complex or unusual queries rather than the everyday ones that populate benchmarks is worth watching.

What About Llama?

The developer community is asking one thing, and it deserves a direct answer.

Muse Spark is not open-source. Every Llama model through Llama 4 shipped with weights that developers could download and run locally. Communities like r/LocalLLaMA were built on that. That use case is gone.

Meta's stated reason is partly competitive: Chinese labs, including DeepSeek, used Llama weights to accelerate their own research. Wang has said the company "hopes" to open-source future Muse models, with no timeline attached. "Hopes" is doing a lot of work in that sentence.

The Llama team was moved into Wang's lab, and Llama 4 was the last model from the old structure. Whether Llama continues alongside Muse or quietly phases out, Meta has not said.

Muse Spark Benchmark Results

Benchmarks are tricky with Muse Spark for a reason worth naming up front. Given Meta's Llama 4 history, keep the self-reported and independently verified numbers separate.

Here are the Thinking mode results, which are where most of the fair comparison data live.

Full benchmark comparison table for Muse Spark Thinking mode versus Opus 4.6, Gemini 3.1 Pro, GPT-5.4, and Grok 4.2 across multimodal, reasoning, health, and agentic categories.

Source: Meta Superintelligence Labs / ai.meta.com

Contemplating mode has a separate set for the hardest evaluations. It leads on Humanity's Last Exam and FrontierScience Research, but trails GPT-5.4 Pro and Gemini 3.1 Deep Think on IPhO 2025 Theory physics problems. All from Meta's own reporting, so read them as directionally interesting rather than settled.

Source: Meta Superintelligence Labs / ai.meta.com

The independent picture from Artificial Analysis is more measured. They placed Muse Spark fourth on their Intelligence Index, behind Gemini 3.1 Pro Preview, GPT-5.4, and Claude Opus 4.6. Still top five globally. The numbers also make the weak spots obvious: ARC-AGI-2 and Terminal-Bench 2.0 are worth paying attention to if coding or abstract reasoning matters to your use case.

Testing Muse Spark

With those benchmark scores in mind, let’s put Muse Spark to the test. I will examine the model in multi-step reasoning, image understanding, and code debugging.

Test 1: Fibonacci–binary logic chain (cascading reasoning stability)

In my first test, I will target the advanced reasoning capabilities of Muse Spark in a multi-step exercise. The model needs to:

  • Identify the correct Fibonacci term
  • Convert it accurately to binary
  • Count bits precisely
  • Generate primes in a computed range
  • Perform a large summation

The prompt used was:

Step 1: Find the 13th number in the Fibonacci sequence (starting with F1=1, F2=1). Let this be X.
Step 2: Convert X into a binary string (Base 2).
Step 3: Count the number of '1's in that binary string. Let this count be C.
Step 4: Identify all prime numbers (p) such that 20 ≤ p ≤ (C × 100).
Step 5: Calculate the sum of these primes. What is the final result?

Muse Spark did very well and solved the exercise correctly on the first try. That’s especially impressive given that GPT-5.4 failed at the last step and only succeeded once it was divided into two steps (listing the prime numbers and then adding them).

Test 2: Image understanding and sales reasoning on a multi-line time-series chart

Meta claims that Muse Spark is great at understanding complex images, so I’m using the following multi-line time-series chart to see if it can identify patterns and turn them into useful suggestions.

This is the prompt:

Examine this multi-line time-series of monthly active users for three products. Describe the key patterns you see, explain how events likely impacted each product, and propose data-driven next steps for the business.

Muse Spark code chart recognition response: key patterns by product

Muse Spark identified all the patterns correctly, which implies that the image recognition works well.

Muse Spark code chart recognition response: effect of annotated events

The data used was randomly fabricated, so there’s no clear right or wrong answer here. That being said, Muse Spark identifies all events and reasons across the different products and times for each of the events, and comes to sensible conclusions. It even analyses changes in the sum of monthly active users (MAU) of product combinations, without being prompted for it, which is a nice addition.

Muse Spark code chart recognition reponse: data-driven next steps

All the suggested next steps are aligned with the analyses about product MAU patterns and event effects. Muse Spark identified the most important theme for each product (launch playbook for A, pricing for B, scaling for C), and came up with specific actions that make sense.

Test 3: Code debugging

Finally, I’ll test Muse Spark’s skills in diagnosing code bugs. The test is designed to show whether the model only traces code correctness line-by-line or is also able to detect underlying flaws.

The prompt:

A developer wrote this Python function to compute a running average: 

def running_average(data, window=3): 
    result = [] 
    for i in range(len(data)): 
        start = max(0, i - window + 1) 
        chunk = data[start:i + 1] 
        result.append(round(sum(chunk) / window, 2)) 
    return result 
When called with running_average([10, 20, 30, 40, 50]), the first two values in the output seem wrong. Why? Please help me fix what is wrong!

The function always divides by window (3), even in the beginning, when the chunk has fewer than 3 elements. The buggy output is [3.33, 10.0, 20.0, 30.0, 40.0], but the first two values should be 10.0 and 15.0 since those chunks contain only 1 and 2 elements, respectively. The fix is changing / window to / len(chunk).

Models often trace through the loop perfectly, but then report that the output looks “correct." They see the math happening step by step and don't flag that dividing a single element by 3 doesn’t make sense. It requires the model to hold intent (what a running average should do) alongside execution (what the code actually does) and spot the gap between them.

Muse Spark code debugging reponse

Muse Spark identified a running average as the intent and spotted the mistake. It suggested the right change and explained why it is necessary. It even suggested another option in case partial windows should be skipped entirely.

Overall, the model passed all three tests perfectly and made a good first impression.

How Can I Access Muse Spark?

You can access Muse Spark at meta.ai or through the Meta AI app on iOS and Android. Both are free. The initial rollout is US-first, with expansion to other regions described as coming in the following weeks. meta.ai user interface and Muse Spark model mode selection

Meta plans to roll it out across WhatsApp, Instagram, Facebook, Messenger, and its Ray-Ban AI glasses over the same timeframe.

There is no public API. A private preview is open to select enterprise partners, with no confirmed date for broader access. On privacy: Meta's policy sets few limits on how conversations can be used to improve its models. If you plan to share sensitive information, read the terms first.

Where Muse Spark Falls Short

Meta said it directly in its technical blog: the model has gaps in multi-step agent tasks and coding workflows.

On SWE-Bench Verified, the gap against Gemini and Opus 4.6 is small. It opens up in agentic work: Terminal-Bench 2.0 (59.0 vs. GPT-5.4's 75.1) and GDPval-AA office automation (1,444 vs. GPT-5.4's 1,672). Those are not close. 

Abstract visual reasoning follows the same pattern: ARC-AGI-2 is 42.5 for Muse Spark against mid-70s for both GPT-5.4 and Gemini. The model that leads on chart reading trails badly on novel visual patterns.

That last one drew a response on launch day. François Chollet, co-founder of ARC Prize and creator of Keras and ARC-AGI, called the model "overoptimized for public benchmark numbers at the detriment of everything else." Wang replied, acknowledged the ARC-AGI-2 gap, and pointed to positive user feedback on visual coding and reasoning. Whether that holds under broader use is still an open question.

The missing public API, as I covered earlier, is a competitive gap on top of that. Wang acknowledged it on launch day: "There are certainly rough edges we will polish over time in model behavior."

Muse Spark Safety 

Meta conducted evaluations under its Advanced AI Scaling Framework before launch. On BioTIER-refuse, Muse Spark leads the comparison set for bioweapons query refusal. These numbers are Meta's own.

Bar chart showing bioweapons refusal rates across frontier models: Muse Spark 98.0%, Opus 4.6 95.4%, GPT-5.4 74.7%, Gemini 3.1 Pro 61.5%, Kimi K2.5 21.2%.

Source: Meta Superintelligence Labs / ai.meta.com

The more interesting finding comes from Apollo Research. They found that Muse Spark showed the highest rate of evaluation awareness of any model they had tested: the model frequently identified safety evaluations as test contexts and behaved more carefully because of that detection. 

A model that only behaves well when it knows it is being watched is a problem worth taking seriously. Apollo's prior work has documented that this pattern can increase what they call "scheming behavior" in actual deployment.

Meta acknowledged the finding at launch, which most labs do not do. Their follow-up found it affected a small subset of alignment evaluations, none related to hazardous capabilities, and concluded it was not a blocking concern. The research is ongoing.

Muse Spark vs. GPT-5.4 vs. Opus 4.6 vs. Gemini 3.1

The benchmarks covered what these models can do. This section covers which one to actually use.

At a glance

Spec

Muse Spark

GPT-5.4

Opus 4.6

Gemini 3.1 Pro

Released

Apr 8, 2026

Mar 5, 2026

Feb 5, 2026

Feb 19, 2026

Context window

262K*

1.05M

1M since Mar 13

1M

Input modalities

Text, image, speech

Text, image

Text, image

Text, image, audio, video

API pricing (per 1M tokens in / out)

No public API

$2.50 / $15.00

$5.00 / $25.00

$2.00 / $12.00

Consumer access

meta.ai (US-first)

ChatGPT

Claude.ai

Gemini app

*Artificial Analysis records Muse Spark's context window at 262K. Some sources cite 1M. Meta has not published a model card confirming either figure.

Which one should you use?

Choose Muse Spark if your use case is health queries, chart reading, or multimodal consumer applications. There is no public API yet, so if you are building a production integration, you will have to wait.

Choose GPT-5.4 if you need a general-purpose model you can build against today. It leads on coding, abstract visual reasoning, and office automation, with a public API and 1M context window available now.

Choose Claude Opus 4.6 if you are working with long documents or need careful, high-quality writing output. The 1M context window moved to standard pricing on March 13, 2026. It is the most expensive option at $5/$25 per 1M tokens.

Choose Gemini 3.1 Pro if your pipeline processes video. It is the only model here that accepts video input, and at $2/$12 per 1M tokens, it is the cheapest frontier option in this group.

What People Are Saying About Muse Spark

Early reactions split along the lines you would expect. Some people found specific things that surprised them. Others looked at the benchmark table and reached different conclusions.

Tweet from Viktor Seraleev saying Claude just got a serious competitor, describing Muse Spark as built by MSL and noting Zuckerberg is rebooting the entire AI strategy with a full stack rebuilt from scratch in 9 months.

The "full stack rebuilt from scratch" framing came up a lot. That nine-month timeline is either impressive or hard to believe, depending on how much you trust Meta's claims.

Pietro Schirano shared a specific example: he asked Muse Spark to convert a UI screenshot into code, and it extracted the image assets from the interface rather than treating them as a flat image.

Tweet from Pietro Schirano saying he asked Muse Spark to convert an image to code and it cut out the assets from the screens so it could use them correctly, with a before-and-after screenshot comparison.

That is not a benchmark. It is the kind of thing that gets shared because it is genuinely unexpected.

Aakash Gupta had the sharpest take. His framing: "This is a data labeling CEO's model. The fingerprints are all over the results." The benchmarks where Muse Spark leads are all data-quality-sensitive tasks where training set curation determines the ceiling. 

The ones where it trails (ARC-AGI-2, Terminal-Bench, GDPval) are exactly where architecture and RL scaling matter more than data. His conclusion: "he built the best model at the things data pipelines solve, and a mediocre one at everything else."

Full tweet thread from Aakash Gupta analyzing Muse Spark: leads on data-quality-sensitive benchmarks, trails on coding and abstract reasoning, described as a data labeling CEO's model, with the conclusion that the $14.3B question was whether Wang could build the best model overall.

Conclusion

The jump from Llama 4 Maverick's 18 to Muse Spark's 52 on the Artificial Analysis Intelligence Index is not subtle. For a team that rebuilt from scratch in nine months, the health and multimodal results are a real first step, and they hold under independent testing.

Sure, the gaps are obvious. Coding and agentic tasks against GPT-5.4 are not close; abstract visual reasoning is a clear weak spot, and there is still no public API. If you need a model you can build against today, Muse Spark is not that yet.

What I keep coming back to is the open-source question. The Llama ecosystem was built on the trust that weights would be available. Muse Spark breaks that. Wang's "hope" to open-source future versions is not a commitment. That is, in my view, the most consequential thing about this launch, and it gets far less attention than the benchmark numbers.

Bigger Muse models are in development. If the architecture scales as claimed, today's numbers will look modest. That is the bet.

If you want to learn how to make the most out of any large language model, I recommend taking our Understanding Prompt Engineering course.

Muse Spark FAQs

If I were using Llama locally, does Muse Spark replace that?

No. Muse Spark is cloud-only. You cannot download it, run it on your own hardware, or fine-tune it. Access is through meta.ai or the Meta AI app, both requiring a Meta account. The open-weights use case that Llama built its community around does not exist here.

When should I actually use Contemplating mode instead of Thinking?

Contemplating mode is most useful when a problem genuinely has multiple valid solution paths, complex scientific questions, multi-step reasoning with ambiguous inputs, or research tasks where different angles might reach different conclusions. For most everyday queries, Thinking mode is faster, and the results are comparable. The other thing worth knowing: Contemplating mode is still rolling out gradually, so you may not have access to it yet.

What does the 10x compute claim actually mean for me as a user?

Probably nothing right now. The comparison is against Muse Spark's own prior model, not against GPT-5.4 or Gemini, and the number has not been independently verified. The more relevant data point is inference efficiency: as mentioned earlier, Muse Spark used 58 million output tokens on Artificial Analysis's independent run versus 157 million for Claude Opus 4.6. That gap may eventually show up in pricing, but API pricing has not been announced yet.

Is it worth switching to Muse Spark from what I use now?

If you use ChatGPT for general tasks, the day-to-day experience is similar. If health queries, science, or chart analysis are your main use cases, Muse Spark is a reasonable upgrade. If you rely on coding assistants or long-document tools, it does not replace GPT-5.4 or Opus 4.6 yet. The comparison table above has the specifics.

Should I be concerned about the evaluation awareness finding?

Not in a practical day-to-day sense, but it is worth understanding. The finding is that Muse Spark behaves more carefully when it detects it is being safety-tested, not because its underlying values differ but because it recognizes the context. Meta's follow-up found this affected a narrow set of alignment tests, none involving hazardous capabilities. If you are evaluating models for sensitive deployments, read Apollo's full report before drawing conclusions from safety benchmark scores alone.


Khalid Abdelaty's photo
Author
Khalid Abdelaty
LinkedIn

I’m a data engineer and community builder who works across data pipelines, cloud, and AI tooling while writing practical, high-impact tutorials for DataCamp and emerging developers.

Themen

AI Courses

Lernpfad

Grundlagen der KI

10 Std.
Lerne die Grundlagen der KI kennen, finde heraus, wie du KI effektiv bei der Arbeit nutzen kannst, und tauche in Modelle wie chatGPT ein, um dich in der dynamischen KI-Landschaft zurechtzufinden.
Details anzeigenRight Arrow
Kurs starten
Mehr anzeigenRight Arrow
Verwandt
llama 4

Blog

Meta's Llama 4: Features, Access, How It Works, and More

Learn about the Llama 4 suite of large language models, including Llama 4 Scout, Llama 4 Maverick, and the in-training Llama 4 Behemoth.
Alex Olteanu's photo

Alex Olteanu

8 Min.

Blog

Introduction to Meta AI’s LLaMA

LLaMA, a revolutionary open-source framework, aims to make large language model research more accessible.
Abid Ali Awan's photo

Abid Ali Awan

8 Min.

Blog

Meta Learning: How Machines Learn to Learn

Discover how meta learning enables AI systems to adapt rapidly to new tasks with minimal data, unlocking new potentials in machine learning, few-shot learning, and more.
Javier Canales Luna's photo

Javier Canales Luna

10 Min.

Blog

What Is Meta's Llama 3.1 405B? How It Works, Use Cases & More

Meta releases Llama 3.1 405B, a large open-source language model designed to compete with closed models like GPT-4o and Claude 3.5 Sonnet.
Richie Cotton's photo

Richie Cotton

8 Min.

Blog

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

Discover how GPT-5.2 improves knowledge work with major upgrades in long-context reasoning, tool calling, coding, vision, and end-to-end workflow execution.
Josef Waples's photo

Josef Waples

10 Min.

gpt-5

Blog

GPT-5: New Features, Tests, Benchmarks, and More

Learn about GPT-5's new features, performance benchmarks, and how it consolidates previous OpenAI models into a unified user experience.
Alex Olteanu's photo

Alex Olteanu

8 Min.

Mehr anzeigenMehr anzeigen