Accéder au contenu principal

Qwen3-Max-Thinking: Hands-On With the Largest LLM in the World

Test Qwen3-Max-Thinking, the world’s largest LLM, in reasoning mode. See how it stacks up against GPT-5 and Claude Sonnet 4.5 in logic, math, and code.
10 nov. 2025  · 11 min de lecture

Qwen3-Max-Thinking became publicly available in November 2025. When you enable thinking mode, the model shows its reasoning before answering. You get to see how it works through math problems, logic puzzles, and complex code. The model has over 1 trillion parameters and handles up to 262k tokens.

In this tutorial, you’ll set up API access, test thinking mode on different problems, and build a Streamlit app that runs the same prompt through Qwen3-Max-Thinking, GPT-5, and Claude Sonnet 4.5. You’ll see how each model approaches the problem and when thinking mode helps. Here is a preview of the app:

What is Qwen3-Max-Thinking?

Qwen3 Max Thinking is Alibaba Cloud’s implementation of the Qwen3-Max model,  launched on September 5, 2025. The "thinking" in its name refers to a special mode where the model exposes its step-by-step reasoning process before providing an answer.

The model’s performance on math and reasoning benchmarks shows what this feature provides. When combined with tool usage (like code interpreters for running calculations) and scaled test-time compute, it scored 100% on AIME25, the American Invitational Mathematics Examination that tests advanced high school math requiring multi-step reasoning. It also achieved 100% on HMMT, the Harvard-MIT Math Tournament that measures complex problem-solving. These benchmarks test actual reasoning ability rather than pattern matching or memorization.

How test-time compute works

Test-time compute is how the model achieves its exceptional benchmark performance. Instead of only using computational resources during training (when the model learns), it also uses them during inference (when you make an API call and the model generates a response). When combined with tool usage, the model can verify its reasoning and check its work against external resources.

How Qwen3-Max-Thinking compares to GPT-5 and Claude Sonnet 4.5

While Qwen3 matches GPT-5 and Claude Sonnet 4.5 on certain benchmarks like AIME25, the models differ in other areas. On software engineering tasks measured by SWE-bench Verified, Claude leads at 77.2%, followed by GPT-5 at 72.8% and Qwen3 at 69.6%. GPT-5 offers the largest context window at 400k tokens compared to Qwen3’s 262k and Claude’s 200k.

The pricing advantage goes to Qwen3 at $1.20/$6.00 per million tokens for input/output, roughly half the cost of GPT-5 ($1.25/$10.00) and 2.5 times cheaper than Claude ($3.00/$15.00). The thinking mode sets Qwen3 apart by showing explicit reasoning steps.

Since the model is still in preview and training, comprehensive benchmark results across all tasks will be available once Alibaba releases the final version.

Feature  Qwen3-Max-Thinking GPT-5 Claude Sonnet 4.5
Enabled Alibaba Cloud OpenAI Anthropic
Parameters (approx.) ~1 trillion + Not publicly disclosed Not publicly disclosed
Context Window 262 k tokens 400 k tokens 200 k tokens (standard) / up to 1 M tokens (API)
Reasoning / Thinking Mode Explicit thinking mode Reasoning effort levels Thinking blocks / budgeted mode
SWE-bench Verified (software tasks) 69.6% 72.8% 77.2%
AIME25 (math reasoning) 100% 100% 100%
Input/Output Pricing $1.20 / $6.00 per million tokens $1.25 / $10.00 $3.00 / $15.00
Test-Time Compute Enabled Yes Yes
Status Preview Production Production
Best For Learning + detailed reasoning Balanced reasoning + capacity Fast, concise reasoning

However, benchmarks aren’t the full story. To get a feel for how the model performs, we are going to create a comparison app in a later section to compare it to GPT-5 and Claude Sonnet 4.5.

Setting Up Qwen3-Max-Thinking API Access

Before you can use Qwen3 Max Thinking, you’ll need Python 3.8 or later, pip, and a credit card to register with Alibaba Cloud. When you first create an account, you’ll get a free usage limit, but you’ll still need to provide payment details.

Register for Alibaba Cloud and activate Model Studio

Go to Alibaba Cloud and create an account. During signup, you’ll choose between two regions: International (Singapore) or China (Beijing). Most readers outside mainland China should pick the International region.

This choice matters because each region has its own API endpoint and key that aren’t interchangeable.

After registration, navigate to Model Studio in the console. You’ll need to activate the service before you can generate API keys. The Qwen (Alibaba Cloud) tutorial walks through the full account setup process if you want more detailed screenshots and explanations.

Create and secure your API key

Once Model Studio is activated, go to the API Keys section in the console. Generate a new key and copy it immediately. Alibaba Cloud shows the key only once.

Store your key in a .env file to keep it separate from your code, and add this file to .gitignore to prevent committing your API key to version control by mistake.

Install Python packages and configure environment

Qwen3 Max Thinking uses an OpenAI-compatible API, so you can use the standard OpenAI Python SDK. Install it along with python-dotenv for environment variable management:

pip install openai python-dotenv

Create a .env file in your project directory:

DASHSCOPE_API_KEY=your_key_here

Replace your_key_here with your API key. Now write a quick test script to verify everything works:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
   api_key=os.getenv("DASHSCOPE_API_KEY"),
   base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
   model="qwen3-max-preview",
   messages=[{"role": "user", "content": "Say hello in one sentence."}]
)

print(completion.choices[0].message.content)

Run this script. If you see an error about invalid API keys, double-check that you copied the key correctly and that you’re using the right base URL for your region.

If the script runs without errors, you should get an output like below:

Hello! How can I assist you today?

Understanding Qwen3-Max Thinking Mode

You just made a basic API call. Now turn on thinking mode to see the model work through problems step-by-step.

Enable thinking mode with extra_body

To turn on thinking mode, add enable_thinking: True inside the extra_body parameter. Here's an example with a prompt that works well with step-by-step reasoning:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
   api_key=os.getenv("DASHSCOPE_API_KEY"),
   base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
   model="qwen3-max-preview",
   messages=[
       {"role": "user", "content": "If I have 3 apples and buy 2 more packs of 4 apples each, then give away half, how many do I have left?"}
   ],
   extra_body={"enable_thinking": True}
)

print("Reasoning:", completion.choices[0].message.reasoning_content)
print("\nFinal answer:", completion.choices[0].message.content)

The response has two parts. The reasoning_content field shows the model's internal thinking process, where it works through the problem step-by-step. The content field gives you the final answer. When thinking mode is disabled, reasoning_content returns None.

Here’s what the output looks like (reasoning truncated):

Reasoning: The problem: "If I have 3 apples and buy 2 more packs of 4 apples each, then give away half, how many do I have left?" Starting with 3 apples. Then buy 2 packs of 4 apples each: that's 2 * 4 = 8 apples. So total after buying is 3 + 8 = 11 apples. Then give away half: half of 11 is 5.5 apples...

Final answer:
1. Start with: 3 apples.
2. Buy 2 packs of 4 apples each → 2 × 4 = 8 apples.
3. Total after buying: 3 + 8 = 11 apples.
4. Give away half: 11/2 = 5.5 apples.

Answer: You have 5½ apples left.

Control costs with thinking_budget

Since reasoning uses additional tokens, you can control costs with the thinking_budget parameter. Add it to extra_body alongside enable_thinking:

completion = client.chat.completions.create(
   model="qwen3-max-preview",
   messages=[
       {"role": "user", "content": "Explain why 0.1 + 0.2 doesn't equal 0.3 in Python."}
   ],
   extra_body={
       "enable_thinking": True,
       "thinking_budget": 1000
   }
)

The thinking_budget parameter sets the maximum tokens the model can use for reasoning. For simple problems, 500-1000 tokens works. Complex multi-step reasoning might need 2000-5000 tokens. If the model hits the budget limit, it stops the reasoning process and provides a final answer based on the reasoning completed up to that point.

You’ve now worked with Qwen3’s thinking mode. GPT-5 and Claude also have reasoning features with different syntax and approaches. To see how these models compare, let’s build an app that tests all three side-by-side.

Building a Multi-Model Comparison Chat App

You’ll build a Streamlit app that sends the same prompt to Qwen3 Max Thinking, GPT-5, and Claude Sonnet 4.5 at the same time. The app shows all three responses side-by-side so you can compare how each model reasons through problems.

You can see the full project code in this GitHub Gist

Note that  the following sections provide a detailed breakdown of that script, so if you only want the results of the app, skip to the next section entirely. The script breakdown is for educational purposes and shows you how to structure similar LLM-based applications with a Streamlit UI. 

Step 1: Project setup and dependencies

Install Streamlit for the web interface and Anthropic’s SDK for Claude. You already have the OpenAI SDK from the API setup section.

pip install streamlit anthropic

Create a new file called model_comparison_chat.py with the imports and environment setup:

import os
import time
from typing import Dict, List
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv
import streamlit as st
from openai import OpenAI
from anthropic import Anthropic

load_dotenv()

Step 2: Configure API connections

You already set up Qwen3 in Section 3. Now add GPT-5 and Claude API keys to your .env file:

DASHSCOPE_API_KEY=your_alibaba_cloud_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key

For GPT-5 access and pricing details, see the GPT-5 guide. For Claude Sonnet 4.5, check the Claude Sonnet 4.5 overview.

Initialize all three API clients:

QWEN_CLIENT = OpenAI(
   api_key=os.getenv("DASHSCOPE_API_KEY"),
   base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
GPT_CLIENT = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
CLAUDE_CLIENT = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

Step 3: Create helper functions for each model

Instead of writing separate functions for each API, use one unified function that handles all three models. Here’s the start of the function for Qwen3:

def call_model(model_name: str, messages: List[Dict], **kwargs) -> Dict:
   """Call any model with unified interface."""
   try:
       start_time = time.time()

       if model_name == "Qwen3 Max Thinking":
           enable_thinking = kwargs.get("enable_thinking", True)
           extra_body = {"enable_thinking": enable_thinking}

           completion = QWEN_CLIENT.chat.completions.create(
               model="qwen3-max-preview",
               messages=messages,
               extra_body=extra_body
           )
           reasoning_content = getattr(
               completion.choices[0].message,
               'reasoning_content',
               None
           )
           # Extract content and token info, then continue below

GPT-5 works differently, using the responses API with configurable reasoning effort levels (continuing the same function):

       elif model_name == "GPT-5":
           reasoning_effort = kwargs.get("reasoning_effort", "medium")

           response = GPT_CLIENT.responses.create(
               model="gpt-5",
               input=messages,
               reasoning={
                   "effort": reasoning_effort,
                   "summary": "auto"
               }
           )
           # Extract response data, then continue below

GPT-5 has four reasoning effort levels: minimal, low, medium, and high. Higher effort means better quality but slower responses and higher costs.

Claude Sonnet 4.5 takes yet another approach (continuing the same function):

       else:  # Claude Sonnet 4.5
           enable_thinking = kwargs.get("enable_thinking", True)
           params = {
               "model": "claude-sonnet-4-5",
               "max_tokens": 10000,
               "messages": messages
           }
           if enable_thinking:
               params["thinking"] = {
                   "type": "enabled",
                   "budget_tokens": 5000
               }

           message = CLAUDE_CLIENT.messages.create(**params)
           # Parse thinking blocks from response, then continue below

Note: Claude’s thinking mode works differently from Qwen3. Instead of a separate reasoning_content, Claude includes thinking blocks in the response content that you need to parse out. This is a key difference when comparing the models.

All three model branches converge to return the same structure (end of the function):

    return {
               "content": content_text,
               "reasoning_content": thinking_text,
               "response_time": time.time() - start_time,
               "tokens_used": {
                   "input": input_tokens,
                   "output": output_tokens,
                   "total": total_tokens
               },
               "error": None
           }

   except Exception as e:
       return {
           "content": None,
           "reasoning_content": None,
           "response_time": 0,
           "tokens_used": {"input": 0, "output": 0, "total": 0},
           "error": f"{model_name} Error: {str(e)}"
       }

This standardized return format makes it easy to display responses from any model using the same code.

Step 4: Build the Streamlit interface

Now that you understand how each model’s API works, create the web interface. Start with the basic Streamlit page configuration:

def main():
   st.set_page_config(
       page_title="Multi-Model Comparison Chat",
       page_icon="🤖",
       layout="wide",
       initial_sidebar_state="expanded"
   )

   st.session_state.setdefault("messages", [])
   st.session_state.setdefault("model_responses", [])

Session state tracks the conversation history and model responses across reruns (when Streamlit refreshes the page after user interactions).

Add the title and sidebar controls for thinking mode configuration:

st.title("🤖 Multi-Model Comparison Chat")
   st.markdown("""
   Compare **Qwen3 Max Thinking**, **GPT-5**, and **Claude Sonnet 4.5** side-by-side.
   All three models support reasoning modes - see how they approach problems differently.
   """)

   with st.sidebar:
       st.header("⚙️ Settings")

       st.subheader("Thinking Mode")
       qwen_thinking = st.checkbox("Enable Qwen3 Thinking", value=True)
       gpt5_reasoning = st.selectbox(
           "GPT-5 Reasoning Effort",
           options=["minimal", "low", "medium", "high"],
           index=2
       )
       claude_thinking = st.checkbox("Enable Claude Thinking", value=True)

Add API status indicators:

st.subheader("API Status")
       st.markdown(f"""
       - Qwen3: {"✅" if os.getenv("DASHSCOPE_API_KEY") else "❌"}
       - GPT-5: {"✅" if os.getenv("OPENAI_API_KEY") else "❌"}
       - Claude: {"✅" if os.getenv("ANTHROPIC_API_KEY") else "❌"}
       """)

This shows which API keys are configured. If you see red X marks, check your .env file.

Step 5: Implement the comparison logic

def call_models_parallel(messages: List[Dict], selected_models: List[str], qwen_thinking: bool, gpt5_reasoning: str, claude_thinking: bool) -> Dict[str, Dict]:
   """Call multiple models in parallel."""
   model_kwargs = {
       "Qwen3 Max Thinking": {"enable_thinking": qwen_thinking},
       "GPT-5": {"reasoning_effort": gpt5_reasoning},
       "Claude Sonnet 4.5": {"enable_thinking": claude_thinking}
   }

   with ThreadPoolExecutor(max_workers=3) as executor:
       futures = {
           executor.submit(call_model, m, messages, **model_kwargs[m]): m
           for m in selected_models if m in model_kwargs
       }
       return {futures[f]: f.result() for f in as_completed(futures)}

Instead of waiting for each model to respond sequentially, parallel execution calls all three at once, reducing total wait time.

Integrate it into your chat interface and display responses side-by-side:

if prompt := st.chat_input("Ask a question to all models..."):
       st.session_state.messages.append({"role": "user", "content": prompt})

       with st.spinner("🤔 Models are thinking..."):
           responses = call_models_parallel(
               st.session_state.messages,
               ["Qwen3 Max Thinking", "GPT-5", "Claude Sonnet 4.5"],
               qwen_thinking,
               gpt5_reasoning,
               claude_thinking
           )

       cols = st.columns(3)
       for idx, (model_name, response_data) in enumerate(responses.items()):
           with cols[idx]:
               st.markdown(f"### {model_name}")
               if response_data["reasoning_content"]:
                   with st.expander("🧠 Thinking Process", expanded=False):
                       st.markdown(response_data["reasoning_content"])
               st.markdown(response_data["content"])

This creates three columns, one for each model. The thinking process stays collapsed by default, so users can focus on the final answers.

Step 6: Test and refine

Run the app with:

streamlit run model_comparison_chat.py

Start with a simple sanity check question like “What is 2+2?” to confirm your API connections work and parallel execution is functioning.

Once you confirm the basic functionality works, you’re ready to test the models with complex prompts that reveal their different reasoning approaches.

Testing The App With Complex Prompts

With the comparison app running, test it with harder problems. Try running these prompts in your app to see how the different models approach them.

The test prompts

Here are three prompts that show clear patterns in how each model thinks.

Test 1: Advanced mathematical reasoning

A circular pizza is cut into 8 equal slices. Alice eats 3 slices, then Bob eats half of what's left. Then Charlie takes 1 slice, and finally Dana eats 40% of the remaining slices. How many slices are left? Also, if the original pizza had a diameter of 16 inches, what is the total area (in square inches) of the pizza that was consumed?

Test 2: Multi-step logical reasoning

Five friends - Amy, Ben, Clara, David, and Emma - are sitting in a row at a

movie theater. The following conditions must be met:

1. Amy must sit next to Ben

2. Clara cannot sit next to David

3. Emma must sit at one of the ends

4. Ben cannot sit at either end. Given these constraints, list all possible seating arrangements from left to right.

Test 3: Code generation with constraints

Write a Python function that finds the longest palindromic substring in a given string, but with these specific requirements:

1. Time complexity must be O(n²) or better

2. Must handle Unicode characters correctly

3. Must return the first occurrence if there are multiple palindromes of the same length

4. Include detailed docstring with examples

5. Add type hints

6. Handle edge cases (empty string, single character, no palindromes).

Also explain your algorithm choice

and why it meets the complexity requirement.

Results

All three models got every problem right. Here’s how they performed:

Test

Model

Time (sec)

Tokens

Result

Pizza Problem

Qwen3-Max-Thinking

157

4,419

✓ Correct (0.9 slices, 178.44 sq in)

 

GPT-5

45

1,490

✓ Correct (0.9 slices, 178.44 sq in)

 

Claude

14

1,010

✓ Correct (0.9 slices, 178.44 sq in)

Theater Seating

Qwen3-Max-Thinking

166

5,728

✓ Found all 8 arrangements

 

GPT-5

83

4,607

✓ Found all 8 arrangements

 

Claude

32

3,081

✓ Found all 8 arrangements

Palindrome Code

Qwen3-Max-Thinking

191

7,345

✓ Working O(n²) solution

 

GPT-5

58

3,263

✓ Working O(n²) solution

 

Claude

48

3,279

✓ Working O(n²) solution

Qwen3 takes 10x longer than Claude and uses 3–4x more tokens.

Qwen3’s approach

It considers everything. On the pizza problem, it questioned whether fractional slices make sense, considered if the math should round to whole numbers, validated calculations multiple times, and examined alternative interpretations before committing to an answer. 

You can see it thinking “what if this” and “but maybe that” throughout. For the seating problem, it worked through “if Ben sits here, where can Amy go” for every possibility. 

The palindrome solution came with deep analysis of why it picked that algorithm over alternatives, how it handles Unicode edge cases, and thorough complexity breakdowns. It’s like watching someone think out loud and check their work repeatedly.

GPT-5’s approach

It’s structured and methodical. It laid out each problem, worked through steps in order, and verified results. Less exploration, more execution. 

On the seating puzzle, it placed Emma at an end first, then filled positions while checking rules. 

The palindrome code came with clear reasoning about algorithm choice and solid documentation. It follows a plan without much backtracking. You get enough detail to understand the thinking, but it doesn’t second-guess itself constantly.

Claude’s approach 

It moves fast. It spots the solution path quickly and executes cleanly. For the seating problem, it organized by cases (Emma at position 1 vs position 5) and worked through each one. The palindrome code was clean and well-documented, but got to the point faster. It shows its work but doesn’t linger. You see the thinking but it’s concise.

The practical takeaway: Use Qwen3 when you need to see every consideration that went into a solution. The detailed thinking helps when you’re learning from the model or validating complex decisions. Use Claude when speed matters and you trust it to get the answer without showing you every step. Use GPT-5 when you want something between those extremes.

Note: These tests can’t fully capture all three models’ abilities. Your results may vary with different prompts.

Conclusion

In this tutorial, I’ve demonstrated how to set up Qwen3 Max Thinking, tested its thinking mode with different types of problems, and built a comparison app that runs prompts through Qwen3, GPT-5, and Claude Sonnet 4.5 side-by-side. 

The key insight is that each model has strengths for different scenarios. Qwen3 excels when you need detailed reasoning and want to understand every step. Claude wins on speed and clean execution. GPT-5 balances both approaches.

Use your comparison app to test these models with your own problems. As you experiment with different tasks, you’ll develop a sense for which model works best for your needs.

To go deeper with Qwen3 and multi-model development, check out the OpenRouter tutorial for managing multiple model APIs, or learn how to run smaller Qwen3 models locally with Ollama for offline development.


Bex Tuychiev's photo
Author
Bex Tuychiev
LinkedIn

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Sujets

Top DataCamp Courses

Cursus

Développer des LLM

0 min
Développez des LLM avec PyTorch et Hugging Face, en appliquant les techniques récentes de deep learning et NLP.
Afficher les détailsRight Arrow
Commencer le cours
Voir plusRight Arrow
Apparenté
robot representing alibaba's qwen 2.5 max model

blog

Qwen 2.5 Max: Features, DeepSeek V3 Comparison & More

Learn about Alibaba's Qwen2.5-Max, a model that competes with GPT-4o, Claude 3.5 Sonnet, and DeepSeek V3.
Alex Olteanu's photo

Alex Olteanu

8 min

blog

I Tested QwQ-32B-Preview: Alibaba’s Reasoning Model

I tested the capabilities of Alibaba’s QwQ-32B-Preview model by testing it on a range of math, coding, and logic tasks.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

8 min

blog

What is an LLM? A Guide on Large Language Models and How They Work

Read this article to discover the basics of large language models, the key technology that is powering the current AI revolution
Javier Canales Luna's photo

Javier Canales Luna

12 min

Didacticiel

Reflection Llama-3.1 70B: Testing & Summary of What We Know

Reflection Llama-3.1 70B, trained with Reflection-Tuning, claims to surpass GPT-4o and Claude 3.5 Sonnet but has faced reproducibility and verification issues so far.
Ryan Ong's photo

Ryan Ong

Didacticiel

How to Train an LLM with PyTorch

Master the process of training large language models using PyTorch, from initial setup to final implementation.
Zoumana Keita 's photo

Zoumana Keita

Didacticiel

Chain-of-Thought Prompting: Step-by-Step Reasoning with LLMs

Unlock the full potential of Large Language Models (LLMs) with our guide on Chain-of-Thought (CoT) prompting. Learn how to enhance reasoning and problem-solving abilities in LLMs.
Andrea Valenzuela's photo

Andrea Valenzuela

Voir plusVoir plus