Cursus
Qwen3-Max-Thinking became publicly available in November 2025. When you enable thinking mode, the model shows its reasoning before answering. You get to see how it works through math problems, logic puzzles, and complex code. The model has over 1 trillion parameters and handles up to 262k tokens.
In this tutorial, you’ll set up API access, test thinking mode on different problems, and build a Streamlit app that runs the same prompt through Qwen3-Max-Thinking, GPT-5, and Claude Sonnet 4.5. You’ll see how each model approaches the problem and when thinking mode helps. Here is a preview of the app:
What is Qwen3-Max-Thinking?
Qwen3 Max Thinking is Alibaba Cloud’s implementation of the Qwen3-Max model, launched on September 5, 2025. The "thinking" in its name refers to a special mode where the model exposes its step-by-step reasoning process before providing an answer.
The model’s performance on math and reasoning benchmarks shows what this feature provides. When combined with tool usage (like code interpreters for running calculations) and scaled test-time compute, it scored 100% on AIME25, the American Invitational Mathematics Examination that tests advanced high school math requiring multi-step reasoning. It also achieved 100% on HMMT, the Harvard-MIT Math Tournament that measures complex problem-solving. These benchmarks test actual reasoning ability rather than pattern matching or memorization.
How test-time compute works
Test-time compute is how the model achieves its exceptional benchmark performance. Instead of only using computational resources during training (when the model learns), it also uses them during inference (when you make an API call and the model generates a response). When combined with tool usage, the model can verify its reasoning and check its work against external resources.
How Qwen3-Max-Thinking compares to GPT-5 and Claude Sonnet 4.5
While Qwen3 matches GPT-5 and Claude Sonnet 4.5 on certain benchmarks like AIME25, the models differ in other areas. On software engineering tasks measured by SWE-bench Verified, Claude leads at 77.2%, followed by GPT-5 at 72.8% and Qwen3 at 69.6%. GPT-5 offers the largest context window at 400k tokens compared to Qwen3’s 262k and Claude’s 200k.
The pricing advantage goes to Qwen3 at $1.20/$6.00 per million tokens for input/output, roughly half the cost of GPT-5 ($1.25/$10.00) and 2.5 times cheaper than Claude ($3.00/$15.00). The thinking mode sets Qwen3 apart by showing explicit reasoning steps.
Since the model is still in preview and training, comprehensive benchmark results across all tasks will be available once Alibaba releases the final version.
| Feature | Qwen3-Max-Thinking | GPT-5 | Claude Sonnet 4.5 |
|---|---|---|---|
| Enabled | Alibaba Cloud | OpenAI | Anthropic |
| Parameters (approx.) | ~1 trillion + | Not publicly disclosed | Not publicly disclosed |
| Context Window | 262 k tokens | 400 k tokens | 200 k tokens (standard) / up to 1 M tokens (API) |
| Reasoning / Thinking Mode | Explicit thinking mode | Reasoning effort levels | Thinking blocks / budgeted mode |
| SWE-bench Verified (software tasks) | 69.6% | 72.8% | 77.2% |
| AIME25 (math reasoning) | 100% | 100% | 100% |
| Input/Output Pricing | $1.20 / $6.00 per million tokens | $1.25 / $10.00 | $3.00 / $15.00 |
| Test-Time Compute | Enabled | Yes | Yes |
| Status | Preview | Production | Production |
| Best For | Learning + detailed reasoning | Balanced reasoning + capacity | Fast, concise reasoning |
However, benchmarks aren’t the full story. To get a feel for how the model performs, we are going to create a comparison app in a later section to compare it to GPT-5 and Claude Sonnet 4.5.
Setting Up Qwen3-Max-Thinking API Access
Before you can use Qwen3 Max Thinking, you’ll need Python 3.8 or later, pip, and a credit card to register with Alibaba Cloud. When you first create an account, you’ll get a free usage limit, but you’ll still need to provide payment details.
Register for Alibaba Cloud and activate Model Studio
Go to Alibaba Cloud and create an account. During signup, you’ll choose between two regions: International (Singapore) or China (Beijing). Most readers outside mainland China should pick the International region.
This choice matters because each region has its own API endpoint and key that aren’t interchangeable.
After registration, navigate to Model Studio in the console. You’ll need to activate the service before you can generate API keys. The Qwen (Alibaba Cloud) tutorial walks through the full account setup process if you want more detailed screenshots and explanations.
Create and secure your API key
Once Model Studio is activated, go to the API Keys section in the console. Generate a new key and copy it immediately. Alibaba Cloud shows the key only once.
Store your key in a .env file to keep it separate from your code, and add this file to .gitignore to prevent committing your API key to version control by mistake.
Install Python packages and configure environment
Qwen3 Max Thinking uses an OpenAI-compatible API, so you can use the standard OpenAI Python SDK. Install it along with python-dotenv for environment variable management:
pip install openai python-dotenv
Create a .env file in your project directory:
DASHSCOPE_API_KEY=your_key_here
Replace your_key_here with your API key. Now write a quick test script to verify everything works:
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
completion = client.chat.completions.create(
model="qwen3-max-preview",
messages=[{"role": "user", "content": "Say hello in one sentence."}]
)
print(completion.choices[0].message.content)
Run this script. If you see an error about invalid API keys, double-check that you copied the key correctly and that you’re using the right base URL for your region.
If the script runs without errors, you should get an output like below:
Hello! How can I assist you today?
Understanding Qwen3-Max Thinking Mode
You just made a basic API call. Now turn on thinking mode to see the model work through problems step-by-step.
Enable thinking mode with extra_body
To turn on thinking mode, add enable_thinking: True inside the extra_body parameter. Here's an example with a prompt that works well with step-by-step reasoning:
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
completion = client.chat.completions.create(
model="qwen3-max-preview",
messages=[
{"role": "user", "content": "If I have 3 apples and buy 2 more packs of 4 apples each, then give away half, how many do I have left?"}
],
extra_body={"enable_thinking": True}
)
print("Reasoning:", completion.choices[0].message.reasoning_content)
print("\nFinal answer:", completion.choices[0].message.content)
The response has two parts. The reasoning_content field shows the model's internal thinking process, where it works through the problem step-by-step. The content field gives you the final answer. When thinking mode is disabled, reasoning_content returns None.
Here’s what the output looks like (reasoning truncated):
Reasoning: The problem: "If I have 3 apples and buy 2 more packs of 4 apples each, then give away half, how many do I have left?" Starting with 3 apples. Then buy 2 packs of 4 apples each: that's 2 * 4 = 8 apples. So total after buying is 3 + 8 = 11 apples. Then give away half: half of 11 is 5.5 apples...
Final answer:
1. Start with: 3 apples.
2. Buy 2 packs of 4 apples each → 2 × 4 = 8 apples.
3. Total after buying: 3 + 8 = 11 apples.
4. Give away half: 11/2 = 5.5 apples.
Answer: You have 5½ apples left.
Control costs with thinking_budget
Since reasoning uses additional tokens, you can control costs with the thinking_budget parameter. Add it to extra_body alongside enable_thinking:
completion = client.chat.completions.create(
model="qwen3-max-preview",
messages=[
{"role": "user", "content": "Explain why 0.1 + 0.2 doesn't equal 0.3 in Python."}
],
extra_body={
"enable_thinking": True,
"thinking_budget": 1000
}
)
The thinking_budget parameter sets the maximum tokens the model can use for reasoning. For simple problems, 500-1000 tokens works. Complex multi-step reasoning might need 2000-5000 tokens. If the model hits the budget limit, it stops the reasoning process and provides a final answer based on the reasoning completed up to that point.
You’ve now worked with Qwen3’s thinking mode. GPT-5 and Claude also have reasoning features with different syntax and approaches. To see how these models compare, let’s build an app that tests all three side-by-side.
Building a Multi-Model Comparison Chat App
You’ll build a Streamlit app that sends the same prompt to Qwen3 Max Thinking, GPT-5, and Claude Sonnet 4.5 at the same time. The app shows all three responses side-by-side so you can compare how each model reasons through problems.
You can see the full project code in this GitHub Gist.
Note that the following sections provide a detailed breakdown of that script, so if you only want the results of the app, skip to the next section entirely. The script breakdown is for educational purposes and shows you how to structure similar LLM-based applications with a Streamlit UI.
Step 1: Project setup and dependencies
Install Streamlit for the web interface and Anthropic’s SDK for Claude. You already have the OpenAI SDK from the API setup section.
pip install streamlit anthropic
Create a new file called model_comparison_chat.py with the imports and environment setup:
import os
import time
from typing import Dict, List
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv
import streamlit as st
from openai import OpenAI
from anthropic import Anthropic
load_dotenv()
Step 2: Configure API connections
You already set up Qwen3 in Section 3. Now add GPT-5 and Claude API keys to your .env file:
DASHSCOPE_API_KEY=your_alibaba_cloud_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
For GPT-5 access and pricing details, see the GPT-5 guide. For Claude Sonnet 4.5, check the Claude Sonnet 4.5 overview.
Initialize all three API clients:
QWEN_CLIENT = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
GPT_CLIENT = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
CLAUDE_CLIENT = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
Step 3: Create helper functions for each model
Instead of writing separate functions for each API, use one unified function that handles all three models. Here’s the start of the function for Qwen3:
def call_model(model_name: str, messages: List[Dict], **kwargs) -> Dict:
"""Call any model with unified interface."""
try:
start_time = time.time()
if model_name == "Qwen3 Max Thinking":
enable_thinking = kwargs.get("enable_thinking", True)
extra_body = {"enable_thinking": enable_thinking}
completion = QWEN_CLIENT.chat.completions.create(
model="qwen3-max-preview",
messages=messages,
extra_body=extra_body
)
reasoning_content = getattr(
completion.choices[0].message,
'reasoning_content',
None
)
# Extract content and token info, then continue below
GPT-5 works differently, using the responses API with configurable reasoning effort levels (continuing the same function):
elif model_name == "GPT-5":
reasoning_effort = kwargs.get("reasoning_effort", "medium")
response = GPT_CLIENT.responses.create(
model="gpt-5",
input=messages,
reasoning={
"effort": reasoning_effort,
"summary": "auto"
}
)
# Extract response data, then continue below
GPT-5 has four reasoning effort levels: minimal, low, medium, and high. Higher effort means better quality but slower responses and higher costs.
Claude Sonnet 4.5 takes yet another approach (continuing the same function):
else: # Claude Sonnet 4.5
enable_thinking = kwargs.get("enable_thinking", True)
params = {
"model": "claude-sonnet-4-5",
"max_tokens": 10000,
"messages": messages
}
if enable_thinking:
params["thinking"] = {
"type": "enabled",
"budget_tokens": 5000
}
message = CLAUDE_CLIENT.messages.create(**params)
# Parse thinking blocks from response, then continue below
Note: Claude’s thinking mode works differently from Qwen3. Instead of a separate reasoning_content, Claude includes thinking blocks in the response content that you need to parse out. This is a key difference when comparing the models.
All three model branches converge to return the same structure (end of the function):
return {
"content": content_text,
"reasoning_content": thinking_text,
"response_time": time.time() - start_time,
"tokens_used": {
"input": input_tokens,
"output": output_tokens,
"total": total_tokens
},
"error": None
}
except Exception as e:
return {
"content": None,
"reasoning_content": None,
"response_time": 0,
"tokens_used": {"input": 0, "output": 0, "total": 0},
"error": f"{model_name} Error: {str(e)}"
}
This standardized return format makes it easy to display responses from any model using the same code.
Step 4: Build the Streamlit interface
Now that you understand how each model’s API works, create the web interface. Start with the basic Streamlit page configuration:
def main():
st.set_page_config(
page_title="Multi-Model Comparison Chat",
page_icon="🤖",
layout="wide",
initial_sidebar_state="expanded"
)
st.session_state.setdefault("messages", [])
st.session_state.setdefault("model_responses", [])
Session state tracks the conversation history and model responses across reruns (when Streamlit refreshes the page after user interactions).
Add the title and sidebar controls for thinking mode configuration:
st.title("🤖 Multi-Model Comparison Chat")
st.markdown("""
Compare **Qwen3 Max Thinking**, **GPT-5**, and **Claude Sonnet 4.5** side-by-side.
All three models support reasoning modes - see how they approach problems differently.
""")
with st.sidebar:
st.header("⚙️ Settings")
st.subheader("Thinking Mode")
qwen_thinking = st.checkbox("Enable Qwen3 Thinking", value=True)
gpt5_reasoning = st.selectbox(
"GPT-5 Reasoning Effort",
options=["minimal", "low", "medium", "high"],
index=2
)
claude_thinking = st.checkbox("Enable Claude Thinking", value=True)
Add API status indicators:
st.subheader("API Status")
st.markdown(f"""
- Qwen3: {"✅" if os.getenv("DASHSCOPE_API_KEY") else "❌"}
- GPT-5: {"✅" if os.getenv("OPENAI_API_KEY") else "❌"}
- Claude: {"✅" if os.getenv("ANTHROPIC_API_KEY") else "❌"}
""")
This shows which API keys are configured. If you see red X marks, check your .env file.
Step 5: Implement the comparison logic
def call_models_parallel(messages: List[Dict], selected_models: List[str], qwen_thinking: bool, gpt5_reasoning: str, claude_thinking: bool) -> Dict[str, Dict]:
"""Call multiple models in parallel."""
model_kwargs = {
"Qwen3 Max Thinking": {"enable_thinking": qwen_thinking},
"GPT-5": {"reasoning_effort": gpt5_reasoning},
"Claude Sonnet 4.5": {"enable_thinking": claude_thinking}
}
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {
executor.submit(call_model, m, messages, **model_kwargs[m]): m
for m in selected_models if m in model_kwargs
}
return {futures[f]: f.result() for f in as_completed(futures)}
Instead of waiting for each model to respond sequentially, parallel execution calls all three at once, reducing total wait time.
Integrate it into your chat interface and display responses side-by-side:
if prompt := st.chat_input("Ask a question to all models..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.spinner("🤔 Models are thinking..."):
responses = call_models_parallel(
st.session_state.messages,
["Qwen3 Max Thinking", "GPT-5", "Claude Sonnet 4.5"],
qwen_thinking,
gpt5_reasoning,
claude_thinking
)
cols = st.columns(3)
for idx, (model_name, response_data) in enumerate(responses.items()):
with cols[idx]:
st.markdown(f"### {model_name}")
if response_data["reasoning_content"]:
with st.expander("🧠 Thinking Process", expanded=False):
st.markdown(response_data["reasoning_content"])
st.markdown(response_data["content"])
This creates three columns, one for each model. The thinking process stays collapsed by default, so users can focus on the final answers.
Step 6: Test and refine
Run the app with:
streamlit run model_comparison_chat.py
Start with a simple sanity check question like “What is 2+2?” to confirm your API connections work and parallel execution is functioning.
Once you confirm the basic functionality works, you’re ready to test the models with complex prompts that reveal their different reasoning approaches.
Testing The App With Complex Prompts
With the comparison app running, test it with harder problems. Try running these prompts in your app to see how the different models approach them.
The test prompts
Here are three prompts that show clear patterns in how each model thinks.
Test 1: Advanced mathematical reasoning
A circular pizza is cut into 8 equal slices. Alice eats 3 slices, then Bob eats half of what's left. Then Charlie takes 1 slice, and finally Dana eats 40% of the remaining slices. How many slices are left? Also, if the original pizza had a diameter of 16 inches, what is the total area (in square inches) of the pizza that was consumed?
Test 2: Multi-step logical reasoning
Five friends - Amy, Ben, Clara, David, and Emma - are sitting in a row at a
movie theater. The following conditions must be met:
1. Amy must sit next to Ben
2. Clara cannot sit next to David
3. Emma must sit at one of the ends
4. Ben cannot sit at either end. Given these constraints, list all possible seating arrangements from left to right.
Test 3: Code generation with constraints
Write a Python function that finds the longest palindromic substring in a given string, but with these specific requirements:
1. Time complexity must be O(n²) or better
2. Must handle Unicode characters correctly
3. Must return the first occurrence if there are multiple palindromes of the same length
4. Include detailed docstring with examples
5. Add type hints
6. Handle edge cases (empty string, single character, no palindromes).
Also explain your algorithm choice
and why it meets the complexity requirement.
Results
All three models got every problem right. Here’s how they performed:
|
Test |
Model |
Time (sec) |
Tokens |
Result |
|
Pizza Problem |
Qwen3-Max-Thinking |
157 |
4,419 |
✓ Correct (0.9 slices, 178.44 sq in) |
|
GPT-5 |
45 |
1,490 |
✓ Correct (0.9 slices, 178.44 sq in) |
|
|
Claude |
14 |
1,010 |
✓ Correct (0.9 slices, 178.44 sq in) |
|
|
Theater Seating |
Qwen3-Max-Thinking |
166 |
5,728 |
✓ Found all 8 arrangements |
|
GPT-5 |
83 |
4,607 |
✓ Found all 8 arrangements |
|
|
Claude |
32 |
3,081 |
✓ Found all 8 arrangements |
|
|
Palindrome Code |
Qwen3-Max-Thinking |
191 |
7,345 |
✓ Working O(n²) solution |
|
GPT-5 |
58 |
3,263 |
✓ Working O(n²) solution |
|
|
Claude |
48 |
3,279 |
✓ Working O(n²) solution |
Qwen3 takes 10x longer than Claude and uses 3–4x more tokens.
Qwen3’s approach
It considers everything. On the pizza problem, it questioned whether fractional slices make sense, considered if the math should round to whole numbers, validated calculations multiple times, and examined alternative interpretations before committing to an answer.
You can see it thinking “what if this” and “but maybe that” throughout. For the seating problem, it worked through “if Ben sits here, where can Amy go” for every possibility.
The palindrome solution came with deep analysis of why it picked that algorithm over alternatives, how it handles Unicode edge cases, and thorough complexity breakdowns. It’s like watching someone think out loud and check their work repeatedly.
GPT-5’s approach
It’s structured and methodical. It laid out each problem, worked through steps in order, and verified results. Less exploration, more execution.
On the seating puzzle, it placed Emma at an end first, then filled positions while checking rules.
The palindrome code came with clear reasoning about algorithm choice and solid documentation. It follows a plan without much backtracking. You get enough detail to understand the thinking, but it doesn’t second-guess itself constantly.
Claude’s approach
It moves fast. It spots the solution path quickly and executes cleanly. For the seating problem, it organized by cases (Emma at position 1 vs position 5) and worked through each one. The palindrome code was clean and well-documented, but got to the point faster. It shows its work but doesn’t linger. You see the thinking but it’s concise.
The practical takeaway: Use Qwen3 when you need to see every consideration that went into a solution. The detailed thinking helps when you’re learning from the model or validating complex decisions. Use Claude when speed matters and you trust it to get the answer without showing you every step. Use GPT-5 when you want something between those extremes.
Note: These tests can’t fully capture all three models’ abilities. Your results may vary with different prompts.
Conclusion
In this tutorial, I’ve demonstrated how to set up Qwen3 Max Thinking, tested its thinking mode with different types of problems, and built a comparison app that runs prompts through Qwen3, GPT-5, and Claude Sonnet 4.5 side-by-side.
The key insight is that each model has strengths for different scenarios. Qwen3 excels when you need detailed reasoning and want to understand every step. Claude wins on speed and clean execution. GPT-5 balances both approaches.
Use your comparison app to test these models with your own problems. As you experiment with different tasks, you’ll develop a sense for which model works best for your needs.
To go deeper with Qwen3 and multi-model development, check out the OpenRouter tutorial for managing multiple model APIs, or learn how to run smaller Qwen3 models locally with Ollama for offline development.

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.



