Cursus
An open-source model beats GPT-5 and Claude Sonnet 4.5 on major benchmarks at a fraction of the cost. That model is Kimi K2 Thinking, Moonshot AI’s reasoning variant, released in November 2025.
Kimi K2 Thinking can execute 200–300 sequential tool calls autonomously, making it ideal for complex agentic workflows. It also exposes its reasoning process through a dedicated API field, so you can see exactly how it thinks through problems.
This tutorial shows you how to use the Kimi K2 Thinking API for reasoning tasks. You’ll implement tool-calling workflows, build a comparison chat app to test K2 against GPT-5 and Claude, and learn when K2’s strengths make it the better choice. If you’re keen to learn more about building apps with LLMs, I recommend taking the Developing LLM Applications with LangChain course.
Here is the preview of the app:
Before diving into the hands-on work, let’s understand what makes K2 Thinking different from traditional language models.
What Is Kimi K2 Thinking?
Most language models generate responses right away. K2 Thinking works differently. It’s built for multi-step problems where the model needs to plan, reason, and act on its own.
Architecture and design
The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters. Only 32 billion activate for any given input. This gives K2 the power of a massive model without the high inference costs. The 256,000-token context window means you can feed it entire codebases, long documents, or extensive conversation histories without chunking.
Moonshot AI released two versions. K2 Instruct handles straightforward tasks like text generation, classification, and simple Q&A, where speed matters. K2 Thinking is for complex reasoning tasks.
Two standout capabilities
K2 Thinking’s first major feature is transparent reasoning. You can see how it breaks down problems, evaluates options, and arrives at conclusions.
The second is tool orchestration. K2 handles extensive sequential tool calling, far beyond what most models manage. It decides which tools to use, when to use them, and how to combine results without you stepping in at each stage.
Additional features
K2 includes features for different deployment needs. For production scenarios where accuracy is critical, Heavy Mode runs eight reasoning paths in parallel and selects the best answer, though at higher computational cost.
For latency-sensitive applications, INT4 quantization doubles inference speed with minimal accuracy loss, making it practical for high-throughput environments.
Kimi K2 Thinking vs GPT-5 vs Claude Sonnet 4.5 vs DeepSeek
Here’s how K2 compares to GPT-5, Claude Sonnet 4.5, and DeepSeek V3.2.
|
Metric |
Kimi K2 Thinking |
GPT-5 (High) |
Claude Sonnet 4.5 |
DeepSeek-V3.2 |
|
HLE (w/ tools) |
44.9 |
41.7 |
32 |
20.3 |
|
HLE Heavy Mode |
51 |
42 |
— |
— |
|
AIME25 (w/ python) |
99.1 % |
99.6 % |
100 % |
58.1 % |
|
GPQA |
84.5 |
85.7 |
83.4 |
79.9 |
|
BrowseComp |
60.2 |
54.9 |
24.1 |
40.1 |
|
Frames |
87 |
86 |
85 |
80.2 |
|
SWE-bench Verified |
71.3 % |
74.9 % |
77.2 % |
67.8 % |
|
LiveCodeBench |
83.1 % |
87.0 % |
64.0 % |
74.1 % |
|
Context window |
256 k tokens |
400 k tokens |
200 k tokens |
128 k tokens |
|
Input pricing |
$0.60 / 1 M |
$1.25 / 1 M |
$3.00 / 1 M |
$0.55 / 1 M |
|
Output pricing |
$2.50 / 1 M |
$10.00 / 1 M |
$15.00 / 1 M |
$2.19 / 1 M |
|
Max tool calls |
200–300 |
Dozens + |
Dozens + |
Not specified |
Quick benchmark definitions: HLE tests graduate-level knowledge, AIME25 uses competition math problems, BrowseComp and Frames test web navigation and synthesis, SWE-bench Verified measures bug fixing in open-source code, LiveCodeBench tests code generation, and GPQA covers graduate science questions.
K2’s tool training shows up most clearly on agentic benchmarks. Heavy Mode beats GPT-5 by 9 points on HLE because it runs eight reasoning paths in parallel. Claude’s code training gives it the edge on SWE-bench Verified, making it the top pick for software engineering projects.
On pure reasoning tasks, all three models perform about the same. The differences appear when tools get involved. K2’s 200–300 tool call capacity lets it handle autonomous research workflows, debugging pipelines, and multi-step data analysis without you stepping in at each stage.
When to pick each model
- Pick K2 Thinking for agentic workflows that need extensive tool orchestration, web research, and information synthesis tasks, or when you need transparent reasoning chains for debugging or compliance.
- Pick GPT-5 for the largest context window at 400k tokens or when you need balanced performance across different tasks with mature ecosystem support.
- Pick Claude Sonnet 4.5 for software engineering projects where debugging and code fixes matter most.
- Pick DeepSeek for budget priority with open-source requirements under MIT license.
Setting Up Kimi K2 Thinking API Access
You have two options for accessing K2 Thinking: directly through Moonshot AI’s platform.moonshot.ai, or through OpenRouter, a unified API gateway. We’ll use OpenRouter because it provides access to K2, GPT-5, Claude, and dozens of other models with a single API key. This unified access becomes essential later when you build the comparison chat app. OpenRouter also handles rate limiting and failover automatically, so you don’t need to manage multiple provider accounts.
First, head to openrouter.ai and create an account. You can sign up with Google or GitHub. Once you’re in, go to the “Keys” section in the dashboard and generate a new API key.
OpenRouter offers $5 in free credits when you sign up, which is enough to test K2 Thinking on several queries. If you need more credits, you can add a payment method in the “Credits” section.
Copy your API key and keep it somewhere safe. You’ll need it in a moment. Don’t commit this key to version control or share it publicly, since anyone with your key can make API calls on your account.
Now let’s set up your Python environment. You’ll need the OpenAI Python SDK, which works with OpenRouter’s API since they maintain OpenAI compatibility. You’ll also want python-dotenv to manage your API key securely. Install both packages:
!pip install openai python-dotenv
Create a .env file in your project directory and add your OpenRouter API key:
OPENROUTER_API_KEY=your_key_here
If you’re using git, add .env to your .gitignore file so you don't accidentally commit your credentials.
Here’s a quick test to verify everything works:
import os
from openai import OpenAI
from dotenv import load_dotenv
# Load API key from .env file
load_dotenv()
# Configure client for OpenRouter
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY")
)
# Test call to K2 Thinking
response = client.chat.completions.create(
model="moonshotai/kimi-k2-thinking",
messages=[
{"role": "user", "content": "What is 15 * 24?"}
]
)
print(response.choices[0].message.content)
Output:
15 * 24 = **360**
If you see the answer, you’re all set.
Understanding Kimi K2’s Thinking Mode
Earlier, we mentioned K2’s transparent reasoning. Now let’s see exactly how to access and use it in your code. K2 Thinking automatically exposes its reasoning process through a dedicated API field.
Unlike some other models, where you need to enable thinking mode manually, K2 includes reasoning content in every response by default.
How to access reasoning content
When you make an API call to K2 Thinking, the response contains two fields. The content field holds the final answer, and the reasoning field shows the step-by-step thinking process. Here's a practical example with a discount calculation that requires multiple steps:
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY")
)
response = client.chat.completions.create(
model="moonshotai/kimi-k2-thinking",
messages=[
{"role": "user", "content": "A laptop costs $850. There's a 20% discount today, then 8% sales tax applies to the discounted price. What's the final amount I pay?"}
],
temperature=1.0
)
print("Final answer:")
print(response.choices[0].message.content)
print("\nReasoning process:")
print(response.choices[0].message.reasoning)
Output:
Final answer:
The final amount you pay is **$734.40**.
Here's the breakdown:
1. **Discount:** 20% of $850 = $170
2. **Discounted price:** $850 - $170 = $680
3. **Sales tax:** 8% of $680 = $54.40
4. **Final amount:** $680 + $54.40 = **$734.40**
Reasoning process (truncated for readability):
The user wants to know the final amount they will pay for a laptop with a discount and then sales tax applied.
Step 1: Calculate the discount amount.
Discount amount = $850 * 0.20 = $170
Step 2: Calculate the discounted price.
Discounted price = $850 - $170 = $680
Step 3: Calculate the sales tax amount.
Sales tax amount = $680 * 0.08 = $54.40
Step 4: Calculate the final amount to pay.
Final amount = $680 + $54.40 = $734.40
Let me double-check the calculations...
[verification steps omitted]
The final amount is $734.40
The content field gives you the clean answer your users see. The reasoning field shows how K2 worked through the problem, which helps you understand whether the model actually reasoned correctly or just got lucky.
K2 generates reasoning tokens during inference, not after. This means the model thinks through the problem before committing to an answer, similar to how you might work through a math problem on scratch paper before writing the final answer. The reasoning happens in real-time as part of the generation process.
Recommended parameters for Thinking mode
Set temperature=1.0 when using K2 Thinking. Lower temperatures restrict the model's reasoning exploration, which defeats the purpose of Thinking mode. You also want max_tokens=4096 or higher since reasoning chains take up tokens before the final answer even starts.
response = client.chat.completions.create(
model="moonshotai/kimi-k2-thinking",
messages=[{"role": "user", "content": "Your question here"}],
temperature=1.0,
max_tokens=4096
)
Now that you understand thinking mode, let’s explore K2’s other key capability: tool calling. While many models support basic tool use, K2’s approach is different in scale and autonomy.
Kimi K2’s Tool Calling Capabilities
Tool calling lets K2 execute functions during reasoning. You define tools as JSON schemas describing what each function does and what parameters it needs. When K2 needs external data or computation, it calls the right tool and waits for results.
This creates a back-and-forth loop: K2 reads your prompt and reasons about it. If it needs a tool, it returns finish_reason: "tool_calls". You execute the function, send results back with role: "tool", and K2 continues. This repeats until K2 returns finish_reason: "stop" with the final answer.
The schema follows OpenAI’s format with a name, description, and parameters. K2 reads these to understand when to use each tool. Clear descriptions matter because they guide the model’s decisions.
Building the CSV analyzer tool
Let’s build a practical tool that analyzes CSV files, a common data science task. This tool reads a CSV, extracts column information, and returns summary statistics. It demonstrates how K2 can access external data during reasoning, which is essential for data-driven applications.
Here’s the complete implementation:
import csv
import os
import json
# Define the tool schema
tools = [
{
"type": "function",
"function": {
"name": "analyze_csv",
"description": "Read and analyze the first few rows of a CSV file. Returns columns, sample rows, total row count, and file size.",
"parameters": {
"type": "object",
"properties": {
"filepath": {
"type": "string",
"description": "Path to the CSV file"
},
"num_rows": {
"type": "integer",
"description": "Number of rows to read",
"default": 10
}
},
"required": ["filepath"]
}
}
}
]
The schema defines a function tool following OpenAI’s format. The description field tells K2 what the tool does and guides its decision on when to use it.
The parameters object specifies the inputs: filepath is required, while num_rows is optional with a default value of 10. Clear descriptions help K2 understand when this tool matches the user's needs.
The implementation handles the actual file processing:
# Implement the function
def analyze_csv(filepath: str, num_rows: int = 10) -> dict:
if not os.path.exists(filepath):
return {"error": f"File not found: {filepath}"}
try:
with open(filepath, 'r') as f:
reader = csv.DictReader(f)
columns = reader.fieldnames
sample_rows = [dict(row) for i, row in enumerate(reader) if i < num_rows]
f.seek(0)
total_rows = sum(1 for _ in f) - 1
return {
"columns": list(columns),
"sample_rows": sample_rows,
"total_rows": total_rows,
"file_size_kb": round(os.path.getsize(filepath) / 1024, 2)
}
except Exception as e:
return {"error": str(e)}
The function validates the file path, then uses csv.DictReader to parse CSV data into dictionaries where keys are column names. It reads the first num_rows entries, resets the file pointer with seek(0), and counts total rows.
The return dictionary includes column names, sample data, row count, and file size in kilobytes. Error handling ensures the tool returns useful feedback if something fails.
Implementing the tool execution loop
With your tool defined and implemented, you need a loop that handles the conversation between K2 and your functions. Tool calling isn’t a one-shot operation. K2 might need multiple iterations: call a tool, analyze results, decide if another tool is needed, and repeat. This requires a loop that handles the back-and-forth conversation until K2 reaches its final answer.
Here’s the complete implementation with a real example:
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY")
)
messages = [
{"role": "user", "content": "Analyze sample_employees.csv and tell me the average salary for Engineering department employees."}
]
This initializes the OpenAI client configured for OpenRouter’s API and creates the initial message list with the user’s query about analyzing employee salaries.
while True:
response = client.chat.completions.create(
model="moonshotai/kimi-k2-thinking",
messages=messages,
tools=tools,
temperature=1.0
)
message = response.choices[0].message
finish_reason = response.choices[0].finish_reason
messages.append({
"role": "assistant",
"content": message.content,
"tool_calls": message.tool_calls
})
The main loop sends the conversation history to K2 along with available tools. After extracting the assistant’s response and finish reason, append it to the message history to maintain conversation context.
if finish_reason == "tool_calls":
for tool_call in message.tool_calls:
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
if function_name == "analyze_csv":
result = analyze_csv(**arguments)
messages.append({
"role": "tool",
"content": json.dumps(result),
"tool_call_id": tool_call.id
})
elif finish_reason == "stop":
print(message.content)
break
When K2 requests tool calls, we execute each function with the provided arguments and append results back to the conversation with matching tool_call_id values. When K2 finishes reasoning and returns "stop", we print the final answer and exit the loop.
For this example, create a sample_employees.csv file with employee data including name, department, and salary columns. Use about 12 employees across the Engineering, Marketing, Sales, and HR departments for testing. When you run this code, K2 typically calls analyze_csv() with num_rows set to 12 or higher to ensure complete data coverage, then identifies Engineering employees and calculates their average salary.
Output:
Based on the CSV data, I found 5 Engineering employees in the sample:
- Alice Johnson: $95,000
- Carol White: $110,000
- Eve Davis: $88,000
- Iris Martinez: $102,000
- Kate Thomas: $92,000
The average salary for Engineering department employees is $97,400.
The tool_call_id in each response links results back to their function calls. K2 might request multiple tools in one response, so each needs its ID for proper matching. Preserving the full conversation history gives K2 the context it needs for coherent reasoning across tool calls.
Understanding K2’s independent orchestration
Once you’re comfortable with basic tool calling, K2’s real strength becomes apparent: extended orchestration. K2 was trained for tool orchestration from the start, not as an add-on. This changes how it handles multi-step tasks. Most models manage 20–50 sequential tool calls before performance drops. K2 handles 200–300 while maintaining coherent reasoning.
This capacity matters for independent workflows where the model decides, gathers information, validates results, and iterates until it solves the problem. Consider research tasks where K2 searches databases, cross-references findings, identifies gaps, refines queries, and synthesizes everything. Or data pipelines that need validation at each step. Or debugging workflows that test multiple hypotheses.
If your use case is “take this single action,” most models work fine. If it’s “keep working until you solve this,” K2’s extended tool capacity becomes the deciding factor.
Building a Multi-Model Comparison Chat App
You’ve seen how K2 Thinking handles individual tasks with tool calling and transparent reasoning. Now let’s put it to the test against its competitors. You’ll build a Streamlit app that queries Kimi K2 Thinking, GPT-5, and Claude Sonnet 4.5 with the same prompt at once. The interface displays all three responses side-by-side so you can see how each model tackles the same problem.
You can find the full application script in this GitHub Gist. If you only want the test results of this app, you can skip this section. The following steps contain a detailed breakdown of the script (which is over 200 lines), so you may save the reading of the section to a later time when you have the opportunity to follow along. With that said, let’s get started.
Step 1: Project setup and dependencies
This comparison app requires API access to all three models. You’ll need your OpenRouter account (already set up with $5 in free credits), and optionally OpenAI and Anthropic API keys. The $5 in OpenRouter credits should be enough to test all three models on multiple queries. If you only have OpenRouter access, you can still compare K2 with other models available through OpenRouter.
Install Streamlit for the web interface and Anthropic’s SDK for Claude access:
pip install streamlit anthropic
Create a file named model_comparison_chat.py. The entire application lives in this single file. Begin with the imports:
import os
import time
from typing import Dict, List
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv
import streamlit as st
from openai import OpenAI
from anthropic import Anthropic
load_dotenv()
Step 2: Configure API connections
You already have OPENROUTER_API_KEY in your .env file. Add keys for GPT-5 and Claude:
OPENROUTER_API_KEY=your_openrouter_key # Already set up
OPENAI_API_KEY=your_openai_key # Add this
ANTHROPIC_API_KEY=your_anthropic_key # Add this
Get your OpenAI API key from platform.openai.com and your Anthropic key from console.anthropic.com. For more details on GPT-5 setup, see the GPT-5 guide. For Claude Sonnet 4.5, check the Claude Sonnet 4.5 overview.
Initialize the three clients:
KIMI_CLIENT = OpenAI(
api_key=os.getenv("OPENROUTER_API_KEY"),
base_url="https://openrouter.ai/api/v1"
)
GPT_CLIENT = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
CLAUDE_CLIENT = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
Step 3: Create helper functions for each model
Build one unified function that routes to the correct model based on the name. This keeps your code clean and makes it easy to add more models later.
Start with the Kimi K2 call:
def call_model(model_name: str, messages: List[Dict], **kwargs) -> Dict:
"""Call any model with unified interface."""
try:
start_time = time.time()
if model_name == "Kimi K2 Thinking":
enable_thinking = kwargs.get("enable_thinking", True)
completion = KIMI_CLIENT.chat.completions.create(
model="moonshotai/kimi-k2-thinking",
messages=messages,
temperature=1.0,
extra_body={"include_reasoning": True}
)
message = completion.choices[0].message
content = message.content
# Extract reasoning from the dedicated field
reasoning_content = None
if enable_thinking and hasattr(message, 'reasoning') and message.reasoning:
reasoning_content = message.reasoning
return {
"content": content,
"reasoning_content": reasoning_content,
"response_time": time.time() - start_time,
"tokens_used": {
"input": completion.usage.prompt_tokens,
"output": completion.usage.completion_tokens,
"total": completion.usage.total_tokens
},
"error": None
}
GPT-5 takes a different approach with a responses endpoint and configurable reasoning effort:
elif model_name == "GPT-5":
reasoning_effort = kwargs.get("reasoning_effort", "medium")
input_messages = [
{"role": m["role"], "content": m["content"]}
for m in messages
]
response = GPT_CLIENT.responses.create(
model="gpt-5",
input=input_messages,
reasoning={
"effort": reasoning_effort,
"summary": "auto"
}
)
# Parse response structure
reasoning_text = None
content_text = ""
for item in response.output:
if item.type == "reasoning" and hasattr(item, "summary"):
summaries = [s.text for s in item.summary if hasattr(s, "text")]
reasoning_text = "\n\n".join(summaries) if summaries else None
elif item.type == "message" and hasattr(item, "content"):
content_text += "".join(c.text for c in item.content if hasattr(c, "text"))
return {
"content": content_text,
"reasoning_content": reasoning_text,
"response_time": time.time() - start_time,
"tokens_used": {
"input": getattr(response.usage, "input_tokens", 0),
"output": getattr(response.usage, "output_tokens", 0),
"total": getattr(response.usage, "total_tokens", 0)
},
"error": None
}
GPT-5 offers four reasoning effort levels: minimal, low, medium, and high. The response structure separates reasoning into summary blocks and content into message blocks, which you parse separately before returning the same standardized dictionary format as Kimi K2.
Claude Sonnet 4.5 implements extended thinking mode with a token budget approach:
else: # Claude Sonnet 4.5
enable_thinking = kwargs.get("enable_thinking", True)
params = {
"model": "claude-sonnet-4-5",
"max_tokens": 10000,
"messages": messages
}
if enable_thinking:
params["thinking"] = {
"type": "enabled",
"budget_tokens": 5000
}
message = CLAUDE_CLIENT.messages.create(**params)
# Extract content and thinking blocks
content_text = ""
thinking_text = None
for block in message.content:
if block.type == "thinking":
thinking_text = block.thinking
elif block.type == "text":
content_text += block.text
return {
"content": content_text,
"reasoning_content": thinking_text,
"response_time": time.time() - start_time,
"tokens_used": {
"input": message.usage.input_tokens,
"output": message.usage.output_tokens,
"total": message.usage.input_tokens + message.usage.output_tokens
},
"error": None
}
except Exception as e:
return {
"content": None,
"reasoning_content": None,
"response_time": 0,
"tokens_used": {"input": 0, "output": 0, "total": 0},
"error": f"{model_name} Error: {str(e)}"
}
Claude allocates a token budget for internal reasoning before generating the final answer. The response contains multiple content blocks that you iterate through, separating thinking blocks from text blocks. The error handling at the end returns the same standardized dictionary structure with an error message, ensuring failed API calls don’t break the interface.
Step 4: Build the Streamlit interface
With the API layer ready, you can build the web interface. Configure the Streamlit page first:
def main():
st.set_page_config(
page_title="Multi-Model Comparison Chat",
page_icon="🤖",
layout="wide",
initial_sidebar_state="expanded"
)
st.session_state.setdefault("messages", [])
st.session_state.setdefault("model_responses", [])
The wide layout creates space for three side-by-side columns. Session state preserves conversation history and model responses when Streamlit reruns the script.
Add the page title and sidebar controls:
st.title("🤖 Multi-Model Comparison Chat")
st.markdown("""
Compare **Kimi K2 Thinking**, **GPT-5**, and **Claude Sonnet 4.5** side-by-side.
All three models support reasoning modes - see how they approach problems differently.
""")
with st.sidebar:
st.header("⚙️ Settings")
st.subheader("Thinking Mode")
kimi_thinking = st.checkbox("Enable Kimi K2 Thinking", value=True)
gpt5_reasoning = st.selectbox(
"GPT-5 Reasoning Effort",
options=["minimal", "low", "medium", "high"],
index=2
)
claude_thinking = st.checkbox("Enable Claude Thinking", value=True)
The sidebar gives users control over thinking modes. Kimi K2 and Claude have binary toggles. GPT-5 provides four reasoning levels because its API exposes this granular control.
Below the thinking controls, add an API status indicator so users know which services are connected:
st.subheader("API Status")
st.markdown(f"""
- Kimi K2: {"✅" if os.getenv("OPENROUTER_API_KEY") else "❌"}
- GPT-5: {"✅" if os.getenv("OPENAI_API_KEY") else "❌"}
- Claude: {"✅" if os.getenv("ANTHROPIC_API_KEY") else "❌"}
""")
The checkmarks appear when environment variables are set. Red X marks mean you need to add the corresponding key to your .env file.
Step 5: Implement the comparison logic
Now comes the core functionality. When users submit a message, you’ll call all three models in parallel and display results side-by-side.
def call_models_parallel(messages: List[Dict], selected_models: List[str],
kimi_thinking: bool, gpt5_reasoning: str,
claude_thinking: bool) -> Dict[str, Dict]:
"""Call multiple models in parallel."""
model_kwargs = {
"Kimi K2 Thinking": {"enable_thinking": kimi_thinking},
"GPT-5": {"reasoning_effort": gpt5_reasoning},
"Claude Sonnet 4.5": {"enable_thinking": claude_thinking}
}
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {
executor.submit(call_model, m, messages, **model_kwargs[m]): m
for m in selected_models if m in model_kwargs
}
return {futures[f]: f.result() for f in as_completed(futures)}
This function submits three API calls at once and collects results as they complete. Sequential calls would take 15–30 seconds. Parallel execution cuts this to 5–10 seconds because you’re only waiting for the slowest model instead of the sum of all three.
Integrate parallel execution into the chat interface:
if prompt := st.chat_input("Ask a question to all models..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
api_messages = [
{"role": m["role"], "content": m["content"]}
for m in st.session_state.messages
]
with st.spinner("🤔 Models are thinking..."):
responses = call_models_parallel(
api_messages,
["Kimi K2 Thinking", "GPT-5", "Claude Sonnet 4.5"],
kimi_thinking,
gpt5_reasoning,
claude_thinking
)
cols = st.columns(3)
for idx, (model_name, response_data) in enumerate(responses.items()):
with cols[idx]:
st.markdown(f"### {model_name}")
if response_data["reasoning_content"]:
with st.expander("🧠 Thinking Process", expanded=False):
st.markdown(response_data["reasoning_content"])
st.markdown(response_data["content"])
This creates three columns and displays each model’s response in its own column. The thinking process starts collapsed so users see the final answers first. They can expand the thinking section when they want to examine the reasoning.
Step 6: Test and refine
Run the app with:
streamlit run model_comparison_chat.py
Start with a basic test to confirm everything works:
What is 2+2?
All three models should respond within seconds. This verifies your API connections and confirms parallel execution is working. The app runs locally, giving you full control over when APIs are called. Each query incurs API costs (approximately $0.01–0.05 per three-model comparison depending on response length), but running locally means you decide exactly when to spend.
Testing The App With Complex Prompts
You built the comparison app and saw it work with simple questions. But the real differences between these models show up when you push them harder. To demonstrate these differences, I tested all three models with complex prompts. You can run these same tests in your app to see the results firsthand. Here are the key insights from that testing.
The test prompts
We chose three different challenges to test distinct reasoning abilities.
Test 1: Advanced mathematical reasoning
A rectangular garden is 24 feet long and 18 feet wide. A path 2 feet wide runs around the outside perimeter of the garden. If Sarah plants flowers in the garden at a density of 4 flowers per square foot, and then decides to add decorative stones to 30% of the planted area (removing those flowers), how many flowers remain? Also, what is the total area (in square feet) covered by the path?
Test 2: Multi-step logical reasoning
Four friends - Alice, Bob, Carol, and David - need to arrange themselves in a line for a photo. The following rules must be followed:
1. Alice must stand next to Bob
2. Carol cannot stand at either end
3. David must stand to the left of Alice
List all valid arrangements from left to right.
Test 3: Code generation with constraints
Write a Python function that checks if a string is a valid IPv4 address, with these requirements:
1. Must validate format (four octets separated by dots)
2. Each octet must be 0-255 (no leading zeros except for "0" itself)
3. Must reject invalid formats (empty strings, too many/few octets, non-numeric characters)
4. Time complexity must be O(n) where n is the string length
5. Include complete docstring with examples
6. Add type hints for all parameters and return values
7. Handle edge cases (empty string, None, special characters)
Also explain your validation approach and why it's O(n).
Results
All three models solved every problem correctly. But they took very different paths to get there. Here’s the performance breakdown:
|
Test |
Model |
Time (sec) |
Tokens |
Result |
|
Garden Problem |
Kimi K2 |
28.5 |
3,420 |
✓ Correct (1,209.6 flowers, 184 sq ft) |
|
GPT-5 |
44.6 |
1,476 |
✓ Correct (1,209.6 flowers, 184 sq ft) |
|
|
Claude |
19.7 |
1,386 |
✓ Correct (1,210 flowers, 184 sq ft) |
|
|
Photo Arrangement |
Kimi K2 |
50.5 |
6,249 |
✓ Found all 2 valid arrangements |
|
GPT-5 |
39.4 |
1,914 |
✓ Found all 2 valid arrangements |
|
|
Claude |
31.3 |
2,310 |
✓ Found all 2 valid arrangements |
|
|
IPv4 Validator |
Kimi K2 |
192.6 |
5,245 |
✓ Working O(n) solution |
|
GPT-5 |
73.7 |
4,233 |
✓ Working O(n) solution |
|
|
Claude |
43.1 |
3,330 |
✓ Working O(n) solution |
Here’s what each model actually does during that time.
Kimi K2’s approach
It explores and questions itself constantly. On the garden problem, it visualized the setup, worked through the math, then stopped to say “Wait, let me double-check that…” before continuing. For the photo arrangement, it tested each constraint one by one, showing its work for every possibility.
The IPv4 validator took over 3 minutes because it considered edge cases one by one, walked through the O(n) complexity proof, and verified the logic multiple times. You’re watching someone think out loud, catch themselves, and verify their reasoning at each step.
GPT-5’s approach
It’s methodical and organized. It broke the garden problem into clear sections, numbered the steps, and worked through them in order. The photo arrangement got a structured walkthrough where it placed people position by position while checking rules.
The code came with clear explanations and well-organized documentation. It doesn’t second-guess itself as much as K2, and it doesn’t rush through like Claude. You get a balanced view of the thinking without excessive detail.
Claude’s approach
It moves fast. It identified the solution path quickly on each test and executed cleanly. For the photo problem, it organized by cases and worked through each one without lingering. The IPv4 code was clean and well-documented, but got to the point faster.
Claude shows its thinking but doesn’t dwell on verification steps. Speed comes from confidence in the approach and minimal backtracking.
These different approaches create a clear trade-off pattern. Pick K2 when you need detailed reasoning for learning, debugging, or audit requirements. Pick Claude when speed matters and you trust the results without inspecting reasoning chains. Pick GPT-5 when you want balanced performance between K2’s verbosity and Claude’s speed.
Note: Take the results of this test with a grain of salt as they can’t realistically test all three models’ abilities to the fullest.
Conclusion
You’ve set up Kimi K2 Thinking through OpenRouter, explored its reasoning capabilities, built tool-calling workflows, and created a comparison app to test it against GPT-5 and Claude. K2’s strengths are independent tool orchestration and transparent reasoning, making it well-suited for complex workflows where you need to verify the model’s decisions.
K2’s capacity and affordability make this practical. It can execute extensive sequential tool calls compared to a few dozen for most competitors, and the reasoning_content field exposes every thinking step. At approximately 1/4th the cost of GPT-5 and 1/6th the cost of Claude for output tokens, you can run extensive experiments without budget concerns.
The comparison app you built gives you direct evidence of these tradeoffs. Test it with your own problems and workflows to develop intuition for whether K2’s detailed reasoning justifies the longer response times. Real-world testing with your specific use cases will reveal patterns that generic benchmarks can’t capture.
To expand on these concepts, check out DataCamp’s OpenRouter tutorial for managing multiple LLM providers, GPT-5 guide for reasoning effort levels, and Claude Sonnet 4 guide for extended thinking mode.

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.

