Kimi K2 Thinking: Open-Source LLM Guide, Benchmarks, and Tools

Hands-on tutorial to run Kimi K2 Thinking, build tool-calling workflows, view transparent reasoning, and benchmark against GPT-5 and Claude 4.5.

10 nov. 2025 · 15 min de lecture

An open-source model beats GPT-5 and Claude Sonnet 4.5 on major benchmarks at a fraction of the cost. That model is Kimi K2 Thinking, Moonshot AI’s reasoning variant, released in November 2025.

Kimi K2 Thinking can execute 200–300 sequential tool calls autonomously, making it ideal for complex agentic workflows. It also exposes its reasoning process through a dedicated API field, so you can see exactly how it thinks through problems.

This tutorial shows you how to use the Kimi K2 Thinking API for reasoning tasks. You’ll implement tool-calling workflows, build a comparison chat app to test K2 against GPT-5 and Claude, and learn when K2’s strengths make it the better choice. If you’re keen to learn more about building apps with LLMs, I recommend taking the Developing LLM Applications with LangChain course.

Here is the preview of the app:

Before diving into the hands-on work, let’s understand what makes K2 Thinking different from traditional language models.

What Is Kimi K2 Thinking?

Most language models generate responses right away. K2 Thinking works differently. It’s built for multi-step problems where the model needs to plan, reason, and act on its own.

Architecture and design

The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters. Only 32 billion activate for any given input. This gives K2 the power of a massive model without the high inference costs. The 256,000-token context window means you can feed it entire codebases, long documents, or extensive conversation histories without chunking.

Moonshot AI released two versions. K2 Instruct handles straightforward tasks like text generation, classification, and simple Q&A, where speed matters. K2 Thinking is for complex reasoning tasks.

Two standout capabilities

K2 Thinking’s first major feature is transparent reasoning. You can see how it breaks down problems, evaluates options, and arrives at conclusions.

The second is tool orchestration. K2 handles extensive sequential tool calling, far beyond what most models manage. It decides which tools to use, when to use them, and how to combine results without you stepping in at each stage.

Additional features

K2 includes features for different deployment needs. For production scenarios where accuracy is critical, Heavy Mode runs eight reasoning paths in parallel and selects the best answer, though at higher computational cost.

For latency-sensitive applications, INT4 quantization doubles inference speed with minimal accuracy loss, making it practical for high-throughput environments.

Kimi K2 Thinking vs GPT-5 vs Claude Sonnet 4.5 vs DeepSeek

Here’s how K2 compares to GPT-5, Claude Sonnet 4.5, and DeepSeek V3.2.

Metric	Kimi K2 Thinking	GPT-5 (High)	Claude Sonnet 4.5	DeepSeek-V3.2
HLE (w/ tools)	44.9	41.7	32	20.3
HLE Heavy Mode	51	42	—	—
AIME25 (w/ python)	99.1 %	99.6 %	100 %	58.1 %
GPQA	84.5	85.7	83.4	79.9
BrowseComp	60.2	54.9	24.1	40.1
Frames	87	86	85	80.2
SWE-bench Verified	71.3 %	74.9 %	77.2 %	67.8 %
LiveCodeBench	83.1 %	87.0 %	64.0 %	74.1 %
Context window	256 k tokens	400 k tokens	200 k tokens	128 k tokens
Input pricing	$0.60 / 1 M	$1.25 / 1 M	$3.00 / 1 M	$0.55 / 1 M
Output pricing	$2.50 / 1 M	$10.00 / 1 M	$15.00 / 1 M	$2.19 / 1 M
Max tool calls	200–300	Dozens +	Dozens +	Not specified

Quick benchmark definitions: HLE tests graduate-level knowledge, AIME25 uses competition math problems, BrowseComp and Frames test web navigation and synthesis, SWE-bench Verified measures bug fixing in open-source code, LiveCodeBench tests code generation, and GPQA covers graduate science questions.

K2’s tool training shows up most clearly on agentic benchmarks. Heavy Mode beats GPT-5 by 9 points on HLE because it runs eight reasoning paths in parallel. Claude’s code training gives it the edge on SWE-bench Verified, making it the top pick for software engineering projects.

On pure reasoning tasks, all three models perform about the same. The differences appear when tools get involved. K2’s 200–300 tool call capacity lets it handle autonomous research workflows, debugging pipelines, and multi-step data analysis without you stepping in at each stage.

When to pick each model

Pick K2 Thinking for agentic workflows that need extensive tool orchestration, web research, and information synthesis tasks, or when you need transparent reasoning chains for debugging or compliance.
Pick GPT-5 for the largest context window at 400k tokens or when you need balanced performance across different tasks with mature ecosystem support.
Pick Claude Sonnet 4.5 for software engineering projects where debugging and code fixes matter most.
Pick DeepSeek for budget priority with open-source requirements under MIT license.

Setting Up Kimi K2 Thinking API Access

You have two options for accessing K2 Thinking: directly through Moonshot AI’s platform.moonshot.ai, or through OpenRouter, a unified API gateway. We’ll use OpenRouter because it provides access to K2, GPT-5, Claude, and dozens of other models with a single API key. This unified access becomes essential later when you build the comparison chat app. OpenRouter also handles rate limiting and failover automatically, so you don’t need to manage multiple provider accounts.

First, head to openrouter.ai and create an account. You can sign up with Google or GitHub. Once you’re in, go to the “Keys” section in the dashboard and generate a new API key.

OpenRouter offers $5 in free credits when you sign up, which is enough to test K2 Thinking on several queries. If you need more credits, you can add a payment method in the “Credits” section.

Copy your API key and keep it somewhere safe. You’ll need it in a moment. Don’t commit this key to version control or share it publicly, since anyone with your key can make API calls on your account.

Now let’s set up your Python environment. You’ll need the OpenAI Python SDK, which works with OpenRouter’s API since they maintain OpenAI compatibility. You’ll also want python-dotenv to manage your API key securely. Install both packages:

!pip install openai python-dotenv

Create a .env file in your project directory and add your OpenRouter API key:

OPENROUTER_API_KEY=your_key_here

If you’re using git, add .env to your .gitignore file so you don't accidentally commit your credentials.

Here’s a quick test to verify everything works:

import os
from openai import OpenAI
from dotenv import load_dotenv

# Load API key from .env file
load_dotenv()

# Configure client for OpenRouter
client = OpenAI(
   base_url="https://openrouter.ai/api/v1",
   api_key=os.getenv("OPENROUTER_API_KEY")
)

# Test call to K2 Thinking
response = client.chat.completions.create(
   model="moonshotai/kimi-k2-thinking",
   messages=[
       {"role": "user", "content": "What is 15 * 24?"}
   ]
)

print(response.choices[0].message.content)

Output:

15 * 24 = **360**

If you see the answer, you’re all set.

Understanding Kimi K2’s Thinking Mode

Earlier, we mentioned K2’s transparent reasoning. Now let’s see exactly how to access and use it in your code. K2 Thinking automatically exposes its reasoning process through a dedicated API field.

Unlike some other models, where you need to enable thinking mode manually, K2 includes reasoning content in every response by default.

How to access reasoning content

When you make an API call to K2 Thinking, the response contains two fields. The content field holds the final answer, and the reasoning field shows the step-by-step thinking process. Here's a practical example with a discount calculation that requires multiple steps:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
   base_url="https://openrouter.ai/api/v1",
   api_key=os.getenv("OPENROUTER_API_KEY")
)

response = client.chat.completions.create(
   model="moonshotai/kimi-k2-thinking",
   messages=[
       {"role": "user", "content": "A laptop costs $850. There's a 20% discount today, then 8% sales tax applies to the discounted price. What's the final amount I pay?"}
   ],
   temperature=1.0
)

print("Final answer:")
print(response.choices[0].message.content)
print("\nReasoning process:")
print(response.choices[0].message.reasoning)

Output:

Final answer:
The final amount you pay is **$734.40**.

Here's the breakdown:

1. **Discount:** 20% of $850 = $170
2. **Discounted price:** $850 - $170 = $680
3. **Sales tax:** 8% of $680 = $54.40
4. **Final amount:** $680 + $54.40 = **$734.40**

Reasoning process (truncated for readability):

The user wants to know the final amount they will pay for a laptop with a discount and then sales tax applied.

Step 1: Calculate the discount amount.
Discount amount = $850 * 0.20 = $170

Step 2: Calculate the discounted price.
Discounted price = $850 - $170 = $680

Step 3: Calculate the sales tax amount.
Sales tax amount = $680 * 0.08 = $54.40

Step 4: Calculate the final amount to pay.
Final amount = $680 + $54.40 = $734.40

Let me double-check the calculations...
[verification steps omitted]

The final amount is $734.40

The content field gives you the clean answer your users see. The reasoning field shows how K2 worked through the problem, which helps you understand whether the model actually reasoned correctly or just got lucky.

K2 generates reasoning tokens during inference, not after. This means the model thinks through the problem before committing to an answer, similar to how you might work through a math problem on scratch paper before writing the final answer. The reasoning happens in real-time as part of the generation process.

Recommended parameters for Thinking mode

Set temperature=1.0 when using K2 Thinking. Lower temperatures restrict the model's reasoning exploration, which defeats the purpose of Thinking mode. You also want max_tokens=4096 or higher since reasoning chains take up tokens before the final answer even starts.

response = client.chat.completions.create(
   model="moonshotai/kimi-k2-thinking",
   messages=[{"role": "user", "content": "Your question here"}],
   temperature=1.0,
   max_tokens=4096
)

Now that you understand thinking mode, let’s explore K2’s other key capability: tool calling. While many models support basic tool use, K2’s approach is different in scale and autonomy.

Kimi K2’s Tool Calling Capabilities

Tool calling lets K2 execute functions during reasoning. You define tools as JSON schemas describing what each function does and what parameters it needs. When K2 needs external data or computation, it calls the right tool and waits for results.

This creates a back-and-forth loop: K2 reads your prompt and reasons about it. If it needs a tool, it returns finish_reason: "tool_calls". You execute the function, send results back with role: "tool", and K2 continues. This repeats until K2 returns finish_reason: "stop" with the final answer.

The schema follows OpenAI’s format with a name, description, and parameters. K2 reads these to understand when to use each tool. Clear descriptions matter because they guide the model’s decisions.

Building the CSV analyzer tool

Let’s build a practical tool that analyzes CSV files, a common data science task. This tool reads a CSV, extracts column information, and returns summary statistics. It demonstrates how K2 can access external data during reasoning, which is essential for data-driven applications.

Here’s the complete implementation:

import csv
import os
import json

# Define the tool schema
tools = [
   {
       "type": "function",
       "function": {
           "name": "analyze_csv",
           "description": "Read and analyze the first few rows of a CSV file. Returns columns, sample rows, total row count, and file size.",
           "parameters": {
               "type": "object",
               "properties": {
                   "filepath": {
                       "type": "string",
                       "description": "Path to the CSV file"
                   },
                   "num_rows": {
                       "type": "integer",
                       "description": "Number of rows to read",
                       "default": 10
                   }
               },
               "required": ["filepath"]
           }
       }
   }
]

The schema defines a function tool following OpenAI’s format. The description field tells K2 what the tool does and guides its decision on when to use it.

The parameters object specifies the inputs: filepath is required, while num_rows is optional with a default value of 10. Clear descriptions help K2 understand when this tool matches the user's needs.

The implementation handles the actual file processing:

# Implement the function
def analyze_csv(filepath: str, num_rows: int = 10) -> dict:
   if not os.path.exists(filepath):
       return {"error": f"File not found: {filepath}"}

   try:
       with open(filepath, 'r') as f:
           reader = csv.DictReader(f)
           columns = reader.fieldnames
           sample_rows = [dict(row) for i, row in enumerate(reader) if i < num_rows]
           f.seek(0)
           total_rows = sum(1 for _ in f) - 1

       return {
           "columns": list(columns),
           "sample_rows": sample_rows,
           "total_rows": total_rows,
           "file_size_kb": round(os.path.getsize(filepath) / 1024, 2)
       }
   except Exception as e:
       return {"error": str(e)}

The function validates the file path, then uses csv.DictReader to parse CSV data into dictionaries where keys are column names. It reads the first num_rows entries, resets the file pointer with seek(0), and counts total rows.

The return dictionary includes column names, sample data, row count, and file size in kilobytes. Error handling ensures the tool returns useful feedback if something fails.

Implementing the tool execution loop

With your tool defined and implemented, you need a loop that handles the conversation between K2 and your functions. Tool calling isn’t a one-shot operation. K2 might need multiple iterations: call a tool, analyze results, decide if another tool is needed, and repeat. This requires a loop that handles the back-and-forth conversation until K2 reaches its final answer.

Here’s the complete implementation with a real example:

from openai import OpenAI

client = OpenAI(
   base_url="https://openrouter.ai/api/v1",
   api_key=os.getenv("OPENROUTER_API_KEY")
)

messages = [
   {"role": "user", "content": "Analyze sample_employees.csv and tell me the average salary for Engineering department employees."}
]

This initializes the OpenAI client configured for OpenRouter’s API and creates the initial message list with the user’s query about analyzing employee salaries.

while True:
   response = client.chat.completions.create(
       model="moonshotai/kimi-k2-thinking",
       messages=messages,
       tools=tools,
       temperature=1.0
   )

   message = response.choices[0].message
   finish_reason = response.choices[0].finish_reason

   messages.append({
       "role": "assistant",
       "content": message.content,
       "tool_calls": message.tool_calls
   })

The main loop sends the conversation history to K2 along with available tools. After extracting the assistant’s response and finish reason, append it to the message history to maintain conversation context.

  if finish_reason == "tool_calls":
       for tool_call in message.tool_calls:
           function_name = tool_call.function.name
           arguments = json.loads(tool_call.function.arguments)

           if function_name == "analyze_csv":
               result = analyze_csv(**arguments)

           messages.append({
               "role": "tool",
               "content": json.dumps(result),
               "tool_call_id": tool_call.id
           })

   elif finish_reason == "stop":
       print(message.content)
       break

When K2 requests tool calls, we execute each function with the provided arguments and append results back to the conversation with matching tool_call_id values. When K2 finishes reasoning and returns "stop", we print the final answer and exit the loop.

For this example, create a sample_employees.csv file with employee data including name, department, and salary columns. Use about 12 employees across the Engineering, Marketing, Sales, and HR departments for testing. When you run this code, K2 typically calls analyze_csv() with num_rows set to 12 or higher to ensure complete data coverage, then identifies Engineering employees and calculates their average salary.

Output:

Based on the CSV data, I found 5 Engineering employees in the sample:
- Alice Johnson: $95,000
- Carol White: $110,000
- Eve Davis: $88,000
- Iris Martinez: $102,000
- Kate Thomas: $92,000

The average salary for Engineering department employees is $97,400.

The tool_call_id in each response links results back to their function calls. K2 might request multiple tools in one response, so each needs its ID for proper matching. Preserving the full conversation history gives K2 the context it needs for coherent reasoning across tool calls.

Understanding K2’s independent orchestration

Once you’re comfortable with basic tool calling, K2’s real strength becomes apparent: extended orchestration. K2 was trained for tool orchestration from the start, not as an add-on. This changes how it handles multi-step tasks. Most models manage 20–50 sequential tool calls before performance drops. K2 handles 200–300 while maintaining coherent reasoning.

This capacity matters for independent workflows where the model decides, gathers information, validates results, and iterates until it solves the problem. Consider research tasks where K2 searches databases, cross-references findings, identifies gaps, refines queries, and synthesizes everything. Or data pipelines that need validation at each step. Or debugging workflows that test multiple hypotheses.

If your use case is “take this single action,” most models work fine. If it’s “keep working until you solve this,” K2’s extended tool capacity becomes the deciding factor.

Building a Multi-Model Comparison Chat App

You’ve seen how K2 Thinking handles individual tasks with tool calling and transparent reasoning. Now let’s put it to the test against its competitors. You’ll build a Streamlit app that queries Kimi K2 Thinking, GPT-5, and Claude Sonnet 4.5 with the same prompt at once. The interface displays all three responses side-by-side so you can see how each model tackles the same problem.

You can find the full application script in this GitHub Gist. If you only want the test results of this app, you can skip this section. The following steps contain a detailed breakdown of the script (which is over 200 lines), so you may save the reading of the section to a later time when you have the opportunity to follow along. With that said, let’s get started.

Step 1: Project setup and dependencies

This comparison app requires API access to all three models. You’ll need your OpenRouter account (already set up with $5 in free credits), and optionally OpenAI and Anthropic API keys. The $5 in OpenRouter credits should be enough to test all three models on multiple queries. If you only have OpenRouter access, you can still compare K2 with other models available through OpenRouter.

Install Streamlit for the web interface and Anthropic’s SDK for Claude access:

pip install streamlit anthropic

Create a file named model_comparison_chat.py. The entire application lives in this single file. Begin with the imports:

import os
import time
from typing import Dict, List
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv
import streamlit as st
from openai import OpenAI
from anthropic import Anthropic

load_dotenv()

Step 2: Configure API connections

You already have OPENROUTER_API_KEY in your .env file. Add keys for GPT-5 and Claude:

OPENROUTER_API_KEY=your_openrouter_key  # Already set up
OPENAI_API_KEY=your_openai_key           # Add this
ANTHROPIC_API_KEY=your_anthropic_key     # Add this

Get your OpenAI API key from platform.openai.com and your Anthropic key from console.anthropic.com. For more details on GPT-5 setup, see the GPT-5 guide. For Claude Sonnet 4.5, check the Claude Sonnet 4.5 overview.

Initialize the three clients:

KIMI_CLIENT = OpenAI(
   api_key=os.getenv("OPENROUTER_API_KEY"),
   base_url="https://openrouter.ai/api/v1"
)
GPT_CLIENT = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
CLAUDE_CLIENT = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

Step 3: Create helper functions for each model

Build one unified function that routes to the correct model based on the name. This keeps your code clean and makes it easy to add more models later.

Start with the Kimi K2 call:

def call_model(model_name: str, messages: List[Dict], **kwargs) -> Dict:
   """Call any model with unified interface."""
   try:
       start_time = time.time()

       if model_name == "Kimi K2 Thinking":
           enable_thinking = kwargs.get("enable_thinking", True)

           completion = KIMI_CLIENT.chat.completions.create(
               model="moonshotai/kimi-k2-thinking",
               messages=messages,
               temperature=1.0,
               extra_body={"include_reasoning": True}
           )

           message = completion.choices[0].message
           content = message.content

           # Extract reasoning from the dedicated field
           reasoning_content = None
           if enable_thinking and hasattr(message, 'reasoning') and message.reasoning:
               reasoning_content = message.reasoning

           return {
               "content": content,
               "reasoning_content": reasoning_content,
               "response_time": time.time() - start_time,
               "tokens_used": {
                   "input": completion.usage.prompt_tokens,
                   "output": completion.usage.completion_tokens,
                   "total": completion.usage.total_tokens
               },
               "error": None
           }

GPT-5 takes a different approach with a responses endpoint and configurable reasoning effort:

      elif model_name == "GPT-5":
           reasoning_effort = kwargs.get("reasoning_effort", "medium")
           input_messages = [
               {"role": m["role"], "content": m["content"]}
               for m in messages
           ]

           response = GPT_CLIENT.responses.create(
               model="gpt-5",
               input=input_messages,
               reasoning={
                   "effort": reasoning_effort,
                   "summary": "auto"
               }
           )

           # Parse response structure
           reasoning_text = None
           content_text = ""

           for item in response.output:
               if item.type == "reasoning" and hasattr(item, "summary"):
                   summaries = [s.text for s in item.summary if hasattr(s, "text")]
                   reasoning_text = "\n\n".join(summaries) if summaries else None
               elif item.type == "message" and hasattr(item, "content"):
                   content_text += "".join(c.text for c in item.content if hasattr(c, "text"))

           return {
               "content": content_text,
               "reasoning_content": reasoning_text,
               "response_time": time.time() - start_time,
               "tokens_used": {
                   "input": getattr(response.usage, "input_tokens", 0),
                   "output": getattr(response.usage, "output_tokens", 0),
                   "total": getattr(response.usage, "total_tokens", 0)
               },
               "error": None
           }

GPT-5 offers four reasoning effort levels: minimal, low, medium, and high. The response structure separates reasoning into summary blocks and content into message blocks, which you parse separately before returning the same standardized dictionary format as Kimi K2.

Claude Sonnet 4.5 implements extended thinking mode with a token budget approach:

    else:  # Claude Sonnet 4.5
           enable_thinking = kwargs.get("enable_thinking", True)
           params = {
               "model": "claude-sonnet-4-5",
               "max_tokens": 10000,
               "messages": messages
           }
           if enable_thinking:
               params["thinking"] = {
                   "type": "enabled",
                   "budget_tokens": 5000
               }

           message = CLAUDE_CLIENT.messages.create(**params)

           # Extract content and thinking blocks
           content_text = ""
           thinking_text = None

           for block in message.content:
               if block.type == "thinking":
                   thinking_text = block.thinking
               elif block.type == "text":
                   content_text += block.text

           return {
               "content": content_text,
               "reasoning_content": thinking_text,
               "response_time": time.time() - start_time,
               "tokens_used": {
                   "input": message.usage.input_tokens,
                   "output": message.usage.output_tokens,
                   "total": message.usage.input_tokens + message.usage.output_tokens
               },
               "error": None
           }

   except Exception as e:
       return {
           "content": None,
           "reasoning_content": None,
           "response_time": 0,
           "tokens_used": {"input": 0, "output": 0, "total": 0},
           "error": f"{model_name} Error: {str(e)}"
       }

Claude allocates a token budget for internal reasoning before generating the final answer. The response contains multiple content blocks that you iterate through, separating thinking blocks from text blocks. The error handling at the end returns the same standardized dictionary structure with an error message, ensuring failed API calls don’t break the interface.

Step 4: Build the Streamlit interface

With the API layer ready, you can build the web interface. Configure the Streamlit page first:

def main():
   st.set_page_config(
       page_title="Multi-Model Comparison Chat",
       page_icon="🤖",
       layout="wide",
       initial_sidebar_state="expanded"
   )

   st.session_state.setdefault("messages", [])
   st.session_state.setdefault("model_responses", [])

The wide layout creates space for three side-by-side columns. Session state preserves conversation history and model responses when Streamlit reruns the script.

Add the page title and sidebar controls:

st.title("🤖 Multi-Model Comparison Chat")
   st.markdown("""
   Compare **Kimi K2 Thinking**, **GPT-5**, and **Claude Sonnet 4.5** side-by-side.
   All three models support reasoning modes - see how they approach problems differently.
   """)

   with st.sidebar:
       st.header("⚙️ Settings")

       st.subheader("Thinking Mode")
       kimi_thinking = st.checkbox("Enable Kimi K2 Thinking", value=True)
       gpt5_reasoning = st.selectbox(
           "GPT-5 Reasoning Effort",
           options=["minimal", "low", "medium", "high"],
           index=2
       )
       claude_thinking = st.checkbox("Enable Claude Thinking", value=True)

The sidebar gives users control over thinking modes. Kimi K2 and Claude have binary toggles. GPT-5 provides four reasoning levels because its API exposes this granular control.

Below the thinking controls, add an API status indicator so users know which services are connected:

st.subheader("API Status")
       st.markdown(f"""
       - Kimi K2: {"✅" if os.getenv("OPENROUTER_API_KEY") else "❌"}
       - GPT-5: {"✅" if os.getenv("OPENAI_API_KEY") else "❌"}
       - Claude: {"✅" if os.getenv("ANTHROPIC_API_KEY") else "❌"}
       """)

The checkmarks appear when environment variables are set. Red X marks mean you need to add the corresponding key to your .env file.

Step 5: Implement the comparison logic

Now comes the core functionality. When users submit a message, you’ll call all three models in parallel and display results side-by-side.

def call_models_parallel(messages: List[Dict], selected_models: List[str],
                       kimi_thinking: bool, gpt5_reasoning: str,
                       claude_thinking: bool) -> Dict[str, Dict]:
   """Call multiple models in parallel."""
   model_kwargs = {
       "Kimi K2 Thinking": {"enable_thinking": kimi_thinking},
       "GPT-5": {"reasoning_effort": gpt5_reasoning},
       "Claude Sonnet 4.5": {"enable_thinking": claude_thinking}
   }

   with ThreadPoolExecutor(max_workers=3) as executor:
       futures = {
           executor.submit(call_model, m, messages, **model_kwargs[m]): m
           for m in selected_models if m in model_kwargs
       }
       return {futures[f]: f.result() for f in as_completed(futures)}

This function submits three API calls at once and collects results as they complete. Sequential calls would take 15–30 seconds. Parallel execution cuts this to 5–10 seconds because you’re only waiting for the slowest model instead of the sum of all three.

Integrate parallel execution into the chat interface:

   if prompt := st.chat_input("Ask a question to all models..."):
       st.session_state.messages.append({"role": "user", "content": prompt})

       with st.chat_message("user"):
           st.markdown(prompt)

       api_messages = [
           {"role": m["role"], "content": m["content"]}
           for m in st.session_state.messages
       ]

       with st.spinner("🤔 Models are thinking..."):
           responses = call_models_parallel(
               api_messages,
               ["Kimi K2 Thinking", "GPT-5", "Claude Sonnet 4.5"],
               kimi_thinking,
               gpt5_reasoning,
               claude_thinking
           )

       cols = st.columns(3)
       for idx, (model_name, response_data) in enumerate(responses.items()):
           with cols[idx]:
               st.markdown(f"### {model_name}")
               if response_data["reasoning_content"]:
                   with st.expander("🧠 Thinking Process", expanded=False):
                       st.markdown(response_data["reasoning_content"])
               st.markdown(response_data["content"])

This creates three columns and displays each model’s response in its own column. The thinking process starts collapsed so users see the final answers first. They can expand the thinking section when they want to examine the reasoning.

Step 6: Test and refine

Run the app with:

streamlit run model_comparison_chat.py

Start with a basic test to confirm everything works:

What is 2+2?

All three models should respond within seconds. This verifies your API connections and confirms parallel execution is working. The app runs locally, giving you full control over when APIs are called. Each query incurs API costs (approximately $0.01–0.05 per three-model comparison depending on response length), but running locally means you decide exactly when to spend.

Testing The App With Complex Prompts

You built the comparison app and saw it work with simple questions. But the real differences between these models show up when you push them harder. To demonstrate these differences, I tested all three models with complex prompts. You can run these same tests in your app to see the results firsthand. Here are the key insights from that testing.

The test prompts

We chose three different challenges to test distinct reasoning abilities.

Test 1: Advanced mathematical reasoning

A rectangular garden is 24 feet long and 18 feet wide. A path 2 feet wide runs around the outside perimeter of the garden. If Sarah plants flowers in the garden at a density of 4 flowers per square foot, and then decides to add decorative stones to 30% of the planted area (removing those flowers), how many flowers remain? Also, what is the total area (in square feet) covered by the path?

Test 2: Multi-step logical reasoning

Four friends - Alice, Bob, Carol, and David - need to arrange themselves in a line for a photo. The following rules must be followed:

1. Alice must stand next to Bob

2. Carol cannot stand at either end

3. David must stand to the left of Alice

List all valid arrangements from left to right.

Test 3: Code generation with constraints

Write a Python function that checks if a string is a valid IPv4 address, with these requirements:

1. Must validate format (four octets separated by dots)

2. Each octet must be 0-255 (no leading zeros except for "0" itself)

3. Must reject invalid formats (empty strings, too many/few octets, non-numeric characters)

4. Time complexity must be O(n) where n is the string length

5. Include complete docstring with examples

6. Add type hints for all parameters and return values

7. Handle edge cases (empty string, None, special characters)

Also explain your validation approach and why it's O(n).

Results

All three models solved every problem correctly. But they took very different paths to get there. Here’s the performance breakdown:

Test	Model	Time (sec)	Tokens	Result
Garden Problem	Kimi K2	28.5	3,420	✓ Correct (1,209.6 flowers, 184 sq ft)
	GPT-5	44.6	1,476	✓ Correct (1,209.6 flowers, 184 sq ft)
	Claude	19.7	1,386	✓ Correct (1,210 flowers, 184 sq ft)
Photo Arrangement	Kimi K2	50.5	6,249	✓ Found all 2 valid arrangements
	GPT-5	39.4	1,914	✓ Found all 2 valid arrangements
	Claude	31.3	2,310	✓ Found all 2 valid arrangements
IPv4 Validator	Kimi K2	192.6	5,245	✓ Working O(n) solution
	GPT-5	73.7	4,233	✓ Working O(n) solution
	Claude	43.1	3,330	✓ Working O(n) solution

Here’s what each model actually does during that time.

Kimi K2’s approach

It explores and questions itself constantly. On the garden problem, it visualized the setup, worked through the math, then stopped to say “Wait, let me double-check that…” before continuing. For the photo arrangement, it tested each constraint one by one, showing its work for every possibility.

The IPv4 validator took over 3 minutes because it considered edge cases one by one, walked through the O(n) complexity proof, and verified the logic multiple times. You’re watching someone think out loud, catch themselves, and verify their reasoning at each step.

GPT-5’s approach

It’s methodical and organized. It broke the garden problem into clear sections, numbered the steps, and worked through them in order. The photo arrangement got a structured walkthrough where it placed people position by position while checking rules.

The code came with clear explanations and well-organized documentation. It doesn’t second-guess itself as much as K2, and it doesn’t rush through like Claude. You get a balanced view of the thinking without excessive detail.

Claude’s approach

It moves fast. It identified the solution path quickly on each test and executed cleanly. For the photo problem, it organized by cases and worked through each one without lingering. The IPv4 code was clean and well-documented, but got to the point faster.

Claude shows its thinking but doesn’t dwell on verification steps. Speed comes from confidence in the approach and minimal backtracking.

These different approaches create a clear trade-off pattern. Pick K2 when you need detailed reasoning for learning, debugging, or audit requirements. Pick Claude when speed matters and you trust the results without inspecting reasoning chains. Pick GPT-5 when you want balanced performance between K2’s verbosity and Claude’s speed.

Note: Take the results of this test with a grain of salt as they can’t realistically test all three models’ abilities to the fullest.

Conclusion

You’ve set up Kimi K2 Thinking through OpenRouter, explored its reasoning capabilities, built tool-calling workflows, and created a comparison app to test it against GPT-5 and Claude. K2’s strengths are independent tool orchestration and transparent reasoning, making it well-suited for complex workflows where you need to verify the model’s decisions.

K2’s capacity and affordability make this practical. It can execute extensive sequential tool calls compared to a few dozen for most competitors, and the reasoning_content field exposes every thinking step. At approximately 1/4th the cost of GPT-5 and 1/6th the cost of Claude for output tokens, you can run extensive experiments without budget concerns.

The comparison app you built gives you direct evidence of these tradeoffs. Test it with your own problems and workflows to develop intuition for whether K2’s detailed reasoning justifies the longer response times. Real-world testing with your specific use cases will reveal patterns that generic benchmarks can’t capture.

To expand on these concepts, check out DataCamp’s OpenRouter tutorial for managing multiple LLM providers, GPT-5 guide for reasoning effort levels, and Claude Sonnet 4 guide for extended thinking mode.

Author

Bex Tuychiev

Sujets

Large Language Models

Artificial Intelligence

Top DataCamp Courses

Cursus

Développer des LLM

0 min

Développez des LLM avec PyTorch et Hugging Face, en appliquant les techniques récentes de deep learning et NLP.

Afficher les détails

Commencer le cours

Cours

Systèmes multi-agents avec LangGraph

2 h 45 min

2.6K

Créez des systèmes multi-agents puissants avec LangGraph et ses patterns de conception émergents.

Afficher les détails

Commencer le cours

Cours

AI Agents with Hugging Face smolagents

3 h

652

Learn how to build intelligent agents that reason, act, and solve real-world tasks using Python.

Afficher les détails

Commencer le cours

Apparenté

Claude Sonnet 4.5 hailed as the best at coding in the world

blog

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Learn about Claude Sonnet 4.5, the ‘best coding model in the world’. Explore new features, use cases, benchmarks, and testing results, plus a look at the Claude Agents SDK and Claude Imagine.

Matt Crabtree

8 min

Didacticiel

Kimi K2: A Guide With Six Practical Examples

Learn what Moonshot's Kimi K2 is, how to access it, and see it in action through six practical examples you can use.

Aashi Dutt

Didacticiel

How to Run Kimi K2 Locally: Complete Setup & Troubleshooting

Learn how to run Kimi K2 on a single A100 GPU with 250GB RAM using llama.cpp.

Abid Ali Awan

Didacticiel

GPT-4.1 Guide With Demo Project: Keyword Code Search Application

Learn how to build an interactive application that enables users to search a code repository using keywords and use GPT-4.1 to analyze, explain, and improve the code in the repository.

Abid Ali Awan

Didacticiel

Qwen3-Max-Thinking: Hands-On With the Largest LLM in the World

Test Qwen3-Max-Thinking, the world’s largest LLM, in reasoning mode. See how it stacks up against GPT-5 and Claude Sonnet 4.5 in logic, math, and code.

Bex Tuychiev

Didacticiel

Reflection Llama-3.1 70B: Testing & Summary of What We Know

Reflection Llama-3.1 70B, trained with Reflection-Tuning, claims to surpass GPT-4o and Claude 3.5 Sonnet but has faced reproducibility and verification issues so far.

Ryan Ong

Voir plus Voir plus

What Is Kimi K2 Thinking?

Architecture and design

Two standout capabilities

Additional features

Kimi K2 Thinking vs GPT-5 vs Claude Sonnet 4.5 vs DeepSeek

When to pick each model

Setting Up Kimi K2 Thinking API Access

Understanding Kimi K2’s Thinking Mode

How to access reasoning content

Recommended parameters for Thinking mode

Kimi K2’s Tool Calling Capabilities

Building the CSV analyzer tool

Implementing the tool execution loop

Understanding K2’s independent orchestration

Building a Multi-Model Comparison Chat App

Step 1: Project setup and dependencies

Step 2: Configure API connections

Step 3: Create helper functions for each model

Step 4: Build the Streamlit interface

Step 5: Implement the comparison logic

Step 6: Test and refine

Testing The App With Complex Prompts

The test prompts

Test 1: Advanced mathematical reasoning

Test 2: Multi-step logical reasoning

Test 3: Code generation with constraints

Results

Kimi K2’s approach

GPT-5’s approach

Claude’s approach

Conclusion

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Kimi K2: A Guide With Six Practical Examples

How to Run Kimi K2 Locally: Complete Setup & Troubleshooting

GPT-4.1 Guide With Demo Project: Keyword Code Search Application

Qwen3-Max-Thinking: Hands-On With the Largest LLM in the World

Reflection Llama-3.1 70B: Testing & Summary of What We Know

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Développer des LLM

Systèmes multi-agents avec LangGraph

AI Agents with Hugging Face smolagents

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Kimi K2: A Guide With Six Practical Examples

How to Run Kimi K2 Locally: Complete Setup & Troubleshooting

GPT-4.1 Guide With Demo Project: Keyword Code Search Application

Qwen3-Max-Thinking: Hands-On With the Largest LLM in the World

Reflection Llama-3.1 70B: Testing & Summary of What We Know

Développer des LLM