Skip to main content

OpenRouter: A Guide With Practical Examples

Learn how to use OpenRouter's unified API to access various AI models, implement features like streaming, reasoning tokens, or structured outputs.
Aug 14, 2025  · 12 min read

Managing multiple AI provider APIs quickly becomes overwhelming. Each provider has different authentication methods, pricing models, and API specifications. Developers waste countless hours switching between OpenAI, Anthropic, Google, and other platforms just to access different models.

OpenRouter solves this complexity by providing a unified API that connects you to over 400 models from dozens of providers. You can access GPT-5, Claude 4, Gemini 2.5 Pro, and hundreds of other models using a single API key and consistent interface. The platform handles automatic fallbacks, cost management, and provider routing behind the scenes.

In this tutorial, I explain everything you need to know about OpenRouter, from setting up your first API call to implementing advanced features like structured outputs. By the end, you will learn how to build reliable applications that aren’t tied to a single provider.

What Is OpenRouter?

OpenRouter is a unified API platform that gives you access to over 400 AI models from dozens of providers through a single endpoint. Instead of juggling separate API keys for OpenAI, Anthropic, Google, Meta, and others, you use one key to reach their entire model catalog.

The platform works as an intelligent router, sending your requests to the right provider while taking care of authentication, billing, and error handling. This approach fixes several headaches that come with using multiple AI providers.

Problems OpenRouter solves

Working with multiple AI providers gets messy fast. Each one has its own API format, login process, and billing system. You end up maintaining separate code for each service, which slows down development and makes testing new models a pain.

Things get worse when providers go down or hit you with rate limits. Your app breaks, and there’s nothing you can do except wait. Plus, figuring out which provider offers the best price for similar models means tracking costs manually across different platforms.

The biggest issue is getting locked into one provider. When you build everything around their specific API, switching to better models or cheaper options later becomes a major project.

How OpenRouter fixes this

OpenRouter solves these problems with a set of connected features:

  • One API key works with 400+ models from all major providers
  • Automatic switching to backup providers when your first choice fails
  • Side-by-side pricing for all models so you can compare costs instantly
  • Works with existing OpenAI code — just change the endpoint URL
  • Real-time monitoring that routes requests to the fastest available provider

These pieces work together to make AI development smoother and more reliable.

Who should use OpenRouter?

Different types of users get value from this unified approach:

  • Developers can try new models without setting up accounts everywhere, making experimentation faster
  • Enterprise teams get the uptime they need through automatic backups when providers fail
  • Budget-conscious users can find the cheapest option for their needs without spreadsheet math
  • Researchers get instant access to cutting-edge models without account setup overhead

Now that you understand what OpenRouter brings to the table, let’s get you set up with your first API call.

Prerequisites

Before diving into OpenRouter, you’ll need a few things set up on your machine. This tutorial assumes you’re comfortable with basic Python programming and have worked with APIs before. You don’t need to be an expert, but you should understand concepts like making HTTP requests and handling JSON responses.

You’ll need Python 3.7 or later installed on your system. We’ll be using the openai Python package to interact with OpenRouter's API, along with python-dotenv to handle environment variables securely. You can install both with:

pip install requests openai python-dotenv

You’ll also need an OpenRouter account and API key. Head to openrouter.ai to create a free account — you’ll get a small credit allowance to test things out. Once you’re logged in, go to the API Keys section in your account settings and generate a new key.

After getting your API key, create a .env file in your project directory and add your key like this:

OPENROUTER_API_KEY=your_api_key_here

This keeps your API key secure and out of your code. If you plan to use OpenRouter beyond testing, you’ll need to add credits to your account through the Credits page.

With these basics in place, you’re ready to make your first API call through OpenRouter.

Making Your First API Call in OpenRouter

Getting started with OpenRouter is remarkably simple if you’ve used the OpenAI SDK before. You just change one line of code and suddenly have access to hundreds of models from different providers.

Your first request and setup

Let’s jump right in with a working example that demonstrates OpenRouter’s approach:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
   base_url="https://openrouter.ai/api/v1",
   api_key=os.getenv("OPENROUTER_API_KEY"),
)

response = client.chat.completions.create(
   model="openai/gpt-5-mini",
   messages=[
       {
           "role": "user",
           "content": "Write a haiku about debugging code at 2 AM"
       }
   ]
)

print(response.choices[0].message.content)
Night hum, coffee cooled
cursor blinks, bug hides somewhere
I chase ghosts 'til dawn

The magic happens in two places. First, the base_url parameter redirects your requests to OpenRouter's servers instead of OpenAI's. Second, the model name follows a provider/model-name format - openai/gpt-5-mini instead of just gpt-5-mini. This tells OpenRouter which provider's version you want while keeping the familiar interface. Here are some common models that you can plug into the above example without any errors:

  • google/gemini-2.0-flash-001
  • google/gemini-2.5-pro
  • mistralai/mistral-nemo
  • deepseek/deepseek-r1-distill-qwen-32b

Now that you’ve seen how easy it is to work with different models, you might be wondering: what happens when your chosen model is unavailable? How do you build applications that stay reliable even when providers face issues? That’s where OpenRouter’s routing and resilience features come in.

Model Routing For Resilience

Building reliable AI applications means preparing for the unexpected. Providers experience downtime, models hit rate limits, and sometimes content moderation blocks your requests. Model routing is OpenRouter’s solution — it automatically switches between different models to keep your application running smoothly.

Setting up manual fallbacks

The most straightforward way to add resilience is to specify backup models. When your primary choice fails, OpenRouter tries your alternatives in order. The extra_body parameter passes these routing instructions to OpenRouter's API since the OpenAI SDK doesn't natively support this feature:

response = client.chat.completions.create(
   model="moonshotai/kimi-k2",  # Primary choice
   messages=[
       {"role": "user", "content": "Explain quantum computing in simple terms"}
   ],
   extra_body={
       "models": ["anthropic/claude-sonnet-4", "deepseek/deepseek-r1"]
   }   
)

print(f"Response from: {response.model}")
print(response.choices[0].message.content)
Response from: moonshotai/kimi-k2
Imagine a normal computer bit as a tiny light-switch that can only be OFF (0) or ON (1)...

OpenRouter tries Kimi-K2 first. If it’s unavailable, rate-limited, or blocked, it automatically tries Claude Sonnet 4, then DeepSeek R1. The response.model field shows which model actually responded.

Auto router for maximum convenience

Once you understand manual fallbacks, the auto router becomes really appealing. It handles model selection and fallbacks automatically, powered by NotDiamond’s evaluation system:

response = client.chat.completions.create(
   model="openrouter/auto",
   messages=[
       {"role": "user", "content": "Debug this Python code in 3 sentences: def factorial(n): return n * factorial(n-1)"}
   ]
)

print(f"Auto router selected: {response.model}")
print(response.choices[0].message.content)
Auto router selected: openai/chatgpt-4o-latest
The given code is missing a base case, which causes infinite recursion and eventually a RecursionError. To fix it, add a base case like `if n == 0: return 1` before the recursive call. Here's the corrected version:

\```python
def factorial(n):
   if n == 0:
       return 1
   return n * factorial(n - 1)
\```

The auto router analyzes your prompt and picks the best available model, with built-in fallbacks if your first choice is unavailable. You get resilience without any configuration. However, use the auto-router with care in sensitive or high-profile scenarios as it has the tendency to underestimate the complexity of your problem and thus, choose a lower-capacity model.

Building effective fallback strategies

Not all models make good backups for each other. Provider downtime may affect all models from that company, so choose fallbacks from different providers. Rate limits and costs vary dramatically, so pair expensive models with cheaper alternatives as well:

# Good fallback chain: different providers, decreasing cost
response = client.chat.completions.create(
   model="anthropic/claude-sonnet-4",
   messages=[
       {"role": "user", "content": "Your prompt here"}
   ],
   extra_body={
       "models": [
           "x-ai/grok-4",                  # Close performance
           "moonshotai/kimi-k2",           # Cheaper
           "deepseek/deepseek-r1:free"     # Free backup
       ]
   }   
)

This gives you premium quality when available, solid performance as backup, and guaranteed availability as a last resort. Content moderation policies also differ between providers, so diversifying your chain gives better coverage for sensitive topics.

Finding models for your fallback chain

The models page lets you filter by provider and capabilities to build your chain. Many powerful models like DeepSeek R1 and Kimi-K2 are free since they’re open-source, making excellent fallbacks. Free models have rate limits of 50 requests per day for new users, or 1000 requests per day if you’ve purchased 10 credits.

For dynamic applications, you can discover models programmatically:

def get_provider_models(api_key: str, provider: str) -> list[str]:
   r = requests.get(
       "https://openrouter.ai/api/v1/models",
       headers={"Authorization": f"Bearer {api_key}"}
   )
   return [m["id"] for m in r.json()["data"] if m["id"].startswith(provider)]

# Build fallbacks across providers
openai_models = get_provider_models(api_key, "openai/")
anthropic_models = get_provider_models(api_key, "anthropic/")

This approach lets you build robust fallback chains that adapt as new models become available.

Streaming For Real-time Responses

When working with AI models, especially for longer responses, users expect to see output appear progressively rather than waiting for the complete response. Streaming solves this by sending response chunks as they’re generated, creating a more interactive experience similar to ChatGPT’s interface.

Basic streaming setup

To set up streaming in OpenRouter, add stream=True to your request. The response becomes an iterator that yields chunks as the model generates them:

response = client.chat.completions.create(
   model="openai/gpt-5",
   messages=[
       {"role": "user", "content": "Write a detailed explanation of how neural networks learn"}
   ],
   stream=True
)

for chunk in response:
   if chunk.choices[0].delta.content is not None:
       print(chunk.choices[0].delta.content, end="")

Each chunk contains a small piece of the response. The delta.content field holds the new text fragment, and we print it immediately without a newline to create the streaming effect. The end="" parameter prevents print from adding newlines between chunks.

Building a better streaming handler

For production applications, you’ll want more control over the streaming process. Here’s a more comprehensive handler that manages the complete response:

def stream_response(model, messages, show_progress=True):
   response = client.chat.completions.create(
       model=model,
       messages=messages,
       stream=True
   )
  
   complete_response = ""
  
   for chunk in response:
       if chunk.choices[0].delta.content is not None:
           content = chunk.choices[0].delta.content
           complete_response += content
          
           if show_progress:
               print(content, end="", flush=True)
  
   if show_progress:
       print()  # Add final newline
  
   return complete_response

# Use it with different models
result = stream_response(
   "anthropic/claude-sonnet-4",
   [{"role": "user", "content": "Explain quantum entanglement like I'm 12 years old"}]
)

This handler captures the complete response while displaying progress, gives you both the streaming experience and the final text, and includes proper output formatting.

Streaming changes the user experience from “waiting and hoping” to “watching progress happen.” This makes your AI applications feel much more responsive and engaging for users.

Handling Reasoning Tokens In OpenRouter

Some AI models can show you their “thinking” process before giving their final answer. These reasoning tokens provide a transparent look into how the model approaches complex problems, showing the step-by-step logic that leads to their conclusions. Understanding this internal reasoning can help you verify answers, debug model behavior, and build more trustworthy applications.

What are reasoning tokens

Reasoning tokens appear in a separate reasoning field in the response, distinct from the main content. Different models support reasoning in different ways—some use effort levels while others use token budgets. 

Here’s a simple example that shows reasoning in action:

response = client.chat.completions.create(
   model="anthropic/claude-sonnet-4",
   messages=[
       {"role": "user", "content": "How many 'r's are in the word 'strrawberry'?"}
   ],
   max_tokens=2048,
   extra_body={
       "reasoning": {
           "max_tokens": 512
       }
   }
)

print("Final answer:")
print(response.choices[0].message.content)
print("\nReasoning process:")
print(response.choices[0].message.reasoning)
Final answer:
To count the 'r's in 'strrawberry', I'll go through each letter:
...
There are **4** 'r's in the word 'strrawberry'.

Reasoning process:
...

The model will show both its final answer and the internal reasoning that led to that conclusion. This dual output helps you understand whether the model approached the problem correctly.

Controlling reasoning intensity

You can control how much reasoning effort models put into their responses using two approaches. The effort parameter works with models like OpenAI's o-series and uses levels that correspond to specific token percentages based on your max_tokens setting:

  • High effort: Uses approximately 80% of max_tokens for reasoning
  • Medium effort: Uses approximately 50% of max_tokens for reasoning
  • Low effort: Uses approximately 20% of max_tokens for reasoning
# High effort reasoning for complex problems
response = client.chat.completions.create(
   model="deepseek/deepseek-r1",
   messages=[
       {"role": "user", "content": "Solve this step by step: If a train travels 240 miles in 3 hours, then speeds up by 20 mph for the next 2 hours, how far does it travel total?"}
   ],
   max_tokens=4000,  # High effort will use ~3200 tokens for reasoning
   extra_body={
       "reasoning": {
           "effort": "high" 
       }
   }
)

print("Problem solution:")
print(response.choices[0].message.content)
print("\nStep-by-step reasoning:")
print(response.choices[0].message.reasoning)

For models that support direct token allocation, like Anthropic’s models, you can specify exact reasoning budgets:

def get_reasoning_response(question, reasoning_budget=2000):
   response = client.chat.completions.create(
       model="anthropic/claude-sonnet-4",
       messages=[{"role": "user", "content": question}],
       max_tokens=10000,
       extra_body={
           "reasoning": {
               "max_tokens": reasoning_budget  # Exact token allocation
           }
       }
   )
   return response

# Compare different reasoning budgets
response = get_reasoning_response(
   "What's bigger: 9.9 or 9.11? Explain your reasoning carefully.",
   reasoning_budget=3000
)

print("Answer:", response.choices[0].message.content)
print("Detailed reasoning:", response.choices[0].message.reasoning)

Higher token budgets generally produce more thorough reasoning, while lower budgets give quicker but less detailed thought processes.

Preserving reasoning in conversations

When building multi-turn conversations, you need to preserve both the reasoning and the final answer to maintain context. This is particularly important for complex discussions where the model’s thinking process informs subsequent responses:

# First message with reasoning
response = client.chat.completions.create(
   model="anthropic/claude-sonnet-4",
   messages=[
       {"role": "user", "content": "Should I invest in renewable energy stocks? Consider both risks and opportunities."}
   ],
   extra_body={
       "reasoning": {
           "max_tokens": 3000
       }
   }
)

# Build conversation history with reasoning preserved
messages = [
   {"role": "user", "content": "Should I invest in renewable energy stocks? Consider both risks and opportunities."},
   {
       "role": "assistant",
       "content": response.choices[0].message.content,
       "reasoning_details": response.choices[0].message.reasoning_details  # Preserve reasoning
   },
   {"role": "user", "content": "What about solar energy specifically? How does that change your analysis?"}
]

# Continue conversation with reasoning context
follow_up = client.chat.completions.create(
   model="anthropic/claude-sonnet-4",
   messages=messages,
   extra_body={
       "reasoning": {
           "max_tokens": 2000
       }
   }
)

print("Follow-up answer:")
print(follow_up.choices[0].message.content)
print("\nContinued reasoning:")
print(follow_up.choices[0].message.reasoning)

The reasoning_details field keeps the complete reasoning chain, allowing the model to build on its previous analysis when answering follow-up questions. This creates more coherent and contextually aware conversations.

Cost and billing considerations

Reasoning tokens are billed as output tokens, so they increase your usage costs. However, they often improve response quality enough to justify the expense, especially for complex tasks where accuracy matters more than speed. According to OpenRouter’s documentation, reasoning tokens can improve model performance on challenging problems while providing transparency into the decision process.

For cost-conscious applications, you can balance reasoning quality against expense by adjusting effort levels or token budgets based on task complexity. Simple questions might not need reasoning at all, while complex problems benefit from high-effort reasoning.

Working With Multimodal Models in OpenRouter

You’ve been working with text so far, but what happens when you need to analyze images or documents? Maybe you want to ask questions about a chart, extract information from a PDF, or describe what’s happening in a photo. That’s where multimodal models come in — they can understand both text and visual content in the same request.

Understanding multimodal capabilities

Instead of trying to describe an image in text, you can send the actual image and ask questions about it directly. This makes your applications way more intuitive since the model sees exactly what you’re working with. You don’t have to guess whether your text description captured all the important details.

You use multimodal models through the same interface you’ve been using, just with an extra attachments parameter to include your visual content. File attachments work with all models on OpenRouter. Even if a model doesn't natively support PDFs or images, OpenRouter internally parses these files and passes the content to the model.

Working with images

You can include images in your requests through URLs or base64 encoding. If your image is already online, the URL approach is simpler:

response = client.chat.completions.create(
   model="openai/gpt-5-mini",
   messages=[
       {
           "role": "user",
           "content": "What's happening in this image? Describe the scene in detail."
       }
   ],
   extra_body={
       "attachments": [
           {
               "type": "image/jpeg",
               "url": "https://example.com/photo.jpg"
           }
       ]
   }
)

print(response.choices[0].message.content)

For local images, you can use base64 encoding:

import base64

def encode_image_to_base64(image_path):
   with open(image_path, "rb") as image_file:
       encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
   return encoded_string

# Analyze a local screenshot
encoded_image = encode_image_to_base64("screenshot.png")

response = client.chat.completions.create(
   model="openai/gpt-5-mini",
   messages=[
       {
           "role": "user",
           "content": "This is a screenshot of a data dashboard. What insights can you extract from the charts and metrics shown?"
       }
   ],
   extra_body={
       "attachments": [
           {
               "type": "image/png",
               "data": encoded_image
           }
       ]
   }
)

print(response.choices[0].message.content)

The model will look at the actual image and give you specific insights about what it sees, not just generic responses.

Processing PDF documents

PDF processing works the same way but opens up document analysis. You can ask questions about reports, analyze forms, or pull information from complex documents:

def encode_pdf_to_base64(pdf_path):
   with open(pdf_path, "rb") as pdf_file:
       encoded_string = base64.b64encode(pdf_file.read()).decode('utf-8')
   return encoded_string

# Analyze a research paper
encoded_pdf = encode_pdf_to_base64("research_paper.pdf")

response = client.chat.completions.create(
   model="openai/gpt-5-mini",
   messages=[
       {
           "role": "user",
           "content": "Summarize the key findings from this research paper. What are the main conclusions and methodology used?"
       }
   ],
   extra_body={
       "attachments": [
           {
               "type": "application/pdf",
               "data": encoded_pdf
           }
       ]
   }
)

print(response.choices[0].message.content)

This works great for financial reports, academic papers, contracts, or any PDF where you need AI analysis of the actual content. You can also include multiple attachments in a single request if you need to compare images or analyze multiple documents together.

Cost and model selection

Multimodal requests cost more than text-only requests since you’re processing additional data types. Images and PDFs need more computational power, which shows up in the pricing. You can check each model’s specific multimodal pricing on the models page.

Different models have different strengths with visual content. Some are better at detailed image analysis, while others excel at document understanding. You’ll want to experiment with different models to find what works best for your specific needs and budget.

Using Structured Outputs

When you’re building real applications, you need predictable data formats that your code can reliably parse. Free-form text responses are great for chat interfaces, but terrible for applications that need to extract specific information. Instead of getting back unpredictable text that you have to parse with regex or hope the model formatted correctly, structured outputs force models to return guaranteed JSON with the exact fields and data types you need. This eliminates parsing errors and makes your application code much simpler.

Anatomy of structured output requests

Structured outputs use a response_format parameter with this basic structure:

"response_format": {
   "type": "json_schema",           # Always this for structured outputs
   "json_schema": {
       "name": "your_schema_name",  # Name for your schema
       "strict": True,              # Enforce strict compliance
       "schema": {
           # Your actual JSON schema definition goes here
       }
   }
}

Sentiment analysis example

Let’s walk through a complete example that extracts sentiment from text. This shows how structured outputs work in practice:

response = client.chat.completions.create(
   model="openai/gpt-5-mini",
   messages=[
       {"role": "user", "content": "Analyze the sentiment: 'This movie was absolutely terrible!'"}
   ],
   extra_body={
       "response_format": {
           "type": "json_schema",
           "json_schema": {
               "name": "sentiment_analysis",
               "strict": True,
               "schema": {
                   "type": "object",
                   "properties": {
                       "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                       "confidence": {"type": "number"}
                   },
                   "required": ["sentiment", "confidence"]
               }
           }
       }
   }
)

import json
result = json.loads(response.choices[0].message.content)
print(result)
{'sentiment': 'negative', 'confidence': 0.98}

Here’s what’s happening in this schema:

  • sentiment: A string field restricted to three specific values using enum. The model can't return anything outside of "positive", "negative", or "neutral"
  • confidence: A number field for the model's confidence score
  • required: Both fields must be present in the response - the model can't skip them
  • strict: True: Enforces rigid compliance with the schema structure

Without structured outputs, you might get responses like “The sentiment is very negative with high confidence” or “Negative (95% sure)”. With the schema, you always get parseable JSON you can immediately use in your code.

Setting strict: True enforces the schema rigorously—the model can't deviate from your structure. The required array specifies which fields must be present. You can use enum to restrict values to specific choices, array for lists, and nested object types for complex data.

Model compatibility

Not all models support structured outputs, but most modern ones do. You can check the models page for compatibility. When a model doesn’t natively support structured outputs, OpenRouter often handles the formatting internally.

Structured outputs turn AI responses from unpredictable text into reliable data that your applications can depend on. For any production use case where you need consistent data extraction, this feature is essential.

Conclusion

We’ve learned how to access hundreds of AI models through OpenRouter’s unified API, from making your first request to implementing features like streaming, reasoning tokens, and structured outputs.

The platform’s automatic fallbacks and model routing mean your applications stay reliable even when individual providers face issues. With the same code patterns, we can compare models, switch providers, and find the perfect fit for each task without managing multiple API keys.

Start experimenting with simple requests and gradually try more features as your needs grow. Test different models for different types of tasks — some work better for creative writing, while others are stronger at data analysis or reasoning problems.

The knowledge you’ve gained here gives you what you need to build AI applications that aren’t locked into any single provider, giving you the freedom to adapt as new models and capabilities become available.


Bex Tuychiev's photo
Author
Bex Tuychiev
LinkedIn

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Topics

Learn AI with these courses!

Course

Building Agentic Workflows with LlamaIndex

2 hr
498
Build AI agentic workflows that can plan, search, remember, and collaborate, using LlamaIndex.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
Robot investigator to represent openai's deep research

blog

OpenAI's Deep Research: A Guide With Practical Examples

Learn about OpenAI's new Deep Research tool, which can perform in-depth, multi-step research.
Alex Olteanu's photo

Alex Olteanu

8 min

Tutorial

OpenAI's O3: A Guide With Five Practical Examples

Explore five practical examples of how to use OpenAI's o3 model within the ChatGPT application.
Marie Fayard's photo

Marie Fayard

Tutorial

OpenAI Realtime API: A Guide With Examples

Learn how to build real-time AI applications with OpenAI's Realtime API. This tutorial covers WebSockets, Node.js setup, text/audio messaging, function calling, and deploying a React voice assistant demo.
François Aubry's photo

François Aubry

Tutorial

OpenAI's Audio API: A Guide With Demo Project

Learn how to build a voice-to-voice assistant using OpenAI's latest audio models and streamline your workflow using the Agents API.
François Aubry's photo

François Aubry

Tutorial

OpenAI's O3 API: Step-by-Step Tutorial With Examples

Learn how to use the OpenAI O3 API for complex, multi-step problem-solving involving visual and textual input, and manage reasoning costs.
Aashi Dutt's photo

Aashi Dutt

Tutorial

How to Use Sora AI: A Guide With 10 Practical Examples

Learn how to use Sora AI to create and edit videos, including using remix, loop, re-cut, style presets, and storyboards.
François Aubry's photo

François Aubry

See MoreSee More