Course
Managing multiple AI provider APIs quickly becomes overwhelming. Each provider has different authentication methods, pricing models, and API specifications. Developers waste countless hours switching between OpenAI, Anthropic, Google, and other platforms just to access different models.
OpenRouter solves this complexity by providing a unified API that connects you to over 400 models from dozens of providers. You can access GPT-5, Claude 4, Gemini 2.5 Pro, and hundreds of other models using a single API key and consistent interface. The platform handles automatic fallbacks, cost management, and provider routing behind the scenes.
In this tutorial, I explain everything you need to know about OpenRouter, from setting up your first API call to implementing advanced features like structured outputs. By the end, you will learn how to build reliable applications that aren’t tied to a single provider.
What Is OpenRouter?
OpenRouter is a unified API platform that gives you access to over 400 AI models from dozens of providers through a single endpoint. Instead of juggling separate API keys for OpenAI, Anthropic, Google, Meta, and others, you use one key to reach their entire model catalog.
The platform works as an intelligent router, sending your requests to the right provider while taking care of authentication, billing, and error handling. This approach fixes several headaches that come with using multiple AI providers.
Problems OpenRouter solves
Working with multiple AI providers gets messy fast. Each one has its own API format, login process, and billing system. You end up maintaining separate code for each service, which slows down development and makes testing new models a pain.
Things get worse when providers go down or hit you with rate limits. Your app breaks, and there’s nothing you can do except wait. Plus, figuring out which provider offers the best price for similar models means tracking costs manually across different platforms.
The biggest issue is getting locked into one provider. When you build everything around their specific API, switching to better models or cheaper options later becomes a major project.
How OpenRouter fixes this
OpenRouter solves these problems with a set of connected features:
- One API key works with 400+ models from all major providers
- Automatic switching to backup providers when your first choice fails
- Side-by-side pricing for all models so you can compare costs instantly
- Works with existing OpenAI code — just change the endpoint URL
- Real-time monitoring that routes requests to the fastest available provider
These pieces work together to make AI development smoother and more reliable.
Who should use OpenRouter?
Different types of users get value from this unified approach:
- Developers can try new models without setting up accounts everywhere, making experimentation faster
- Enterprise teams get the uptime they need through automatic backups when providers fail
- Budget-conscious users can find the cheapest option for their needs without spreadsheet math
- Researchers get instant access to cutting-edge models without account setup overhead
Now that you understand what OpenRouter brings to the table, let’s get you set up with your first API call.
Prerequisites
Before diving into OpenRouter, you’ll need a few things set up on your machine. This tutorial assumes you’re comfortable with basic Python programming and have worked with APIs before. You don’t need to be an expert, but you should understand concepts like making HTTP requests and handling JSON responses.
You’ll need Python 3.7 or later installed on your system. We’ll be using the openai Python package to interact with OpenRouter's API, along with python-dotenv to handle environment variables securely. You can install both with:
pip install requests openai python-dotenv
You’ll also need an OpenRouter account and API key. Head to openrouter.ai to create a free account — you’ll get a small credit allowance to test things out. Once you’re logged in, go to the API Keys section in your account settings and generate a new key.
After getting your API key, create a .env file in your project directory and add your key like this:
OPENROUTER_API_KEY=your_api_key_here
This keeps your API key secure and out of your code. If you plan to use OpenRouter beyond testing, you’ll need to add credits to your account through the Credits page.
With these basics in place, you’re ready to make your first API call through OpenRouter.
Making Your First API Call in OpenRouter
Getting started with OpenRouter is remarkably simple if you’ve used the OpenAI SDK before. You just change one line of code and suddenly have access to hundreds of models from different providers.
Your first request and setup
Let’s jump right in with a working example that demonstrates OpenRouter’s approach:
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY"),
)
response = client.chat.completions.create(
model="openai/gpt-5-mini",
messages=[
{
"role": "user",
"content": "Write a haiku about debugging code at 2 AM"
}
]
)
print(response.choices[0].message.content)
Night hum, coffee cooled
cursor blinks, bug hides somewhere
I chase ghosts 'til dawn
The magic happens in two places. First, the base_url parameter redirects your requests to OpenRouter's servers instead of OpenAI's. Second, the model name follows a provider/model-name format - openai/gpt-5-mini instead of just gpt-5-mini. This tells OpenRouter which provider's version you want while keeping the familiar interface. Here are some common models that you can plug into the above example without any errors:
- google/gemini-2.0-flash-001
- google/gemini-2.5-pro
- mistralai/mistral-nemo
- deepseek/deepseek-r1-distill-qwen-32b
Now that you’ve seen how easy it is to work with different models, you might be wondering: what happens when your chosen model is unavailable? How do you build applications that stay reliable even when providers face issues? That’s where OpenRouter’s routing and resilience features come in.
Model Routing For Resilience
Building reliable AI applications means preparing for the unexpected. Providers experience downtime, models hit rate limits, and sometimes content moderation blocks your requests. Model routing is OpenRouter’s solution — it automatically switches between different models to keep your application running smoothly.
Setting up manual fallbacks
The most straightforward way to add resilience is to specify backup models. When your primary choice fails, OpenRouter tries your alternatives in order. The extra_body parameter passes these routing instructions to OpenRouter's API since the OpenAI SDK doesn't natively support this feature:
response = client.chat.completions.create(
model="moonshotai/kimi-k2", # Primary choice
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
extra_body={
"models": ["anthropic/claude-sonnet-4", "deepseek/deepseek-r1"]
}
)
print(f"Response from: {response.model}")
print(response.choices[0].message.content)
Response from: moonshotai/kimi-k2
Imagine a normal computer bit as a tiny light-switch that can only be OFF (0) or ON (1)...
OpenRouter tries Kimi-K2 first. If it’s unavailable, rate-limited, or blocked, it automatically tries Claude Sonnet 4, then DeepSeek R1. The response.model field shows which model actually responded.
Auto router for maximum convenience
Once you understand manual fallbacks, the auto router becomes really appealing. It handles model selection and fallbacks automatically, powered by NotDiamond’s evaluation system:
response = client.chat.completions.create(
model="openrouter/auto",
messages=[
{"role": "user", "content": "Debug this Python code in 3 sentences: def factorial(n): return n * factorial(n-1)"}
]
)
print(f"Auto router selected: {response.model}")
print(response.choices[0].message.content)
Auto router selected: openai/chatgpt-4o-latest
The given code is missing a base case, which causes infinite recursion and eventually a RecursionError. To fix it, add a base case like `if n == 0: return 1` before the recursive call. Here's the corrected version:
\```python
def factorial(n):
if n == 0:
return 1
return n * factorial(n - 1)
\```
The auto router analyzes your prompt and picks the best available model, with built-in fallbacks if your first choice is unavailable. You get resilience without any configuration. However, use the auto-router with care in sensitive or high-profile scenarios as it has the tendency to underestimate the complexity of your problem and thus, choose a lower-capacity model.
Building effective fallback strategies
Not all models make good backups for each other. Provider downtime may affect all models from that company, so choose fallbacks from different providers. Rate limits and costs vary dramatically, so pair expensive models with cheaper alternatives as well:
# Good fallback chain: different providers, decreasing cost
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[
{"role": "user", "content": "Your prompt here"}
],
extra_body={
"models": [
"x-ai/grok-4", # Close performance
"moonshotai/kimi-k2", # Cheaper
"deepseek/deepseek-r1:free" # Free backup
]
}
)
This gives you premium quality when available, solid performance as backup, and guaranteed availability as a last resort. Content moderation policies also differ between providers, so diversifying your chain gives better coverage for sensitive topics.
Finding models for your fallback chain
The models page lets you filter by provider and capabilities to build your chain. Many powerful models like DeepSeek R1 and Kimi-K2 are free since they’re open-source, making excellent fallbacks. Free models have rate limits of 50 requests per day for new users, or 1000 requests per day if you’ve purchased 10 credits.
For dynamic applications, you can discover models programmatically:
def get_provider_models(api_key: str, provider: str) -> list[str]:
r = requests.get(
"https://openrouter.ai/api/v1/models",
headers={"Authorization": f"Bearer {api_key}"}
)
return [m["id"] for m in r.json()["data"] if m["id"].startswith(provider)]
# Build fallbacks across providers
openai_models = get_provider_models(api_key, "openai/")
anthropic_models = get_provider_models(api_key, "anthropic/")
This approach lets you build robust fallback chains that adapt as new models become available.
Streaming For Real-time Responses
When working with AI models, especially for longer responses, users expect to see output appear progressively rather than waiting for the complete response. Streaming solves this by sending response chunks as they’re generated, creating a more interactive experience similar to ChatGPT’s interface.
Basic streaming setup
To set up streaming in OpenRouter, add stream=True to your request. The response becomes an iterator that yields chunks as the model generates them:
response = client.chat.completions.create(
model="openai/gpt-5",
messages=[
{"role": "user", "content": "Write a detailed explanation of how neural networks learn"}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Each chunk contains a small piece of the response. The delta.content field holds the new text fragment, and we print it immediately without a newline to create the streaming effect. The end="" parameter prevents print from adding newlines between chunks.
Building a better streaming handler
For production applications, you’ll want more control over the streaming process. Here’s a more comprehensive handler that manages the complete response:
def stream_response(model, messages, show_progress=True):
response = client.chat.completions.create(
model=model,
messages=messages,
stream=True
)
complete_response = ""
for chunk in response:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
complete_response += content
if show_progress:
print(content, end="", flush=True)
if show_progress:
print() # Add final newline
return complete_response
# Use it with different models
result = stream_response(
"anthropic/claude-sonnet-4",
[{"role": "user", "content": "Explain quantum entanglement like I'm 12 years old"}]
)
This handler captures the complete response while displaying progress, gives you both the streaming experience and the final text, and includes proper output formatting.
Streaming changes the user experience from “waiting and hoping” to “watching progress happen.” This makes your AI applications feel much more responsive and engaging for users.
Handling Reasoning Tokens In OpenRouter
Some AI models can show you their “thinking” process before giving their final answer. These reasoning tokens provide a transparent look into how the model approaches complex problems, showing the step-by-step logic that leads to their conclusions. Understanding this internal reasoning can help you verify answers, debug model behavior, and build more trustworthy applications.
What are reasoning tokens
Reasoning tokens appear in a separate reasoning field in the response, distinct from the main content. Different models support reasoning in different ways—some use effort levels while others use token budgets.
Here’s a simple example that shows reasoning in action:
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[
{"role": "user", "content": "How many 'r's are in the word 'strrawberry'?"}
],
max_tokens=2048,
extra_body={
"reasoning": {
"max_tokens": 512
}
}
)
print("Final answer:")
print(response.choices[0].message.content)
print("\nReasoning process:")
print(response.choices[0].message.reasoning)
Final answer:
To count the 'r's in 'strrawberry', I'll go through each letter:
...
There are **4** 'r's in the word 'strrawberry'.
Reasoning process:
...
The model will show both its final answer and the internal reasoning that led to that conclusion. This dual output helps you understand whether the model approached the problem correctly.
Controlling reasoning intensity
You can control how much reasoning effort models put into their responses using two approaches. The effort parameter works with models like OpenAI's o-series and uses levels that correspond to specific token percentages based on your max_tokens setting:
- High effort: Uses approximately 80% of
max_tokensfor reasoning - Medium effort: Uses approximately 50% of
max_tokensfor reasoning - Low effort: Uses approximately 20% of
max_tokensfor reasoning
# High effort reasoning for complex problems
response = client.chat.completions.create(
model="deepseek/deepseek-r1",
messages=[
{"role": "user", "content": "Solve this step by step: If a train travels 240 miles in 3 hours, then speeds up by 20 mph for the next 2 hours, how far does it travel total?"}
],
max_tokens=4000, # High effort will use ~3200 tokens for reasoning
extra_body={
"reasoning": {
"effort": "high"
}
}
)
print("Problem solution:")
print(response.choices[0].message.content)
print("\nStep-by-step reasoning:")
print(response.choices[0].message.reasoning)
For models that support direct token allocation, like Anthropic’s models, you can specify exact reasoning budgets:
def get_reasoning_response(question, reasoning_budget=2000):
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[{"role": "user", "content": question}],
max_tokens=10000,
extra_body={
"reasoning": {
"max_tokens": reasoning_budget # Exact token allocation
}
}
)
return response
# Compare different reasoning budgets
response = get_reasoning_response(
"What's bigger: 9.9 or 9.11? Explain your reasoning carefully.",
reasoning_budget=3000
)
print("Answer:", response.choices[0].message.content)
print("Detailed reasoning:", response.choices[0].message.reasoning)
Higher token budgets generally produce more thorough reasoning, while lower budgets give quicker but less detailed thought processes.
Preserving reasoning in conversations
When building multi-turn conversations, you need to preserve both the reasoning and the final answer to maintain context. This is particularly important for complex discussions where the model’s thinking process informs subsequent responses:
# First message with reasoning
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[
{"role": "user", "content": "Should I invest in renewable energy stocks? Consider both risks and opportunities."}
],
extra_body={
"reasoning": {
"max_tokens": 3000
}
}
)
# Build conversation history with reasoning preserved
messages = [
{"role": "user", "content": "Should I invest in renewable energy stocks? Consider both risks and opportunities."},
{
"role": "assistant",
"content": response.choices[0].message.content,
"reasoning_details": response.choices[0].message.reasoning_details # Preserve reasoning
},
{"role": "user", "content": "What about solar energy specifically? How does that change your analysis?"}
]
# Continue conversation with reasoning context
follow_up = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=messages,
extra_body={
"reasoning": {
"max_tokens": 2000
}
}
)
print("Follow-up answer:")
print(follow_up.choices[0].message.content)
print("\nContinued reasoning:")
print(follow_up.choices[0].message.reasoning)
The reasoning_details field keeps the complete reasoning chain, allowing the model to build on its previous analysis when answering follow-up questions. This creates more coherent and contextually aware conversations.
Cost and billing considerations
Reasoning tokens are billed as output tokens, so they increase your usage costs. However, they often improve response quality enough to justify the expense, especially for complex tasks where accuracy matters more than speed. According to OpenRouter’s documentation, reasoning tokens can improve model performance on challenging problems while providing transparency into the decision process.
For cost-conscious applications, you can balance reasoning quality against expense by adjusting effort levels or token budgets based on task complexity. Simple questions might not need reasoning at all, while complex problems benefit from high-effort reasoning.
Working With Multimodal Models in OpenRouter
You’ve been working with text so far, but what happens when you need to analyze images or documents? Maybe you want to ask questions about a chart, extract information from a PDF, or describe what’s happening in a photo. That’s where multimodal models come in — they can understand both text and visual content in the same request.
Understanding multimodal capabilities
Instead of trying to describe an image in text, you can send the actual image and ask questions about it directly. This makes your applications way more intuitive since the model sees exactly what you’re working with. You don’t have to guess whether your text description captured all the important details.
You use multimodal models through the same interface you’ve been using, just with an extra attachments parameter to include your visual content. File attachments work with all models on OpenRouter. Even if a model doesn't natively support PDFs or images, OpenRouter internally parses these files and passes the content to the model.
Working with images
You can include images in your requests through URLs or base64 encoding. If your image is already online, the URL approach is simpler:
response = client.chat.completions.create(
model="openai/gpt-5-mini",
messages=[
{
"role": "user",
"content": "What's happening in this image? Describe the scene in detail."
}
],
extra_body={
"attachments": [
{
"type": "image/jpeg",
"url": "https://example.com/photo.jpg"
}
]
}
)
print(response.choices[0].message.content)
For local images, you can use base64 encoding:
import base64
def encode_image_to_base64(image_path):
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
return encoded_string
# Analyze a local screenshot
encoded_image = encode_image_to_base64("screenshot.png")
response = client.chat.completions.create(
model="openai/gpt-5-mini",
messages=[
{
"role": "user",
"content": "This is a screenshot of a data dashboard. What insights can you extract from the charts and metrics shown?"
}
],
extra_body={
"attachments": [
{
"type": "image/png",
"data": encoded_image
}
]
}
)
print(response.choices[0].message.content)
The model will look at the actual image and give you specific insights about what it sees, not just generic responses.
Processing PDF documents
PDF processing works the same way but opens up document analysis. You can ask questions about reports, analyze forms, or pull information from complex documents:
def encode_pdf_to_base64(pdf_path):
with open(pdf_path, "rb") as pdf_file:
encoded_string = base64.b64encode(pdf_file.read()).decode('utf-8')
return encoded_string
# Analyze a research paper
encoded_pdf = encode_pdf_to_base64("research_paper.pdf")
response = client.chat.completions.create(
model="openai/gpt-5-mini",
messages=[
{
"role": "user",
"content": "Summarize the key findings from this research paper. What are the main conclusions and methodology used?"
}
],
extra_body={
"attachments": [
{
"type": "application/pdf",
"data": encoded_pdf
}
]
}
)
print(response.choices[0].message.content)
This works great for financial reports, academic papers, contracts, or any PDF where you need AI analysis of the actual content. You can also include multiple attachments in a single request if you need to compare images or analyze multiple documents together.
Cost and model selection
Multimodal requests cost more than text-only requests since you’re processing additional data types. Images and PDFs need more computational power, which shows up in the pricing. You can check each model’s specific multimodal pricing on the models page.
Different models have different strengths with visual content. Some are better at detailed image analysis, while others excel at document understanding. You’ll want to experiment with different models to find what works best for your specific needs and budget.
Using Structured Outputs
When you’re building real applications, you need predictable data formats that your code can reliably parse. Free-form text responses are great for chat interfaces, but terrible for applications that need to extract specific information. Instead of getting back unpredictable text that you have to parse with regex or hope the model formatted correctly, structured outputs force models to return guaranteed JSON with the exact fields and data types you need. This eliminates parsing errors and makes your application code much simpler.
Anatomy of structured output requests
Structured outputs use a response_format parameter with this basic structure:
"response_format": {
"type": "json_schema", # Always this for structured outputs
"json_schema": {
"name": "your_schema_name", # Name for your schema
"strict": True, # Enforce strict compliance
"schema": {
# Your actual JSON schema definition goes here
}
}
}
Sentiment analysis example
Let’s walk through a complete example that extracts sentiment from text. This shows how structured outputs work in practice:
response = client.chat.completions.create(
model="openai/gpt-5-mini",
messages=[
{"role": "user", "content": "Analyze the sentiment: 'This movie was absolutely terrible!'"}
],
extra_body={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "sentiment_analysis",
"strict": True,
"schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number"}
},
"required": ["sentiment", "confidence"]
}
}
}
}
)
import json
result = json.loads(response.choices[0].message.content)
print(result)
{'sentiment': 'negative', 'confidence': 0.98}
Here’s what’s happening in this schema:
sentiment: A string field restricted to three specific values usingenum. The model can't return anything outside of "positive", "negative", or "neutral"confidence: A number field for the model's confidence scorerequired: Both fields must be present in the response - the model can't skip themstrict: True: Enforces rigid compliance with the schema structure
Without structured outputs, you might get responses like “The sentiment is very negative with high confidence” or “Negative (95% sure)”. With the schema, you always get parseable JSON you can immediately use in your code.
Setting strict: True enforces the schema rigorously—the model can't deviate from your structure. The required array specifies which fields must be present. You can use enum to restrict values to specific choices, array for lists, and nested object types for complex data.
Model compatibility
Not all models support structured outputs, but most modern ones do. You can check the models page for compatibility. When a model doesn’t natively support structured outputs, OpenRouter often handles the formatting internally.
Structured outputs turn AI responses from unpredictable text into reliable data that your applications can depend on. For any production use case where you need consistent data extraction, this feature is essential.
Conclusion
We’ve learned how to access hundreds of AI models through OpenRouter’s unified API, from making your first request to implementing features like streaming, reasoning tokens, and structured outputs.
The platform’s automatic fallbacks and model routing mean your applications stay reliable even when individual providers face issues. With the same code patterns, we can compare models, switch providers, and find the perfect fit for each task without managing multiple API keys.
Start experimenting with simple requests and gradually try more features as your needs grow. Test different models for different types of tasks — some work better for creative writing, while others are stronger at data analysis or reasoning problems.
The knowledge you’ve gained here gives you what you need to build AI applications that aren’t locked into any single provider, giving you the freedom to adapt as new models and capabilities become available.

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.


