Course
Switching between different LLM providers shouldn’t require rewriting your entire codebase. But that’s exactly what happens when you work with multiple AI services today. Each provider has its own SDK, authentication method, and response format. LiteLLM fixes this with a unified interface.
In this tutorial, I will explain the main parts of LiteLLM and show you how to start with basic API calls and work up to streaming responses, structured outputs, and cost tracking.
What Is LiteLLM?
LiteLLM is an open-source Python library that works as a universal translator for AI models. Instead of learning different APIs for each provider, you use one interface that connects to over 100 LLM services.
The framework has two main parts. The Python SDK lets you write code once and run it with any provider. The proxy server acts as a central gateway for teams and companies that need to manage AI services at scale.
LiteLLM gives you these additional benefits:
- Built-in cost tracking: See spending across all providers in one dashboard instead of checking multiple billing systems
- Automatic failovers: When one provider goes down or hits rate limits, LiteLLM automatically tries your backup options
- Self-hosting option: Run everything on your own servers if you need data privacy or compliance
Note that OpenRouter provides similar unified access to multiple LLMs, though with different pricing and feature sets.
The proxy server adds spending limits per team, usage tracking across projects, and centralized API key management. All configuration happens through simple YAML files.
LiteLLM works well for different types of users:
- Individual developers who want to test multiple models without setting up accounts everywhere
- Small teams that need cost visibility and don’t want vendor lock-in
- Large companies that require centralized management, budget controls, and detailed usage analytics
The library is MIT-licensed and completely open-source. You can inspect the code, modify it for your needs, or contribute back to the project.
Now that you know what LiteLLM can do, here’s how to set up your environment.
Prerequisites
Before you can start using LiteLLM, you need a few things set up. Most of this is probably already done if you’ve worked with AI models before.
Python environment
You need Python 3.7 or newer. Most systems today have this, but you can check your version with:
import sys
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
Package installation
Install LiteLLM and python-dotenv for managing environment variables:
uv add litellm python-dotenv
# or with pip: pip install litellm python-dotenv
API keys
You’ll need API keys from the providers you want to use. For this tutorial, we’ll work with two models:
- OpenAI API key: You can get a key from platform.openai.com (for GPT-5)
- Anthropic API key: You can get a key from console.anthropic.com (for Claude Sonnet 4)
Store these in a .env file in your project directory:
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
Environment check
Run this script to check your setup:
import os
import sys
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Check Python version
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
# Check LiteLLM
try:
import litellm
print("✓ LiteLLM installed")
except ImportError:
print("✗ LiteLLM not installed")
# Check API keys
openai_key = os.getenv("OPENAI_API_KEY")
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
if openai_key:
print(f"✓ OpenAI API key found")
else:
print("✗ OpenAI API key not found")
if anthropic_key:
print(f"✓ Anthropic API key found")
else:
print("✗ Anthropic API key not found")
Python version: 3.11
✓ LiteLLM installed
✓ OpenAI API key found
✓ Anthropic API key found
If you see all green checkmarks, you’re ready to go. If any checks fail, make sure you’ve installed the packages and added your API keys to the .env file.
Making Your First API Call With LiteLLM
LiteLLM uses the same interface you might know from OpenAI, but it works with any provider.
Basic setup
Start with these imports:
import litellm
from dotenv import load_dotenv
# Load your API keys
load_dotenv()
A basic request to GPT-5 follows this pattern: define your messages, call litellm.completion(), and get the response.
messages = [
{"role": "user", "content": "Write a simple Python function that adds two numbers"}
]
response = litellm.completion(
model="gpt-5",
messages=messages
)
print(response.choices[0].message.content)
def add(a, b):
"""Return the sum of a and b."""
return a + b
You just called GPT-5 through LiteLLM. The response object follows OpenAI’s format, so if you’ve used their API before, this will look familiar. You might want to inspect what model actually responded and check token usage:
print(f"Model used: {response.model}")
print(f"Response ID: {response.id}")
print(f"Total tokens: {response.usage.total_tokens}")
Model used: gpt-5-2025-08-07
Response ID: chatcmpl-CIbZWUDE1amW2vHtHDmieTX6hnZUq
Total tokens: 557
Switching providers
LiteLLM’s real power shows up when switching providers. Want to try Claude Sonnet 4 instead? Change one parameter:
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=messages
)
print(response.choices[0].message.content)
Here's a simple Python function that adds two numbers:
def add_numbers(a, b):
"""
Adds two numbers and returns the result.
Args:
a: First number
b: Second number
Returns:
The sum of a and b
"""
return a + b
# Example usage:
result = add_numbers(5, 3)
print(result) # Output: 8
This function:
- Takes two parameters a and b
- Returns their sum using the + operator
- Works with integers, floats, and even negative numbers
- Includes a docstring to explain what the function does
Same code structure, different provider. Claude gives a much more detailed response with documentation and examples, while GPT-5 was more concise.
Basic error handling
Add some error handling for when things go wrong:
try:
response = litellm.completion(
model="gpt-5",
messages=messages
)
print(response.choices[0].message.content)
except Exception as e:
print(f"API call failed: {e}")
This pattern catches common issues like invalid API keys, network problems, or model unavailability.
What’s next
You now know the basic pattern for using LiteLLM with any provider. The Getting Started documentation covers more details about supported models and parameters, and the Completion function reference shows all available options.
For building production applications, you’ll also want to follow API design best practices to create reliable, high-performance systems.
Provider Switching and Fallbacks
LiteLLM’s fallback system automatically handles provider failures so your users never see error messages. When you’re building real applications, you need backup plans that work without any extra code on your part.
Basic fallback setup
To set up a fallback, add a fallbacks list to your completion call:
messages = [
{"role": "user", "content": "Explain what a Python decorator is in one sentence"}
]
response = litellm.completion(
model="gpt-5",
messages=messages,
fallbacks=["anthropic/claude-sonnet-4-20250514"]
)
print(f"Response: {response.choices[0].message.content}")
print(f"Model used: {response.model}")
Response: A Python decorator is a callable that wraps another function or class to augment its behavior and returns it, applied with the @ syntax so you can add reusable functionality without changing the original code.
Model used: gpt-5-2025-08-07
If GPT-5 fails, LiteLLM automatically tries Claude Sonnet 4. Users get a response either way.
Building fallback chains
You can chain multiple providers for better reliability. Using the same messages from before:
response = litellm.completion(
model="gpt-5",
messages=messages,
fallbacks=["anthropic/claude-sonnet-4-20250514", "gpt-3.5-turbo"]
)
print(f"Model used: {response.model}")
Model used: gpt-5-2025-08-07
LiteLLM tries models in order. If the primary fails, it tries the first fallback. If that fails, it tries the second. Multiple layers of protection.
Adding retries
Combine fallbacks with retries for even better reliability:
response = litellm.completion(
model="gpt-5",
messages=messages, # Same messages as before
num_retries=2
)
Retries the same model twice before moving to fallbacks. Good for temporary network issues or rate limit spikes.
Fallbacks handle most failures automatically, but you should still catch cases where all models fail:
try:
response = litellm.completion(
model="gpt-5",
messages=messages, # Same messages as before
fallbacks=["anthropic/claude-sonnet-4-20250514"],
num_retries=2
)
return response.choices[0].message.content
except Exception as e:
# All models failed - handle gracefully
return "Sorry, I'm having trouble right now. Please try again."
Even when providers work perfectly, users still face another problem: frustrating wait time. Someone clicks submit in your app and sees nothing for several seconds. They wonder if something broke.
Streaming for Real-Time Responses
Most AI API responses arrive as one complete block, like downloading a file. Users wait with no feedback until the entire response is ready. Streaming changes this by showing text as it generates, word by word.
Why streaming changes the user experience
Instead of waiting for the complete response, streaming shows text as the AI generates it. Watch the difference:
import litellm
from dotenv import load_dotenv
load_dotenv()
messages = [{"role": "user", "content": "What is Python in one sentence?"}]
response = litellm.completion(
model="gpt-5",
messages=messages,
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Python is a high-level, interpreted, dynamically typed, general-purpose programming language known for its readability and vast ecosystem, used across web development, data science, automation, scripting, and more.
The text appears gradually instead of all at once (when running the snippet above on your machine). Users see progress immediately, which feels much more responsive than staring at a loading spinner.
How streaming actually works
This works at the chunk level. Instead of one big response, you get many small pieces, called deltas. Sometimes, chunks are empty, so you should handle them properly for safety:
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Define JavaScript in exactly two sentences"}],
stream=True
)
for chunk in response:
if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
print(f"'{chunk.choices[0].delta.content}'", end=" ")
'JavaScript is a high' '-level, interpreted programming language primarily used for creating' ' interactive and dynamic content on web pages, running' ' in web browsers to manip' 'ulate HTML elements, handle user events' ', and communicate with servers.' ' Originally designed for client-side web development, JavaScript has evolve' 'd into a versatile language that can also' ' be used for server-side development, mobile applications' ', and desktop software through various runtime' ' environments like Node.js.'
Making streaming reusable
For real applications, you’ll want to capture the complete response while still showing the streaming output. Here is a function to do this:
def stream_and_capture(model, messages):
"""Stream response while capturing the complete text"""
response = litellm.completion(
model=model,
messages=messages,
stream=True
)
full_response = ""
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # New line after response
return full_response
# Test it
messages = [{"role": "user", "content": "Define APIs in 2 sentences max"}]
result = stream_and_capture("gpt-5", messages)
APIs (Application Programming Interfaces) are standardized contracts, often exposed as endpoints, that specify how software components or services can communicate, including the allowed operations, inputs, and outputs. They let developers access functionality or data without knowing the underlying implementation, enabling modular, interoperable systems.
The end="" parameter prevents line breaks between chunks. flush=True forces immediate display instead of buffering.
Streaming with fallbacks
You can also combine streaming with the fallback system for better reliability:
response = litellm.completion(
model="gpt-5",
messages=[{"role": "user", "content": "Define machine learning in one sentence"}],
fallbacks=["anthropic/claude-sonnet-4-20250514"],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Machine learning is a branch of artificial intelligence where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed.
Immediate feedback plus automatic provider switching if the primary model fails.
Streaming gives users immediate feedback, but you still face another problem when building real applications. Once that text arrives, your code needs to work with it. You might extract data, store information in databases, or pass values to other functions. But parsing text responses is messy and breaks easily when the format changes.
Using Structured Outputs
Here’s what actually happens: You build an app that analyzes customer feedback. Your AI returns text like “The sentiment is positive with confidence 0.85 and top keywords: responsive, helpful, fast.” You write regex patterns to extract the confidence score and keywords. Everything works in testing.
Then the AI changes its response format slightly to “Sentiment: positive (confidence: 0.85). Keywords include responsive, helpful, and fast.” Your regex breaks. Your data pipeline stops working. Customer sentiment reports show empty values instead of scores. Your dashboard crashes. You spend hours fixing parsing code that worked yesterday.
Structured outputs skip this problem by returning data in predictable formats from the start.
Getting JSON instead of text
LiteLLM can return JSON instead of plain text. You tell it what structure you want through your prompt, and it comes back formatted correctly:
import litellm
from dotenv import load_dotenv
load_dotenv()
messages = [{"role": "user", "content": "Describe Python as a programming language. Return as JSON with fields: language, difficulty, description"}]
response = litellm.completion(
model="gpt-5",
messages=messages,
response_format={"type": "json_object"}
)
print(response.choices[0].message.content)
{
"language": "Python",
"difficulty": "Beginner-friendly (easy to learn, moderate to master)",
"description": "Python is a high-level, interpreted, dynamically typed, general-purpose programming language focused on readability and developer productivity. It supports multiple paradigms (object-oriented, functional, procedural), uses indentation for block structure, and includes a large standard library with an extensive third-party ecosystem (PyPI). Common uses include web development, data science, machine learning, automation/scripting, scientific computing, and DevOps. While not the fastest for CPU-bound tasks, it integrates well with C/C++ and accelerators, and runs cross-platform."
}
This structured approach eliminates common parsing problems. Instead of hunting through text, you get back valid JSON with exactly the fields you requested. No parsing headaches, no format surprises.
Working with the JSON data
Raw JSON strings aren’t useful until you convert them to Python objects that your application can work with. You need to turn that string into Python objects you can actually use:
import json
messages = [{"role": "user", "content": "Describe JavaScript in 2 sentences. Return as JSON with fields: name, paradigm, use_cases (array of exactly 3 items)"}]
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=messages,
response_format={"type": "json_object"}
)
# Parse the JSON response
data = json.loads(response.choices[0].message.content)
print(f"Language: {data['name']}")
print(f"Paradigm: {data['paradigm']}")
print(f"Use cases: {', '.join(data['use_cases'])}")
Language: JavaScript
Paradigm: multi-paradigm (object-oriented, functional, event-driven)
Use cases: web development and browser scripting, server-side development with Node.js, mobile app development with frameworks like React Native
Now that you have structured data, you can pass it directly to other functions, store it in databases, or display it in your UI without any text manipulation. For data analysis workflows, tools like Pandas AI can work directly with this structured output to generate insights.
Adding type safety with Pydantic
JSON parsing catches syntax errors, but wrong data types or missing fields are different problems. Pydantic models solve this by validating your data structure and catching problems before they break your application:
from pydantic import BaseModel
import json
class ProgrammingLanguage(BaseModel):
name: str
paradigm: str
use_cases: list[str]
messages = [{"role": "user", "content": "Describe CSS briefly. Return as JSON with fields: name, paradigm, use_cases (array of 3 items)"}]
response = litellm.completion(
model="gpt-5",
messages=messages,
response_format={"type": "json_object"}
)
# Parse and validate with Pydantic
data = json.loads(response.choices[0].message.content)
language = ProgrammingLanguage(**data)
print(f"Validated data: {language.name} - {language.paradigm}")
print(f"Use cases: {language.use_cases}")
Validated data: CSS (Cascading Style Sheets) - Declarative, rule-based stylesheet language for describing the presentation of structured documents.
Use cases: ['Styling layout, colors, and typography of web pages', 'Responsive design across devices and viewports', 'Animations, transitions, and visual effects']
Pydantic automatically validates that name and paradigm are strings, and use_cases is a list of strings. If the AI returns the wrong types or forgets a required field, you'll get a clear error instead of mysterious bugs later.
Handling JSON errors
JSON parsing can fail if the AI returns malformed JSON or non-JSON text. Add basic error handling:
import json
messages = [{"role": "user", "content": "Summarize machine learning in 1 sentence. Return as JSON with topic and summary fields"}]
try:
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=messages,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
print(f"Topic: {data['topic']}")
print(f"Summary: {data['summary']}")
except json.JSONDecodeError:
print("Invalid JSON returned")
except Exception as e:
print(f"Request failed: {e}")
Topic: machine learning
Summary: Machine learning is a subset of artificial intelligence that allows computers to learn and make predictions or decisions from data without being explicitly programmed for each specific task.
Structured outputs replace messy text parsing with clean data structures. You can combine JSON mode with fallbacks just like regular completions.
Structured outputs give you clean, predictable data that won’t break your parsing logic. But there’s another problem that hits when you move from prototype to production: the bills. Every API call costs money, and those costs can surprise you.
Cost Tracking
Here’s what happens: Your prototype works great in testing with 50 API calls per day. You’re spending about $2 per week. Everything looks fine. You launch to production and get 500 users. Now you’re making 5,000 API calls daily. Your monthly bill jumps from $8 to $800.
Without cost tracking, you had no warning. You can’t tell which features cost the most, which models are more expensive, or how to budget for growth. You discover the bill spike weeks later when you get the credit card statement.
LiteLLM tracks exactly what you spend across all providers.
Understanding token usage
Every AI response includes detailed usage information that shows exactly what you’re being charged for:
import litellm
from dotenv import load_dotenv
load_dotenv()
messages = [{"role": "user", "content": "Explain what a REST API is in two sentences"}]
response = litellm.completion(
model="gpt-5",
messages=messages
)
print(f"Response: {response.choices[0].message.content}")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
Response: A REST API is a web service that follows the Representational State Transfer architectural style, exposing resources identified by URLs and manipulated with HTTP methods like GET, POST, PUT, and DELETE. It uses stateless client-server communication, cacheability, and uniform representations (often JSON) to support scalable, decoupled interactions.
Prompt tokens: 15
Completion tokens: 201
Total tokens: 216
Input tokens are your prompt, output tokens are the AI’s response. Different providers charge different rates for each type.
Automatic cost calculation
Raw token counts don’t tell you the dollar amount. LiteLLM calculates the actual cost for you:
# Get the cost for a completed response
cost = litellm.completion_cost(completion_response=response)
print(f"Total cost: ${cost:.4f}")
Total cost: $0.0020
LiteLLM’s pricing database converts tokens to dollars based on your model.
Manual cost calculations
Sometimes you want to estimate costs before making a call, or calculate costs for text you already have:
# Calculate cost without making an API call
prompt_text = "Explain machine learning"
completion_text = "Machine learning is a method of data analysis that automates analytical model building."
estimated_cost = litellm.completion_cost(
model="anthropic/claude-sonnet-4-20250514",
prompt=prompt_text,
completion=completion_text
)
print(f"Estimated cost: ${estimated_cost:.4f}")
Estimated cost: $0.0002
This is useful for budgeting and comparing models.
Tracking multiple API calls
For applications that make many API calls, you can track cumulative costs:
total_cost = 0.0
conversation = [
"What is Python?",
"How do you install Python packages?",
"What are virtual environments?"
]
for i, question in enumerate(conversation):
response = litellm.completion(
model="gpt-5",
messages=[{"role": "user", "content": question}]
)
call_cost = litellm.completion_cost(completion_response=response)
total_cost += call_cost
print(f"Call {i+1}: ${call_cost:.4f}")
print(f"Total cost: ${total_cost:.4f}")
Call 1: $0.0057
Call 2: $0.0134
Call 3: $0.0102
Total cost: $0.0292
Cost tracking helps you make informed decisions about which models to use and avoid surprise bills. See exactly what each API call costs and compare providers.
Conclusion
You now have the tools to build AI applications without vendor lock-in. You’ve learned basic API calls, automatic fallbacks, streaming responses, structured outputs, and cost tracking.
This gives you the freedom to choose the best models for each task. Switch providers easily, set up automatic fallbacks, and track costs. When better models come along, you can adopt them without rewriting code.

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.


