LiteLLM: A Guide With Practical Examples

Learn what LiteLLM is and how to use it for unified API calls to various LLM providers, covering basic API usage, error handling, fallbacks, streaming, structured outputs, and cost tracking.

Sep 23, 2025 · 12 min read

Switching between different LLM providers shouldn’t require rewriting your entire codebase. But that’s exactly what happens when you work with multiple AI services today. Each provider has its own SDK, authentication method, and response format. LiteLLM fixes this with a unified interface.

In this tutorial, I will explain the main parts of LiteLLM and show you how to start with basic API calls and work up to streaming responses, structured outputs, and cost tracking.

What Is LiteLLM?

LiteLLM is an open-source Python library that works as a universal translator for AI models. Instead of learning different APIs for each provider, you use one interface that connects to over 100 LLM services.

The framework has two main parts. The Python SDK lets you write code once and run it with any provider. The proxy server acts as a central gateway for teams and companies that need to manage AI services at scale.

LiteLLM gives you these additional benefits:

Built-in cost tracking: See spending across all providers in one dashboard instead of checking multiple billing systems
Automatic failovers: When one provider goes down or hits rate limits, LiteLLM automatically tries your backup options
Self-hosting option: Run everything on your own servers if you need data privacy or compliance

Note that OpenRouter provides similar unified access to multiple LLMs, though with different pricing and feature sets.

The proxy server adds spending limits per team, usage tracking across projects, and centralized API key management. All configuration happens through simple YAML files.

LiteLLM works well for different types of users:

Individual developers who want to test multiple models without setting up accounts everywhere
Small teams that need cost visibility and don’t want vendor lock-in
Large companies that require centralized management, budget controls, and detailed usage analytics

The library is MIT-licensed and completely open-source. You can inspect the code, modify it for your needs, or contribute back to the project.

Now that you know what LiteLLM can do, here’s how to set up your environment.

Prerequisites

Before you can start using LiteLLM, you need a few things set up. Most of this is probably already done if you’ve worked with AI models before.

Python environment

You need Python 3.7 or newer. Most systems today have this, but you can check your version with:

import sys
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")

Package installation

Install LiteLLM and python-dotenv for managing environment variables:

uv add litellm python-dotenv
# or with pip: pip install litellm python-dotenv

API keys

You’ll need API keys from the providers you want to use. For this tutorial, we’ll work with two models:

OpenAI API key: You can get a key from platform.openai.com (for GPT-5)
Anthropic API key: You can get a key from console.anthropic.com (for Claude Sonnet 4)

Store these in a .env file in your project directory:

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here

Environment check

Run this script to check your setup:

import os
import sys
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Check Python version
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")

# Check LiteLLM
try:
   import litellm
   print("✓ LiteLLM installed")
except ImportError:
   print("✗ LiteLLM not installed")

# Check API keys
openai_key = os.getenv("OPENAI_API_KEY")
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if openai_key:
   print(f"✓ OpenAI API key found")
else:
   print("✗ OpenAI API key not found")

if anthropic_key:
   print(f"✓ Anthropic API key found")
else:
   print("✗ Anthropic API key not found")

Python version: 3.11
✓ LiteLLM installed
✓ OpenAI API key found
✓ Anthropic API key found

If you see all green checkmarks, you’re ready to go. If any checks fail, make sure you’ve installed the packages and added your API keys to the .env file.

Making Your First API Call With LiteLLM

LiteLLM uses the same interface you might know from OpenAI, but it works with any provider.

Basic setup

Start with these imports:

import litellm
from dotenv import load_dotenv

# Load your API keys
load_dotenv()

A basic request to GPT-5 follows this pattern: define your messages, call litellm.completion(), and get the response.

messages = [
   {"role": "user", "content": "Write a simple Python function that adds two numbers"}
]

response = litellm.completion(
   model="gpt-5",
   messages=messages
)

print(response.choices[0].message.content)

def add(a, b):
   """Return the sum of a and b."""
   return a + b

You just called GPT-5 through LiteLLM. The response object follows OpenAI’s format, so if you’ve used their API before, this will look familiar. You might want to inspect what model actually responded and check token usage:

print(f"Model used: {response.model}")
print(f"Response ID: {response.id}")
print(f"Total tokens: {response.usage.total_tokens}")

Model used: gpt-5-2025-08-07
Response ID: chatcmpl-CIbZWUDE1amW2vHtHDmieTX6hnZUq
Total tokens: 557

Switching providers

LiteLLM’s real power shows up when switching providers. Want to try Claude Sonnet 4 instead? Change one parameter:

response = litellm.completion(
   model="anthropic/claude-sonnet-4-20250514",
   messages=messages
)

print(response.choices[0].message.content)

Here's a simple Python function that adds two numbers:

def add_numbers(a, b):
   """
   Adds two numbers and returns the result.
  
   Args:
       a: First number
       b: Second number
  
   Returns:
       The sum of a and b
   """
   return a + b

# Example usage:
result = add_numbers(5, 3)
print(result)  # Output: 8

This function:
- Takes two parameters a and b
- Returns their sum using the + operator
- Works with integers, floats, and even negative numbers
- Includes a docstring to explain what the function does

Same code structure, different provider. Claude gives a much more detailed response with documentation and examples, while GPT-5 was more concise.

Basic error handling

Add some error handling for when things go wrong:

try:
   response = litellm.completion(
       model="gpt-5",
       messages=messages
   )
   print(response.choices[0].message.content)
except Exception as e:
   print(f"API call failed: {e}")

This pattern catches common issues like invalid API keys, network problems, or model unavailability.

What’s next

You now know the basic pattern for using LiteLLM with any provider. The Getting Started documentation covers more details about supported models and parameters, and the Completion function reference shows all available options.

For building production applications, you’ll also want to follow API design best practices to create reliable, high-performance systems.

Provider Switching and Fallbacks

LiteLLM’s fallback system automatically handles provider failures so your users never see error messages. When you’re building real applications, you need backup plans that work without any extra code on your part.

Basic fallback setup

To set up a fallback, add a fallbacks list to your completion call:

messages = [
   {"role": "user", "content": "Explain what a Python decorator is in one sentence"}
]

response = litellm.completion(
   model="gpt-5",
   messages=messages,
   fallbacks=["anthropic/claude-sonnet-4-20250514"]
)

print(f"Response: {response.choices[0].message.content}")
print(f"Model used: {response.model}")

Response: A Python decorator is a callable that wraps another function or class to augment its behavior and returns it, applied with the @ syntax so you can add reusable functionality without changing the original code.
Model used: gpt-5-2025-08-07

If GPT-5 fails, LiteLLM automatically tries Claude Sonnet 4. Users get a response either way.

Building fallback chains

You can chain multiple providers for better reliability. Using the same messages from before:

response = litellm.completion(
   model="gpt-5",
   messages=messages,
   fallbacks=["anthropic/claude-sonnet-4-20250514", "gpt-3.5-turbo"]
)

print(f"Model used: {response.model}")

Model used: gpt-5-2025-08-07

LiteLLM tries models in order. If the primary fails, it tries the first fallback. If that fails, it tries the second. Multiple layers of protection.

Adding retries

Combine fallbacks with retries for even better reliability:

response = litellm.completion(
   model="gpt-5",
   messages=messages,  # Same messages as before
   num_retries=2
)

Retries the same model twice before moving to fallbacks. Good for temporary network issues or rate limit spikes.

Fallbacks handle most failures automatically, but you should still catch cases where all models fail:

try:
   response = litellm.completion(
       model="gpt-5",
       messages=messages,  # Same messages as before
       fallbacks=["anthropic/claude-sonnet-4-20250514"],
       num_retries=2
   )
   return response.choices[0].message.content
except Exception as e:
   # All models failed - handle gracefully
   return "Sorry, I'm having trouble right now. Please try again."

Even when providers work perfectly, users still face another problem: frustrating wait time. Someone clicks submit in your app and sees nothing for several seconds. They wonder if something broke.

Streaming for Real-Time Responses

Most AI API responses arrive as one complete block, like downloading a file. Users wait with no feedback until the entire response is ready. Streaming changes this by showing text as it generates, word by word.

Why streaming changes the user experience

Instead of waiting for the complete response, streaming shows text as the AI generates it. Watch the difference:

import litellm
from dotenv import load_dotenv

load_dotenv()

messages = [{"role": "user", "content": "What is Python in one sentence?"}]

response = litellm.completion(
   model="gpt-5",
   messages=messages,
   stream=True
)

for chunk in response:
   if chunk.choices[0].delta.content:
       print(chunk.choices[0].delta.content, end="", flush=True)

Python is a high-level, interpreted, dynamically typed, general-purpose programming language known for its readability and vast ecosystem, used across web development, data science, automation, scripting, and more.

The text appears gradually instead of all at once (when running the snippet above on your machine). Users see progress immediately, which feels much more responsive than staring at a loading spinner.

How streaming actually works

This works at the chunk level. Instead of one big response, you get many small pieces, called deltas. Sometimes, chunks are empty, so you should handle them properly for safety:

response = litellm.completion(
   model="anthropic/claude-sonnet-4-20250514",
   messages=[{"role": "user", "content": "Define JavaScript in exactly two sentences"}],
   stream=True
)

for chunk in response:
   if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
       print(f"'{chunk.choices[0].delta.content}'", end=" ")

'JavaScript is a high' '-level, interpreted programming language primarily used for creating' ' interactive and dynamic content on web pages, running' ' in web browsers to manip' 'ulate HTML elements, handle user events' ', and communicate with servers.' ' Originally designed for client-side web development, JavaScript has evolve' 'd into a versatile language that can also' ' be used for server-side development, mobile applications' ', and desktop software through various runtime' ' environments like Node.js.'

Making streaming reusable

For real applications, you’ll want to capture the complete response while still showing the streaming output. Here is a function to do this:

def stream_and_capture(model, messages):
   """Stream response while capturing the complete text"""
   response = litellm.completion(
       model=model,
       messages=messages,
       stream=True
   )
  
   full_response = ""
   for chunk in response:
       if chunk.choices[0].delta.content:
           content = chunk.choices[0].delta.content
           print(content, end="", flush=True)
           full_response += content
  
   print()  # New line after response
   return full_response

# Test it
messages = [{"role": "user", "content": "Define APIs in 2 sentences max"}]
result = stream_and_capture("gpt-5", messages)

APIs (Application Programming Interfaces) are standardized contracts, often exposed as endpoints, that specify how software components or services can communicate, including the allowed operations, inputs, and outputs. They let developers access functionality or data without knowing the underlying implementation, enabling modular, interoperable systems.

The end="" parameter prevents line breaks between chunks. flush=True forces immediate display instead of buffering.

Streaming with fallbacks

You can also combine streaming with the fallback system for better reliability:

response = litellm.completion(
   model="gpt-5",
   messages=[{"role": "user", "content": "Define machine learning in one sentence"}],
   fallbacks=["anthropic/claude-sonnet-4-20250514"],
   stream=True
)

for chunk in response:
   if chunk.choices[0].delta.content:
       print(chunk.choices[0].delta.content, end="", flush=True)

Machine learning is a branch of artificial intelligence where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed.

Immediate feedback plus automatic provider switching if the primary model fails.

Streaming gives users immediate feedback, but you still face another problem when building real applications. Once that text arrives, your code needs to work with it. You might extract data, store information in databases, or pass values to other functions. But parsing text responses is messy and breaks easily when the format changes.

Using Structured Outputs

Here’s what actually happens: You build an app that analyzes customer feedback. Your AI returns text like “The sentiment is positive with confidence 0.85 and top keywords: responsive, helpful, fast.” You write regex patterns to extract the confidence score and keywords. Everything works in testing.

Then the AI changes its response format slightly to “Sentiment: positive (confidence: 0.85). Keywords include responsive, helpful, and fast.” Your regex breaks. Your data pipeline stops working. Customer sentiment reports show empty values instead of scores. Your dashboard crashes. You spend hours fixing parsing code that worked yesterday.

Structured outputs skip this problem by returning data in predictable formats from the start.

Getting JSON instead of text

LiteLLM can return JSON instead of plain text. You tell it what structure you want through your prompt, and it comes back formatted correctly:

import litellm
from dotenv import load_dotenv

load_dotenv()

messages = [{"role": "user", "content": "Describe Python as a programming language. Return as JSON with fields: language, difficulty, description"}]

response = litellm.completion(
   model="gpt-5",
   messages=messages,
   response_format={"type": "json_object"}
)

print(response.choices[0].message.content)

{
 "language": "Python",
 "difficulty": "Beginner-friendly (easy to learn, moderate to master)",
 "description": "Python is a high-level, interpreted, dynamically typed, general-purpose programming language focused on readability and developer productivity. It supports multiple paradigms (object-oriented, functional, procedural), uses indentation for block structure, and includes a large standard library with an extensive third-party ecosystem (PyPI). Common uses include web development, data science, machine learning, automation/scripting, scientific computing, and DevOps. While not the fastest for CPU-bound tasks, it integrates well with C/C++ and accelerators, and runs cross-platform."
}

This structured approach eliminates common parsing problems. Instead of hunting through text, you get back valid JSON with exactly the fields you requested. No parsing headaches, no format surprises.

Working with the JSON data

Raw JSON strings aren’t useful until you convert them to Python objects that your application can work with. You need to turn that string into Python objects you can actually use:

import json

messages = [{"role": "user", "content": "Describe JavaScript in 2 sentences. Return as JSON with fields: name, paradigm, use_cases (array of exactly 3 items)"}]

response = litellm.completion(
   model="anthropic/claude-sonnet-4-20250514",
   messages=messages,
   response_format={"type": "json_object"}
)

# Parse the JSON response
data = json.loads(response.choices[0].message.content)
print(f"Language: {data['name']}")
print(f"Paradigm: {data['paradigm']}")
print(f"Use cases: {', '.join(data['use_cases'])}")

Language: JavaScript
Paradigm: multi-paradigm (object-oriented, functional, event-driven)
Use cases: web development and browser scripting, server-side development with Node.js, mobile app development with frameworks like React Native

Now that you have structured data, you can pass it directly to other functions, store it in databases, or display it in your UI without any text manipulation. For data analysis workflows, tools like Pandas AI can work directly with this structured output to generate insights.

Adding type safety with Pydantic

JSON parsing catches syntax errors, but wrong data types or missing fields are different problems. Pydantic models solve this by validating your data structure and catching problems before they break your application:

from pydantic import BaseModel
import json

class ProgrammingLanguage(BaseModel):
   name: str
   paradigm: str
   use_cases: list[str]

messages = [{"role": "user", "content": "Describe CSS briefly. Return as JSON with fields: name, paradigm, use_cases (array of 3 items)"}]

response = litellm.completion(
   model="gpt-5",
   messages=messages,
   response_format={"type": "json_object"}
)

# Parse and validate with Pydantic
data = json.loads(response.choices[0].message.content)
language = ProgrammingLanguage(**data)

print(f"Validated data: {language.name} - {language.paradigm}")
print(f"Use cases: {language.use_cases}")

Validated data: CSS (Cascading Style Sheets) - Declarative, rule-based stylesheet language for describing the presentation of structured documents.
Use cases: ['Styling layout, colors, and typography of web pages', 'Responsive design across devices and viewports', 'Animations, transitions, and visual effects']

Pydantic automatically validates that name and paradigm are strings, and use_cases is a list of strings. If the AI returns the wrong types or forgets a required field, you'll get a clear error instead of mysterious bugs later.

Handling JSON errors

JSON parsing can fail if the AI returns malformed JSON or non-JSON text. Add basic error handling:

import json

messages = [{"role": "user", "content": "Summarize machine learning in 1 sentence. Return as JSON with topic and summary fields"}]

try:
   response = litellm.completion(
       model="anthropic/claude-sonnet-4-20250514",
       messages=messages,
       response_format={"type": "json_object"}
   )
  
   data = json.loads(response.choices[0].message.content)
   print(f"Topic: {data['topic']}")
   print(f"Summary: {data['summary']}")
  
except json.JSONDecodeError:
   print("Invalid JSON returned")
except Exception as e:
   print(f"Request failed: {e}")

Topic: machine learning
Summary: Machine learning is a subset of artificial intelligence that allows computers to learn and make predictions or decisions from data without being explicitly programmed for each specific task.

Structured outputs replace messy text parsing with clean data structures. You can combine JSON mode with fallbacks just like regular completions.

Structured outputs give you clean, predictable data that won’t break your parsing logic. But there’s another problem that hits when you move from prototype to production: the bills. Every API call costs money, and those costs can surprise you.

Cost Tracking

Here’s what happens: Your prototype works great in testing with 50 API calls per day. You’re spending about $2 per week. Everything looks fine. You launch to production and get 500 users. Now you’re making 5,000 API calls daily. Your monthly bill jumps from $8 to $800.

Without cost tracking, you had no warning. You can’t tell which features cost the most, which models are more expensive, or how to budget for growth. You discover the bill spike weeks later when you get the credit card statement.

LiteLLM tracks exactly what you spend across all providers.

Understanding token usage

Every AI response includes detailed usage information that shows exactly what you’re being charged for:

import litellm
from dotenv import load_dotenv

load_dotenv()

messages = [{"role": "user", "content": "Explain what a REST API is in two sentences"}]

response = litellm.completion(
   model="gpt-5",
   messages=messages
)

print(f"Response: {response.choices[0].message.content}")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Response: A REST API is a web service that follows the Representational State Transfer architectural style, exposing resources identified by URLs and manipulated with HTTP methods like GET, POST, PUT, and DELETE. It uses stateless client-server communication, cacheability, and uniform representations (often JSON) to support scalable, decoupled interactions.
Prompt tokens: 15
Completion tokens: 201
Total tokens: 216

Input tokens are your prompt, output tokens are the AI’s response. Different providers charge different rates for each type.

Automatic cost calculation

Raw token counts don’t tell you the dollar amount. LiteLLM calculates the actual cost for you:

# Get the cost for a completed response
cost = litellm.completion_cost(completion_response=response)
print(f"Total cost: ${cost:.4f}")
Total cost: $0.0020

LiteLLM’s pricing database converts tokens to dollars based on your model.

Manual cost calculations

Sometimes you want to estimate costs before making a call, or calculate costs for text you already have:

# Calculate cost without making an API call
prompt_text = "Explain machine learning"
completion_text = "Machine learning is a method of data analysis that automates analytical model building."

estimated_cost = litellm.completion_cost(
   model="anthropic/claude-sonnet-4-20250514",
   prompt=prompt_text,
   completion=completion_text
)

print(f"Estimated cost: ${estimated_cost:.4f}")

Estimated cost: $0.0002

This is useful for budgeting and comparing models.

Tracking multiple API calls

For applications that make many API calls, you can track cumulative costs:

total_cost = 0.0

conversation = [
   "What is Python?",
   "How do you install Python packages?",
   "What are virtual environments?"
]

for i, question in enumerate(conversation):
   response = litellm.completion(
       model="gpt-5",
       messages=[{"role": "user", "content": question}]
   )
  
   call_cost = litellm.completion_cost(completion_response=response)
   total_cost += call_cost
  
   print(f"Call {i+1}: ${call_cost:.4f}")

print(f"Total cost: ${total_cost:.4f}")

Call 1: $0.0057
Call 2: $0.0134
Call 3: $0.0102
Total cost: $0.0292

Cost tracking helps you make informed decisions about which models to use and avoid surprise bills. See exactly what each API call costs and compare providers.

Conclusion

You now have the tools to build AI applications without vendor lock-in. You’ve learned basic API calls, automatic fallbacks, streaming responses, structured outputs, and cost tracking.

This gives you the freedom to choose the best models for each task. Switch providers easily, set up automatic fallbacks, and track costs. When better models come along, you can adopt them without rewriting code.

Author

Bex Tuychiev

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Course

Scalable AI Models with PyTorch Lightning

3 hr

599

Streamline your AI projects by building modular models and mastering advanced optimization with PyTorch Lightning!

See Details

Start Course

Course

End-to-End RAG with Weaviate

2 hr

322

Master RAG with Weaviate! Embed text and images for retrieval, and experiment with vector, BM25, and hybrid search.

See Details

Start Course

Course

AI Agents with Hugging Face smolagents

3 hr

1.1K

Learn how to build intelligent agents that reason, act, and solve real-world tasks using Python.

See Details

Start Course

blog

What is an LLM? A Guide on Large Language Models and How They Work

Read this article to discover the basics of large language models, the key technology that is powering the current AI revolution

Javier Canales Luna

12 min

blog

Small Language Models: A Guide With Examples

Learn about small language models (SLMs), their benefits and applications, and how they compare to large language models (LLMs).

Dr Ana Rojo-Echeburúa

8 min

blog

12 LLM Projects For All Levels

Discover 12 LLM project ideas with easy-to-follow visual guides and source codes, suitable for beginners, intermediate students, final-year scholars, and experts.

Abid Ali Awan

12 min

Tutorial

Fine-Tuning LLMs: A Guide With Examples

Learn how fine-tuning large language models (LLMs) improves their performance in tasks like language translation, sentiment analysis, and text generation.

Josep Ferrer

Tutorial

Llama Stack: A Guide With Practical Examples

Llama Stack is a set of standardized tools and APIs developed by Meta that simplifies the process of building and deploying large language model applications.

Hesam Sheikh Hassani

code-along

Understanding LLMs for Code Generation

Explore the role of LLMs for coding tasks, focusing on hands-on examples that demonstrate effective prompt engineering techniques to optimize code generation.

Andrea Valenzuela

See More See More

What Is LiteLLM?

Prerequisites

Python environment

Package installation

API keys

Environment check

Making Your First API Call With LiteLLM

Basic setup

Switching providers

Basic error handling

What’s next

Provider Switching and Fallbacks

Basic fallback setup

Building fallback chains

Adding retries

Streaming for Real-Time Responses

Why streaming changes the user experience

How streaming actually works

Making streaming reusable

Streaming with fallbacks

Using Structured Outputs

Getting JSON instead of text

Working with the JSON data

Adding type safety with Pydantic

Handling JSON errors

Cost Tracking

Understanding token usage

Automatic cost calculation

Manual cost calculations

Tracking multiple API calls

Conclusion

What is an LLM? A Guide on Large Language Models and How They Work

Small Language Models: A Guide With Examples

12 LLM Projects For All Levels

Fine-Tuning LLMs: A Guide With Examples

Llama Stack: A Guide With Practical Examples

Understanding LLMs for Code Generation

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Scalable AI Models with PyTorch Lightning

End-to-End RAG with Weaviate

AI Agents with Hugging Face smolagents

What is an LLM? A Guide on Large Language Models and How They Work

Small Language Models: A Guide With Examples

12 LLM Projects For All Levels

Fine-Tuning LLMs: A Guide With Examples

Llama Stack: A Guide With Practical Examples

Understanding LLMs for Code Generation

Scalable AI Models with PyTorch Lightning