Ollama Python Library: Getting Started with LLMs Locally

Master the Ollama Python SDK for local LLM development. Learn to generate text, handle multi-turn chats, use vision models, and build AI apps securely.

2026年4月17日 · 7分钟读

Cloud LLM APIs are powerful, but they come with trade-offs: usage-based pricing, rate limits, and the constant uncertainty around where your data is being processed. For developers working with sensitive data or experimenting heavily, these constraints can quickly become friction.

This is where local-first approaches stand out. The Ollama Python library removes that friction by allowing you to run large language models locally while interacting with them using clean, Python-native code. This gives you full control over performance, cost, and privacy.

In this article, I will walk you through the complete Ollama Python library API, from simple text generation with generate() to tool calling and vision models.

I also recommend checking out our other latest Ollama tutorials:

Prerequisites to Run Ollama with Python

Before getting started, ensure you have the following setup on your device:

Python 3.8 or higher
Ollama downloaded from its website, installed, and running (ollama serve)
At least one model pulled (e.g., ollama pull llama3.2)

These prerequisites matter because the Python SDK is only a client; the actual inference happens in the Ollama runtime. If the runtime is unavailable or no suitable model is present, calls will fail.

You may also consider using Docker with Ollama for version consistency.

What Is the Ollama Python Library?

The Ollama Python library is the official SDK that wraps the Ollama REST API into a simple, Pythonic interface. In other words, it turns low-level HTTP requests and JSON payloads into high-level Python functions so you can focus on intent rather than transport details.

As your application grows, this abstraction removes repetitive request construction, standardizes how responses are handled, and centralizes error handling in one place.

For comparison, a raw request might look like this:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "prompt": "Explain recursion"
    }
)

This works, but it quickly becomes verbose and error-prone. With the SDK, the same task becomes:

import ollama

response = ollama.generate(
    model='llama3.2',
    prompt='Explain recursion'
)

How the library communicates with the Ollama server

Under the hood, each SDK call becomes an HTTP request to the Ollama server at http://localhost:11434. Your Python script acts as a client, while the Ollama runtime acts as a server that hosts and executes models.

This separation is important because it allows the model to run as a dedicated service, making resource management (CPU/GPU) more efficient and enabling multiple applications to share the same model instance.

If you need to connect to a different machine, you can configure a custom client:

from ollama import Client

client = Client(host='http://remote-server:11434')
response = client.generate(model='llama3.2', prompt='Hello')

Installing and configuring the library

Installation is straightforward and requires minimal dependencies:

pip install ollama

After installation, it’s good practice to verify connectivity by listing available models.

This helps you confirm that your Python environment, SDK, and Ollama runtime are all correctly connected.

To do that, run the following:

import ollama

print(ollama.list())

Generating Text Using generate()

The generate() function is designed for stateless tasks, meaning each request is handled independently without any memory of previous interactions. This makes it ideal for tasks like summarization, rewriting, or code generation.

Because there is no retained context, the quality of the output depends entirely on how clearly the prompt is written.

Basic text generation

The following example demonstrates the simplest workflow: send a prompt, receive a response, and extract the generated text.

import ollama

response = ollama.generate(
    model='llama3.2',
    prompt='Write a Python docstring for a function that calculates factorial'
)

print(response['response'])

The response also includes metadata such as execution time and token counts, which are useful when optimizing performance.

Customizing output with parameters

Generation behavior can be adjusted using sampling parameters, which control how the model selects tokens.

Lower temperature values produce more deterministic outputs, while higher values introduce more variability. You can use parameters like top_p and num_predict to further refine output diversity and length.

Here are some important parameters you can use:

Parameter	What It Controls	How It Affects Output	When to Use
`temperature`	Randomness of token selection	Lower = more predictable, higher = more creative/random	Use low (0.1–0.3) for factual tasks, higher (0.7–1.0) for creative writing
`top_p`	Nucleus sampling (probability mass cutoff)	Model only considers tokens within top cumulative probability p	Use to limit weird outputs while keeping some diversity
`top_k`	Limits the number of candidate tokens	Model picks from the top k most likely tokens only	Useful for tighter control in structured outputs
`num_predict`	Maximum tokens to generate	Controls the length of the response	Increase for long explanations, reduce for concise answers

Here’s an example of the use of top_p, temperature, and num_predict parameters:

response = ollama.generate(
    model='llama3.2',
    prompt='Explain machine learning in one paragraph',
    options={
        'temperature': 0.2,
        'top_p': 0.9,
        'num_predict': 100
    }
)

Building Conversations Using chat()

Unlike generate(), the chat() API supports stateful interactions by working with a sequence of messages. This allows the model to maintain context across multiple turns.

Each message includes a role, such as user, assistant, or system, which helps structure the conversation.

Single-turn chat requests

Even a single-turn interaction uses the message format, which lays the foundation for more complex conversations.

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain Python decorators'}
    ]
)

print(response['message']['content'])

Maintaining multi-turn context

To maintain context, you explicitly store and resend the full conversation history with each request. This gives you complete control over what the model remembers.

messages = [
    {'role': 'user', 'content': 'What is recursion?'}
]

response = ollama.chat(model='llama3.2', messages=messages)
messages.append(response['message'])

messages.append({'role': 'user', 'content': 'Give an example in Python'})
response = ollama.chat(model='llama3.2', messages=messages)

Using system prompts to shape behavior

A system prompt is used to define the model’s behavior upfront, such as tone, constraints, or role.

messages = [
    {'role': 'system', 'content': 'You are a strict Python code reviewer.'},
    {'role': 'user', 'content': 'Review this code: def add(a,b): return a+b'}
]

Streaming and Async Support in the Ollama Python Library

For interactive applications, responsiveness is just as important as correctness. Ollama supports both streaming and asynchronous execution to improve performance and user experience.

Streaming responses in real time

Streaming allows you to process output incrementally as it is generated, rather than waiting for the full response.

for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Using AsyncClient for async applications

Asynchronous execution allows your application to handle multiple requests concurrently without blocking. You’ll need to use the asyncio Python library to implement this.

Let’s look at an example below:

import asyncio
from ollama import AsyncClient

async def main():
    client = AsyncClient()
    async for chunk in await client.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Explain async programming'}],
        stream=True
    ):
        print(chunk['message']['content'], end='')

asyncio.run(main())

Managing Ollama Models From Python

The Ollama SDK also provides tools for managing models programmatically, which is especially useful in automated environments.

Listing and inspecting local models

You can retrieve the list of available models and inspect their properties, such as size and configuration.

models = ollama.list()
print(models)

info = ollama.show('llama3.2')
print(info)

Pulling and deleting models programmatically

Models can be downloaded or removed directly from Python, making it easier to manage dependencies dynamically.

ollama.pull('llama3.2')
ollama.delete('llama3.2')

Generating and Using Embeddings with the Ollama Python Library

Embeddings represent text as numerical vectors that capture semantic meaning. This allows you to compare texts based on similarity rather than exact wording.

Creating text embeddings

The following example converts text into a vector representation that can be used for search or clustering.

response = ollama.embed(
    model='nomic-embed-text',
    input='Ollama is a local LLM runtime'
)

embedding = response['embeddings'][0]

Building a basic similarity search

Once embeddings are generated, similarity can be measured using cosine similarity, which compares the angle between vectors.

Here’s a simple example of the search function:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Tool Calling and Structured Output in the Ollama Python Library

To build more advanced applications, models often need to interact with external functions or return structured data.

Implementing tool calling with Python functions

Tool calling allows the model to invoke predefined Python functions based on user intent.

Let’s create a function that uses such tools:

def get_weather(city: str) -> str:
    """Get current weather for a city"""
    return f"Weather in {city} is sunny"
response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'What is the weather in Paris?'}],
    tools=[get_weather]
)

Getting structured JSON responses

Structured outputs ensure that responses are returned in a consistent, machine-readable format such as JSON.

response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Review: Great product, 5 stars!'}],
    format='json'
)

Advanced: Vision Models and Ollama Cloud in Python

Ollama supports multimodal models and cloud-based inference for more advanced use cases.

Sending images to vision models

Vision models can process both text and images, enabling tasks like image description and visual analysis.

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['image.jpg']
    }]
)

Running cloud models from Python

For larger models that cannot run locally, Ollama Cloud provides hosted inference. It requires you to sign in to Ollama Cloud.

ollama signin

After signing in, you can chat with cloud-hosted models like this:

from ollama import Client
import os

ollama.chat(model='deepseek-v3.1:671b-cloud', messages=[...])

client = Client(
    host='https://ollama.com',
    headers={'Authorization': 'Bearer YOUR_API_KEY'}
)

Error Handling of the Most Common Ollama Python Pitfalls

When building real applications, handling errors explicitly helps prevent silent failures and improves reliability.

Handling ResponseError exceptions

The Ollama SDK raises structured exceptions for server-side errors, allowing you to inspect what went wrong.

import ollama

try:
    ollama.generate(model='unknown', prompt='test')
except ollama.ResponseError as e:
    print(e.status_code, e.error)

Debugging connection and model issues

Common issues include the server not running, missing models, insufficient memory, or context limits being exceeded.

Server not running: Start with ollama serve
Model not found: Run ollama pull
Out of memory: Use smaller models or quantization
Context issues: Adjust num_ctx

num_ctx controls the maximum number of tokens the model can “see” at once, including:

your prompt
system instructions
conversation history
retrieved documents (RAG)
and the model’s own generated tokens

Managing this parameter will help prevent the LLM from truncating earlier content (usually from the beginning) or losing important instructions or data silently.

Final Thoughts

The Ollama Python library provides a complete interface for working with local and cloud LLMs , from simple text generation to advanced capabilities like embeddings, tool calling, and multimodal inputs. LLMs become a local service you can script against, test, and scale like any other component in your stack.

In my experience using Ollama, I’ve felt that it’s a good option to have without having to use cloud LLMs. For example, I can use open source models more freely. If you’re also looking for more options to switch between models, Ollama is a good gateway to access all this.

If you want to deepen your skills, I recommend taking our Developing LLM Applications with LangChain course or pursuing the Associate AI Engineer for Developers certification.