Cursus
So you've trained a model in PyTorch, but your production system runs TensorFlow. Do you need to rewrite everything from scratch?
No, you don’t.
Model interoperability can cause headaches for machine learning engineers. Different frameworks use different formats, and converting between them can break your model or introduce unexpected behavior. You don’t want that, but you also don’t want to maintain multiple versions of the same model.
ONNX (Open Neural Network Exchange) solves this by providing a universal format for machine learning models. It lets you train in one framework and deploy in another without rewriting code or dealing with bugs.
In this article, I’ll show you how to convert models to ONNX format, run inference with ONNX Runtime, optimize models for production, and deploy them across various platforms, from edge devices to cloud servers.
Getting Started with ONNX
You can't use ONNX without setting up your development environment first.
In this section, I’ll walk you through everything you need - from software requirements to creating a clean, reproducible workspace. I'll show you how to install ONNX and ONNX Runtime on Windows, Linux, and macOS.
Prerequisites
ONNX works on any major operating system.
You need Python 3.8 or higher installed on your machine. That's it for the basics. ONNX supports Python's ABI3 (Application Binary Interface), which means you can use pre-built binary packages without compiling from source.
Here's what you'll install:
-
uv(grab it from astral.sh/uv) -
Python 3.8+ (
uvhandles this for you) -
pip(comes withuv)
If you want to learn more on how uv works and why it should be your go-to choice, read our guide to the fastest Python package manager.
For Windows users, you can install uv using PowerShell:
powershell -c "irm <https://astral.sh/uv/install.ps1> | iex"
If you get an execution policy error, run PowerShell as administrator first.
Linux users can install uv with this shell command:
curl -LsSf <https://astral.sh/uv/install.sh> | sh
This works on any Linux distribution.
Finally, macOS users can install uv using Homebrew, or by using a curl command:
brew install uv
# or
curl -LsSf <https://astral.sh/uv/install.sh> | sh
Setting up the environment
Here's how to set up a project with uv:
mkdir onnx-project
cd onnx-project
# Initialize a uv project with Python 3.10
uv init --python 3.13
uv creates a .venv directory and a pyproject.toml file. The virtual environment activates automatically when you run commands through uv.

Now install ONNX and ONNX Runtime:
uv add onnx onnxruntime
# Verify the installation
uv run python -c "import onnx; print(onnx.__version__)"
uv run python -c "import onnxruntime; print(onnxruntime.__version__)"
The verification commands should print version numbers:

Your dependencies are already pinned. uv automatically writes all package versions to uv.lock. This makes your builds reproducible - anyone can recreate your exact environment by running uv sync.
Here's what your pyproject.toml looks like:
[project]
name = "onnx-project"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"onnx>=1.19.1",
"onnxruntime>=1.23.2",
]
And that’s it, the environment is configured! Let’s proceed with the fundamentals.
Understanding ONNX Basics
You need to understand how ONNX works before you start converting models.
In this section, I’ll break down the ONNX format and explain its graph-based architecture. I'll show you what's inside an ONNX file and how data flows through it.
The ONNX model format
An ONNX model is a file that contains your neural network's structure and weights.
Think of it as a blueprint. The file describes what operations to perform, in what order, and with what parameters. Your trained weights are stored alongside this blueprint, so the model is ready to run without additional files.
ONNX launched in 2017 as a collaboration between Microsoft and Facebook (now Meta). The goal was simple - stop rewriting models every time you switch frameworks. PyTorch, TensorFlow, and other frameworks kept improving, but models couldn't move between them without manual conversion.
ONNX changed that. Version 1.0 supported basic operations for computer vision and simple neural networks. Today's ONNX supports transformers, large language models, and complex architectures that didn't exist in 2017.
It uses Protocol Buffers (protobuf) to serialize this data. Protobuf is a binary format developed by Google for efficient data storage and transmission. It's much faster to read and write than JSON or XML, and the files are smaller. Win-win.
Here's what's inside an ONNX model:
- Graph: The network architecture
- Weights: Trained model parameters
- Metadata: Version info, producer name, and model documentation
- Opset version: Which operations are available
The opset version matters. ONNX evolves over time, and it continues adding new operations and improving existing ones. Your model file specifies which opset version it uses, so the runtime knows how to execute it.
Key concepts and architecture
ONNX represents your model as a computational graph.
A graph has nodes and edges. Nodes are operations (like matrix multiplication or activation functions), and edges are tensors (your data) flowing between those operations. This is how ONNX describes "multiply these matrices, then apply ReLU, then multiply again."
Here's a simple example:
import onnx
from onnx import helper, TensorProto
# Create input and output tensors
input_tensor = helper.make_tensor_value_info(
"input", TensorProto.FLOAT, [1, 3, 224, 224]
)
output_tensor = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 1000])
# Create a node (operation)
node = helper.make_node(
"Relu", # Operation type
["input"], # Input edges
["output"], # Output edges
)
# Create the graph
graph = helper.make_graph(
[node], # List of nodes
"simple_model", # Graph name
[input_tensor], # Inputs
[output_tensor], # Outputs
)
# Create the model
model = helper.make_model(graph)
onnx.save(model, "simple_model.onnx")

Each node performs one operation. The operation type (like Relu, Conv, or MatMul) comes from the ONNX operator set. You can't just make up operation names - they must exist in the opset version you're using.
Edges connect nodes and carry tensors. A tensor has a shape and a data type. When you define [1, 3, 224, 224], you're saying "this tensor has 4 dimensions with these sizes." The runtime uses this information to allocate memory and validate the graph.
The graph flows in one direction. Data enters through input nodes, passes through operations, and exits through output nodes. No cycles allowed - ONNX doesn't support recurrent connections directly. You need to unroll loops or use specific operations designed for sequences.
This graph structure enables framework interoperability. PyTorch thinks in terms of eager execution, TensorFlow uses static graphs, and scikit-learn has a different abstraction entirely. But they can all export to the same graph format.
The graph also enables shared optimization. An ONNX optimizer can fuse operations (combine multiple nodes into one), eliminate dead code (remove unused nodes), or replace slow operations with faster equivalents. These optimizations work regardless of which framework created the model.
Up next, let me show you the ins and outs of converting models to ONNX format.
Converting Models to ONNX Format
In this section, I’ll show you how to export models from PyTorch, TensorFlow, and scikit-learn. I'll walk you through the conversion process and show you how to validate that your model works correctly after conversion.
Framework support and tools
ONNX supports the most popular machine learning frameworks.
PyTorch has built-in ONNX export through torch.onnx.export(). It's the most mature converter because PyTorch and ONNX are both maintained by Meta.
TensorFlow uses tf2onnx for conversion. It's a separate library that converts TensorFlow graphs to ONNX format. Install it with uv add tf2onnx, and you're ready to go.
scikit-learn converts through skl2onnx. This library handles traditional machine learning models like random forests, linear regression, and SVMs. Deep learning frameworks get more attention, but scikit-learn models work just as well.
These three are the most popular, but you can use any of these other supported framework - just install the dependency via uv add <name>:
-
Keras: Use
tf2onnx(Keras is part of TensorFlow) -
XGBoost: Use
onnxmltools -
LightGBM: Use
onnxmltools -
MATLAB: Built-in ONNX export
-
Paddle: Use
paddle2onnx
Each framework has its own converter because they represent models differently. PyTorch uses dynamic computation graphs, TensorFlow uses static graphs, and scikit-learn doesn't use graphs at all. The converters translate these representations into ONNX's graph format.
You can also use models trained by others. The ONNX Model Zoo hosts pre-trained models ready to use. You'll find computer vision models (ResNet, YOLO, EfficientNet), NLP models (BERT, GPT-2), and more. Download them from the official GitHub repository.
Hugging Face also hosts ONNX models. Search for models with "onnx" in the name or filter by ONNX format. Many popular transformer models are available in ONNX format already optimized for inference.
Before you proceed, run this command to fetch all required libraries with uv:
uv add torch torchvision tensorflow tf2onnx scikit-learn skl2onnx onnxscript
Step-by-step conversion process
Here's how to convert a PyTorch model to ONNX:
import torch
import torch.nn as nn
# Define a simple model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 5)
def forward(self, x):
return self.fc(x)
# Create and prepare the model
model = SimpleModel()
model.eval() # Set to evaluation mode
# Create dummy input with the same shape as your real data
dummy_input = torch.randn(1, 10)
# Export to ONNX
torch.onnx.export(
model, # Model to export
dummy_input, # Example input
"simple_model.onnx", # Output file
input_names=["input"], # Name for the input
output_names=["output"], # Name for the output
dynamic_shapes={"x": {0: "batch_size"}}, # Allow variable batch size
)
New to PyTorch? Don’t let that hold you back. Our Deep Learning with PyTorch course covers the fundamentals in hours.
The dummy input matters. ONNX needs to know the input shape to build the graph. If your model accepts variable-length sequences or different batch sizes, use dynamic_shapes to mark those dimensions as dynamic.
Always set your model to evaluation mode with model.eval() before export. This disables dropout and batch normalization training behavior. If you forget this step, your converted model won't match the original.
Dynamic shapes cause problems. If your converter complains about unknown dimensions, make sure you specify dynamic_shapes correctly or provide concrete shapes.
Now switching to TensorFlow, here’s an alternative code you can run:
import numpy as np
# Temporary compatibility fixes for tf2onnx
if not hasattr(np, "object"):
np.object = object
if not hasattr(np, "cast"):
np.cast = lambda dtype: np.asarray
import tensorflow as tf
import tf2onnx
# Define a simple Keras model (same spirit as your PyTorch one)
inputs = tf.keras.Input(shape=(10,), name="input")
outputs = tf.keras.layers.Dense(5, name="output")(inputs)
model = tf.keras.Model(inputs, outputs)
# Save the model (optional, if you want a .h5 file)
model.save("simple_tf_model.h5")
# Define a dynamic input spec (None = dynamic batch dimension)
spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
# Convert to ONNX
model_proto, _ = tf2onnx.convert.from_keras(
model, input_signature=spec, output_path="simple_tf_model.onnx"
)
If you’ve never worked with TensorFlow, we recommend that you take our course, Introduction to TensorFlow in Python.
Unfortunately, there is a bit of conflict with the latest version of Numpy, so I added a temporary fix. At the time of reading, you can hopefully run the snippet without the first six lines.
And finally, let’s see ONNX conversion for scikit-learn models:
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
# Train a simple model
model = RandomForestClassifier(n_estimators=10)
model.fit(X_train, y_train)
# Define input type and shape
initial_type = [('float_input', FloatTensorType([None, 4]))]
# Convert to ONNX
onnx_model = convert_sklearn(model, initial_types=initial_type)
# Save the model
with open("sklearn_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
Similar to TensorFlow, DataCamp also has a free Supervised Learning with scikit-learn course - check it out if you find the snippet above confusing.
And that’s it!
You can now run this snippet to validate your converted model:
import onnx
import onnxruntime as ort
import numpy as np
# Load and check the ONNX model
onnx_model = onnx.load("simple_model.onnx")
onnx.checker.check_model(onnx_model)
# Run inference with ONNX Runtime
session = ort.InferenceSession("simple_model.onnx")
input_name = session.get_inputs()[0].name
# Create test input
test_input = np.random.randn(1, 10).astype(np.float32)
# Get prediction
onnx_output = session.run(None, {input_name: test_input})
# Compare with original model
with torch.no_grad():
original_output = model(torch.from_numpy(test_input))
# Check if outputs match (within floating point tolerance)
np.testing.assert_allclose(
original_output.numpy(),
onnx_output[0],
rtol=1e-3,
atol=1e-5
)
print("Model conversion successful - outputs match!")

Run this validation every time you convert a model. Small numerical differences are normal (floating point math isn't exact), but large differences mean something went wrong during conversion.
To finish things off, here are two important caveats you must be aware of:
- Custom operations won't convert: If you wrote custom PyTorch operations or TensorFlow layers, the converter doesn't know how to translate them. You need to either rewrite them using standard operations or register custom ONNX operators.
- Version mismatches break things: Your framework version, converter version, and ONNX version must be compatible. Check the converter's documentation for supported version combinations.
Up next, let’s discuss ONNX Runtime.
Running Inference with ONNX Runtime
In this section, I’ll show you how to use ONNX Runtime to load models and make predictions. I'll walk you through hardware acceleration options and show you where ONNX Runtime shines in production.
Introduction to ONNX Runtime
ONNX Runtime is a cross-platform inference engine. Microsoft developed it to run ONNX models fast on any hardware - CPUs, GPUs, mobile chips, and specialized AI accelerators.
Here's how it works: You load an ONNX model, and ONNX Runtime builds an execution plan. It analyzes the computational graph, applies optimizations, and figures out the best way to execute operations on your hardware. Then it runs inference using the optimized plan.
The architecture separates the execution logic from hardware-specific code. Execution providers handle the hardware interface. Want to run on NVIDIA GPUs? Use the CUDA execution provider. Want Apple Silicon? Your underlying code stays the same.
ONNX Runtime has these execution providers:
-
CPUExecutionProvider: Default provider, works everywhere -
CUDAExecutionProvider: NVIDIA GPUs with CUDA -
TensorRTExecutionProvider: NVIDIA GPUs with TensorRT optimization -
CoreMLExecutionProvider: Apple devices (macOS, iOS) -
DmlExecutionProvider: DirectML for Windows (works with any GPU) -
OpenVINOExecutionProvider: Intel CPUs and GPUs -
ROCMExecutionProvider: AMD GPUs
Each provider translates ONNX operations into hardware-specific instructions. The CPU provider uses standard CPU operations. The CUDA provider uses cuDNN and cuBLAS. The TensorRT provider compiles your model into optimized GPU kernels, and so on.
Setting up and executing a model
You can load an ONNX model with three lines of code:
import onnxruntime as ort
# Create an inference session
session = ort.InferenceSession("simple_model.onnx")
# Get input and output names
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
The InferenceSession does all the heavy work. It loads the model, validates the graph, and prepares for inference. By default, it uses the CPU execution provider.
Now run inference:
import numpy as np
import onnxruntime as ort
# Create an inference session
session = ort.InferenceSession("simple_model.onnx")
# Get input and output names
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
# Prepare input data
input_data = np.random.randn(1, 10).astype(np.float32)
# Run inference
outputs = session.run(
[output_name],
{input_name: input_data},
)
# Get predictions
predictions = outputs[0]
print(predictions.shape)

The run() method takes two arguments. First, a list of output names (or None for all outputs). Second, a dictionary mapping input names to numpy arrays.
Input shapes must match. If your model expects [1, 10] and you pass [1, 10, 3], inference will fail.
If you want GPU acceleration, just specify execution providers when creating the session:
# For NVIDIA GPUs with CUDA
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
# For Apple Silicon
session = ort.InferenceSession(
"model.onnx",
providers=['CoreMLExecutionProvider', 'CPUExecutionProvider']
)
# For Intel hardware
session = ort.InferenceSession(
"model.onnx",
providers=['OpenVINOExecutionProvider', 'CPUExecutionProvider']
)
List providers in priority order. ONNX Runtime tries the first provider, and falls back to the next if it's not available. Always include CPUExecutionProvider as the last fallback.
You can check which provider is actually being used:
import numpy as np
import onnxruntime as ort
# Create an inference session
session = ort.InferenceSession(
"simple_model.onnx", providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
# Get input and output names
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
# Prepare input data
input_data = np.random.randn(1, 10).astype(np.float32)
# Run inference
outputs = session.run(
[output_name],
{input_name: input_data},
)
# Get predictions
predictions = outputs[0]
print(predictions.shape)
print("Providers:")
print(session.get_providers())

If you see ["CPUExecutionProvider"] when you expect a GPU, the GPU provider isn't available. Common reasons might be missing drivers, a wrong ONNX Runtime package, or an incompatible GPU.
The configuration doesn’t stop here. You can further optimize the execution provider to get better performance:
providers = [
(
"TensorRTExecutionProvider",
{
"device_id": 0,
"trt_max_workspace_size": 2147483648, # 2GB
"trt_fp16_enable": True, # Use FP16 precision
},
),
"CUDAExecutionProvider",
"CPUExecutionProvider",
]
session = ort.InferenceSession("simple_model.onnx", providers=providers)
You can refer to the official ONNX Runtime documentation to get a list of all available options for the provider of your choice.
ONNX Runtime applications
The impressive thing about ONNX Runtime is that it runs everywhere - servers, browsers, and mobile devices.
ONNX Runtime Web brings inference to browsers using WebAssembly and WebGL. Your model runs client-side without sending data to servers. This works great for privacy-sensitive applications like medical image analysis or document processing.
Here’s a simple example of how you can use ONNX Runtime in a simple HTML file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>ONNXRuntime Web Demo</title>
</head>
<body>
<h1>ONNXRuntime Web Demo</h1>
<!-- Load ONNXRuntime-Web from CDN -->
<script src="<https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js>"></script>
<script>
async function runModel() {
// Create inference session
const session = await ort.InferenceSession.create('./simple_model.onnx');
// Prepare input
const inputData = new Float32Array(10).fill(0).map(() => Math.random());
const tensor = new ort.Tensor('float32', inputData, [1, 10]);
// Run inference
const results = await session.run({ input: tensor });
const output = results.output;
console.log('Output shape:', output.dims);
console.log('Output values:', output.data);
}
runModel();
</script>
</body>
</html>

For mobile deployment, you can use ONNX Runtime Mobile - a lightweight version optimized for iOS and Android. It strips out unnecessary features and reduces binary size.
This is an example iOS configuration:
import onnxruntime_objc
let modelPath = Bundle.main.path(forResource: "simple_model", ofType: "onnx")!
let session = try ORTSession(modelPath: modelPath)
let inputData = // Your input as Data
let inputTensor = try ORTValue(tensorData: inputData,
elementType: .float,
shape: [1, 10])
let outputs = try session.run(withInputs: ["input": inputTensor])
Or, if you’re using Python to build a REST API, you can include ONNX Runtime to one of your endpoints by following this approach:
from fastapi import FastAPI
import onnxruntime as ort
import numpy as np
app = FastAPI()
session = ort.InferenceSession("simple_model.onnx")
@app.post("/predict")
async def predict(data: dict):
input_data = np.array(data["input"]).astype(np.float32)
outputs = session.run(None, {"input": input_data})
return {"prediction": outputs[0].tolist()}
And that’s it! Other environments will use a similar approach, but you get the gist - ONNX Runtime is easy to use. Up next, let’s discuss optimization.
Optimizing ONNX Models
In this section, I’ll demonstrate how to make ONNX models faster and smaller through quantization and graph optimizations. I'll walk you through the techniques that matter most for real-world deployments.
Quantization techniques
Quantization reduces your model's precision to make it faster and smaller.
Instead of storing weights as 32-bit floats, you store them as 8-bit integers. This cuts memory usage by 75% and speeds up inference because integer math is faster than floating point math on most hardware.
Dynamic quantization converts weights to lower precision but keeps activations (intermediate values during inference) in full precision. It's the easiest quantization method because you don't need calibration data.
Here's how to apply dynamic quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
# Quantize the model
quantize_dynamic(
model_input="model.onnx",
model_output="model_quantized.onnx",
weight_type=QuantType.QUInt8 # 8-bit unsigned integers
)
That's it. Your quantized model now uses 8-bit unsigned integers and is ready to use.
Static quantization converts both weights and activations to lower precision. It's more aggressive and faster than dynamic quantization, but you need representative calibration data to measure activation ranges.
Here’s a general strategy for applying static quantization:
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np
# Create a calibration data reader
class DataReader(CalibrationDataReader):
def __init__(self, calibration_data):
self.data = calibration_data
self.index = 0
def get_next(self):
if self.index >= len(self.data):
return None
batch = {"input": self.data[self.index]}
self.index += 1
return batch
# Load calibration data (100-1000 samples from your dataset)
calibration_data = [np.random.randn(1, 10).astype(np.float32)
for _ in range(100)]
data_reader = DataReader(calibration_data)
# Quantize
quantize_static(
model_input="model.onnx",
model_output="model_static_quantized.onnx",
calibration_data_reader=data_reader
)
The calibration data should represent your actual inference workload, so don’t use random data as I’ve shown above. Use real samples from your dataset.
In general, quantization has some benefits and trade-offs you need to know as a a machine learning engineer. Here are the benefits:
- 4x smaller models: INT8 uses one-quarter the memory of FP32
- 2-4x faster inference: Integer operations run faster on CPUs
- Lower power consumption: Less computation means less energy
- Better cache utilization: Smaller models fit in CPU cache
And here are the trade-offs:
- Slight accuracy loss: Usually 1-2% for INT8, acceptable for most applications
- Not all operations quantize well: Batch normalization and softmax are tricky
- Hardware dependency: GPU speedups vary, some GPUs don't accelerate INT8
Quantization also shines for large language models and generative AI. A 7B parameter model in FP32 takes 28GB of memory. Quantized to INT8, it drops to 7GB. Quantized to INT4, it's 3.5GB - small enough to run on consumer hardware.
Graph optimizations
Graph optimizations rewrite your model's computational graph to make it faster.
The concept of node fusion combines multiple operations into a single operation. If your model has a Conv layer followed by BatchNorm followed by ReLU, the optimizer fuses them into one ConvBnRelu node. This reduces memory traffic and speeds up execution.
This is what the graph looks before fusion:
Input -> Conv -> BatchNorm -> ReLU -> Output
It’s simplified after fusion:
Input -> ConvBnRelu -> Output
There are two more concepts you need to know when it comes to graph optimization:
- Constant folding pre-computes operations that don't depend on input data. If your model multiplies a weight matrix by a constant, the optimizer does that multiplication once during optimization instead of every inference.
- Dead code elimination removes unused operations. If your model has branches that never execute or outputs you never use, the optimizer cuts them out.
This is how you can enable graph optimizations in ONNX Runtime:
import onnxruntime as ort
# Set optimization level
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Create session with optimizations
session = ort.InferenceSession(
"simple_model.onnx",
sess_options,
providers=["CPUExecutionProvider"]
)
ONNX Runtime has the following optimization levels:
-
ORT_DISABLE_ALL: No optimizations (useful for debugging) -
ORT_ENABLE_BASIC: Safe optimizations that don't change numerical results -
ORT_ENABLE_EXTENDED: Aggressive optimizations may have small numerical differences -
ORT_ENABLE_ALL: All optimizations, including layout transformations
Use ORT_ENABLE_ALL for production. The speedup is worth the tiny numerical differences.
You can run this snippet to save optimized models for reuse:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"
# This creates and saves the optimized model
session = ort.InferenceSession("simple_model.onnx", sess_options)
Graph optimizations work regardless of which framework created your model. A PyTorch Conv-BN-ReLU pattern and a TensorFlow Conv-BN-ReLU pattern both get fused the same way. This is the benefit of shared optimization - you write the optimization once, and apply it to models from any framework.
Up next, let’s discuss deployment.
Deployment Scenarios
In this section, I’ll go over three deployment scenarios - edge devices, cloud infrastructure, and web browsers. I'll show you the challenges and solutions for each environment.
Edge device deployment
The core challenge of edge devices is that they have limited resources.
Your smartphone has 4-16GB of RAM. A Raspberry Pi has even less. IoT devices might have 512MB or less. You can't just throw a 7B parameter model at these devices and expect it to work.
Start with quantization. INT8 models should be your baseline. Go even with INT4 if you can tolerate the accuracy loss. This gets your model small enough to fit in memory.
When it comes to deploying on edge devices, ONNX Runtime Mobile is your friend. It strips out server features you don't need and optimizes for battery life. The binary is smaller, startup is faster, and power consumption is lower.
Here’s how you can add it to your mobile app project:
# iOS
pod 'onnxruntime-mobile-objc'
# Android
implementation 'com.microsoft.onnxruntime:onnxruntime-android:latest.version'
Execution providers handle hardware differences. Android devices use ARM CPUs, some have GPUs from different vendors, and newer phones have NPUs (Neural Processing Units). You don't want to write code for each chip.
Refer to this snippet to choose the right execution provider:
# iOS - use CoreML for Apple Silicon optimization
session = ort.InferenceSession(
"simple_model.onnx",
providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
# Android - use NNAPI for hardware acceleration
session = ort.InferenceSession(
"simple_model.onnx",
providers=["NnapiExecutionProvider", "CPUExecutionProvider"]
)
CoreML on iOS uses the Neural Engine (Apple's NPU). NNAPI on Android uses whatever accelerator your device has - GPU, DSP, or NPU. The provider abstracts the hardware so your code stays the same. When a new chip comes out with better performance, just update the execution provider, and your app runs faster without code changes.
Cloud deployment
Cloud deployment gives you unlimited scale and the latest hardware.
You can deploy ONNX models on any cloud platform - AWS, Azure, GCP, or your own servers. The pattern is the same - containerize your inference service and deploy it behind a load balancer.
Here's a small Python FastAPI example:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import onnxruntime as ort
import numpy as np
import logging
app = FastAPI()
logger = logging.getLogger(__name__)
# Load model at startup
session = None
@app.on_event("startup")
async def load_model():
global session
session = ort.InferenceSession(
"simple_model.onnx",
providers=[
"CUDAExecutionProvider",
"CoreMLExecutionProvider",
"CPUExecutionProvider",
],
)
logger.info("Model loaded successfully")
class PredictionRequest(BaseModel):
input: list
class PredictionResponse(BaseModel):
prediction: list
inference_time_ms: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
import time
start = time.time()
# Input shape: (1, 10)
input_data = np.array(request.input, dtype=np.float32).reshape(1, 10)
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: input_data})
inference_time = (time.time() - start) * 1000
return PredictionResponse(
prediction=outputs[0].tolist(), inference_time_ms=inference_time
)
except Exception as e:
logger.error(f"Prediction failed: {str(e)}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": session is not None}
This FastAPI example, while small, isn’t the easiest to comprehend for machine learning engineers. You can master the fundamentals with our free Introduction to FastAPI course.
You can then containerize it with Docker:
FROM python:3.13-slim
WORKDIR /app
# Install dependencies
RUN pip install onnx onnxruntime onnxscript fastapi pydantic numpy uvicorn
# Copy model and code
COPY simple_model.onnx .
COPY main.py .
# Expose port
EXPOSE 8000
# Run the service
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Every machine learning enthusiast must know the basics of Docker. Take our Introduction to Docker course to finally learn how to deploy a machine learning model.
You’ll get an application running on port 8000, for which you can open the docs directly and try out the prediction endpoint:

If you’re using Azure Machine Learning, you’ll be pleased to learn that it integrates directly with ONNX. You can deploy your model in three steps:
from azureml.core import Workspace, Model
from azureml.core.webservice import AciWebservice, Webservice
# Register model
ws = Workspace.from_config()
model = Model.register(
workspace=ws, model_path="simple_model.onnx", model_name="my-onnx-model"
)
# Deploy
deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
service = Model.deploy(
workspace=ws,
name="onnx-service",
models=[model],
deployment_config=deployment_config,
)
Azure handles scaling, monitoring, and updates. You get automatic logging, request tracing, and health checks.
If you’re new to Azure Machine Learning, read our Beginner’s Guide to quickly learn the fundamentals.
If we’re talking about MLOps workflows need these components:
- Version control: Track model versions with DVC or MLflow
- CI/CD: Automate testing and deployment with GitHub Actions or Azure Pipelines
- Monitoring: Track inference latency, throughput, and error rates
- A/B testing: Compare model versions in production
- Rollback: Revert to previous versions when problems occur
Here’s an example snippet you can use:
import prometheus_client as prom
import onnxruntime as ort
import numpy as np
# Define metrics
inference_duration = prom.Histogram(
"model_inference_duration_seconds", "Time spent processing inference"
)
inference_count = prom.Counter("model_inference_total", "Total number of inferences")
# Load model
session = ort.InferenceSession("simple_model.onnx")
@app.post("/predict")
@inference_duration.time()
async def predict(request: PredictionRequest):
inference_count.inc()
# Input shape: (1, 10)
input_data = np.array(request.input, dtype=np.float32).reshape(1, 10)
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: input_data})
return {"prediction": outputs[0].tolist()}
From there, you can scale horizontally by running multiple service instances behind a load balancer. ONNX Runtime is stateless, so any instance can handle any request.
Browser-based inference
ONNX Runtime Web runs models directly in browsers using WebAssembly and WebGL.
Instead of sending data to a server, you send the model to the browser once and run inference locally. Users get instant responses without network latency.
To start, just create an HTML file that uses an inline JS script:
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<script src="<https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js>"></script>
</head>
<body>
<script>
async function run() {
try {
const session = await ort.InferenceSession.create("simple_model.onnx", {
executionProviders: ["wasm"]
});
console.log("Session loaded:", session);
// Inspect names
console.log("Inputs:", session.inputNames);
console.log("Outputs:", session.outputNames);
const inputName = session.inputNames[0];
const inputData = new Float32Array(10).fill(0).map(() => Math.random());
const tensor = new ort.Tensor("float32", inputData, [1, 10]);
const feeds = {};
feeds[inputName] = tensor;
const results = await session.run(feeds);
const outputName = session.outputNames[0];
console.log("Prediction:", results[outputName].data);
} catch (err) {
console.error("ERROR:", err);
}
}
run();
</script>
</body>
</html>
You can see the output in the console as soon as you open the file with live server:

And finally, let’s discuss a couple of advanced topics.
Advanced Topics
Standard ONNX operations cover most use cases, but if you need more, this is the section for you.
I’ll cover custom operators and large language model deployment. I'll explain when you need these advanced features and what trade-offs they bring.
Custom operators
ONNX ships with hundreds of operations, but your model might use something unique.
Custom operators let you define operations that don't exist in the standard ONNX operator set. You write the operation logic yourself and tell ONNX Runtime how to execute it. This extends ONNX beyond its built-in capabilities.
When do you need custom operators? When you're using cutting-edge research that ONNX doesn't support yet. Or when you've built proprietary operations specific to your domain. Or when you need hardware-specific optimizations that standard ops can't provide.
But here's the catch - custom operators break portability. Your model won't run on systems that don't have your custom operator implementation. If you export a PyTorch model with custom ops to ONNX, you need to package those ops separately and register them with ONNX Runtime.
Deployment becomes harder. Every environment needs your custom operator library. Edge devices, cloud servers, and browsers all need the same implementation. You lose ONNX's biggest advantage: write once, run anywhere.
Here’s a small working example of custom operators, just so you can see how they work:
import numpy as np
import onnxruntime as ort
from onnxruntime import InferenceSession, SessionOptions
# Define the custom operation
def custom_square(x):
return x * x
# Register the custom op with ONNX Runtime
# This tells ONNX Runtime how to execute "CustomSquare" operations
class CustomSquareOp:
@staticmethod
def forward(x):
return np.square(x).astype(np.float32)
# Create a simple ONNX model with a custom op using the helper
from onnx import helper, TensorProto
import onnx
# Create a graph with custom operator
node = helper.make_node(
"CustomSquare", # Custom operation name
["input"],
["output"],
domain="custom.ops", # Custom domain
)
graph = helper.make_graph(
[node],
"custom_op_model",
[helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 10])],
[helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 10])],
)
model = helper.make_model(graph)
onnx.save(model, "custom_op_model.onnx")
# To use this model, you'd need to implement the custom op in C++
# and register it with ONNX Runtime - Python-only custom ops aren't supported
print("Custom operator model created - requires C++ implementation to run")
Most teams avoid custom operators when possible. Try using combinations of standard ops first. Only go custom when standard ops can't express your operation or when performance demands it.
Handling large language models
Memory is a problem when it comes to LLMs. A 7B parameter model needs 28GB in FP32, 14GB in FP16, and 7GB in INT8. Most consumer hardware can't hold that. You need quantization, and often aggressive quantization to INT4 or INT3.
If you’re new to LLMs and want to learn more, our Large Language Models (LLMs) Concepts course is a great place to start.
KV cache management matters for autoregressive generation. Every token you generate needs to reference all previous tokens. The cache grows with sequence length. Long conversations eat memory fast. ONNX Runtime GenAI includes optimized KV cache handling that reuses memory and reduces overhead.
Inference speed depends on your hardware and model size. Smaller models (under 3B parameters) run well on consumer GPUs. Larger models need multiple GPUs or specialized inference accelerators. ONNX Runtime supports multi-GPU inference through execution providers, but you need to split your model across devices manually.
Flash Attention speeds up the attention mechanism in transformers. Standard attention is O(n²) in sequence length - it gets slow fast. Flash Attention reduces memory movement and improves speed without changing results. ONNX Runtime includes Flash Attention optimizations for supported hardware.
Training LLMs with ONNX Runtime is possible but rare. Most teams train with PyTorch or JAX, then export to ONNX for inference only. ONNX Runtime Training exists for fine-tuning and continued pre-training, but the ecosystem around PyTorch training is more mature.
Generative AI workloads get special treatment in ONNX Runtime. The GenAI extensions provide high-level APIs for text generation, maintaining state between calls, and managing beam search or sampling strategies. This saves you from implementing generation logic yourself.
The LLM landscape changes fast, and ONNX Runtime updates try its best to keep up. The operator set and optimizations are updated regularly, but there's always a lag between research and production-ready support.
Conclusion
You now know how to convert models from any framework, optimize them for production, and deploy them anywhere.
ONNX removes the friction between training and deployment. Train in PyTorch because it's great for research. Deploy with ONNX Runtime because it's fast and runs everywhere. Quantize to INT8 and cut your inference costs in half. Switch from CPU to GPU without changing code. Move models between cloud providers without vendor lock-in.
ONNX is what makes it possible.
The ecosystem keeps improving. Generative AI support gets better with each release - faster LLM inference, better memory management, and new optimization techniques. Hardware support expands to more accelerators and edge devices. The community builds tools that make ONNX easier to use. You can participate by testing new features, reporting issues, or contributing code through the GitHub repositories.
Ready for the next level? Master Explainable Artificial Intelligence (XAI) Concepts to finally stop thinking of models as black boxes.
FAQs
What is ONNX, and why should I use it?
ONNX (Open Neural Network Exchange) is a universal format for machine learning models that works across different frameworks. It lets you train a model in PyTorch or TensorFlow and deploy it anywhere without rewriting code. You get framework independence, better performance through optimizations, and the ability to run models on any hardware, from edge devices to cloud servers.
How does ONNX improve model performance?
ONNX Runtime applies graph optimizations like node fusion and constant folding that speed up inference. Quantization reduces model size by up to 75% and makes inference 2-4x faster by converting weights from 32-bit floats to 8-bit integers. These optimizations work automatically and don't require you to manually tune your model.
Can ONNX run on mobile devices and browsers?
Yes. ONNX Runtime Mobile runs on iOS and Android devices with optimizations for battery life and memory usage. ONNX Runtime Web runs models directly in browsers using WebAssembly and WebGL, so you can do client-side inference without sending data to servers. Both options support hardware acceleration through execution providers like CoreML for Apple devices and NNAPI for Android.
What's the difference between dynamic and static quantization?
Dynamic quantization converts weights to lower precision but keeps activations in full precision, and it doesn't require calibration data. Static quantization converts both weights and activations to lower precision, which is faster but requires representative calibration data to measure activation ranges. Static quantization gives better performance but needs more setup work.
How do I handle models with custom operations when converting to ONNX?
Custom operations require C++ implementations registered with ONNX Runtime - Python-only custom ops aren't supported for production use. You need to implement the custom operator in C++, compile it as a shared library, and register it with ONNX Runtime in every deployment environment. Most teams avoid custom operators when possible and use combinations of standard ONNX operations instead, since custom ops break portability across different platforms.



