Note
You can click here to run the Jupyter notebook locally.
Bring Your Own Codegen#
TVM’s Bring Your Own Codegen (BYOC) framework lets you offload parts of a model to a custom backend – a hardware accelerator, an inference library, or your own kernels – while TVM compiles the rest. This tutorial has two parts:
How BYOC works – we teach the flow with a bundled, hardware-free example NPU backend and then drive the same flow on a real production backend, NVIDIA TensorRT. Both run a small, hand-written model so every step is visible; the only thing that changes between them is the backend, and that contrast is the lesson.
Deploying a real model – we then put it to work, taking an actual PyTorch
nn.Modulefrom export through TensorRT and running it on the GPU.
The example NPU is a teaching stub: its runtime logs the dispatch decisions an NPU would make (memory tier, execution engine, fusion) but performs no real computation, so its output buffers are left uninitialized. We therefore check shapes, not values, in the NPU sections – its job is to make every BYOC step visible with nothing hidden. TensorRT then runs the identical flow for real, so we cross-check its result against a reference.
Prerequisites: the example NPU sections need TVM built with
USE_EXAMPLE_NPU_CODEGEN=ON and USE_EXAMPLE_NPU_RUNTIME=ON; the TensorRT
sections need USE_TENSORRT_CODEGEN=ON, USE_TENSORRT_RUNTIME=ON and
USE_CUDA=ON plus a CUDA GPU and a matching TensorRT install (from NVIDIA’s
pip install tensorrt packages or the TensorRT archive); the final deployment
section also needs PyTorch. Each section degrades gracefully when its backend is
unavailable.
Overview of the BYOC flow#
BYOC plugs a custom backend into TVM’s compilation pipeline in four steps:
Register patterns - describe which sequences of Relax ops the backend can handle.
Partition the graph - group matched ops into composite functions.
Run codegen - lower each composite to the backend’s representation (a JSON graph for both backends in this tutorial).
Execute - the runtime dispatches each composite to the backend.
Steps 1 and 2 are pure Python and run anywhere; steps 3 and 4 need the backend’s codegen and runtime compiled into TVM, which is why the build-and-run cells below are guarded.
Step 1: Import the backends to register their patterns#
Importing a backend module registers its patterns with TVM’s global registry. Pattern registration is independent of the C++ build – only codegen and the runtime require the backend to be compiled in – so we probe each backend and guard the build-and-run cells accordingly.
import os
import tempfile
import numpy as np
import tvm
import tvm.relax.backend.contrib.example_npu
from tvm import relax
from tvm.relax.backend.contrib.tensorrt import partition_for_tensorrt
from tvm.relax.backend.pattern_registry import get_patterns_with_prefix
from tvm.relax.transform import FuseOpsByPattern, MergeCompositeFunctions, RunCodegen
from tvm.script import relax as R
has_example_npu_codegen = tvm.get_global_func("relax.ext.example_npu", True)
has_example_npu_runtime = tvm.get_global_func("runtime.ExampleNPUJSONRuntimeCreate", True)
has_example_npu = has_example_npu_codegen and has_example_npu_runtime
has_tensorrt_codegen = tvm.get_global_func("relax.ext.tensorrt", True) is not None
_is_trt_runtime_enabled = tvm.get_global_func("relax.is_tensorrt_runtime_enabled", True)
has_tensorrt = (
has_tensorrt_codegen and _is_trt_runtime_enabled is not None and _is_trt_runtime_enabled()
)
has_cuda = tvm.cuda(0).exist
Step 2: Define the model#
A single convolution followed by a ReLU. This one model is used for both backends.
@tvm.script.ir_module
class ConvReLU:
@R.function
def main(
data: R.Tensor((1, 3, 32, 32), "float32"),
weight: R.Tensor((16, 3, 3, 3), "float32"),
) -> R.Tensor((1, 16, 30, 30), "float32"):
with R.dataflow():
conv = relax.op.nn.conv2d(data, weight)
out = relax.op.nn.relu(conv)
R.output(out)
return out
Step 3: Partition for the example NPU#
FuseOpsByPattern groups ops matching a registered pattern into composite
functions; MergeCompositeFunctions then consolidates adjacent composites
bound for the same backend into a single external call. Two flags steer
partitioning:
bind_constants=Falsekeeps weights as function arguments, so the host stays in charge of the parameters. (TensorRT below makes the opposite choice: it binds weights as constants because it bakes them into its engine.)annotate_codegen=Truewraps each matched composite in a function tagged with the backend name – the tagRunCodegenroutes on. (The follow-upMergeCompositeFunctionsalso attaches this tag when it groups composites, which is whypartition_for_tensorrtbelow can leave the flag off.)
The example NPU registers a fused conv2d + relu pattern with higher
priority than the standalone conv2d pattern, so the two ops collapse into a
single example_npu.conv2d_relu_fused composite – look for it in the
printed module.
npu_patterns = get_patterns_with_prefix("example_npu")
npu_mod = FuseOpsByPattern(npu_patterns, bind_constants=False, annotate_codegen=True)(ConvReLU)
npu_mod = MergeCompositeFunctions()(npu_mod)
print("After partitioning for the example NPU:")
print(npu_mod)
After partitioning for the example NPU:
# from tvm.script import ir as I
# from tvm.script import relax as R
@I.ir_module
class Module:
@R.function
def fused_relax_nn_conv2d_relax_nn_relu_example_npu_example_npu(data: R.Tensor((1, 3, 32, 32), dtype="float32"), weight: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
R.func_attr({"Codegen": "example_npu"})
# from tvm.script import relax as R
@R.function
def local_func(data_1: R.Tensor((1, 3, 32, 32), dtype="float32"), weight_1: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
R.func_attr({"Composite": "example_npu.conv2d_relu_fused"})
conv: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.conv2d(data_1, weight_1, strides=[1, 1], padding=[0, 0, 0, 0], dilation=[1, 1], groups=1, data_layout="NCHW", kernel_layout="OIHW", out_layout="NCHW", out_dtype=None)
gv: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.relu(conv)
return gv
output: R.Tensor((1, 16, 30, 30), dtype="float32") = local_func(data, weight)
return output
@R.function
def main(data: R.Tensor((1, 3, 32, 32), dtype="float32"), weight: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
cls = Module
with R.dataflow():
gv: R.Tensor((1, 16, 30, 30), dtype="float32") = cls.fused_relax_nn_conv2d_relax_nn_relu_example_npu_example_npu(data, weight)
R.output(gv)
return gv
Step 4: Codegen, build and run on the example NPU#
RunCodegen invokes each annotated composite’s backend codegen, replacing it
with the backend runtime module (here, the NPU’s JSON graph); relax.build
then compiles the remaining host-side program and links everything. Because
the runtime is a stub that computes nothing, we assert on the output shape
only – the values are uninitialized.
np.random.seed(0)
data_np = np.random.randn(1, 3, 32, 32).astype("float32")
weight_np = np.random.randn(16, 3, 3, 3).astype("float32")
if has_example_npu:
npu_mod = RunCodegen()(npu_mod)
with tvm.transform.PassContext(opt_level=3):
npu_exec = relax.build(npu_mod, tvm.target.Target("llvm"))
npu_vm = relax.VirtualMachine(npu_exec, tvm.cpu())
npu_out = npu_vm["main"](
tvm.runtime.tensor(data_np, tvm.cpu()), tvm.runtime.tensor(weight_np, tvm.cpu())
)
assert npu_out.numpy().shape == (1, 16, 30, 30)
print("Example NPU run completed. Output shape:", npu_out.numpy().shape)
else:
print("Example NPU backend unavailable; skipping its build and run.")
Example NPU backend unavailable; skipping its build and run.
The same flow on a real backend: TensorRT#
Steps 1-4 above are the whole mechanism. Aiming them at a real backend changes very little, so rather than repeat the walkthrough, here is only what differs for NVIDIA TensorRT:
Partition in one call.
partition_for_tensorrtbundles theFuseOpsByPattern+MergeCompositeFunctionsyou ran by hand, using TensorRT’s own pattern table.Weights become constants (
bind_constants=True): TensorRT bakes them into the engine it builds, so bind the parameters before partitioning.Real values. TensorRT actually computes, so we build for CUDA, run on the GPU, and cross-check against a plain CPU build – not just the shape.
trt_mod = relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
trt_mod = partition_for_tensorrt(trt_mod)
print("After partition_for_tensorrt:")
print(trt_mod)
After partition_for_tensorrt:
# from tvm.script import ir as I
# from tvm.script import relax as R
@I.ir_module
class Module:
@R.function
def fused_relax_nn_conv2d_relax_nn_relu_tensorrt(data: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
R.func_attr({"Codegen": "tensorrt"})
# from tvm.script import relax as R
@R.function
def gv(data_1: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
R.func_attr({"Composite": "tensorrt.nn.conv2d"})
with R.dataflow():
gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.conv2d(data_1, metadata["relax.expr.Constant"][0], strides=[1, 1], padding=[0, 0, 0, 0], dilation=[1, 1], groups=1, data_layout="NCHW", kernel_layout="OIHW", out_layout="NCHW", out_dtype=None)
R.output(gv_1)
return gv_1
lv: R.Tensor((1, 16, 30, 30), dtype="float32") = gv(data)
# from tvm.script import relax as R
@R.function
def gv1(lv_1: R.Tensor((1, 16, 30, 30), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
R.func_attr({"Composite": "tensorrt.nn.relu"})
with R.dataflow():
gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.relu(lv_1)
R.output(gv_1)
return gv_1
gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = gv1(lv)
return gv_1
@R.function
def main(data: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
cls = Module
with R.dataflow():
gv: R.Tensor((1, 16, 30, 30), dtype="float32") = cls.fused_relax_nn_conv2d_relax_nn_relu_tensorrt(data)
R.output(gv)
return gv
# Metadata omitted. Use show_meta=True in script() method to show it.
Build for CUDA, run on the GPU, and compare against the CPU reference.
if has_tensorrt and has_cuda:
dev = tvm.cuda(0)
with tvm.transform.PassContext(opt_level=3):
trt_exec = relax.build(RunCodegen()(trt_mod), "cuda")
trt_out = relax.VirtualMachine(trt_exec, dev)["main"](tvm.runtime.tensor(data_np, dev)).numpy()
cpu_mod = relax.transform.LegalizeOps()(
relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
)
cpu_exec = relax.build(cpu_mod, "llvm")
cpu_out = relax.VirtualMachine(cpu_exec, tvm.cpu())["main"](
tvm.runtime.tensor(data_np, tvm.cpu())
).numpy()
np.testing.assert_allclose(trt_out, cpu_out, rtol=1e-2, atol=1e-2)
print("TensorRT output shape:", trt_out.shape, "- matches the CPU reference.")
else:
print("TensorRT/CUDA unavailable; skipping the GPU build and run.")
TensorRT/CUDA unavailable; skipping the GPU build and run.
A real backend also exposes knobs the stub does not. Setting use_fp16
through the relax.ext.tensorrt.options config lets TensorRT pick FP16
kernels, trading a little accuracy for speed; nothing else about the flow
changes. (Other options are environment-driven: TVM_TENSORRT_USE_INT8
enables INT8 with calibration, TVM_TENSORRT_MAX_WORKSPACE_SIZE caps the
build workspace, and TVM_TENSORRT_CACHE_DIR caches built engines to disk
for reuse across runs.)
if has_tensorrt and has_cuda:
fp16_mod = partition_for_tensorrt(
relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
)
with tvm.transform.PassContext(
opt_level=3, config={"relax.ext.tensorrt.options": {"use_fp16": True}}
):
fp16_exec = relax.build(RunCodegen()(fp16_mod), "cuda")
fp16_out = relax.VirtualMachine(fp16_exec, tvm.cuda(0))["main"](
tvm.runtime.tensor(data_np, tvm.cuda(0))
).numpy()
np.testing.assert_allclose(fp16_out, cpu_out, rtol=5e-2, atol=5e-2)
print("TensorRT FP16 output shape:", fp16_out.shape, "- matches within FP16 tolerance.")
else:
print("TensorRT/CUDA unavailable; skipping the FP16 build.")
TensorRT/CUDA unavailable; skipping the FP16 build.
Example NPU vs TensorRT at a glance#
The same four-step flow, two backends:
Aspect |
Example NPU (teaching stub) |
TensorRT (real backend) |
|---|---|---|
Runtime |
logs decisions, no compute |
builds and runs an nvinfer engine |
Output |
uninitialized (check shape) |
real values (cross-checked vs CPU) |
Weights |
|
|
Partition |
two passes, by hand |
|
Deploying a PyTorch model with TensorRT#
Everything above used a hand-written IRModule so each op was visible. In
practice you start from a trained model. This final section runs the same
partition_for_tensorrt flow on a real PyTorch nn.Module, end to end:
export it, import it into Relax with the PyTorch frontend (the weights come in
as constants – exactly what TensorRT bakes into its engine), partition, build
for CUDA, and check the GPU result against PyTorch’s own output. Beyond the
frontend import, the only difference is that the imported program returns its
outputs as a tuple, so we index [0] for the single result tensor; the
partition-build-run flow is otherwise unchanged.
This section additionally requires PyTorch.
try:
import torch
from torch import nn
has_torch = True
except ImportError:
has_torch = False
if has_torch and has_tensorrt and has_cuda:
from tvm.relax.frontend.torch import from_exported_program
class SmallConvNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 8, 3)
self.conv2 = nn.Conv2d(8, 16, 3)
self.pool = nn.MaxPool2d(2)
def forward(self, x):
x = torch.relu(self.conv1(x))
x = self.pool(x)
x = torch.relu(self.conv2(x))
return x
torch_model = SmallConvNet().eval()
example_input = torch.randn(1, 3, 32, 32)
with torch.no_grad():
torch_ref = torch_model(example_input).numpy()
exported = torch.export.export(torch_model, (example_input,))
torch_mod = from_exported_program(exported)
torch_mod = partition_for_tensorrt(torch_mod)
print("After importing and partitioning the PyTorch model:")
print(torch_mod)
torch_dev = tvm.cuda(0)
with tvm.transform.PassContext(opt_level=3):
torch_exec = relax.build(RunCodegen()(torch_mod), "cuda")
deployed = relax.VirtualMachine(torch_exec, torch_dev)["main"](
tvm.runtime.tensor(example_input.numpy(), torch_dev)
)[0].numpy()
np.testing.assert_allclose(deployed, torch_ref, rtol=1e-2, atol=1e-2)
print("Deployed PyTorch model on TensorRT; output", deployed.shape, "matches PyTorch.")
else:
print("PyTorch / TensorRT / CUDA unavailable; skipping the deployment example.")
PyTorch / TensorRT / CUDA unavailable; skipping the deployment example.
Real deployment builds once and reuses the artifact. Export the compiled module to a shared library, then load and run it later – in a fresh process, with no PyTorch and no rebuild needed.
if has_torch and has_tensorrt and has_cuda:
with tempfile.TemporaryDirectory() as tmpdir:
lib_path = os.path.join(tmpdir, "deployed_trt.so")
torch_exec.export_library(lib_path)
loaded = tvm.runtime.load_module(lib_path)
reran = relax.VirtualMachine(loaded, torch_dev)["main"](
tvm.runtime.tensor(example_input.numpy(), torch_dev)
)[0].numpy()
np.testing.assert_allclose(reran, torch_ref, rtol=1e-2, atol=1e-2)
print("Reloaded the exported library and reran; output", reran.shape, "still matches.")
else:
print("PyTorch / TensorRT / CUDA unavailable; skipping the export/reload step.")
PyTorch / TensorRT / CUDA unavailable; skipping the export/reload step.
Notes for real deployments#
Operator coverage and fallback. TensorRT offloads only the ops in its pattern table (see
python/tvm/relax/backend/contrib/tensorrt.py); anything unsupported simply stays on the host. Print the partitioned module and look for theCodegen: "tensorrt"functions to see what was offloaded.Dynamic shapes. The builder sets up an optimization profile for a dynamic leading (batch) dimension, so the integration can serve a model exported with a symbolic batch size.
Engine build cost. Building a TensorRT engine is slow the first time (it is not a hang). Set
TVM_TENSORRT_CACHE_DIRto cache built engines to disk and skip the rebuild on later runs.
Next steps#
To build your own backend using the example NPU as a starting point:
Replace the stub runtime in
src/runtime/extra/contrib/example_npu/example_npu_runtime.ccwith your hardware SDK calls.Extend
patterns.pywith the ops your hardware supports.Add a C++ codegen under
src/relax/backend/contrib/if your backend needs a non-JSON serialization format.Add a CMake module under
cmake/modules/contrib/followingExampleNPU.cmake.
For a complete real-backend implementation to study, see the TensorRT
integration: the pattern table and partition_for_tensorrt in
python/tvm/relax/backend/contrib/tensorrt.py, the codegen in
src/relax/backend/contrib/tensorrt/, and the runtime in
src/runtime/extra/contrib/tensorrt/.