Note
You can click here to run the Jupyter notebook locally.
Bring Your Own Codegen: NPU Backend Example
This tutorial shows how to integrate a custom hardware backend with TVM’s BYOC framework, using the bundled example NPU backend (CPU emulation, no real hardware required) as the worked example. You will see the key concepts needed to offload operations to a custom accelerator: pattern registration, graph partitioning, codegen, and runtime dispatch.
NPUs are purpose-built accelerators designed around a fixed set of operations common in neural network inference, such as matrix multiplication, convolution, and activation functions. The example backend’s runtime is a stub: it logs the dispatch decisions an NPU would make (memory tier, execution engine, fusion) but performs no real computation, so output buffers are uninitialized. Assertions in this tutorial therefore check shapes, not values. When you replace the runtime with your hardware SDK calls, the same flow produces real results.
Prerequisites: Build TVM with USE_EXAMPLE_NPU_CODEGEN=ON and
USE_EXAMPLE_NPU_RUNTIME=ON.
Overview of the BYOC Flow
The BYOC framework lets you plug a custom backend into TVM’s compilation pipeline in four steps:
Register patterns - describe which sequences of Relax ops the backend can handle.
Partition the graph - group matched ops into composite functions.
Run codegen - lower composite functions to backend-specific representation (JSON graph for the example NPU).
Execute - the runtime dispatches composite functions to the registered backend runtime.
Step 1: Import the backend to register its patterns
Importing the module is enough to register all supported patterns with TVM’s pattern registry.
import numpy as np
import tvm
import tvm.relax.backend.contrib.example_npu # registers patterns
from tvm import relax
from tvm.relax.backend.pattern_registry import get_patterns_with_prefix
from tvm.relax.transform import FuseOpsByPattern, MergeCompositeFunctions, RunCodegen
from tvm.script import relax as R
has_example_npu_codegen = tvm.get_global_func("relax.ext.example_npu", True)
has_example_npu_runtime = tvm.get_global_func("runtime.ExampleNPUJSONRuntimeCreate", True)
has_example_npu = has_example_npu_codegen and has_example_npu_runtime
target = tvm.target.Target("llvm")
patterns = get_patterns_with_prefix("example_npu")
print("Registered patterns:", [p.name for p in patterns])
Registered patterns: ['example_npu.conv2d_relu_fused', 'example_npu.matmul_relu_fused', 'example_npu.depthwise_conv2d', 'example_npu.conv2d', 'example_npu.conv1d', 'example_npu.matmul', 'example_npu.dense', 'example_npu.avg_pool2d', 'example_npu.max_pool2d', 'example_npu.batch_norm', 'example_npu.softmax', 'example_npu.gelu', 'example_npu.relu', 'example_npu.divide', 'example_npu.subtract', 'example_npu.multiply', 'example_npu.add', 'example_npu.dequantize', 'example_npu.quantize']
Step 2: Define a model
We use a simple MatMul + ReLU module to illustrate the flow.
@tvm.script.ir_module
class MatmulReLU:
@R.function
def main(
x: R.Tensor((2, 4), "float32"),
w: R.Tensor((4, 8), "float32"),
) -> R.Tensor((2, 8), "float32"):
with R.dataflow():
y = relax.op.matmul(x, w)
z = relax.op.nn.relu(y)
R.output(z)
return z
Step 3: Partition the graph
FuseOpsByPattern groups ops that match a registered pattern into
composite functions, controlled by two flags:
bind_constants=Falsekeeps weights as function arguments instead of baking them in, so the host stays in charge of parameter ownership.annotate_codegen=Truetags each composite with its backend name (example_npu); without this tag,RunCodegenhas no way to route the composite to a backend.
MergeCompositeFunctions then consolidates adjacent composites
that target the same backend so each group becomes a single external
call. Note that consolidation depends on the patterns themselves: an
op_a + op_b chain only collapses into one composite if a fused
pattern (e.g. matmul_relu_fused) was registered for it; otherwise
each op stays as its own composite even when both target the same
backend.
mod = MatmulReLU
mod = FuseOpsByPattern(patterns, bind_constants=False, annotate_codegen=True)(mod)
mod = MergeCompositeFunctions()(mod)
print("After partitioning:")
print(mod)
After partitioning:
# from tvm.script import ir as I
# from tvm.script import relax as R
@I.ir_module
class Module:
@R.function
def fused_relax_matmul_relax_nn_relu_example_npu_example_npu(x: R.Tensor((2, 4), dtype="float32"), w: R.Tensor((4, 8), dtype="float32")) -> R.Tensor((2, 8), dtype="float32"):
R.func_attr({"Codegen": "example_npu"})
# from tvm.script import relax as R
@R.function
def local_func(x_1: R.Tensor((2, 4), dtype="float32"), w_1: R.Tensor((4, 8), dtype="float32")) -> R.Tensor((2, 8), dtype="float32"):
R.func_attr({"Composite": "example_npu.matmul_relu_fused"})
y: R.Tensor((2, 8), dtype="float32") = R.matmul(x_1, w_1, out_dtype="void")
gv: R.Tensor((2, 8), dtype="float32") = R.nn.relu(y)
return gv
output: R.Tensor((2, 8), dtype="float32") = local_func(x, w)
return output
@R.function
def main(x: R.Tensor((2, 4), dtype="float32"), w: R.Tensor((4, 8), dtype="float32")) -> R.Tensor((2, 8), dtype="float32"):
cls = Module
with R.dataflow():
gv: R.Tensor((2, 8), dtype="float32") = cls.fused_relax_matmul_relax_nn_relu_example_npu_example_npu(x, w)
R.output(gv)
return gv
Step 4: Run codegen
RunCodegen lowers each annotated composite function to the backend’s
serialization format. For the example NPU this produces a JSON graph
that the C++ runtime can execute.
Steps 4 and 5 require TVM to be built with USE_EXAMPLE_NPU_CODEGEN=ON
and USE_EXAMPLE_NPU_RUNTIME=ON.
if has_example_npu:
mod = RunCodegen()(mod)
print("After codegen:")
print(mod)
######################################################################
# Step 5: Build and run
# ---------------------
#
# Build the module for the host target, create a virtual machine, and
# execute the compiled function.
np.random.seed(0)
x_np = np.random.randn(2, 4).astype("float32")
w_np = np.random.randn(4, 8).astype("float32")
with tvm.transform.PassContext(opt_level=3):
built = relax.build(mod, target)
vm = relax.VirtualMachine(built, tvm.cpu())
result = vm["main"](tvm.runtime.tensor(x_np, tvm.cpu()), tvm.runtime.tensor(w_np, tvm.cpu()))
assert result.numpy().shape == (2, 8)
print("Execution completed. Output shape:", result.numpy().shape)
Step 6: Conv2D + ReLU
The same flow applies to convolution workloads. Because the fused
conv2d + relu pattern is registered after the standalone
conv2d pattern in patterns.py (later entries have higher
priority), both ops are offloaded as a single composite function.
@tvm.script.ir_module
class Conv2dReLU:
@R.function
def main(
x: R.Tensor((1, 3, 32, 32), "float32"),
w: R.Tensor((16, 3, 3, 3), "float32"),
) -> R.Tensor((1, 16, 30, 30), "float32"):
with R.dataflow():
y = relax.op.nn.conv2d(x, w)
z = relax.op.nn.relu(y)
R.output(z)
return z
if has_example_npu:
mod2 = Conv2dReLU
mod2 = FuseOpsByPattern(patterns, bind_constants=False, annotate_codegen=True)(mod2)
mod2 = MergeCompositeFunctions()(mod2)
mod2 = RunCodegen()(mod2)
with tvm.transform.PassContext(opt_level=3):
built2 = relax.build(mod2, target)
x2_np = np.random.randn(1, 3, 32, 32).astype("float32")
w2_np = np.random.randn(16, 3, 3, 3).astype("float32")
vm2 = relax.VirtualMachine(built2, tvm.cpu())
result2 = vm2["main"](
tvm.runtime.tensor(x2_np, tvm.cpu()), tvm.runtime.tensor(w2_np, tvm.cpu())
)
assert result2.numpy().shape == (1, 16, 30, 30)
print("Conv2dReLU output shape:", result2.numpy().shape)
Next steps
To build a real NPU backend using this example as a starting point:
Replace
example_npu_runtime.ccwith your hardware SDK calls.Extend
patterns.pywith the ops your hardware supports.Add a C++ codegen under
src/relax/backend/contrib/if your hardware requires a non-JSON serialization format.Add your cmake module under
cmake/modules/contrib/following the pattern incmake/modules/contrib/ExampleNPU.cmake.