
.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
.. CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "how_to/tutorials/bring_your_own_codegen.py"

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        You can click :ref:`here <sphx_glr_download_how_to_tutorials_bring_your_own_codegen.py>` to run the Jupyter notebook locally.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_how_to_tutorials_bring_your_own_codegen.py:


.. _tutorial-bring-your-own-codegen:

Bring Your Own Codegen
======================

TVM's Bring Your Own Codegen (BYOC) framework lets you offload parts of a model
to a custom backend -- a hardware accelerator, an inference library, or your own
kernels -- while TVM compiles the rest.  This tutorial has two parts:

- **How BYOC works** -- we teach the flow with a bundled, hardware-free *example
  NPU* backend and then drive the **same flow** on a real production backend,
  NVIDIA TensorRT.  Both run a small, hand-written model so every step is
  visible; the only thing that changes between them is the backend, and that
  contrast is the lesson.
- **Deploying a real model** -- we then put it to work, taking an actual PyTorch
  ``nn.Module`` from export through TensorRT and running it on the GPU.

The example NPU is a teaching stub: its runtime logs the dispatch decisions an
NPU would make (memory tier, execution engine, fusion) but performs no real
computation, so its output buffers are left uninitialized.  We therefore check
*shapes*, not values, in the NPU sections -- its job is to make every BYOC step
visible with nothing hidden.  TensorRT then runs the identical flow for real, so
we cross-check its result against a reference.

**Prerequisites**: the example NPU sections need TVM built with
``USE_EXAMPLE_NPU_CODEGEN=ON`` and ``USE_EXAMPLE_NPU_RUNTIME=ON``; the TensorRT
sections need ``USE_TENSORRT_CODEGEN=ON``, ``USE_TENSORRT_RUNTIME=ON`` and
``USE_CUDA=ON`` plus a CUDA GPU and a matching TensorRT install (from NVIDIA's
``pip install tensorrt`` packages or the TensorRT archive); the final deployment
section also needs PyTorch.  Each section degrades gracefully when its backend is
unavailable.

.. GENERATED FROM PYTHON SOURCE LINES 53-68

Overview of the BYOC flow
-------------------------

BYOC plugs a custom backend into TVM's compilation pipeline in four steps:

1. **Register patterns** - describe which sequences of Relax ops the backend
   can handle.
2. **Partition the graph** - group matched ops into composite functions.
3. **Run codegen** - lower each composite to the backend's representation
   (a JSON graph for both backends in this tutorial).
4. **Execute** - the runtime dispatches each composite to the backend.

Steps 1 and 2 are pure Python and run anywhere; steps 3 and 4 need the
backend's codegen and runtime compiled into TVM, which is why the
build-and-run cells below are guarded.

.. GENERATED FROM PYTHON SOURCE LINES 70-77

Step 1: Import the backends to register their patterns
------------------------------------------------------

Importing a backend module registers its patterns with TVM's global registry.
Pattern registration is independent of the C++ build -- only codegen and the
runtime require the backend to be compiled in -- so we probe each backend and
guard the build-and-run cells accordingly.

.. GENERATED FROM PYTHON SOURCE LINES 77-102

.. code-block:: Python


    import os
    import tempfile

    import numpy as np

    import tvm
    import tvm.relax.backend.contrib.example_npu
    from tvm import relax
    from tvm.relax.backend.contrib.tensorrt import partition_for_tensorrt
    from tvm.relax.backend.pattern_registry import get_patterns_with_prefix
    from tvm.relax.transform import FuseOpsByPattern, MergeCompositeFunctions, RunCodegen
    from tvm.script import relax as R

    has_example_npu_codegen = tvm.get_global_func("relax.ext.example_npu", True)
    has_example_npu_runtime = tvm.get_global_func("runtime.ExampleNPUJSONRuntimeCreate", True)
    has_example_npu = has_example_npu_codegen and has_example_npu_runtime

    has_tensorrt_codegen = tvm.get_global_func("relax.ext.tensorrt", True) is not None
    _is_trt_runtime_enabled = tvm.get_global_func("relax.is_tensorrt_runtime_enabled", True)
    has_tensorrt = (
        has_tensorrt_codegen and _is_trt_runtime_enabled is not None and _is_trt_runtime_enabled()
    )
    has_cuda = tvm.cuda(0).exist








.. GENERATED FROM PYTHON SOURCE LINES 103-108

Step 2: Define the model
------------------------

A single convolution followed by a ReLU.  This one model is used for both
backends.

.. GENERATED FROM PYTHON SOURCE LINES 108-124

.. code-block:: Python



    @tvm.script.ir_module
    class ConvReLU:
        @R.function
        def main(
            data: R.Tensor((1, 3, 32, 32), "float32"),
            weight: R.Tensor((16, 3, 3, 3), "float32"),
        ) -> R.Tensor((1, 16, 30, 30), "float32"):
            with R.dataflow():
                conv = relax.op.nn.conv2d(data, weight)
                out = relax.op.nn.relu(conv)
                R.output(out)
            return out









.. GENERATED FROM PYTHON SOURCE LINES 125-145

Step 3: Partition for the example NPU
-------------------------------------

``FuseOpsByPattern`` groups ops matching a registered pattern into composite
functions; ``MergeCompositeFunctions`` then consolidates adjacent composites
bound for the same backend into a single external call.  Two flags steer
partitioning:

- ``bind_constants=False`` keeps weights as function arguments, so the host
  stays in charge of the parameters.  (TensorRT below makes the opposite
  choice: it binds weights as constants because it bakes them into its engine.)
- ``annotate_codegen=True`` wraps each matched composite in a function tagged
  with the backend name -- the tag ``RunCodegen`` routes on.  (The follow-up
  ``MergeCompositeFunctions`` also attaches this tag when it groups composites,
  which is why ``partition_for_tensorrt`` below can leave the flag off.)

The example NPU registers a fused ``conv2d + relu`` pattern with higher
priority than the standalone ``conv2d`` pattern, so the two ops collapse into a
single ``example_npu.conv2d_relu_fused`` composite -- look for it in the
printed module.

.. GENERATED FROM PYTHON SOURCE LINES 145-152

.. code-block:: Python


    npu_patterns = get_patterns_with_prefix("example_npu")
    npu_mod = FuseOpsByPattern(npu_patterns, bind_constants=False, annotate_codegen=True)(ConvReLU)
    npu_mod = MergeCompositeFunctions()(npu_mod)
    print("After partitioning for the example NPU:")
    print(npu_mod)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    After partitioning for the example NPU:
    # from tvm.script import ir as I
    # from tvm.script import relax as R

    @I.ir_module
    class Module:
        @R.function
        def fused_relax_nn_conv2d_relax_nn_relu_example_npu_example_npu(data: R.Tensor((1, 3, 32, 32), dtype="float32"), weight: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
            R.func_attr({"Codegen": "example_npu"})
            # from tvm.script import relax as R
        
            @R.function
            def local_func(data_1: R.Tensor((1, 3, 32, 32), dtype="float32"), weight_1: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
                R.func_attr({"Composite": "example_npu.conv2d_relu_fused"})
                conv: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.conv2d(data_1, weight_1, strides=[1, 1], padding=[0, 0, 0, 0], dilation=[1, 1], groups=1, data_layout="NCHW", kernel_layout="OIHW", out_layout="NCHW", out_dtype=None)
                gv: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.relu(conv)
                return gv

            output: R.Tensor((1, 16, 30, 30), dtype="float32") = local_func(data, weight)
            return output

        @R.function
        def main(data: R.Tensor((1, 3, 32, 32), dtype="float32"), weight: R.Tensor((16, 3, 3, 3), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
            cls = Module
            with R.dataflow():
                gv: R.Tensor((1, 16, 30, 30), dtype="float32") = cls.fused_relax_nn_conv2d_relax_nn_relu_example_npu_example_npu(data, weight)
                R.output(gv)
            return gv




.. GENERATED FROM PYTHON SOURCE LINES 153-161

Step 4: Codegen, build and run on the example NPU
-------------------------------------------------

``RunCodegen`` invokes each annotated composite's backend codegen, replacing it
with the backend runtime module (here, the NPU's JSON graph); ``relax.build``
then compiles the remaining host-side program and links everything.  Because
the runtime is a stub that computes nothing, we assert on the output *shape*
only -- the values are uninitialized.

.. GENERATED FROM PYTHON SOURCE LINES 161-181

.. code-block:: Python


    np.random.seed(0)
    data_np = np.random.randn(1, 3, 32, 32).astype("float32")
    weight_np = np.random.randn(16, 3, 3, 3).astype("float32")

    if has_example_npu:
        npu_mod = RunCodegen()(npu_mod)

        with tvm.transform.PassContext(opt_level=3):
            npu_exec = relax.build(npu_mod, tvm.target.Target("llvm"))

        npu_vm = relax.VirtualMachine(npu_exec, tvm.cpu())
        npu_out = npu_vm["main"](
            tvm.runtime.tensor(data_np, tvm.cpu()), tvm.runtime.tensor(weight_np, tvm.cpu())
        )
        assert npu_out.numpy().shape == (1, 16, 30, 30)
        print("Example NPU run completed. Output shape:", npu_out.numpy().shape)
    else:
        print("Example NPU backend unavailable; skipping its build and run.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Example NPU backend unavailable; skipping its build and run.




.. GENERATED FROM PYTHON SOURCE LINES 182-196

The same flow on a real backend: TensorRT
-----------------------------------------

Steps 1-4 above are the whole mechanism.  Aiming them at a real backend
changes very little, so rather than repeat the walkthrough, here is only what
differs for NVIDIA TensorRT:

- **Partition in one call.** ``partition_for_tensorrt`` bundles the
  ``FuseOpsByPattern`` + ``MergeCompositeFunctions`` you ran by hand, using
  TensorRT's own pattern table.
- **Weights become constants** (``bind_constants=True``): TensorRT bakes them
  into the engine it builds, so bind the parameters before partitioning.
- **Real values.** TensorRT actually computes, so we build for CUDA, run on
  the GPU, and cross-check against a plain CPU build -- not just the shape.

.. GENERATED FROM PYTHON SOURCE LINES 196-202

.. code-block:: Python


    trt_mod = relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
    trt_mod = partition_for_tensorrt(trt_mod)
    print("After partition_for_tensorrt:")
    print(trt_mod)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    After partition_for_tensorrt:
    # from tvm.script import ir as I
    # from tvm.script import relax as R

    @I.ir_module
    class Module:
        @R.function
        def fused_relax_nn_conv2d_relax_nn_relu_tensorrt(data: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
            R.func_attr({"Codegen": "tensorrt"})
            # from tvm.script import relax as R
        
            @R.function
            def gv(data_1: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
                R.func_attr({"Composite": "tensorrt.nn.conv2d"})
                with R.dataflow():
                    gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.conv2d(data_1, metadata["relax.expr.Constant"][0], strides=[1, 1], padding=[0, 0, 0, 0], dilation=[1, 1], groups=1, data_layout="NCHW", kernel_layout="OIHW", out_layout="NCHW", out_dtype=None)
                    R.output(gv_1)
                return gv_1

            lv: R.Tensor((1, 16, 30, 30), dtype="float32") = gv(data)
            # from tvm.script import relax as R
        
            @R.function
            def gv1(lv_1: R.Tensor((1, 16, 30, 30), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
                R.func_attr({"Composite": "tensorrt.nn.relu"})
                with R.dataflow():
                    gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = R.nn.relu(lv_1)
                    R.output(gv_1)
                return gv_1

            gv_1: R.Tensor((1, 16, 30, 30), dtype="float32") = gv1(lv)
            return gv_1

        @R.function
        def main(data: R.Tensor((1, 3, 32, 32), dtype="float32")) -> R.Tensor((1, 16, 30, 30), dtype="float32"):
            cls = Module
            with R.dataflow():
                gv: R.Tensor((1, 16, 30, 30), dtype="float32") = cls.fused_relax_nn_conv2d_relax_nn_relu_tensorrt(data)
                R.output(gv)
            return gv

    # Metadata omitted. Use show_meta=True in script() method to show it.




.. GENERATED FROM PYTHON SOURCE LINES 203-204

Build for CUDA, run on the GPU, and compare against the CPU reference.

.. GENERATED FROM PYTHON SOURCE LINES 204-224

.. code-block:: Python


    if has_tensorrt and has_cuda:
        dev = tvm.cuda(0)
        with tvm.transform.PassContext(opt_level=3):
            trt_exec = relax.build(RunCodegen()(trt_mod), "cuda")
        trt_out = relax.VirtualMachine(trt_exec, dev)["main"](tvm.runtime.tensor(data_np, dev)).numpy()

        cpu_mod = relax.transform.LegalizeOps()(
            relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
        )
        cpu_exec = relax.build(cpu_mod, "llvm")
        cpu_out = relax.VirtualMachine(cpu_exec, tvm.cpu())["main"](
            tvm.runtime.tensor(data_np, tvm.cpu())
        ).numpy()

        np.testing.assert_allclose(trt_out, cpu_out, rtol=1e-2, atol=1e-2)
        print("TensorRT output shape:", trt_out.shape, "- matches the CPU reference.")
    else:
        print("TensorRT/CUDA unavailable; skipping the GPU build and run.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    TensorRT/CUDA unavailable; skipping the GPU build and run.




.. GENERATED FROM PYTHON SOURCE LINES 225-232

A real backend also exposes knobs the stub does not.  Setting ``use_fp16``
through the ``relax.ext.tensorrt.options`` config lets TensorRT pick FP16
kernels, trading a little accuracy for speed; nothing else about the flow
changes.  (Other options are environment-driven: ``TVM_TENSORRT_USE_INT8``
enables INT8 with calibration, ``TVM_TENSORRT_MAX_WORKSPACE_SIZE`` caps the
build workspace, and ``TVM_TENSORRT_CACHE_DIR`` caches built engines to disk
for reuse across runs.)

.. GENERATED FROM PYTHON SOURCE LINES 232-250

.. code-block:: Python


    if has_tensorrt and has_cuda:
        fp16_mod = partition_for_tensorrt(
            relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
        )
        with tvm.transform.PassContext(
            opt_level=3, config={"relax.ext.tensorrt.options": {"use_fp16": True}}
        ):
            fp16_exec = relax.build(RunCodegen()(fp16_mod), "cuda")
        fp16_out = relax.VirtualMachine(fp16_exec, tvm.cuda(0))["main"](
            tvm.runtime.tensor(data_np, tvm.cuda(0))
        ).numpy()

        np.testing.assert_allclose(fp16_out, cpu_out, rtol=5e-2, atol=5e-2)
        print("TensorRT FP16 output shape:", fp16_out.shape, "- matches within FP16 tolerance.")
    else:
        print("TensorRT/CUDA unavailable; skipping the FP16 build.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    TensorRT/CUDA unavailable; skipping the FP16 build.




.. GENERATED FROM PYTHON SOURCE LINES 251-264

Example NPU vs TensorRT at a glance
-----------------------------------

The same four-step flow, two backends:

=========  ==============================  ==================================
Aspect     Example NPU (teaching stub)     TensorRT (real backend)
=========  ==============================  ==================================
Runtime    logs decisions, no compute      builds and runs an nvinfer engine
Output     uninitialized (check shape)     real values (cross-checked vs CPU)
Weights    ``bind_constants=False``        ``bind_constants=True`` (baked in)
Partition  two passes, by hand             ``partition_for_tensorrt`` one call
=========  ==============================  ==================================

.. GENERATED FROM PYTHON SOURCE LINES 266-280

Deploying a PyTorch model with TensorRT
---------------------------------------

Everything above used a hand-written ``IRModule`` so each op was visible.  In
practice you start from a trained model.  This final section runs the *same*
``partition_for_tensorrt`` flow on a real PyTorch ``nn.Module``, end to end:
export it, import it into Relax with the PyTorch frontend (the weights come in
as constants -- exactly what TensorRT bakes into its engine), partition, build
for CUDA, and check the GPU result against PyTorch's own output.  Beyond the
frontend import, the only difference is that the imported program returns its
outputs as a tuple, so we index ``[0]`` for the single result tensor; the
partition-build-run flow is otherwise unchanged.

This section additionally requires PyTorch.

.. GENERATED FROM PYTHON SOURCE LINES 280-328

.. code-block:: Python


    try:
        import torch
        from torch import nn

        has_torch = True
    except ImportError:
        has_torch = False

    if has_torch and has_tensorrt and has_cuda:
        from tvm.relax.frontend.torch import from_exported_program

        class SmallConvNet(nn.Module):
            def __init__(self):
                super().__init__()
                self.conv1 = nn.Conv2d(3, 8, 3)
                self.conv2 = nn.Conv2d(8, 16, 3)
                self.pool = nn.MaxPool2d(2)

            def forward(self, x):
                x = torch.relu(self.conv1(x))
                x = self.pool(x)
                x = torch.relu(self.conv2(x))
                return x

        torch_model = SmallConvNet().eval()
        example_input = torch.randn(1, 3, 32, 32)
        with torch.no_grad():
            torch_ref = torch_model(example_input).numpy()
            exported = torch.export.export(torch_model, (example_input,))

        torch_mod = from_exported_program(exported)
        torch_mod = partition_for_tensorrt(torch_mod)
        print("After importing and partitioning the PyTorch model:")
        print(torch_mod)

        torch_dev = tvm.cuda(0)
        with tvm.transform.PassContext(opt_level=3):
            torch_exec = relax.build(RunCodegen()(torch_mod), "cuda")
        deployed = relax.VirtualMachine(torch_exec, torch_dev)["main"](
            tvm.runtime.tensor(example_input.numpy(), torch_dev)
        )[0].numpy()

        np.testing.assert_allclose(deployed, torch_ref, rtol=1e-2, atol=1e-2)
        print("Deployed PyTorch model on TensorRT; output", deployed.shape, "matches PyTorch.")
    else:
        print("PyTorch / TensorRT / CUDA unavailable; skipping the deployment example.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    PyTorch / TensorRT / CUDA unavailable; skipping the deployment example.




.. GENERATED FROM PYTHON SOURCE LINES 329-332

Real deployment builds once and reuses the artifact.  Export the compiled
module to a shared library, then load and run it later -- in a fresh process,
with no PyTorch and no rebuild needed.

.. GENERATED FROM PYTHON SOURCE LINES 332-346

.. code-block:: Python


    if has_torch and has_tensorrt and has_cuda:
        with tempfile.TemporaryDirectory() as tmpdir:
            lib_path = os.path.join(tmpdir, "deployed_trt.so")
            torch_exec.export_library(lib_path)
            loaded = tvm.runtime.load_module(lib_path)
            reran = relax.VirtualMachine(loaded, torch_dev)["main"](
                tvm.runtime.tensor(example_input.numpy(), torch_dev)
            )[0].numpy()
            np.testing.assert_allclose(reran, torch_ref, rtol=1e-2, atol=1e-2)
            print("Reloaded the exported library and reran; output", reran.shape, "still matches.")
    else:
        print("PyTorch / TensorRT / CUDA unavailable; skipping the export/reload step.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    PyTorch / TensorRT / CUDA unavailable; skipping the export/reload step.




.. GENERATED FROM PYTHON SOURCE LINES 347-360

Notes for real deployments
--------------------------

- **Operator coverage and fallback.** TensorRT offloads only the ops in its
  pattern table (see ``python/tvm/relax/backend/contrib/tensorrt.py``);
  anything unsupported simply stays on the host.  Print the partitioned module
  and look for the ``Codegen: "tensorrt"`` functions to see what was offloaded.
- **Dynamic shapes.** The builder sets up an optimization profile for a dynamic
  leading (batch) dimension, so the integration can serve a model exported with
  a symbolic batch size.
- **Engine build cost.** Building a TensorRT engine is slow the first time (it
  is not a hang).  Set ``TVM_TENSORRT_CACHE_DIR`` to cache built engines to
  disk and skip the rebuild on later runs.

.. GENERATED FROM PYTHON SOURCE LINES 362-381

Next steps
----------

To build your own backend using the example NPU as a starting point:

- Replace the stub runtime in
  ``src/runtime/extra/contrib/example_npu/example_npu_runtime.cc`` with your
  hardware SDK calls.
- Extend ``patterns.py`` with the ops your hardware supports.
- Add a C++ codegen under ``src/relax/backend/contrib/`` if your backend needs
  a non-JSON serialization format.
- Add a CMake module under ``cmake/modules/contrib/`` following
  ``ExampleNPU.cmake``.

For a complete real-backend implementation to study, see the TensorRT
integration: the pattern table and ``partition_for_tensorrt`` in
``python/tvm/relax/backend/contrib/tensorrt.py``, the codegen in
``src/relax/backend/contrib/tensorrt/``, and the runtime in
``src/runtime/extra/contrib/tensorrt/``.


.. _sphx_glr_download_how_to_tutorials_bring_your_own_codegen.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: bring_your_own_codegen.ipynb <bring_your_own_codegen.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: bring_your_own_codegen.py <bring_your_own_codegen.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: bring_your_own_codegen.zip <bring_your_own_codegen.zip>`
