..  Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

..    http://www.apache.org/licenses/LICENSE-2.0

..  Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.

.. _relax-vm-arch:

Relax Virtual Machine
=====================

This document explains the Relax VM architecture in detail, covering the compilation pipeline
from Relax IR to bytecode, the instruction set, the execution model, and the Python-level user
interface.

Overview
--------

The end-to-end flow from model to execution is:

1. **Relax IR** — a high-level computational graph (``relax.Function`` inside an ``IRModule``).
2. **Compilation** — ``tvm.compile()`` applies the Relax transformation pipeline, then invokes
   ``VMCodeGen`` to translate each Relax function into bytecode instructions.
3. **Linking** — TIR functions are compiled to native kernels (via LLVM, CUDA, etc.); the bytecode,
   constant pool, and compiled kernels are packaged together into a ``VMExecutable``.
4. **Execution** — at runtime, a ``VirtualMachine`` loads the executable, initializes devices and
   memory allocators, and runs the bytecode.

.. code-block:: text

   IRModule (Relax + TIR)
        │
        ▼  relax_pipeline (FuseOps, LegalizeOps, ...)
   IRModule (optimized)
        │
        ▼  VMCodeGen
   ExecBuilder (bytecode) + IRModule (TIR only)
        │                        │
        │                        ▼  tirx.build()
        │                   runtime.Module (native kernels)
        │                        │
        ▼  VMLink               ▼
   VMExecutable ◄───────── linked together
        │
        ▼  VirtualMachine(exec, device)
   Runtime execution


Compilation: From Relax IR to Bytecode
--------------------------------------

Build entry point
~~~~~~~~~~~~~~~~~

The main entry point is ``tvm.compile()`` (which delegates to ``relax.build()`` in
``python/tvm/relax/vm_build.py``):

.. code-block:: python

   import tvm
   from tvm import relax

   @tvm.script.ir_module
   class MyModule:
       @R.function
       def main(x: R.Tensor((3, 4), "float32")):
           return R.add(x, x)

   target = tvm.target.Target("llvm")
   ex = tvm.compile(MyModule, target)

Internally, ``relax.build()`` performs these steps:

1. Apply the **Relax pipeline** (``relax.get_pipeline("default")``), which includes operator
   legalization, fusion, buffer planning, and other graph-level passes.
2. Create an ``ExecBuilder`` and run **VMCodeGen** (``src/relax/backend/vm/codegen_vm.cc``),
   which walks each ``relax.Function`` and emits bytecode instructions. The Relax functions are
   removed from the IRModule; only TIR functions remain.
3. Compile the remaining TIR functions to native code via ``tirx.build()``.
4. **Link** the bytecode executable with the compiled native module using ``VMLink``, producing
   a ``VMExecutable``.

Two execution modes are supported:

- ``exec_mode="bytecode"`` (default): Relax functions are interpreted by the VM's bytecode
  dispatch loop.
- ``exec_mode="compiled"``: Relax functions are compiled into TIR functions (``VMTIRCodeGen``)
  that directly manipulate the register file, bypassing the interpreter loop. This avoids
  dispatch overhead but produces more code.

Bytecode generation
~~~~~~~~~~~~~~~~~~~

The ``CodeGenVM`` class (``src/relax/backend/vm/codegen_vm.cc``) is an ``ExprFunctor`` that visits
each Relax expression and emits instructions through the ``ExecBuilder``:

- Each ``relax.Var`` is mapped to a register.
- Function parameters occupy registers 0 through N-1.
- Each binding in a ``SeqExpr`` generates one or more instructions; the result is stored in a
  new register.
- Function calls (``R.call_tir``, ``R.call_packed``, operator calls) become ``Call`` instructions.
- Conditional expressions (``relax.If``, written as Python ``if`` in TVMScript) become an ``If``
  instruction followed by ``Goto`` to skip branches.
- The function body ends with a ``Ret`` instruction.


Instruction Set
---------------

The VM uses a **register-based** architecture with an intentionally minimal instruction set.
There are only four opcodes:

.. list-table::
   :header-rows: 1
   :widths: 15 30 55

   * - Opcode
     - Fields
     - Semantics
   * - ``Call``
     - ``dst``, ``func_idx``, ``num_args``, ``args[]``
     - Call function ``func_idx`` with the given arguments; store the result in register ``dst``.
   * - ``Ret``
     - ``result``
     - Return the value in register ``result`` to the caller.
   * - ``Goto``
     - ``pc_offset``
     - Jump forward or backward by ``pc_offset`` instructions.
   * - ``If``
     - ``cond``, ``false_offset``
     - If register ``cond`` is nonzero, fall through (pc++); otherwise jump by ``false_offset``.

The VM itself performs **no mathematical computation**. All actual work — matrix multiplications,
convolutions, elementwise operations — is carried out by compiled TIR kernels or external
libraries (cuBLAS, cuDNN, etc.), dispatched through ``Call`` instructions.

Instruction encoding
~~~~~~~~~~~~~~~~~~~~

Each instruction argument (``Instruction::Arg``) is a 64-bit word encoded as:

- **Bits [63:56]** — ``ArgKind`` (8 bits): ``kRegister`` (0), ``kImmediate`` (1), ``kConstIdx`` (2),
  or ``kFuncIdx`` (3).
- **Bits [55:0]** — value (56 bits, sign-extended).

Two special register values exist:

- ``kVoidRegister``: indicates "no destination" (the return value is discarded).
- ``kVMRegister``: refers to the VM context pointer itself, passed as the first argument to
  closures.

The instruction stream is stored as a flat ``vector<ExecWord>`` (``instr_data``) with an offset
table (``instr_offset``) for random access.


Executable
----------

A ``VMExecutable`` (``include/tvm/runtime/vm/executable.h``) bundles everything needed for
execution:

- **Function table** (``func_table``): a ``vector<VMFuncInfo>`` describing every function. Each
  entry records the function's kind, name, instruction range (``start_instr`` to ``end_instr``),
  number of arguments, register file size, and parameter names.
- **Constant pool** (``constants``): model weights, shape tuples, and other compile-time constants.
- **Bytecode** (``instr_data`` + ``instr_offset``): the instruction stream.
- **Imported modules**: the compiled TIR kernels and external libraries.

Function kinds
~~~~~~~~~~~~~~

The VM recognizes three function kinds (``VMFuncInfo::FuncKind``):

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Kind
     - Description
   * - ``kPackedFunc``
     - An external C/C++ function looked up from imported modules or the global PackedFunc
       registry. Examples: ``vm.builtin.alloc_shape_heap``, ``vm.builtin.match_shape``.
   * - ``kVMFunc``
     - A bytecode-interpreted Relax function. The VM interprets its instructions in ``RunLoop()``.
   * - ``kVMTIRFunc``
     - A Relax function compiled to a TIR function (``exec_mode="compiled"``). Found in
       imports under the name ``__vmtir__<func_name>``. Called directly with register file
       pointers, bypassing the interpreter loop.

Serialization
~~~~~~~~~~~~~

The executable supports binary serialization for deployment:

.. code-block:: python

   # Save
   ex.export_library("model.so")

   # Load
   loaded = tvm.runtime.load_module("model.so")
   vm = relax.VirtualMachine(loaded, tvm.cuda())

The binary format includes a magic number (``0xD225DE2F4214151E``), a version string
(currently ``"0.14"``), followed by four sections: globals (the function table), memory scopes,
constant pool, and bytecode. ``AsText()`` and ``AsPython()`` provide human-readable representations
for debugging.


Runtime Execution
-----------------

VM initialization
~~~~~~~~~~~~~~~~~

At runtime, a ``VirtualMachine`` is created and initialized:

.. code-block:: python

   from tvm.relax import VirtualMachine

   vm = VirtualMachine(exec_module, tvm.cuda())

Under the hood:

1. **LoadExecutable**: the bytecode and metadata are loaded from the ``VMExecutable``.
2. **Init**: devices and memory allocators are set up. Each device gets an ``Allocator``
   (either ``NAIVE_ALLOCATOR`` or ``POOLED_ALLOCATOR``, defaulting to pooled). A CPU device
   is always added for shape computations.
3. **InitFuncPool**: the function pool is populated — ``kPackedFunc`` entries are resolved from
   imports or the global registry; ``kVMFunc`` and ``kVMTIRFunc`` entries are wrapped in
   ``VMClosure`` objects.
4. **Constant pool**: model constants are loaded and optionally transferred to the target device.

The bytecode dispatch loop
~~~~~~~~~~~~~~~~~~~~~~~~~~

When a ``kVMFunc`` is invoked, the VM enters ``InvokeBytecode()``:

1. A new ``VMFrame`` is pushed onto the call stack. Each frame contains:

   - A **register file** (``vector<ffi::Any>``) — type-erased slots that can hold tensors,
     shapes, closures, or any TVM object. The size is determined at compile time
     (``VMFuncInfo::register_file_size``).
   - The **return program counter** — where to resume after the function returns.
   - The **caller's return register** — which register in the parent frame receives the result.

2. Function arguments are written to registers 0..N-1.
3. The program counter (``pc_``) is set to the function's ``start_instr``.
4. ``RunLoop()`` executes instructions until a ``Ret`` is encountered:

   - **Call**: resolve arguments (from registers, immediates, constant pool, or function pool),
     invoke the target function via ``InvokeClosurePacked()``, store the result in ``dst``.
   - **Ret**: read the return value from the specified register, write the result to the
     caller's return register, and return from ``RunLoop()`` (the frame is popped by an RAII
     guard when ``InvokeBytecode()`` exits).
   - **Goto**: adjust ``pc_`` by the offset.
   - **If**: check the condition register; if nonzero, fall through; otherwise jump by
     ``false_offset``.

The dispatch loop is implemented in ``src/runtime/vm/vm.cc`` (``VirtualMachineImpl::RunLoop``).

.. code-block:: text

   Frame Stack              Register File (per frame)
   ┌─────────────┐          ┌────┬────┬────┬─────┬────┐
   │  Frame 2    │ ───────► │ R0 │ R1 │ R2 │ ... │ Rn │
   ├─────────────┤          └────┴────┴────┴─────┴────┘
   │  Frame 1    │ ───────► [register file]
   ├─────────────┤
   │  Frame 0    │ ───────► [register file]
   └─────────────┘

VMClosure and function dispatch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Functions in the VM are stored in a ``func_pool_`` indexed by function table position.
``kVMFunc`` and ``kVMTIRFunc`` entries are wrapped as ``VMClosure`` objects, while ``kPackedFunc``
entries are stored as plain ``ffi::Function``. A ``VMClosure`` stores:

- ``func_name``: the function's string name.
- ``impl``: a ``ffi::Function`` that takes the VM context pointer as its first argument, followed
  by the actual parameters.

When the VM encounters a ``Call`` instruction, it looks up the function in ``func_pool_`` by
index and dispatches via ``InvokeClosurePacked()``. If the target is a ``VMClosure``, the VM
pointer is prepended to the arguments and ``impl`` is invoked. If it is a plain
``ffi::Function``, it is called directly.

``VMClosure::BindLastArgs`` enables partial application — it creates a new function with
some arguments pre-bound at the end, useful for implementing captured closures in Relax.

Built-in operations
~~~~~~~~~~~~~~~~~~~

The VM relies on several built-in PackedFuncs (registered in ``src/runtime/vm/builtin.cc``)
for runtime support:

- ``vm.builtin.alloc_shape_heap``: allocate workspace for symbolic shape computations.
- ``vm.builtin.match_shape``: validate tensor shapes against expected patterns at runtime,
  supporting assertions (``kAssertEqualToImm``, ``kAssertEqualToLoad``), storing symbolic
  dimensions to the shape heap (``kStoreToHeap``), or no-ops (``kNoOp``).
- ``vm.builtin.make_shape``: construct shape tuples from immediates or heap-loaded values.
- ``vm.builtin.match_prim_value``: validate primitive values (e.g., integers) against expected
  patterns.
- ``vm.builtin.copy``: copy a value into a register. Used in several codegen scenarios:
  materializing non-register arguments (immediates, constants) into registers, ensuring each
  variable binding gets its own register, and merging results from if/else branches.


Python Interface
----------------

Users interact with the VM through ``tvm.relax.VirtualMachine``:

.. code-block:: python

   import tvm
   from tvm import relax
   import numpy as np

   # Compile
   ex = tvm.compile(MyModule, target="llvm")

   # Create VM
   vm = relax.VirtualMachine(ex, tvm.cpu())

   # Direct invocation
   inp = tvm.runtime.tensor(np.random.rand(3, 4).astype("float32"))
   result = vm["main"](inp)

   # Stateful interface (useful for RPC)
   vm.set_input("main", inp)
   vm.invoke_stateful("main")
   output = vm.get_outputs("main")

Key methods:

- ``vm["func_name"](*args)`` — direct invocation, returns the result.
- ``vm.set_input()`` / ``vm.invoke_stateful()`` / ``vm.get_outputs()`` — stateful interface
  that avoids sending output over the wire, useful for RPC-based remote execution.
- ``vm.save_function(func_name, saved_name, *args)`` — pre-bind arguments for repeated calls,
  reducing dictionary lookup overhead during benchmarking.
- ``vm.time_evaluator(func_name, dev)`` — returns a timing function following the same convention
  as ``tvm.runtime.Module.time_evaluator``.
- ``vm.set_instrument(func)`` — register an instrumentation callback that is invoked before/after
  every ``Call`` instruction. The callback can return ``VMInstrumentReturnKind.SKIP_RUN`` to
  skip the call.

Instrumentation
~~~~~~~~~~~~~~~

The VM supports observability via instrumentation:

**Instrumentation** via ``set_instrument()``:

.. code-block:: python

   def my_instrument(func, func_symbol, before_run, ret_value, *args):
       if before_run:
           print(f"About to call: {func_symbol}")
       return VMInstrumentReturnKind.NO_OP

   vm.set_instrument(my_instrument)
   vm["main"](inp)

The instrument function is called before and after every ``Call`` instruction, receiving the
function object, its symbol name, a flag indicating before/after, the return value (only valid
after), and all arguments.


Inspecting Bytecode
-------------------

The executable provides text and Python representations of the compiled bytecode:

.. code-block:: python

   ex = tvm.compile(MyModule, target="llvm")
   print(ex.as_text())    # Human-readable instruction listing
   print(ex.as_python())  # Equivalent Python program
   print(ex.stats())      # Summary statistics

These are invaluable for debugging compilation issues — they show exactly which functions
are called, in what order, and how registers are used.


Source Code Map
---------------

.. list-table::
   :header-rows: 1
   :widths: 45 55

   * - Path
     - Contents
   * - ``include/tvm/runtime/vm/bytecode.h``
     - Instruction, Opcode, and Arg definitions
   * - ``include/tvm/runtime/vm/executable.h``
     - VMExecutable, VMFuncInfo, serialization
   * - ``include/tvm/runtime/vm/vm.h``
     - VirtualMachine base class, VMClosure
   * - ``src/runtime/vm/vm.cc``
     - VirtualMachineImpl, RunLoop, InvokeBytecode
   * - ``src/runtime/vm/executable.cc``
     - Serialization/deserialization, text output
   * - ``src/runtime/vm/builtin.cc``
     - Built-in operations (shape matching, allocation)
   * - ``src/relax/backend/vm/codegen_vm.cc``
     - CodeGenVM: Relax IR → bytecode
   * - ``src/relax/backend/vm/codegen_vm_tir.cc``
     - VMTIRCodeGen: Relax IR → compiled TIR
   * - ``python/tvm/runtime/vm.py``
     - Python VirtualMachine wrapper
   * - ``python/tvm/relax/vm_build.py``
     - ``relax.build()`` and VMExecutable Python class