Relax Virtual Machine

This document explains the Relax VM architecture in detail, covering the compilation pipeline from Relax IR to bytecode, the instruction set, the execution model, and the Python-level user interface.

Overview

The end-to-end flow from model to execution is:

Relax IR — a high-level computational graph (relax.Function inside an IRModule).
Compilation — tvm.compile() applies the Relax transformation pipeline, then invokes VMCodeGen to translate each Relax function into bytecode instructions.
Linking — TIR functions are compiled to native kernels (via LLVM, CUDA, etc.); the bytecode, constant pool, and compiled kernels are packaged together into a VMExecutable.
Execution — at runtime, a VirtualMachine loads the executable, initializes devices and memory allocators, and runs the bytecode.

IRModule (Relax + TIR)
     │
     ▼  relax_pipeline (FuseOps, LegalizeOps, ...)
IRModule (optimized)
     │
     ▼  VMCodeGen
ExecBuilder (bytecode) + IRModule (TIR only)
     │                        │
     │                        ▼  tirx.build()
     │                   runtime.Module (native kernels)
     │                        │
     ▼  VMLink               ▼
VMExecutable ◄───────── linked together
     │
     ▼  VirtualMachine(exec, device)
Runtime execution

Compilation: From Relax IR to Bytecode

Build entry point

The main entry point is tvm.compile() (which delegates to relax.build() in python/tvm/relax/vm_build.py):

import tvm
from tvm import relax

@tvm.script.ir_module
class MyModule:
    @R.function
    def main(x: R.Tensor((3, 4), "float32")):
        return R.add(x, x)

target = tvm.target.Target("llvm")
ex = tvm.compile(MyModule, target)

Internally, relax.build() performs these steps:

Apply the Relax pipeline (relax.get_pipeline("default")), which includes operator legalization, fusion, buffer planning, and other graph-level passes.
Create an ExecBuilder and run VMCodeGen (src/relax/backend/vm/codegen_vm.cc), which walks each relax.Function and emits bytecode instructions. The Relax functions are removed from the IRModule; only TIR functions remain.
Compile the remaining TIR functions to native code via tirx.build().
Link the bytecode executable with the compiled native module using VMLink, producing a VMExecutable.

Two execution modes are supported:

exec_mode="bytecode" (default): Relax functions are interpreted by the VM’s bytecode dispatch loop.
exec_mode="compiled": Relax functions are compiled into TIR functions (VMTIRCodeGen) that directly manipulate the register file, bypassing the interpreter loop. This avoids dispatch overhead but produces more code.

Bytecode generation

The CodeGenVM class (src/relax/backend/vm/codegen_vm.cc) is an ExprFunctor that visits each Relax expression and emits instructions through the ExecBuilder:

Each relax.Var is mapped to a register.
Function parameters occupy registers 0 through N-1.
Each binding in a SeqExpr generates one or more instructions; the result is stored in a new register.
Function calls (R.call_tir, R.call_packed, operator calls) become Call instructions.
Conditional expressions (relax.If, written as Python if in TVMScript) become an If instruction followed by Goto to skip branches.
The function body ends with a Ret instruction.

Instruction Set

The VM uses a register-based architecture with an intentionally minimal instruction set. There are only four opcodes:

Opcode	Fields	Semantics
`Call`	`dst`, `func_idx`, `num_args`, `args[]`	Call function `func_idx` with the given arguments; store the result in register `dst`.
`Ret`	`result`	Return the value in register `result` to the caller.
`Goto`	`pc_offset`	Jump forward or backward by `pc_offset` instructions.
`If`	`cond`, `false_offset`	If register `cond` is nonzero, fall through (pc++); otherwise jump by `false_offset`.

The VM itself performs no mathematical computation. All actual work — matrix multiplications, convolutions, elementwise operations — is carried out by compiled TIR kernels or external libraries (cuBLAS, cuDNN, etc.), dispatched through Call instructions.

Instruction encoding

Each instruction argument (Instruction::Arg) is a 64-bit word encoded as:

Bits [63:56] — ArgKind (8 bits): kRegister (0), kImmediate (1), kConstIdx (2), or kFuncIdx (3).
Bits [55:0] — value (56 bits, sign-extended).

Two special register values exist:

kVoidRegister: indicates “no destination” (the return value is discarded).
kVMRegister: refers to the VM context pointer itself, passed as the first argument to closures.

The instruction stream is stored as a flat vector<ExecWord> (instr_data) with an offset table (instr_offset) for random access.

Executable

A VMExecutable (include/tvm/runtime/vm/executable.h) bundles everything needed for execution:

Function table (func_table): a vector<VMFuncInfo> describing every function. Each entry records the function’s kind, name, instruction range (start_instr to end_instr), number of arguments, register file size, and parameter names.
Constant pool (constants): model weights, shape tuples, and other compile-time constants.
Bytecode (instr_data + instr_offset): the instruction stream.
Imported modules: the compiled TIR kernels and external libraries.

Function kinds

The VM recognizes three function kinds (VMFuncInfo::FuncKind):

Kind	Description
`kPackedFunc`	An external C/C++ function looked up from imported modules or the global PackedFunc registry. Examples: `vm.builtin.alloc_shape_heap`, `vm.builtin.match_shape`.
`kVMFunc`	A bytecode-interpreted Relax function. The VM interprets its instructions in `RunLoop()`.
`kVMTIRFunc`	A Relax function compiled to a TIR function (`exec_mode="compiled"`). Found in imports under the name `__vmtir__<func_name>`. Called directly with register file pointers, bypassing the interpreter loop.

Serialization

The executable supports binary serialization for deployment:

# Save
ex.export_library("model.so")

# Load
loaded = tvm.runtime.load_module("model.so")
vm = relax.VirtualMachine(loaded, tvm.cuda())

The binary format includes a magic number (0xD225DE2F4214151E), a version string (currently "0.14"), followed by four sections: globals (the function table), memory scopes, constant pool, and bytecode. AsText() and AsPython() provide human-readable representations for debugging.

Runtime Execution

VM initialization

At runtime, a VirtualMachine is created and initialized:

from tvm.relax import VirtualMachine

vm = VirtualMachine(exec_module, tvm.cuda())

Under the hood:

LoadExecutable: the bytecode and metadata are loaded from the VMExecutable.
Init: devices and memory allocators are set up. Each device gets an Allocator (either NAIVE_ALLOCATOR or POOLED_ALLOCATOR, defaulting to pooled). A CPU device is always added for shape computations.
InitFuncPool: the function pool is populated — kPackedFunc entries are resolved from imports or the global registry; kVMFunc and kVMTIRFunc entries are wrapped in VMClosure objects.
Constant pool: model constants are loaded and optionally transferred to the target device.

The bytecode dispatch loop

When a kVMFunc is invoked, the VM enters InvokeBytecode():

A new VMFrame is pushed onto the call stack. Each frame contains:
- A register file (vector<ffi::Any>) — type-erased slots that can hold tensors, shapes, closures, or any TVM object. The size is determined at compile time (VMFuncInfo::register_file_size).
- The return program counter — where to resume after the function returns.
- The caller’s return register — which register in the parent frame receives the result.
Function arguments are written to registers 0..N-1.
The program counter (pc_) is set to the function’s start_instr.
RunLoop() executes instructions until a Ret is encountered:
- Call: resolve arguments (from registers, immediates, constant pool, or function pool), invoke the target function via InvokeClosurePacked(), store the result in dst.
- Ret: read the return value from the specified register, write the result to the caller’s return register, and return from RunLoop() (the frame is popped by an RAII guard when InvokeBytecode() exits).
- Goto: adjust pc_ by the offset.
- If: check the condition register; if nonzero, fall through; otherwise jump by false_offset.

The dispatch loop is implemented in src/runtime/vm/vm.cc (VirtualMachineImpl::RunLoop).

Frame Stack              Register File (per frame)
┌─────────────┐          ┌────┬────┬────┬─────┬────┐
│  Frame 2    │ ───────► │ R0 │ R1 │ R2 │ ... │ Rn │
├─────────────┤          └────┴────┴────┴─────┴────┘
│  Frame 1    │ ───────► [register file]
├─────────────┤
│  Frame 0    │ ───────► [register file]
└─────────────┘

VMClosure and function dispatch

Functions in the VM are stored in a func_pool_ indexed by function table position. kVMFunc and kVMTIRFunc entries are wrapped as VMClosure objects, while kPackedFunc entries are stored as plain ffi::Function. A VMClosure stores:

func_name: the function’s string name.
impl: a ffi::Function that takes the VM context pointer as its first argument, followed by the actual parameters.

When the VM encounters a Call instruction, it looks up the function in func_pool_ by index and dispatches via InvokeClosurePacked(). If the target is a VMClosure, the VM pointer is prepended to the arguments and impl is invoked. If it is a plain ffi::Function, it is called directly.

VMClosure::BindLastArgs enables partial application — it creates a new function with some arguments pre-bound at the end, useful for implementing captured closures in Relax.

Built-in operations

The VM relies on several built-in PackedFuncs (registered in src/runtime/vm/builtin.cc) for runtime support:

vm.builtin.alloc_shape_heap: allocate workspace for symbolic shape computations.
vm.builtin.match_shape: validate tensor shapes against expected patterns at runtime, supporting assertions (kAssertEqualToImm, kAssertEqualToLoad), storing symbolic dimensions to the shape heap (kStoreToHeap), or no-ops (kNoOp).
vm.builtin.make_shape: construct shape tuples from immediates or heap-loaded values.
vm.builtin.match_prim_value: validate primitive values (e.g., integers) against expected patterns.
vm.builtin.copy: copy a value into a register. Used in several codegen scenarios: materializing non-register arguments (immediates, constants) into registers, ensuring each variable binding gets its own register, and merging results from if/else branches.

Python Interface

Users interact with the VM through tvm.relax.VirtualMachine:

import tvm
from tvm import relax
import numpy as np

# Compile
ex = tvm.compile(MyModule, target="llvm")

# Create VM
vm = relax.VirtualMachine(ex, tvm.cpu())

# Direct invocation
inp = tvm.runtime.tensor(np.random.rand(3, 4).astype("float32"))
result = vm["main"](inp)

# Stateful interface (useful for RPC)
vm.set_input("main", inp)
vm.invoke_stateful("main")
output = vm.get_outputs("main")

Key methods:

vm["func_name"](*args) — direct invocation, returns the result.
vm.set_input() / vm.invoke_stateful() / vm.get_outputs() — stateful interface that avoids sending output over the wire, useful for RPC-based remote execution.
vm.save_function(func_name, saved_name, *args) — pre-bind arguments for repeated calls, reducing dictionary lookup overhead during benchmarking.
vm.time_evaluator(func_name, dev) — returns a timing function following the same convention as tvm.runtime.Module.time_evaluator.
vm.set_instrument(func) — register an instrumentation callback that is invoked before/after every Call instruction. The callback can return VMInstrumentReturnKind.SKIP_RUN to skip the call.

Instrumentation

The VM supports observability via instrumentation:

Instrumentation via set_instrument():

def my_instrument(func, func_symbol, before_run, ret_value, *args):
    if before_run:
        print(f"About to call: {func_symbol}")
    return VMInstrumentReturnKind.NO_OP

vm.set_instrument(my_instrument)
vm["main"](inp)

The instrument function is called before and after every Call instruction, receiving the function object, its symbol name, a flag indicating before/after, the return value (only valid after), and all arguments.

Inspecting Bytecode

The executable provides text and Python representations of the compiled bytecode:

ex = tvm.compile(MyModule, target="llvm")
print(ex.as_text())    # Human-readable instruction listing
print(ex.as_python())  # Equivalent Python program
print(ex.stats())      # Summary statistics

These are invaluable for debugging compilation issues — they show exactly which functions are called, in what order, and how registers are used.

Source Code Map

Path	Contents
`include/tvm/runtime/vm/bytecode.h`	Instruction, Opcode, and Arg definitions
`include/tvm/runtime/vm/executable.h`	VMExecutable, VMFuncInfo, serialization
`include/tvm/runtime/vm/vm.h`	VirtualMachine base class, VMClosure
`src/runtime/vm/vm.cc`	VirtualMachineImpl, RunLoop, InvokeBytecode
`src/runtime/vm/executable.cc`	Serialization/deserialization, text output
`src/runtime/vm/builtin.cc`	Built-in operations (shape matching, allocation)
`src/relax/backend/vm/codegen_vm.cc`	CodeGenVM: Relax IR → bytecode
`src/relax/backend/vm/codegen_vm_tir.cc`	VMTIRCodeGen: Relax IR → compiled TIR
`python/tvm/runtime/vm.py`	Python VirtualMachine wrapper
`python/tvm/relax/vm_build.py`	`relax.build()` and VMExecutable Python class