Compiler Integration#

TVM FFI is a standard ABI designed as a standalone module that is independent from compiler or intermediate representation implementations. It specifies a runtime ABI that DSL compilers and languages can integrate with.

Kernel Language Compilers#

Kernel languages such as OpenAI Triton, TileLang, Mojo, cuteDSL, Helion, and Hidet usually leverage their own internal compilation mechanisms to build code. To connect these functions to the FFI convention, one can use the following options:

  • For compilers that generate host functions via codegen (e.g., LLVM), one can generate the symbol __tvm_ffi_<func_name>, where <funcname> is the exported function.

  • For kernel generators that generate C++ host code, one can directly use TVM_FFI_DLL_EXPORT_TYPED_FUNC to expose the symbol.

The following code snippet shows C code that corresponds to a function performing add_one_c under the ABI. It is reasonably straightforward for low-level code generators to replicate this C logic. You can run this code as part of the quick start example.

#include <tvm/ffi/c_api.h>
#include <tvm/ffi/extra/c_env_api.h>

// Helper function to extract DLTensor from TVMFFIAny (can be inlined into generated code)
int ReadDLTensorPtr(const TVMFFIAny *value, DLTensor** out) {
  if (value->type_index == kTVMFFIDLTensorPtr) {
    *out = (DLTensor*)(value->v_ptr);
    return 0;
  }
  if (value->type_index != kTVMFFITensor) {
    // Use TVMFFIErrorSetRaisedFromCStr to set an error which will
    // be propagated to the caller
    TVMFFIErrorSetRaisedFromCStr("ValueError", "Expects a Tensor input");
    return -1;
  }
  *out = (DLTensor*)((char*)(value->v_obj) + sizeof(TVMFFIObject));
  return 0;
}

// FFI function implementing add_one operation
int __tvm_ffi_add_one_c(
  void* handle, const TVMFFIAny* args, int32_t num_args, TVMFFIAny* result
) {
  DLTensor *x, *y;
  // Extract tensor arguments
  // return -1 for error, error is set through TVMFFIErrorSetRaisedFromCStr
  if (ReadDLTensorPtr(&args[0], &x) == -1) return -1;
  if (ReadDLTensorPtr(&args[1], &y) == -1) return -1;

  // Get current stream for device synchronization (e.g., CUDA)
  // not needed for CPU, just keep here for demonstration purpose
  void* stream = TVMFFIEnvGetStream(x->device.device_type, x->device.device_id);

  // perform the actual operation
  for (int i = 0; i < x->shape[0]; ++i) {
    ((float*)(y->data))[i] = ((float*)(x->data))[i] + 1;
  }
  // return 0 for success run
  return 0;
}

Some of the key takeaways include:

You can also check out the ABI overview for a more complete guide.

Graph Compilers#

Machine learning graph compilers take computational graphs and can integrate with TVM FFI through:

  • Supporting the call_tvm_ffi primitive that calls into my_func that follows the ABI:

    Op.call_tvm_ffi("my_func", *args)
    
  • Using the module API to load the modules into context and run. Alternatively, look up global functions that are registered and invoke them.

  • For ahead-of-time compilation (AOT) with minimum runtime, the AOT compiler can generate direct calls into FFI functions:

    • Use the TVMFFIFunctionCall API to call into custom tvm::ffi::Functions

    • If the function exposes a C symbol following the FFI ABI, call it directly.

This approach provides a unified mechanism to call into any libraries and other DSLs that expose kernels following the FFI convention, enabling seamless interoperability with various kernel DSLs and libraries.

Runtime and State Management for Compilers#

While TVM FFI provides a standard ABI for compiler-generated kernels, many compilers and domain-specific languages (DSLs) require their own runtime to manage states like dynamic shapes, workspace memory, or other application-specific data. This runtime can be a separate shared library accessible to all kernels from a specific compiler.

Distributing the Runtime#

For a user to use a kernel from your compiler, they must have access to your runtime library. The preferred method is to package the runtime shared library (e.g., libmylang_runtime.so) as part of a Python or C++ package. Users must install and import this package before loading any kernels compiled by your system. This approach ensures the state is shared among different kernels.

Common vs. Custom State#

It’s important to distinguish between compiler-specific state and common state managed by TVM FFI. TVM FFI handles common states like streams and memory allocators through environment functions (e.g., TVMFFIEnvGetStream), allowing kernels to access these without managing their own. However, for any unique state required by your compiler, the global function registration approach is the most suitable method.

Advanced: Custom Modules#

While the standard dynamic library module is sufficient for many use cases, sometimes it may be helpful to package a custom runtime module that wraps over a driver API. For example, using cuModuleLoad explicitly to load generated PTX code and expose it as an tvm::ffi::Function. The tvm::ffi::ModuleObj interface provides a way to support this need. Generally, the steps include subclassing the tvm::ffi::ModuleObj:

Enable Export and Loading#

We also support export and loading of modules that import custom modules. We allow libraries to embed a binary symbol __tvm_ffi__library_bin in the following binary layout:

  • <nbytes : u64> <import_tree> <key0: str> [val0: bytes] <key1: str> [val1: bytes] ...

  • nbytes indicates the total number of bytes following the nbytes header

  • <import_tree> uses CSR sparse array format: <indptr: vec<u64>> <child_indices: vec<u64>> to store child indices of each node (each node is a Module instance)

  • <key> stores the module kind, or can be _lib:

    • _lib indicates the module corresponds to the dynamic library itself

    • For other cases, val: bytes contains the serialized bytes from the custom module

  • Both bytes and str are serialized as <size: u64> <content>

This information allows us to deserialize the custom modules by calling ffi.Module.load_from_bytes.<kind> and then reconstruct the overall import relations from <import_tree> and return the final composed modules back to the user. As long as the compiler generates the __tvm_ffi__library_bin in the above format, tvm_ffi.load_module() will correctly handle the loading and recover the original module. Note that we will need the custom module class definition to be available during loading, either by importing another runtime DLL, or embedding it in the generated library.