microTVM Design Document

Background

TVM is a model deployment framework that has demonstrated good performance across a wide range of models on traditional operating systems. Given TVM’s layered approach to compilation, it is a natural extension to target bare metal devices. While most of the compilation flow does not need to change for a proof-of-concept implementation on such devices, the runtime cannot depend on:

  • Virtual Memory, and by extension any system-provided malloc. Additionally, bare metal devices typically have very limited memory (measured in KB). Because of this, libraries designed for such platforms typically need to be more judicious in using memory, and need to release memory when it is not in use.

  • Traditional OS abstractions, such as files, libraries, and kernel functions. Some projects implement support for these, but they are by no means standard.

  • Support for programming languages other than C.

Such changes require a different approach from the TVM C++ runtime typically used on traditional Operating Systems.

Typical Use

This section discusses our vision of the “typical” microTVM use case. Each component used to achieve this typical use case is intended to be designed for flexibility, but this unifying vision serves to motivate the inclusion of each part of the design.

https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_workflow.svg

The parts of this process are described below:

  1. Model Import. The user imports an existing model or describes a new model to TVM, producing a Relay module.

  2. Model Transformations. The user can apply transformations, such as quantization, to the model. After each transformation, the user should still have a Relay module.

  3. Compilation (Scheduling and Code Generation). TVM implements each operator into Tensor IR by assigning a schedule and schedule configuration to each Relay operator. Then, code (C source or compiled object) is generated for each operator.

  4. Integration. The generated code is integrated along with the TVM C Runtime library into a user-supplied binary project. In some cases (such as when the project is standardized across multiple SoC/development boards), this process is handled automatically.

  5. Deployment. The project is built and the residual firmware binary is flashed onto the device. Model inference is driven either by TVM using an on-device RPC server, or on the device using the on-device Graph Runtime.

Design Goals

microTVM aims to achieve these design goals:

  1. Portable Code. microTVM can translate any Relay model into C code that can compile with only a C standard library.

  2. Minimal Overhead. microTVM generates target-specific, highly optimized code. As much overhead from the runtime should be removed.

  3. Accessible Code. microTVM considers C source code as a first-class output mechanism so that it is easier for a firmware engineer to understand and tweak.

Overview

microTVM requires changes at all levels of the TVM compiler stack. The following sub-sections enumerate these changes at a high level, and follow-on sections discuss the specifics in more detail.

Modeling Target Platforms

TVM’s search-based optimization approach allows it to largely avoid system-level modeling of targets in favor of experimental results. However, some modeling is necessary in order to ensure TVM is comparing apples-to-apples search results, and to avoid wasting time during the search by attempting to compile invalid code for a target.

microTVM models these parts of the target:

  • The CPU used, through the -mcpu and -march target flags.

  • The presence or absence of accelerators, through the device components of the target (Currently only the absence of accelerators can be expressed, but this mechanism should extend well).

microTVM aims to model these parts of the target in the future:

  • Memory, modeled as a set of disjoint memory spaces, each with a label and size and prefetch/flush behavior. Some memory may be shared with accelerators.

  • Target runtime configuration (i.e. clock tree configuration, clock speed, etc). This is intended only to contribute to the AutoTVM schedule key and not for any other use.

At this time, TVM does not intend to model:

  • Size, type, or relationship of caches, with the exception of prefetching or cache flushing.

TVM Targets for microTVM

A central data structure in the compilation process is the tvm::target::Target class. TVM uses Target to decide which TIR schedules to enable and how to configure the code generator. The Target class should also uniquely identify the generated code for a particular operator, as autotuning logs use it to rank measured performance (but see Future Work).

Targets are currently represented as strings structured similarly to command-line arguments. An example target is shown below:

c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx -runtime=c -system-lib=1

The relevant parts to microTVM are:

  • Code generator (llvm or c)

  • -mcpu=cortex-m7: used by TOPI to enable Cortex-M schedules, and, when the C source code generator is selected, included in the output as a comment to help identify the code and configure the downstream C compiler.

  • -link-params: include parameters as global constants to load from flash.

  • -runtime=c: build glue code to allow operators to work with the C runtime

  • -system-lib=1: emit a system library (i.e. which can be loaded by calling the PackedFunc runtime.SystemLib.

Writing Schedules for microTVM

For operations scheduled on the CPU, microTVM initially plans to make use of specialized instructions and extern (i.e. hand-optimized) functions to achieve good performance. In TVM, this approach is generally accomplished through tensorization, in which TVM breaks a computation into small pieces, and a TIR extern function accelerates each small piece.

TVM currently accommodates both approaches using tir.call_extern. First, a pragma is attached to the schedule defining the extern function in portable C.

sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* b) { /* ... */ }")

Next, tensorize is used to split the computation.

sched[output].tensorize(owi, gemm)

There are a couple of caveats to this approach, all which could be resolved by linking generated code against external libraries:

  • Inline assembly is compiler-specific. While Clang and GCC have standardized on one syntax, this may not be portable to other compilers. SDKs solve this by conditionally including a header file depending on the compiler being used. However, taking this approach means that the generated code needs additional compiler flags (i.e. -Isystempath/to/header).

  • It may be helpful to reference helper functions from the generated code (e.g. to inline common sequences of hand-optimized assembly).

  • Finally, the extern function invoked may be wholly written in an external library. If those functions can be wholly inlined, this caveat is the same as the previous. If not, then additional C code needs to be compiled and linked against the operator.

At present, microTVM presumes that all eligible schedules can be compiled. This means that the user- supplied project (see next section) must include all libraries that are used by the generated code. When not using autotuning, TVM randomly chooses a fallback schedule, so all libraries would need to be supported. When using autotuning, TVM selects the best-performing schedule, so only that library is needed. There isn’t currently a way to force TVM to pick a particular schedule outside of autotuning logs, but that would be a good addition.

Finally, when using the llvm backend, the process is similar except that LLVM bitcode is included in the generated code (with an import_llvm pragma). LLVM bitcode provides a portable way to call inline assembly. However, it may be more complex to call external C functions, and helper functions are of course not easy to use from LLVM bitcode.

Executing Models

The TVM compiler traditionally outputs three pieces:

  1. Model operator implementations, as discussed above;

  2. A model execution graph, encoded as JSON; and

  3. Simplified parameters.

To correctly execute the model, a Graph Runtime needs to reconstruct the graph in memory, load the parameters, and then invoke the operator implementations in the correct order.

microTVM supports two ways to do this:

  1. Host-Driven. The Graph Runtime can run on the host and carry out execution by issuing commands to the device using an RPC link with a UART-like transport.

  2. Standalone. A C Graph Runtime is available to be compiled on-device, but it is not particularly memory efficient. This way enables standalone execution without any attached host.

Host-Driven is designed for experimenting with models on-device and, like AutoTVM, uses the RPC server to drive computation on-device. Standalone is intended for deployment.

Host-Driven Execution

In Host-Driven execution, the firmware binary is the following:

  1. Generated operator implementations from TVM.

  2. The TVM C runtime.

  3. SoC-specific initialization.

  4. The TVM RPC server.

  5. (optional) Simplified Parameters.

This firmware image is flashed onto the device and a GraphRuntime instance is created on the host. The GraphRuntime drives execution by sending RPC commands over a UART:

https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_host_driven.svg

Standalone Execution

In Standalone execution, the GraphRuntime is instantiated on device:

https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_standalone.svg

microTVM Firmware

We can now discuss how microTVM firmware should behave. An important task common to both model execution strategies is configuring the SoC to match the way it performs in production. microTVM considers this task project- and SoC-dependent. Whether for AutoTVM, host-driven model inference, or in standalone deployment, the user is expected to supply a project whose main() does the following:

  1. Configure the SoC to match deployment performance.

  2. Initialize the TVM C Runtime.

When configuring for host-driven inference or AutoTVM, the remaining tasks are well-defined:

  1. Initialize a transport (i.e. a UART) for use with the TVM RPC server.

  2. Launch the TVM RPC Server.

When configuring for standalone deployment, the firmware needs to:

  1. Instantiate the system library by calling the runtime.SystemLib PackedFunc.

  2. Instantiate a GraphRuntime passing the system library module.

  3. Configure parameters and inputs as needed.

  4. Run the model.

Parts of a microTVM Binary

To summarize, a microTVM firwmare binary image must contain these parts:

  1. Operator implementations, produced by TVM.

  2. The TVM C runtime library, supplied by TVM as a static library.

  3. SoC Initialization, supplied by the user.

For Host-driven model execution, firmware also needs:

  1. The TVM RPC Server library.

For Standalone model execution, firmware also needs:

  1. The TVM C GraphRuntime library, supplied by TVM as a static library.

  2. The remaining compiler outputs (Simplified Parameters and Graph JSON).

The Automated Build Flow

Once code generation is complete, tvm.relay.build returns a tvm.runtime.Module and the user can save the generated C source or binary object to a .c or .o file. From this point, TVM can theoretically step back and the user can compile and run the code separately.

However, for AutoTVM, TVM needs some automated flow to handle the following tasks:

  1. Integrate operator implementations, the TVM C Runtime library, and the TVM RPC Server library into the firmware project containing user-supplied SoC Initialization.

  2. Build the resulting project.

  3. Program the built firmware onto a (specific) attached device.

  4. Identify the serial port or other transport to be used by TVM to drive remote execution.

At present, TVM expects the user to supply an implementation of the tvm.micro.Compiler, tvm.micro.Flasher, and tvm.micro.Transport interfaces. TVM then:

  1. Builds each piece separately as a library.

  2. Builds the libraries into a binary firmware image.

  3. Programs the firmware image onto an attached device.

  4. Opens a serial port to serve as the RPC server transport.

This design was chosen to reduce build times for microTVM (the common libraries need to be built only once per candidate operator implemmentation). In practice, these projects are extremely small and compile relatively quickly. Compared with the added complexity of this tighter build integration with TVM, the performance gains are likely not worth it. A future design will consolidate the build tasks into a single step and narrow the interface to provide a better integration.

Measuring operator performance

The TVM C runtime depends on user-supplied functions to measure time on-device. Users should implement TVMPlatformTimerStart and TVMPlatformTimerStop. These functions should measure wall clock time, so there are some pitfalls in implementing these functions:

  1. If the CPU could halt or sleep during a computation (i.e. if it is being done on an accelerator), a cycle counter should likely not be used as these tend to stop counting while the CPU is asleep.

  2. The granularity of these functions can be relaxed as needed to extend the range of the timer device. However, if granularity is too coarse, a sub-optimal schedule may be used.

  3. An error should be raised if the timer overflows.

  4. The timer should not interrupt computation unless absolutely necessary. Doing so may affect the accuracy of the results.

  5. Calibrating the output against a wall clock is ideal, but it will likely be too cumbersome. A future PR could enable some characterization of the platform timer by, e.g., measuring the internal oscillator against a reference such as an external crystal.

Future Work

Ahead-of-Time Runtime

A limitation of the Graph Runtime is the amount of memory overhead required in parsing the JSON. The current implementation contributes significantly to the dynamic memory usage of microTVM, limiting its utility. An ahead-of-time runtime can avoid the need for any Graph JSON parsing and improve inference speed by generating C code to call the generated operator implementations directly rather than relying on a data-driven approach with the Graph Runtime.

Memory Planning

The current memory planner attempts to limit the number of TVMBackendDeviceAlloc() calls issued for intermediate tensors only. Because scratchpads can vary widely, and because the planner coalesces memory allocations within 16x of each other, this strategy typically results in high peak memory usage.

Heterogeneous Execution

Newer Cortex-M SoCs can contain multiple CPUs and onboard ML accelerators.

Autotuning Target

As discussed previously,