Device/Target Interactions
This documented is intended for developers interested in understanding how the TVM framework interacts with specific device APIs, or who may want to implement support for a new API or new hardware.
There are three main aspects that must be implemented for any new runtime environment.
The DeviceAPI class gives a handle to a specific device, and the API used to interact with it. It defines a common interface for querying device parameters (e.g. memory available, number of threads, etc.) and for performing simple actions (e.g. copying memory from the host, or between buffers on the device).
The Target class contains a description of the device on which a function will run. It is exposed both to the target code generators and to the optimization passes.
The target code generators construct a Module consisting of one or more PackedFunc, from an IRModule.
DeviceAPI
The DeviceAPI
represents a handle to a specific hardware device
API. (e.g. CUDADeviceAPI
handles all interactions through the
CUDA framework.) Most DeviceAPI
methods accept a device_id
parameter to specify which device should be accessed. In Python,
these are typically accessed using the tvm.runtime.device()
function, which returns a handle to a specific device, accessed
through a specific API. (e.g. tvm.runtime.device('cuda',0)
gives
access to physical device 0
, accessed through the CUDA API.)
Attribute queries -
GetAttr
allows different device-specific parameters to be queried, such as the device name, number of threads, etc. The parameters that can be queried are defined inenum DeviceAttrKind
in device_api.h. Not all query-able parameters are supported by all devices. If a parameter cannot be queried (e.g.kMaxClockRate
on Vulkan), or if a parameter isn’t applicable (e.g.kWarpSize
on CPU), then those queries should returnnullptr
.Setting active device -
SetDevice
should set a particular device as being active. If aPackedFunc
generated by the target-specific code gen requires execution on a device, it should run on the active device.Memory management - Utilities for allocating and deallocating memory on the device.
Allocate data space -
AllocDataSpace
andFreeDataSpace
allocate and free space on the device. These allocations can be provided as inputs and outputs to an operator and make up the primary data flow of the operator graph. It must be possible to transfer data from the host to/from a data space. The return value is an opaquevoid*
. While some implementations return a memory address, this is not required, and thevoid*
may be an opaque handle that is interpretable only by the device backend that generated it. Thevoid*
is used as an argument to other backend-specific functions, such asCopyDataFromTo
.Allocate work space -
AllocWorkspace
andFreeWorkspace
allocate and free space on the device. Unlike data space, these are used for storage of intermediate values within an operator definition, and are not required to be transferable to/from the host device. If aDeviceAPI
subclass does not implement these methods, they will default to calling the correspondingDataSpace
functions.Copy data -
CopyDataFromTo
should copy data from one location to another. The type of copy is determined by thedev_from
anddev_to
parameters. Implementations should support copying memory from CPU to device, from device to CPU, and from one buffer to another on a single device. If the source or destination locations are on the CPU, the correspondingvoid*
points to a CPU address that can be passed intomemcpy
. If the source or destinations locations are on the device, the correspondingvoid*
was previously generated by eitherAllocDataSpace
orAllocWorkspace
.These copies are queued to execute on a specific
TVMStreamHandle
. However, implementations should not assume that CPU buffers remains valid or accessible after the call toCopyDataFromTo
completes.
Execution stream management - Utilities for handling
TVMStreamHandle
, which represents parallel streams of execution used to execute commands.Create stream -
CreateStream
andFreeStream
should allocate/free a handle to a stream of execution. If a device implements only a single queue of commands, thenCreateStream
should returnnullptr
.Set active stream -
SetStream
should set a stream as being active. While active, if aPackedFunc
generated by the target-specific code gen requires execution on a device, the work should be submitted to the active stream.Synchronize to CPU -
StreamSync
should synchronize a stream of execution to the CPU. The call toStreamSync
should return once all memory transfers and computations submitted prior to theStreamSync
call have completed.Synchronize between streams -
SyncStreamFromTo
should introduce a synchronization barrier between the source and destination stream. That is, the destination stream may not proceed beyond commands currently queued until the source stream has completed all commands that are currently queued.
In order to be usable by the TVM framework, the new DeviceAPI should then be registered with the following steps.
Create a function that instantiates the new DeviceAPI, and returns a pointer to it:
FooDeviceAPI* FooDeviceAPI::Global() { static FooDeviceAPI inst; return &inst; }
Register the function to the tvm registry:
TVM_REGISTER_GLOBAL("device_api.foo").set_body_typed(FooDeviceAPI::Global);
Add an entry for the new DeviceAPI to the
TVMDeviceExtType
enum in c_runtime_api.h. The value should be an unused value greater thanDLDeviceType::kDLExtDev
, but less thanDeviceAPIManager::kMaxDeviceAPI
.Add a case in
DeviceName
in device_api.h to convert from the enum value to a string representation. This string representation should match the name given toTVM_REGISTER_GLOBAL
.Add entries to the
MASK2STR
andSTR2MASK
dictionaries oftvm.runtime.Device
for the new enum value.
Target Definition
The Target
object is a lookup table of properties about a physical
device, its hardware/driver limits, and its capabilities. The
Target
is accessible both during optimization and code generation
stages. While the same Target
class is used for all runtime
targets, each runtime target may need to add target-specific options.
In target_kind.cc, add a new declaration of
TVM_REGISTER_TARGET_KIND
, passing a string name of the new target,
and the TVMDeviceExtType
or DLDeviceType
enum value for the
device on which that target should run. Typically, the target name
and the device name will match. (e.g. The "cuda"
target runs on
the kDLCUDA
device.) There are exceptions, such as when multiple
different code generation targets can run on the same physical device.
(e.g. The "llvm"
and "c"
targets both run on the kDLCPU
device type.)
All options for a specific target kind are added with the
add_attr_option
function, with optional default values. A Target
parser can be added with set_target_parser
to process
any parameters that are dynamically based on other parameters or
queried from device properties.
This argument definition defines a parser that can unpack a string
description of a target. This is done in the Target::Target(const
String&)
constructor in C++, which accepts a JSON-formatted string
and is typically called using the tvm.target.Target
python
object. For example, tvm.target.Target('{"kind": "cuda",
"max_num_threads": 1024}')
will create a cuda
target, while
overriding the default maximum number of threads.
In a code generator, the target properties can be accessed using
target->GetAttr<T>(param_name)
in C++, or with the
target.attrs
dictionary in Python.
Target Code Generators
The code generators take an optimized IRModule
and converts it
into an executable representation. Each code generator must be
registered in order to be used by the TVM framework. This is done by
registering a function named "target.build.foo"
, where foo
is
the same name as was used in the TVM_REGISTER_TARGET_KIND
definition above.
tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
TVM_REGISTER_GLOBAL("target.build.foo").set_body_typed(GeneratorFooCode);
The code generator takes two arguments. The first is the IRModule
to compile, and the second is the Target
that describes the device
on which the code should run. Because the environment performing the
compilation is not necessarily the same as the environment that will
be executing the code, code generators should not perform any
attribute lookups on the device itself, and should instead access
parameters stored in the Target
.
Each function in the input IRModule
should be accessible by name
in the output runtime::Module
.