Relay Operator Strategy¶
In order to lower Relay operators to the implementations defined in TOPI library, a compute and schedule function need to be registered to each Relay operator. However, compute and schedule functions are usually specialized for each target, and further, even for the same target, we may have multiple algorithms and implementations available. To deal with the complexity, we introduce operator strategy to allow developers to define a flexible lowering strategy for each operator and target.
Operator Strategy Design¶
The basic element in operator strategy is an
OpImplementation. It includes
the a pair of compute and schedule function, the name of the implementation,
and a priority level (the use of priority level is explained in
Select Implementation from Op Strategy).
OpStrategy includes a list of
contains a list of
OpImplementation associated with a
(see definition in
can be null, indicating the implementations are generally applicable;
otherwise, the implementations are only considered when the specialized
condition is satisfied.
SpecializedCondition consists of a list
of clauses defined in Tensor Expression in conjunctive normal form (CNF) and
only supports conditions on tensor shapes.
Last, a strategy function, or
FTVMStrategy, determines which pair(s) of
compute and schedule functions should be used given a workload, and needs to be
registered to each Relay operator.
FTVMStrategy is a generic function (see
include/tvm/target/generic_func.h), that can be overwritten for each
target. The function signature is
OpStrategy(const Attrs& attrs, const Array<Tensor>& inputs, const Type& out_type, const Target& target)
that the function returns an
OpStrategy given the op attributes, input
tensors, output types, and target to compile to.
Write A Strategy Function¶
We recommend developers to write strategy function in Python as
most TOPI compute and schedule functions are written in Python.
In python, we provide
OpStrategy class in
It only has one API, which is to add an implementation to the strategy:
def add_implementation(self, compute, schedule, name="default", plevel=10)
We now take
topk as an example to explain how to write the
# add to python/tvm/relay/op/strategy/generic.py @override_native_generic_func("topk_strategy") def topk_strategy(attrs, inputs, out_type, target): strategy = _op.OpStrategy() strategy.add_implementation( wrap_compute_topk(topi.topk), wrap_topi_schedule(topi.generic.schedule_topk), name="topk.generic") return strategy # add to each target file in python/tvm/relay/op/strategy, e.g., x86.py, cuda.py, etc. @topk_strategy.register(["cuda", "gpu"]) def topk_strategy_cuda(attrs, inputs, out_type, target): strategy = _op.OpStrategy() strategy.add_implementation( wrap_compute_my_new_op(topi.cuda.topk), wrap_topi_schedule(topi.cuda.schedule_topk), name="topk.cuda") return strategy
In this example, we use
as the compute and schedule function for CUDA or GPU target, while use TOPI
generic compute and schedule for the rest of targets.
Note that we use two wrapper functions that wrap the topi
compute and schedule to conform with the required function signature (
Usually we need to write a customized compute wrapper function for each operator
to get different fields from op attributes.
The example above shows a very basic strategy function that only adds one implementation in the strategy. But for many complicated operators, we may need to add multiple implementations that use different algorithms. For example, we can use both direct and winograd algorithm to compute a conv2d op. In order to achieve this, we can write the strategy function as follows:
strategy.add_implementation( wrap_compute_conv2d(topi.cuda.conv2d_nchw), wrap_topi_schedule(topi.cuda.schedule_conv2d_nchw), name="conv2d_nchw.cuda", plevel=10) if winograd_condition: strategy.add_implementation( wrap_compute_conv2d(topi.cuda.conv2d_nchw_winograd), wrap_topi_schedule(topi.cuda.schedule_conv2d_nchw_winograd), name="conv2d_nchw_winograd.cuda", plevel=15)
In this example, we add two implementations to the conv2d strategy where
winograd algorithm is only added when
winograd_condition is true.
"conv2d_nchw_winograd.cuda" will be used to compile
winograd_condition is true as it has higher
priority level (this could be changed if certain implementation is an AutoTVM
template. See Select Implementation from Op Strategy for more
"conv2d_nchw.cuda" is used.
We can extend the example above to third party library implementation. For example, we can add the implementation that invokes kernel in the cblas library when cblas is included in the target.
if "cblas" in target.libs: strategy.add_implementation( wrap_compute_dense(topi.x86.dense_cblas), wrap_topi_schedule(topi.x86.schedule_dense_cblas), name="dense_cblas.x86", plevel=15)
Further, we can add implementation specialized for a certain range of shapes.
The code below shows an example of dense strategy that adds an implementation
that is specialized for
m greater than 16. The main difference between
hardcode python condition like examples above and specialized condition is that
it allows TVM to generate multiple kernels when the input tensors have symbolic
shapes. The compile engine will generate a dispatch function that invokes the
specialized kernel when the corresponding condition is met; otherwise,
invoke the kernel that has no associated specialized condition (
in this example). This part is still work in progress. More details will be
provided after it is done.
def dense_strategy(attrs, inputs, out_type, target): m = inputs.shape strategy = _op.OpStrategy() strategy.add_implementation( wrap_compute_dense(dense_compute1), wrap_topi_schedule(dense_schedule1), name="dense_common") with tvm.te.SpecializedCondition(m > 16): strategy.add_implementation( wrap_compute_dense(dense_compute2), wrap_topi_schedule(dense_schedule2), name="dense_for_large_m", plevel=15) return strategy
Register Strategy Function to An Operator¶
After we define the strategy function for an operator, we can now register the strategy function to this operator with
However, it takes much effort to write a strategy function for an operator. Therefore, we provide two other methods for simpler operators.
First, for operators that have injective, broadcast, or reduction pattern, we
register_reduce_schedule repsectively. The schedule function for these
patterns are already registered by each target and can be applied to these
operators. We assume the compute function should be the same across all targets,
FTVMCompute needs to be registered to the op before invoking register
Second, for operators that doesn’t have these common patterns mentioned before,
but also have the same compute function for all targets, we can use
register_schedule API. It is easier to write
as we only need to provide which schedule function to use. The following
code snippet shows
FTVMSchedule function for pooling.
# add to python/tvm/relay/op/strategy/generic.py @generic_func def schedule_pool(attrs, outs, target): with target: return topi.generic.schedule_pool(outs, attrs.layout) # add to each target file in python/tvm/relay/op/strategy, e.g., x86.py, cuda.py, etc. @schedule_pool.register("cpu") def schedule_pool_cpu(attrs, outs, target): ...
After we created the
FTVMSchedule for an operator, we can
register the strategy using
Register Strategies for A New Target¶
There are two ways to register strategies for a new target. The more
straightforward one is adding a new target file in the directory
python/tvm/relay/op/strategy. You only need to customize the strategy for
ops that have been implemented for this new target and reuse the generic
strategies for the rest.
Alternatively, you can also register the strategy for the new target outside the
TVM python library. The following code snippet shows an example how to do
so. You can find more examples in
@relay.op.strategy.conv2d_strategy.register("mytarget") def conv2d_strategy_mytarget(attrs, inputs, out_type, target): ...
Select Implementation from Op Strategy¶
During the compilation, Relay compile engine needs to determine which implementation to use for an operator when there are multiple. The selection policy works as follows.
When the input tensors to an operator or a fused op all have constant shapes, the compile engine first finds the best implementation based on AutoTVM tuning logs. If there is no implementation that is an AutoTVM template or all AutoTVM templates have fallback configs, the implementation with highest priority level will then be chosen. Implementations with same priority level in this case leads to an undefined behavior, and any of them might be selected.
The selection policy for ops with symbolic input shapes is still work in progess. Currently, if any input tensor has a symbolic shape, only the implementation with highest priority level will be used for this operator. This will be updated after the implemention finishes.
For debug purpose, you can add the following lines before you compile the Relay model to learn which implementation is used for each operator.