Relay TensorRT Integration¶

Introduction¶

NVIDIA TensorRT is a library for optimized deep learning inference. This integration will offload as many operators as possible from Relay to TensorRT, providing a performance boost on NVIDIA GPUs without the need to tune schedules.

This guide will demonstrate how to install TensorRT and build TVM with TensorRT BYOC and runtime enabled. It will also provide example code to compile and run a ResNet-18 model using TensorRT and how to configure the compilation and runtime settings. Finally, we document the supported operators and how to extend the integration to support other operators.

Installing TensorRT¶

In order to download TensorRT, you will need to create an NVIDIA Developer program account. Please see NVIDIA’s documentation for more info: https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html. If you have a Jetson device such as a TX1, TX2, Xavier, or Nano, TensorRT will already be installed on the device via the JetPack SDK.

There are two methods to install TensorRT:

System install via deb or rpm package.
Tar file installation.

With the tar file installation method, you must provide the path of the extracted tar archive to USE_TENSORRT_RUNTIME=/path/to/TensorRT. With the system install method, USE_TENSORRT_RUNTIME=ON will automatically locate your installation.

Building TVM with TensorRT support¶

There are two separate build flags for TensorRT integration in TVM. These flags also enable cross-compilation: USE_TENSORRT_CODEGEN=ON will also you to build a module with TensorRT support on a host machine, while USE_TENSORRT_RUNTIME=ON will enable the TVM runtime on an edge device to execute the TensorRT module. You should enable both if you want to compile and also execute models with the same TVM build.

USE_TENSORRT_CODEGEN=ON/OFF - This flag will enable compiling a TensorRT module, which does not require any TensorRT library.
USE_TENSORRT_RUNTIME=ON/OFF/path-to-TensorRT - This flag will enable the TensorRT runtime module. This will build TVM against the installed TensorRT library.

Example setting in config.cmake file:

set(USE_TENSORRT_CODEGEN ON)
set(USE_TENSORRT_RUNTIME /home/ubuntu/TensorRT-7.0.0.11)

Build and Deploy ResNet-18 with TensorRT¶

Create a Relay graph from a MXNet ResNet-18 model.

import tvm
from tvm import relay
import mxnet
from mxnet.gluon.model_zoo.vision import get_model

dtype = "float32"
input_shape = (1, 3, 224, 224)
block = get_model('resnet18_v1', pretrained=True)
mod, params = relay.frontend.from_mxnet(block, shape={'data': input_shape}, dtype=dtype)

Annotate and partition the graph for TensorRT. All ops which are supported by the TensorRT integration will be marked and offloaded to TensorRT. The rest of the ops will go through the regular TVM CUDA compilation and code generation.

from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod = partition_for_tensorrt(mod, params)

Build the Relay graph, using the new module and config returned by partition_for_tensorrt. The target must always be a cuda target. partition_for_tensorrt will automatically fill out the required values in the config, so there is no need to modify it - just pass it along to the PassContext so the values can be read during compilation.

target = "cuda"
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

Export the module.

lib.export_library('compiled.so')

Load module and run inference on the target machine, which must be built with USE_TENSORRT_RUNTIME enabled. The first run will take longer because the TensorRT engine will have to be built.

dev = tvm.cuda(0)
loaded_lib = tvm.runtime.load_module('compiled.so')
gen_module = tvm.contrib.graph_executor.GraphModule(loaded_lib['default'](dev))
input_data = np.random.uniform(0, 1, input_shape).astype(dtype)
gen_module.run(data=input_data)

Partitioning and Compilation Settings¶

There are some options which can be configured in partition_for_tensorrt.

version - TensorRT version to target as tuple of (major, minor, patch). If TVM is compiled with USE_TENSORRT_RUNTIME=ON, the linked TensorRT version will be used instead. The version will affect which ops can be partitioned to TensorRT.
use_implicit_batch - Use TensorRT implicit batch mode (default true). Setting to false will enable explicit batch mode which will widen supported operators to include those which modify the batch dimension, but may reduce performance for some models.
remove_no_mac_subgraphs - A heuristic to improve performance. Removes subgraphs which have been partitioned for TensorRT if they do not have any multiply-accumulate operations. The removed subgraphs will go through TVM’s standard compilation instead.
max_workspace_size - How many bytes of workspace size to allow each subgraph to use for TensorRT engine creation. See TensorRT documentation for more info. Can be overriden at runtime.

Runtime Settings¶

There are some additional options which can be configured at runtime using environment variables.

Automatic FP16 Conversion - Environment variable TVM_TENSORRT_USE_FP16=1 can be set to automatically convert the TensorRT components of your model to 16-bit floating point precision. This can greatly increase performance, but may cause some slight loss in the model accuracy.
Caching TensorRT Engines - During the first inference, the runtime will invoke the TensorRT API to build an engine. This can be time consuming, so you can set TVM_TENSORRT_CACHE_DIR to point to a directory to save these built engines to on the disk. The next time you load the model and give it the same directory, the runtime will load the already built engines to avoid the long warmup time. A unique directory is required for each model.
TensorRT has a paramter to configure the maximum amount of scratch space that each layer in the model can use. It is generally best to use the highest value which does not cause you to run out of memory. You can use TVM_TENSORRT_MAX_WORKSPACE_SIZE to override this by specifying the workspace size in bytes you would like to use.
For models which contain a dynamic batch dimension, the varaible TVM_TENSORRT_MULTI_ENGINE can be used to determine how TensorRT engines will be created at runtime. The default mode, TVM_TENSORRT_MULTI_ENGINE=0, will maintain only one engine in memory at a time. If an input is encountered with a higher batch size, the engine will be rebuilt with the new max_batch_size setting. That engine will be compatible with all batch sizes from 1 to max_batch_size. This mode reduces the amount of memory used at runtime. The second mode, TVM_TENSORRT_MULTI_ENGINE=1 will build a unique TensorRT engine which is optimized for each batch size that is encountered. This will give greater performance, but will consume more memory.

Operator support¶

Relay Node	Remarks
nn.relu
sigmoid
tanh
nn.batch_norm
nn.layer_norm
nn.softmax
nn.conv1d
nn.conv2d
nn.dense
nn.bias_add
add
subtract
multiply
divide
power
maximum
minimum
nn.max_pool2d
nn.avg_pool2d
nn.global_max_pool2d
nn.global_avg_pool2d
exp
log
sqrt
abs
negative
nn.batch_flatten
expand_dims
squeeze
concatenate
nn.conv2d_transpose
transpose
layout_transform
reshape
nn.pad
sum
prod
max
min
mean
nn.adaptive_max_pool2d
nn.adaptive_avg_pool2d
nn.batch_matmul
clip	Requires TensorRT 5.1.5 or greater
nn.leaky_relu	Requires TensorRT 5.1.5 or greater
sin	Requires TensorRT 5.1.5 or greater
cos	Requires TensorRT 5.1.5 or greater
atan	Requires TensorRT 5.1.5 or greater
ceil	Requires TensorRT 5.1.5 or greater
floor	Requires TensorRT 5.1.5 or greater
split	Requires TensorRT 5.1.5 or greater
strided_slice	Requires TensorRT 5.1.5 or greater
nn.conv3d	Requires TensorRT 6.0.1 or greater
nn.max_pool3d	Requires TensorRT 6.0.1 or greater
nn.avg_pool3d	Requires TensorRT 6.0.1 or greater
nn.conv3d_transpose	Requires TensorRT 6.0.1 or greater
erf	Requires TensorRT 7.0.0 or greater

Adding a new operator¶

To add support for a new operator, there are a series of files we need to make changes to:

src/runtime/contrib/tensorrt/tensorrt_ops.cc Create a new op converter class which implements the TensorRTOpConverter interface. You must implement the constructor to specify how many inputs there are and whether they are tensors or weights. You must also implement the Convert method to perform the conversion. This is done by using the inputs, attributes, and network from params to add the new TensorRT layers and push the layer outputs. You can use the existing converters as an example. Finally, register your new op conventer in the GetOpConverters() map.
python/relay/op/contrib/tensorrt.py This file contains the annotation rules for TensorRT. These determine which operators and their attributes that are supported. You must register an annotation function for the relay operator and specify which attributes are supported by your converter, by checking the attributes are returning true or false.
tests/python/contrib/test_tensorrt.py Add unit tests for the given operator.