Relay TensorRT Integration

Author: Trevor Morris

Introduction

NVIDIA TensorRT is a library for optimized deep learning inference. This integration will offload as many operators as possible from Relay to TensorRT, providing a performance boost on NVIDIA GPUs without the need to tune schedules.

This guide will demonstrate how to install TensorRT and build TVM with TensorRT BYOC and runtime enabled. It will also provide example code to compile and run a ResNet-18 model using TensorRT and how to configure the compilation and runtime settings. Finally, we document the supported operators and how to extend the integration to support other operators.

Installing TensorRT

In order to download TensorRT, you will need to create an NVIDIA Developer program account. Please see NVIDIA’s documentation for more info: https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html. If you have a Jetson device such as a TX1, TX2, Xavier, or Nano, TensorRT will already be installed on the device via the JetPack SDK.

There are two methods to install TensorRT:

  • System install via deb or rpm package.

  • Tar file installation.

With the tar file installation method, you must provide the path of the extracted tar archive to USE_TENSORRT_RUNTIME=/path/to/TensorRT. With the system install method, USE_TENSORRT_RUNTIME=ON will automatically locate your installation.

Building TVM with TensorRT support

There are two separate build flags for TensorRT integration in TVM. These flags also enable cross-compilation: USE_TENSORRT_CODEGEN=ON will also you to build a module with TensorRT support on a host machine, while USE_TENSORRT_RUNTIME=ON will enable the TVM runtime on an edge device to execute the TensorRT module. You should enable both if you want to compile and also execute models with the same TVM build.

  • USE_TENSORRT_CODEGEN=ON/OFF - This flag will enable compiling a TensorRT module, which does not require any TensorRT library.

  • USE_TENSORRT_RUNTIME=ON/OFF/path-to-TensorRT - This flag will enable the TensorRT runtime module. This will build TVM against the installed TensorRT library.

Example setting in config.cmake file:

set(USE_TENSORRT_CODEGEN ON)
set(USE_TENSORRT_RUNTIME /home/ubuntu/TensorRT-7.0.0.11)

Build and Deploy ResNet-18 with TensorRT

Create a Relay graph from a MXNet ResNet-18 model.

import tvm
from tvm import relay
import mxnet
from mxnet.gluon.model_zoo.vision import get_model

dtype = "float32"
input_shape = (1, 3, 224, 224)
block = get_model('resnet18_v1', pretrained=True)
mod, params = relay.frontend.from_mxnet(block, shape={'data': input_shape}, dtype=dtype)

Annotate and partition the graph for TensorRT. All ops which are supported by the TensorRT integration will be marked and offloaded to TensorRT. The rest of the ops will go through the regular TVM CUDA compilation and code generation.

from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod, config = partition_for_tensorrt(mod, params)

Build the Relay graph, using the new module and config returned by partition_for_tensorrt. The target must always be a cuda target. partition_for_tensorrt will automatically fill out the required values in the config, so there is no need to modify it - just pass it along to the PassContext so the values can be read during compilation.

target = "cuda"
with tvm.transform.PassContext(opt_level=3, config={'relay.ext.tensorrt.options': config}):
    lib = relay.build(mod, target=target, params=params)

Export the module.

lib.export_library('compiled.so')

Load module and run inference on the target machine, which must be built with USE_TENSORRT_RUNTIME enabled. The first run will take longer because the TensorRT engine will have to be built.

dev = tvm.cuda(0)
loaded_lib = tvm.runtime.load_module('compiled.so')
gen_module = tvm.contrib.graph_executor.GraphModule(loaded_lib['default'](dev))
input_data = np.random.uniform(0, 1, input_shape).astype(dtype)
gen_module.run(data=input_data)

Partitioning and Compilation Settings

There are some options which can be configured in partition_for_tensorrt.

  • version - TensorRT version to target as tuple of (major, minor, patch). If TVM is compiled with USE_TENSORRT_RUNTIME=ON, the linked TensorRT version will be used instead. The version will affect which ops can be partitioned to TensorRT.

  • use_implicit_batch - Use TensorRT implicit batch mode (default true). Setting to false will enable explicit batch mode which will widen supported operators to include those which modify the batch dimension, but may reduce performance for some models.

  • remove_no_mac_subgraphs - A heuristic to improve performance. Removes subgraphs which have been partitioned for TensorRT if they do not have any multiply-accumulate operations. The removed subgraphs will go through TVM’s standard compilation instead.

  • max_workspace_size - How many bytes of workspace size to allow each subgraph to use for TensorRT engine creation. See TensorRT documentation for more info. Can be overriden at runtime.

Runtime Settings

There are some additional options which can be configured at runtime using environment variables.

  • Automatic FP16 Conversion - Environment variable TVM_TENSORRT_USE_FP16=1 can be set to automatically convert the TensorRT components of your model to 16-bit floating point precision. This can greatly increase performance, but may cause some slight loss in the model accuracy.

  • Caching TensorRT Engines - During the first inference, the runtime will invoke the TensorRT API to build an engine. This can be time consuming, so you can set TVM_TENSORRT_CACHE_DIR to point to a directory to save these built engines to on the disk. The next time you load the model and give it the same directory, the runtime will load the already built engines to avoid the long warmup time. A unique directory is required for each model.

  • TensorRT has a paramter to configure the maximum amount of scratch space that each layer in the model can use. It is generally best to use the highest value which does not cause you to run out of memory. You can use TVM_TENSORRT_MAX_WORKSPACE_SIZE to override this by specifying the workspace size in bytes you would like to use.

  • For models which contain a dynamic batch dimension, the varaible TVM_TENSORRT_MULTI_ENGINE can be used to determine how TensorRT engines will be created at runtime. The default mode, TVM_TENSORRT_MULTI_ENGINE=0, will maintain only one engine in memory at a time. If an input is encountered with a higher batch size, the engine will be rebuilt with the new max_batch_size setting. That engine will be compatible with all batch sizes from 1 to max_batch_size. This mode reduces the amount of memory used at runtime. The second mode, TVM_TENSORRT_MULTI_ENGINE=1 will build a unique TensorRT engine which is optimized for each batch size that is encountered. This will give greater performance, but will consume more memory.

Operator support

Relay Node

Remarks

nn.relu

sigmoid

tanh

nn.batch_norm

nn.layer_norm

nn.softmax

nn.conv1d

nn.conv2d

nn.dense

nn.bias_add

add

subtract

multiply

divide

power

maximum

minimum

nn.max_pool2d

nn.avg_pool2d

nn.global_max_pool2d

nn.global_avg_pool2d

exp

log

sqrt

abs

negative

nn.batch_flatten

expand_dims

squeeze

concatenate

nn.conv2d_transpose

transpose

layout_transform

reshape

nn.pad

sum

prod

max

min

mean

nn.adaptive_max_pool2d

nn.adaptive_avg_pool2d

nn.batch_matmul

clip

Requires TensorRT 5.1.5 or greater

nn.leaky_relu

Requires TensorRT 5.1.5 or greater

sin

Requires TensorRT 5.1.5 or greater

cos

Requires TensorRT 5.1.5 or greater

atan

Requires TensorRT 5.1.5 or greater

ceil

Requires TensorRT 5.1.5 or greater

floor

Requires TensorRT 5.1.5 or greater

split

Requires TensorRT 5.1.5 or greater

strided_slice

Requires TensorRT 5.1.5 or greater

nn.conv3d

Requires TensorRT 6.0.1 or greater

nn.max_pool3d

Requires TensorRT 6.0.1 or greater

nn.avg_pool3d

Requires TensorRT 6.0.1 or greater

nn.conv3d_transpose

Requires TensorRT 6.0.1 or greater

erf

Requires TensorRT 7.0.0 or greater

Adding a new operator

To add support for a new operator, there are a series of files we need to make changes to:

  • src/runtime/contrib/tensorrt/tensorrt_ops.cc Create a new op converter class which implements the TensorRTOpConverter interface. You must implement the constructor to specify how many inputs there are and whether they are tensors or weights. You must also implement the Convert method to perform the conversion. This is done by using the inputs, attributes, and network from params to add the new TensorRT layers and push the layer outputs. You can use the existing converters as an example. Finally, register your new op conventer in the GetOpConverters() map.

  • python/relay/op/contrib/tensorrt.py This file contains the annotation rules for TensorRT. These determine which operators and their attributes that are supported. You must register an annotation function for the relay operator and specify which attributes are supported by your converter, by checking the attributes are returning true or false.

  • tests/python/contrib/test_tensorrt.py Add unit tests for the given operator.