# Relay Arm|reg| Compute Library Integration¶

## Introduction¶

Arm Compute Library (ACL) is an open source project that provides accelerated kernels for Arm CPU’s and GPU’s. Currently the integration offloads operators to ACL to use hand-crafted assembler routines in the library. By offloading select operators from a relay graph to ACL we can achieve a performance boost on such devices.

## Building with ACL support¶

The current implementation has two separate build options in cmake. The reason for this split is because ACL cannot be used on an x86 machine. However, we still want to be able compile an ACL runtime module on an x86 machine.

• USE_ARM_COMPUTE_LIB=ON/OFF - Enabling this flag will add support for compiling an ACL runtime module.

• USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME=ON/OFF/path-to-acl - Enabling this flag will allow the graph runtime to compute the ACL offloaded functions.

These flags can be used in different scenarios depending on your setup. For example, if you want to compile an ACL module on an x86 machine and then run the module on a remote Arm device via RPC, you will need to use USE_ARM_COMPUTE_LIB=ON on the x86 machine and USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME=ON on the remote AArch64 device.

## Usage¶

Note

This section may not stay up-to-date with changes to the API.

Create a relay graph. This may be a single operator or a whole graph. The intention is that any relay graph can be input. The ACL integration will only pick supported operators to be offloaded whilst the rest will be computed via TVM. (For this example we will use a single max_pool2d operator).

import tvm
from tvm import relay

data_type = "float32"
data_shape = (1, 14, 14, 512)
strides = (2, 2)
padding = (0, 0, 0, 0)
pool_size = (2, 2)
layout = "NHWC"
output_shape = (1, 7, 7, 512)

data = relay.var('data', shape=data_shape, dtype=data_type)
module = tvm.IRModule.from_expr(out)


Annotate and partition the graph for ACL.

..code:: python

from tvm.relay.op.contrib.arm_compute_lib import partition_for_arm_compute_lib module = partition_for_arm_compute_lib(module)

Build the Relay graph.

target = "llvm -mtriple=aarch64-linux-gnu -mattr=+neon"
with tvm.transform.PassContext(opt_level=3, disabled_pass=["AlterOpLayout"]):
lib = relay.build(module, target=target)


Export the module.

lib_path = '~/lib_acl.so'
cross_compile = 'aarch64-linux-gnu-c++'
lib.export_library(lib_path, cc=cross_compile)


Run Inference. This must be on an Arm device. If compiling on x86 device and running on AArch64, consider using the RPC mechanism. Tutorials for using the RPC mechanism: https://tvm.apache.org/docs/tutorials/cross_compilation_and_rpc.html#sphx-glr-tutorials-cross-compilation-and-rpc-py

ctx = tvm.cpu(0)
d_data = np.random.uniform(0, 1, data_shape).astype(data_type)
map_inputs = {'data': d_data}
gen_module.set_input(**map_inputs)
gen_module.run()


## More examples¶

The example above only shows a basic example of how ACL can be used for offloading a single Maxpool2D. If you would like to see more examples for each implemented operator and for networks refer to the tests: tests/python/contrib/test_arm_compute_lib. Here you can modify infrastructure.py to use the remote device you have setup.

## Operator support¶

Relay Node

Remarks

nn.conv2d

fp32:

(only groups = 1 supported)

qnn.conv2d

uint8:

(only groups = 1 supported)

nn.maxpool2d

fp32, uint8

reshape

fp32, uint8

Note

A composite operator is a series of operators that map to a single Arm Compute Library operator. You can view this as being a single fused operator from the view point of Arm Compute Library. ‘?’ denotes an optional operator in the series of operators that make up a composite operator.