# Relay Arm ® Compute Library Integration¶

Author: Luke Hutton

## Introduction¶

Arm Compute Library (ACL) is an open source project that provides accelerated kernels for Arm CPU’s and GPU’s. Currently the integration offloads operators to ACL to use hand-crafted assembler routines in the library. By offloading select operators from a relay graph to ACL we can achieve a performance boost on such devices.

## Installing Arm Compute Library¶

Before installing Arm Compute Library, it is important to know what architecture to build for. One way to determine this is to use lscpu and look for the “Model name” of the CPU. You can then use this to determine the architecture by looking online.

We recommend two different ways to build and install ACL:

• Use the script located at docker/install/ubuntu_install_arm_compute_library.sh. You can use this script for building ACL from source natively or for cross-compiling the library on an x86 machine. You may need to change the architecture of the device you wish to compile for by altering the target_arch variable. Binaries will be built from source and installed to the location denoted by install_path.

• Alternatively, you can download and use pre-built binaries from: https://github.com/ARM-software/ComputeLibrary/releases. When using this package, you will need to select the binaries for the architecture you require and make sure they are visible to cmake. This can be done like so:

cd <acl-prebuilt-package>/lib
mv ./linux-<architecture-to-build-for>-neon/* .


In both cases you will need to set USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME to the path where the ACL package is located. Cmake will look in /path-to-acl/ along with /path-to-acl/lib and /path-to-acl/build for the required binaries. See the section below for more information on how to use these configuration options.

## Building with ACL support¶

The current implementation has two separate build options in cmake. The reason for this split is because ACL cannot be used on an x86 machine. However, we still want to be able compile an ACL runtime module on an x86 machine.

• USE_ARM_COMPUTE_LIB=ON/OFF - Enabling this flag will add support for compiling an ACL runtime module.

• USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME=ON/OFF/path-to-acl - Enabling this flag will allow the graph runtime to compute the ACL offloaded functions.

These flags can be used in different scenarios depending on your setup. For example, if you want to compile an ACL module on an x86 machine and then run the module on a remote Arm device via RPC, you will need to use USE_ARM_COMPUTE_LIB=ON on the x86 machine and USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME=ON on the remote AArch64 device.

By default both options are set to OFF. Using USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME=ON will mean that ACL binaries are searched for by cmake in the default locations (see https://cmake.org/cmake/help/v3.4/command/find_library.html). In addition to this, /path-to-tvm-project/acl/ will also be searched. It is likely that you will need to set your own path to locate ACL. This can be done by specifying a path in the place of ON.

These flags should be set in your config.cmake file. For example:

set(USE_ARM_COMPUTE_LIB ON)
set(USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME /path/to/acl)


## Usage¶

Note

This section may not stay up-to-date with changes to the API.

Create a relay graph. This may be a single operator or a whole graph. The intention is that any relay graph can be input. The ACL integration will only pick supported operators to be offloaded whilst the rest will be computed via TVM. (For this example we will use a single max_pool2d operator).

import tvm
from tvm import relay

data_type = "float32"
data_shape = (1, 14, 14, 512)
strides = (2, 2)
padding = (0, 0, 0, 0)
pool_size = (2, 2)
layout = "NHWC"
output_shape = (1, 7, 7, 512)

data = relay.var('data', shape=data_shape, dtype=data_type)
module = tvm.IRModule.from_expr(out)


Annotate and partition the graph for ACL.

from tvm.relay.op.contrib.arm_compute_lib import partition_for_arm_compute_lib
module = partition_for_arm_compute_lib(module)


Build the Relay graph.

target = "llvm -mtriple=aarch64-linux-gnu -mattr=+neon"
with tvm.transform.PassContext(opt_level=3, disabled_pass=["AlterOpLayout"]):
lib = relay.build(module, target=target)


Export the module.

lib_path = '~/lib_acl.so'
cross_compile = 'aarch64-linux-gnu-c++'
lib.export_library(lib_path, cc=cross_compile)


Run Inference. This must be on an Arm device. If compiling on x86 device and running on AArch64, consider using the RPC mechanism. Tutorials for using the RPC mechanism: https://tvm.apache.org/docs/tutorials/get_started/cross_compilation_and_rpc.html

ctx = tvm.cpu(0)
d_data = np.random.uniform(0, 1, data_shape).astype(data_type)
map_inputs = {'data': d_data}
gen_module.set_input(**map_inputs)
gen_module.run()


## More examples¶

The example above only shows a basic example of how ACL can be used for offloading a single Maxpool2D. If you would like to see more examples for each implemented operator and for networks refer to the tests: tests/python/contrib/test_arm_compute_lib. Here you can modify test_config.json to configure how a remote device is created in infrastructure.py and, as a result, how runtime tests will be run.

An example configuration for test_config.json:

• connection_type - The type of RPC connection. Options: local, tracker, remote.

• host - The host device to connect to.

• port - The port to use when connecting.

• target - The target to use for compilation.

• device_key - The device key when connecting via a tracker.

• cross_compile - Path to cross compiler when connecting from a non-arm platform e.g. aarch64-linux-gnu-g++.

{
"connection_type": "local",
"host": "localhost",
"port": 9090,
"target": "llvm -mtriple=aarch64-linux-gnu -mattr=+neon",
"device_key": "",
"cross_compile": ""
}


## Operator support¶

Relay Node

Remarks

nn.conv2d

fp32:

(only groups = 1 supported)

qnn.conv2d

uint8:

(only groups = 1 supported)

nn.dense

fp32:

qnn.dense

uint8:

nn.max_pool2d

fp32, uint8

nn.global_max_pool2d

fp32, uint8

nn.avg_pool2d

fp32:

Simple: nn.avg_pool2d

uint8:

Composite: cast(int32), nn.avg_pool2d, cast(uint8)

nn.global_avg_pool2d

fp32:

Simple: nn.global_avg_pool2d

uint8:

Composite: cast(int32), nn.avg_pool2d, cast(uint8)

power(of 2) + nn.avg_pool2d + sqrt

A special case for L2 pooling.

fp32:

Composite: power(of 2), nn.avg_pool2d, sqrt

reshape

fp32, uint8

maximum

fp32

fp32

uint8

Note

A composite operator is a series of operators that map to a single Arm Compute Library operator. You can view this as being a single fused operator from the view point of Arm Compute Library. ‘?’ denotes an optional operator in the series of operators that make up a composite operator.