.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at .. http://www.apache.org/licenses/LICENSE-2.0 .. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Vitis AI Integration ==================== `Vitis AI `__ is Xilinx's development stack for hardware-accelerated AI inference on Xilinx platforms, including both edge devices and Alveo cards. It consists of optimized IP, tools, libraries, models, and example designs. It is designed with high efficiency and ease of use in mind, unleashing the full potential of AI acceleration on Xilinx FPGA and ACAP. The current Vitis AI flow inside TVM enables acceleration of Neural Network model inference on edge and cloud with the `Zynq Ultrascale+ MPSoc `__, `Alveo `__ and `Versal `__ platforms. The identifiers for the supported edge and cloud Deep Learning Processor Units (DPU's) are: +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | **Target Board** | **DPU ID** | **TVM Target ID** | +=========================================================================================+=======================+============================+ | `ZCU104 `__ | DPUCZDX8G | DPUCZDX8G-zcu104 | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `ZCU102 `__ | DPUCZDX8G | DPUCZDX8G-zcu102 | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `Kria KV260 `__ | DPUCZDX8G | DPUCZDX8G-kv260 | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `VCK190 `__ | DPUCVDX8G | DPUCVDX8G | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `VCK5000 `__ | DPUCVDX8H | DPUCVDX8H | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `U200 `__ | DPUCADF8H | DPUCADF8H | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `U250 `__ | DPUCADF8H | DPUCADF8H | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `U50 `__ | DPUCAHX8H / DPUCAHX8L | DPUCAHX8H-u50 / DPUCAHX8L | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ | `U280 `__ | DPUCAHX8H / DPUCAHX8L | DPUCAHX8H-u280 / DPUCAHX8L | +-----------------------------------------------------------------------------------------+-----------------------+----------------------------+ For more information about the DPU identifiers see following table: +-------------------+-------------+--------------------------------+------------------------+------------------------+------------------------+ | DPU | Application | HW Platform | Quantization Method | Quantization Bitwidth | Design Target | +===================+=============+================================+========================+========================+========================+ | | Deep Learning | | C: CNN | | AD: Alveo DDR | | X: DECENT | | 4: 4-bit | | G: General purpose | | | Processing Unit | | R: RNN | | AH: Alveo HBM | | I: Integer threshold | | 8: 8-bit | | H: High throughput | | | | | VD: Versal DDR with AIE & PL | | F: Float threshold | | 16: 16-bit | | L: Low latency | | | | | ZD: Zynq DDR | | R: RNN | | M: Mixed Precision | | C: Cost optimized | +-------------------+-------------+--------------------------------+------------------------+------------------------+------------------------+ On this page you will find information on how to `setup <#setup-instructions>`__ TVM with Vitis AI on different platforms (Zynq, Alveo, Versal) and on how to get started with `Compiling a Model <#compiling-a-model>`__ and executing on different platforms: `Inference <#inference>`__. System Requirements ------------------- The `Vitis AI System Requirements page `__ lists the system requirements for running docker containers as well as doing executing on Alveo cards. For edge devices (e.g. Zynq), deploying models requires a host machine for compiling models using the TVM with Vitis AI flow, and an edge device for running the compiled models. The host system requirements are the same as specified in the link above. Setup instructions ------------------ This section provide the instructions for setting up the TVM with Vitis AI flow for both cloud and edge. TVM with Vitis AI support is provided through a docker container. The provided scripts and Dockerfile compiles TVM and Vitis AI into a single image. 1. Clone TVM repo .. code:: bash git clone --recursive https://github.com/apache/tvm.git cd tvm 2. Build and start the TVM - Vitis AI docker container. .. code:: bash ./docker/build.sh demo_vitis_ai bash ./docker/bash.sh tvm.demo_vitis_ai # Setup inside container conda activate vitis-ai-tensorflow 3. Build TVM inside the container with Vitis AI (inside tvm directory) .. code:: bash mkdir build cp cmake/config.cmake build cd build echo set\(USE_LLVM ON\) >> config.cmake echo set\(USE_VITIS_AI ON\) >> config.cmake cmake .. make -j$(nproc) 4. Install TVM .. code:: bash cd ../python pip3 install -e . --user Inside this docker container you can now compile models for both cloud and edge targets. To run on cloud Alveo or Versal VCK5000 cards inside the docker container, please follow the `Alveo <#alveo-setup>`__ respectively `Versal VCK5000 <#versal-vck5000-setup>`__ setup instructions. To setup your Zynq or Versal VCK190 evaluation board for inference, please follow the `Zynq <#zynq-setup>`__ respectively `Versal VCK190 <#versal-vck190-setup>`__ instructions. Alveo Setup ~~~~~~~~~~~ Check out following page for setup information: `Alveo Setup `__. After setup, you can select the right DPU inside the docker container in the following way: .. code:: bash cd /workspace git clone --branch v1.4 --single-branch --recursive https://github.com/Xilinx/Vitis-AI.git cd Vitis-AI/setup/alveo source setup.sh [DPU-IDENTIFIER] The DPU identifier for this can be found in the second column of the DPU Targets table at the top of this page. Versal VCK5000 Setup ~~~~~~~~~~~~~~~~~~~~ Check out following page for setup information: `VCK5000 Setup `__. After setup, you can select the right DPU inside the docker container in the following way: .. code:: bash cd /workspace git clone --branch v1.4 --single-branch --recursive https://github.com/Xilinx/Vitis-AI.git cd Vitis-AI/setup/vck5000 source setup.sh Zynq Setup ~~~~~~~~~~ For the Zynq target (DPUCZDX8G) the compilation stage will run inside the docker on a host machine. This doesn't require any specific setup except for building the TVM - Vitis AI docker. For executing the model, the Zynq board will first have to be set up and more information on that can be found here. 1. Download the Petalinux image for your target: - `ZCU104 `__ - `ZCU102 `__ - `Kria KV260 `__ 2. Use Etcher software to burn the image file onto the SD card. 3. Insert the SD card with the image into the destination board. 4. Plug in the power and boot the board using the serial port to operate on the system. 5. Set up the IP information of the board using the serial port. For more details on step 1 to 5, please refer to `Setting Up The Evaluation Board `__. 6. Create 4GB of swap space on the board .. code:: bash fallocate -l 4G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile echo "/swapfile swap swap defaults 0 0" >> /etc/fstab 7. Install hdf5 dependency (will take between 30 min and 1 hour to finish) .. code:: bash cd /tmp && \ wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.7/src/hdf5-1.10.7.tar.gz && \ tar -zxvf hdf5-1.10.7.tar.gz && \ cd hdf5-1.10.7 && \ ./configure --prefix=/usr && \ make -j$(nproc) && \ make install && \ cd /tmp && rm -rf hdf5-1.10.7* 8. Install Python dependencies .. code:: bash pip3 install Cython==0.29.23 h5py==2.10.0 pillow 9. Install PyXIR .. code:: bash git clone --recursive --branch rel-v0.3.1 --single-branch https://github.com/Xilinx/pyxir.git cd pyxir sudo python3 setup.py install --use_vart_edge_dpu 10. Build and install TVM with Vitis AI .. code:: bash git clone --recursive https://github.com/apache/tvm cd tvm mkdir build cp cmake/config.cmake build cd build echo set\(USE_LLVM OFF\) >> config.cmake echo set\(USE_VITIS_AI ON\) >> config.cmake cmake .. make tvm_runtime -j$(nproc) cd ../python pip3 install --no-deps -e . 11. Check whether the setup was successful in the Python shell: .. code:: bash python3 -c 'import pyxir; import tvm' .. note:: You might see a warning about the 'cpu-tf' runtime not being found. This warning is expected on the board and can be ignored. Versal VCK190 Setup ~~~~~~~~~~~~~~~~~~~ For the Versal VCK190 setup, please follow the instructions for `Zynq Setup <#zynq-setup>`__, but now use the `VCK190 image `__ in step 1. The other steps are the same. Compiling a Model ----------------- The TVM with Vitis AI flow contains two stages: Compilation and Inference. During the compilation a user can choose a model to compile for the cloud or edge target devices that are currently supported. Once a model is compiled, the generated files can be used to run the model on a the specified target device during the `Inference <#inference>`__ stage. Currently, the TVM with Vitis AI flow supported a selected number of Xilinx data center and edge devices. In this section we walk through the typical flow for compiling models with Vitis AI inside TVM. **Imports** Make sure to import PyXIR and the DPU target (``import pyxir.contrib.target.DPUCADF8H`` for DPUCADF8H): .. code:: python import pyxir import pyxir.contrib.target.DPUCADF8H import tvm import tvm.relay as relay from tvm.contrib.target import vitis_ai from tvm.contrib import utils, graph_executor from tvm.relay.op.contrib.vitis_ai import partition_for_vitis_ai **Declare the Target** .. code:: python tvm_target = 'llvm' dpu_target = 'DPUCADF8H' # options: 'DPUCADF8H', 'DPUCAHX8H-u50', 'DPUCAHX8H-u280', 'DPUCAHX8L', 'DPUCVDX8H', 'DPUCZDX8G-zcu104', 'DPUCZDX8G-zcu102', 'DPUCZDX8G-kv260' The TVM with Vitis AI flow currently supports the DPU targets listed in the table at the top of this page. Once the appropriate targets are defined, we invoke the TVM compiler to build the graph for the specified target. **Import the Model** Example code to import an MXNet model: .. code:: python mod, params = relay.frontend.from_mxnet(block, input_shape) **Partition the Model** After importing the model, we utilize the Relay API to annotate the Relay expression for the provided DPU target and partition the graph. .. code:: python mod = partition_for_vitis_ai(mod, params, dpu=dpu_target) **Build the Model** The partitioned model is passed to the TVM compiler to generate the runtime libraries for the TVM Runtime. .. code:: python export_rt_mod_file = os.path.join(os.getcwd(), 'vitis_ai.rtmod') build_options = { 'dpu': dpu_target, 'export_runtime_module': export_rt_mod_file } with tvm.transform.PassContext(opt_level=3, config={'relay.ext.vitis_ai.options': build_options}): lib = relay.build(mod, tvm_target, params=params) **Quantize the Model** Usually, to be able to accelerate inference of Neural Network models with Vitis AI DPU accelerators, those models need to quantized upfront. In TVM - Vitis AI flow, we make use of on-the-fly quantization to remove this additional preprocessing step. In this flow, one doesn't need to quantize his/her model upfront but can make use of the typical inference execution calls (module.run) to quantize the model on-the-fly using the first N inputs that are provided (see more information below). This will set up and calibrate the Vitis-AI DPU and from that point onwards inference will be accelerated for all next inputs. Note that the edge flow deviates slightly from the explained flow in that inference won't be accelerated after the first N inputs but the model will have been quantized and compiled and can be moved to the edge device for deployment. Please check out the `Running on Zynq <#running-on-zynq>`__ section below for more information. .. code:: python module = graph_executor.GraphModule(lib["default"](tvm.cpu())) # First N (default = 128) inputs are used for quantization calibration and will # be executed on the CPU # This config can be changed by setting the 'PX_QUANT_SIZE' (e.g. export PX_QUANT_SIZE=64) for i in range(128): module.set_input(input_name, inputs[i]) module.run() By default, the number of images used for quantization is set to 128. You could change the number of images used for On-The-Fly Quantization with the PX_QUANT_SIZE environment variable. For example, execute the following line in the terminal before calling the compilation script to reduce the quantization calibration dataset to eight images. This can be used for quick testing. .. code:: bash export PX_QUANT_SIZE=8 Lastly, we store the compiled output from the TVM compiler on disk for running the model on the target device. This happens as follows for cloud DPU's (Alveo, VCK5000): .. code:: python lib_path = "deploy_lib.so" lib.export_library(lib_path) For edge targets (Zynq, VCK190) we have to rebuild for aarch64. To do this we first have to normally export the module to also serialize the Vitis AI runtime module (vitis_ai.rtmod). We will load this runtime module again afterwards to rebuild and export for aarch64. .. code:: python temp = utils.tempdir() lib.export_library(temp.relpath("tvm_lib.so")) # Build and export lib for aarch64 target tvm_target = tvm.target.arm_cpu('ultra96') lib_kwargs = { 'fcompile': contrib.cc.create_shared, 'cc': "/usr/aarch64-linux-gnu/bin/ld" } build_options = { 'load_runtime_module': export_rt_mod_file } with tvm.transform.PassContext(opt_level=3, config={'relay.ext.vitis_ai.options': build_options}): lib_edge = relay.build(mod, tvm_target, params=params) lib_edge.export_library('deploy_lib_edge.so', **lib_kwargs) This concludes the tutorial to compile a model using TVM with Vitis AI. For instructions on how to run a compiled model please refer to the next section. Inference --------- The TVM with Vitis AI flow contains two stages: Compilation and Inference. During the compilation a user can choose to compile a model for any of the target devices that are currently supported. Once a model is compiled, the generated files can be used to run the model on a target device during the Inference stage. Check out the `Running on Alveo and VCK5000 <#running-on-alveo-and-vck5000>`__ and `Running on Zynq and VCK190 <#running-on-zynq-and-vck190>`__ sections for doing inference on cloud accelerator cards respectively edge boards. Running on Alveo and VCK5000 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After having followed the steps in the `Compiling a Model <#compiling-a-model>`__ section, you can continue running on new inputs inside the docker for accelerated inference: .. code:: python module.set_input(input_name, inputs[i]) module.run() Alternatively, you can load the exported runtime module (the deploy_lib.so exported in `Compiling a Model <#compiling-a-model>`__): .. code:: python import pyxir import tvm from tvm.contrib import graph_executor dev = tvm.cpu() # input_name = ... # input_data = ... # load the module into memory lib = tvm.runtime.load_module("deploy_lib.so") module = graph_executor.GraphModule(lib["default"](dev)) module.set_input(input_name, input_data) module.run() Running on Zynq and VCK190 ~~~~~~~~~~~~~~~~~~~~~~~~~~ Before proceeding, please follow the `Zynq <#zynq-setup>`__ or `Versal VCK190 <#versal-vck190-setup>`__ setup instructions. Prior to running a model on the board, you need to compile the model for your target evaluation board and transfer the compiled model on to the board. Please refer to the `Compiling a Model <#compiling-a-model>`__ section for information on how to compile a model. Afterwards, you will have to transfer the compiled model (deploy_lib_edge.so) to the evaluation board. Then, on the board you can use the typical "load_module" and "module.run" APIs to execute. For this, please make sure to run the script as root (execute ``su`` in terminal to log into root). .. note:: Note also that you **shouldn't** import the PyXIR DPU targets in the run script (``import pyxir.contrib.target.DPUCZDX8G``). .. code:: python import pyxir import tvm from tvm.contrib import graph_executor dev = tvm.cpu() # input_name = ... # input_data = ... # load the module into memory lib = tvm.runtime.load_module("deploy_lib_edge.so") module = graph_executor.GraphModule(lib["default"](dev)) module.set_input(input_name, input_data) module.run()