.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
.. CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "topic/vta/tutorials/vta_get_started.py"

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        This tutorial can be used interactively with Google Colab! You can also click
        :ref:`here <sphx_glr_download_topic_vta_tutorials_vta_get_started.py>` to run the Jupyter notebook locally.

        .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
            :align: center
            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/83b9961c758069912464db3443fffc06/vta_get_started.ipynb
            :width: 300px

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_topic_vta_tutorials_vta_get_started.py:


.. _vta-get-started:

Get Started with VTA
====================
**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_

This is an introduction tutorial on how to use TVM to program the VTA design.

In this tutorial, we will demonstrate the basic TVM workflow to implement
a vector addition on the VTA design's vector ALU.
This process includes specific scheduling transformations necessary to lower
computation down to low-level accelerator operations.

To begin, we need to import TVM which is our deep learning optimizing compiler.
We also need to import the VTA python package which contains VTA specific
extensions for TVM to target the VTA design.

.. GENERATED FROM PYTHON SOURCE LINES 35-43

.. code-block:: default

    from __future__ import absolute_import, print_function

    import os
    import tvm
    from tvm import te
    import vta
    import numpy as np


.. GENERATED FROM PYTHON SOURCE LINES 44-59

Loading in VTA Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~
VTA is a modular and customizable design. Consequently, the user
is free to modify high-level hardware parameters that affect
the hardware design layout.
These parameters are specified in the :code:`vta_config.json` file by their
:code:`log2` values.
These VTA parameters can be loaded with the :code:`vta.get_env`
function.

Finally, the TVM target is also specified in the :code:`vta_config.json` file.
When set to *sim*, execution will take place inside of a behavioral
VTA simulator.
If you want to run this tutorial on the Pynq FPGA development platform,
follow the *VTA Pynq-Based Testing Setup* guide.

.. GENERATED FROM PYTHON SOURCE LINES 59-62

.. code-block:: default


    env = vta.get_env()


.. GENERATED FROM PYTHON SOURCE LINES 63-67

FPGA Programming
----------------
When targeting the Pynq FPGA development board, we need to configure
the board with a VTA bitstream.

.. GENERATED FROM PYTHON SOURCE LINES 67-101

.. code-block:: default


    # We'll need the TVM RPC module and the VTA simulator module
    from tvm import rpc
    from tvm.contrib import utils
    from vta.testing import simulator

    # We read the Pynq RPC host IP address and port number from the OS environment
    host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
    port = int(os.environ.get("VTA_RPC_PORT", "9091"))

    # We configure both the bitstream and the runtime system on the Pynq
    # to match the VTA configuration specified by the vta_config.json file.
    if env.TARGET == "pynq" or env.TARGET == "de10nano":

        # Make sure that TVM was compiled with RPC=1
        assert tvm.runtime.enabled("rpc")
        remote = rpc.connect(host, port)

        # Reconfigure the JIT runtime
        vta.reconfig_runtime(remote)

        # Program the FPGA with a pre-compiled VTA bitstream.
        # You can program the FPGA with your own custom bitstream
        # by passing the path to the bitstream file instead of None.
        vta.program_fpga(remote, bitstream=None)

    # In simulation mode, host the RPC server locally.
    elif env.TARGET in ("sim", "tsim", "intelfocl"):
        remote = rpc.LocalSession()

        if env.TARGET in ["intelfocl"]:
            # program intelfocl aocx
            vta.program_fpga(remote, bitstream="vta.bitstream")


.. GENERATED FROM PYTHON SOURCE LINES 102-124

Computation Declaration
-----------------------
As a first step, we need to describe our computation.
TVM adopts tensor semantics, with each intermediate result
represented as multi-dimensional array. The user needs to describe
the computation rule that generates the output tensors.

In this example we describe a vector addition, which requires multiple
computation stages, as shown in the dataflow diagram below.
First we describe the input tensors :code:`A` and :code:`B` that are living
in main memory.
Second, we need to declare intermediate tensors :code:`A_buf` and
:code:`B_buf`, which will live in VTA's on-chip buffers.
Having this extra computational stage allows us to explicitly
stage cached reads and writes.
Third, we describe the vector addition computation which will
add :code:`A_buf` to :code:`B_buf` to produce :code:`C_buf`.
The last operation is a cast and copy back to DRAM, into results tensor
:code:`C`.

.. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/vadd_dataflow.png
     :align: center

.. GENERATED FROM PYTHON SOURCE LINES 126-139

Input Placeholders
~~~~~~~~~~~~~~~~~~
We describe the placeholder tensors :code:`A`, and :code:`B` in a tiled data
format to match the data layout requirements imposed by the VTA vector ALU.

For VTA's general purpose operations such as vector adds, the tile size is
:code:`(env.BATCH, env.BLOCK_OUT)`.
The dimensions are specified in
the :code:`vta_config.json` configuration file and are set by default to
a (1, 16) vector.

In addition, A and B's data types also needs to match the :code:`env.acc_dtype`
which is set by the :code:`vta_config.json` file to be a 32-bit integer.

.. GENERATED FROM PYTHON SOURCE LINES 139-149

.. code-block:: default


    # Output channel factor m - total 64 x 16 = 1024 output channels
    m = 64
    # Batch factor o - total 1 x 1 = 1
    o = 1
    # A placeholder tensor in tiled data format
    A = te.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name="A", dtype=env.acc_dtype)
    # B placeholder tensor in tiled data format
    B = te.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name="B", dtype=env.acc_dtype)


.. GENERATED FROM PYTHON SOURCE LINES 150-164

Copy Buffers
~~~~~~~~~~~~
One specificity of hardware accelerators, is that on-chip memory has to be
explicitly managed.
This means that we'll need to describe intermediate tensors :code:`A_buf`
and :code:`B_buf` that can have a different memory scope than the original
placeholder tensors :code:`A` and :code:`B`.

Later in the scheduling phase, we can tell the compiler that :code:`A_buf`
and :code:`B_buf` will live in the VTA's on-chip buffers (SRAM), while
:code:`A` and :code:`B` will live in main memory (DRAM).
We describe A_buf and B_buf as the result of a compute
operation that is the identity function.
This can later be interpreted by the compiler as a cached read operation.

.. GENERATED FROM PYTHON SOURCE LINES 164-170

.. code-block:: default


    # A copy buffer
    A_buf = te.compute((o, m, env.BATCH, env.BLOCK_OUT), lambda *i: A(*i), "A_buf")
    # B copy buffer
    B_buf = te.compute((o, m, env.BATCH, env.BLOCK_OUT), lambda *i: B(*i), "B_buf")


.. GENERATED FROM PYTHON SOURCE LINES 171-180

Vector Addition
~~~~~~~~~~~~~~~
Now we're ready to describe the vector addition result tensor :code:`C`,
with another compute operation.
The compute function takes the shape of the tensor, as well as a lambda
function that describes the computation rule for each position of the tensor.

No computation happens during this phase, as we are only declaring how
the computation should be done.

.. GENERATED FROM PYTHON SOURCE LINES 180-188

.. code-block:: default


    # Describe the in-VTA vector addition
    C_buf = te.compute(
        (o, m, env.BATCH, env.BLOCK_OUT),
        lambda *i: A_buf(*i).astype(env.acc_dtype) + B_buf(*i).astype(env.acc_dtype),
        name="C_buf",
    )


.. GENERATED FROM PYTHON SOURCE LINES 189-193

Casting the Results
~~~~~~~~~~~~~~~~~~~
After the computation is done, we'll need to send the results computed by VTA
back to main memory.

.. GENERATED FROM PYTHON SOURCE LINES 195-206

.. note::

  **Memory Store Restrictions**

  One specificity of VTA is that it only supports DRAM stores in the narrow
  :code:`env.inp_dtype` data type format.
  This lets us reduce the data footprint for memory transfers (more on this
  in the basic matrix multiply example).

We perform one last typecast operation to the narrow
input activation data format.

.. GENERATED FROM PYTHON SOURCE LINES 206-212

.. code-block:: default


    # Cast to output type, and send to main memory
    C = te.compute(
        (o, m, env.BATCH, env.BLOCK_OUT), lambda *i: C_buf(*i).astype(env.inp_dtype), name="C"
    )


.. GENERATED FROM PYTHON SOURCE LINES 213-214

This concludes the computation declaration part of this tutorial.

.. GENERATED FROM PYTHON SOURCE LINES 217-230

Scheduling the Computation
--------------------------
While the above lines describes the computation rule, we can obtain
:code:`C` in many ways.
TVM asks the user to provide an implementation of the computation called
*schedule*.

A schedule is a set of transformations to an original computation that
transforms the implementation of the computation without affecting
correctness.
This simple VTA programming tutorial aims to demonstrate basic schedule
transformations that will map the original schedule down to VTA hardware
primitives.

.. GENERATED FROM PYTHON SOURCE LINES 233-237

Default Schedule
~~~~~~~~~~~~~~~~
After we construct the schedule, by default the schedule computes
:code:`C` in the following way:

.. GENERATED FROM PYTHON SOURCE LINES 237-243

.. code-block:: default


    # Let's take a look at the generated schedule
    s = te.create_schedule(C.op)

    print(tvm.lower(s, [A, B, C], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    # from tvm.script import ir as I
    # from tvm.script import tir as T

    @I.ir_module
    class Module:
        @T.prim_func
        def main(A: T.Buffer((1, 64, 1, 16), "int32"), B: T.Buffer((1, 64, 1, 16), "int32"), C: T.Buffer((1, 64, 1, 16), "int8")):
            T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
            A_buf = T.allocate([1024], "int32", "global")
            B_buf = T.allocate([1024], "int32", "global")
            A_buf_1 = T.Buffer((1024,), "int32", data=A_buf)
            for i1, i3 in T.grid(64, 16):
                cse_var_1: T.int32 = i1 * 16 + i3
                A_1 = T.Buffer((1024,), "int32", data=A.data)
                A_buf_1[cse_var_1] = A_1[cse_var_1]
            B_buf_1 = T.Buffer((1024,), "int32", data=B_buf)
            for i1, i3 in T.grid(64, 16):
                cse_var_2: T.int32 = i1 * 16 + i3
                B_1 = T.Buffer((1024,), "int32", data=B.data)
                B_buf_1[cse_var_2] = B_1[cse_var_2]
            A_buf_2 = T.Buffer((1024,), "int32", data=A_buf)
            for i1, i3 in T.grid(64, 16):
                cse_var_3: T.int32 = i1 * 16 + i3
                A_buf_2[cse_var_3] = A_buf_1[cse_var_3] + B_buf_1[cse_var_3]
            for i1, i3 in T.grid(64, 16):
                cse_var_4: T.int32 = i1 * 16 + i3
                C_1 = T.Buffer((1024,), "int8", data=C.data)
                C_1[cse_var_4] = T.Cast("int8", A_buf_2[cse_var_4])


.. GENERATED FROM PYTHON SOURCE LINES 244-253

Although this schedule makes sense, it won't compile to VTA.
In order to obtain correct code generation, we need to apply scheduling
primitives and code annotation that will transform the schedule into
one that can be directly lowered onto VTA hardware intrinsics.
Those include:

 - DMA copy operations which will take globally-scoped tensors and copy
   those into locally-scoped tensors.
 - Vector ALU operations that will perform the vector add.

.. GENERATED FROM PYTHON SOURCE LINES 255-264

Buffer Scopes
~~~~~~~~~~~~~
First, we set the scope of the copy buffers to indicate to TVM that these
intermediate tensors will be stored in the VTA's on-chip SRAM buffers.
Below, we tell TVM that :code:`A_buf`, :code:`B_buf`, :code:`C_buf`
will live in VTA's on-chip *accumulator buffer* which serves as
VTA's general purpose register file.

Set the intermediate tensors' scope to VTA's on-chip accumulator buffer

.. GENERATED FROM PYTHON SOURCE LINES 264-268

.. code-block:: default

    s[A_buf].set_scope(env.acc_scope)
    s[B_buf].set_scope(env.acc_scope)
    s[C_buf].set_scope(env.acc_scope)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    stage(C_buf, compute(C_buf, body=[A_buf[i0, i1, i2, i3] + B_buf[i0, i1, i2, i3]], axis=[T.iter_var(i0, T.Range(0, 1), "DataPar", ""), T.iter_var(i1, T.Range(0, 64), "DataPar", ""), T.iter_var(i2, T.Range(0, 1), "DataPar", ""), T.iter_var(i3, T.Range(0, 16), "DataPar", "")], reduce_axis=[], tag=, attrs={}))


.. GENERATED FROM PYTHON SOURCE LINES 269-276

DMA Transfers
~~~~~~~~~~~~~
We need to schedule DMA transfers to move data living in DRAM to
and from the VTA on-chip buffers.
We insert :code:`dma_copy` pragmas to indicate to the compiler
that the copy operations will be performed in bulk via DMA,
which is common in hardware accelerators.

.. GENERATED FROM PYTHON SOURCE LINES 276-283

.. code-block:: default


    # Tag the buffer copies with the DMA pragma to map a copy loop to a
    # DMA transfer operation
    s[A_buf].pragma(s[A_buf].op.axis[0], env.dma_copy)
    s[B_buf].pragma(s[B_buf].op.axis[0], env.dma_copy)
    s[C].pragma(s[C].op.axis[0], env.dma_copy)


.. GENERATED FROM PYTHON SOURCE LINES 284-291

ALU Operations
~~~~~~~~~~~~~~
VTA has a vector ALU that can perform vector operations on tensors
in the accumulator buffer.
In order to tell TVM that a given operation needs to be mapped to the
VTA's vector ALU, we need to explicitly tag the vector addition loop
with an :code:`env.alu` pragma.

.. GENERATED FROM PYTHON SOURCE LINES 291-299

.. code-block:: default


    # Tell TVM that the computation needs to be performed
    # on VTA's vector ALU
    s[C_buf].pragma(C_buf.op.axis[0], env.alu)

    # Let's take a look at the finalized schedule
    print(vta.lower(s, [A, B, C], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    # from tvm.script import ir as I
    # from tvm.script import tir as T

    @I.ir_module
    class Module:
        @T.prim_func
        def main(A: T.Buffer((1, 64, 1, 16), "int32"), B: T.Buffer((1, 64, 1, 16), "int32"), C: T.Buffer((1, 64, 1, 16), "int8")):
            T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
            vta = T.int32()
            with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 2):
                T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), A.data, 0, 64, 1, 64, 0, 0, 0, 0, 0, 3)
                T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), B.data, 0, 64, 1, 64, 0, 0, 0, 0, 64, 3)
                with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushALUOp"):
                    T.call_extern("int32", "VTAUopLoopBegin", 64, 1, 1, 0)
                    T.tir.vta.uop_push(1, 0, 0, 64, 0, 2, 0, 0)
                    T.call_extern("int32", "VTAUopLoopEnd")
                T.tir.vta.coproc_dep_push(2, 3)
            with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 3):
                T.tir.vta.coproc_dep_pop(2, 3)
                T.call_extern("int32", "VTAStoreBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), 0, 4, C.data, 0, 64, 1, 64)
            T.tir.vta.coproc_sync()


.. GENERATED FROM PYTHON SOURCE LINES 300-301

This concludes the scheduling portion of this tutorial.

.. GENERATED FROM PYTHON SOURCE LINES 303-314

TVM Compilation
---------------
After we have finished specifying the schedule, we can compile it
into a TVM function. By default TVM compiles into a type-erased
function that can be directly called from python side.

In the following line, we use :code:`tvm.build` to create a function.
The build function takes the schedule, the desired signature of the
function(including the inputs and outputs) as well as target language
we want to compile to.


.. GENERATED FROM PYTHON SOURCE LINES 314-318

.. code-block:: default

    my_vadd = vta.build(
        s, [A, B, C], tvm.target.Target("ext_dev", host=env.target_host), name="my_vadd"
    )


.. GENERATED FROM PYTHON SOURCE LINES 319-327

Saving the Module
~~~~~~~~~~~~~~~~~
TVM lets us save our module into a file so it can loaded back later. This
is called ahead-of-time compilation and allows us to save some compilation
time.
More importantly, this allows us to cross-compile the executable on our
development machine and send it over to the Pynq FPGA board over RPC for
execution.

.. GENERATED FROM PYTHON SOURCE LINES 327-335

.. code-block:: default


    # Write the compiled module into an object file.
    temp = utils.tempdir()
    my_vadd.save(temp.relpath("vadd.o"))

    # Send the executable over RPC
    remote.upload(temp.relpath("vadd.o"))


.. GENERATED FROM PYTHON SOURCE LINES 336-339

Loading the Module
~~~~~~~~~~~~~~~~~~
We can load the compiled module from the file system to run the code.

.. GENERATED FROM PYTHON SOURCE LINES 339-342

.. code-block:: default


    f = remote.load_module("vadd.o")


.. GENERATED FROM PYTHON SOURCE LINES 343-357

Running the Function
--------------------
The compiled TVM function uses a concise C API and can be invoked from
any language.

TVM provides an array API in python to aid quick testing and prototyping.
The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.

- We first create a remote context (for remote execution on the Pynq).
- Then :code:`tvm.nd.array` formats the data accordingly.
- :code:`f()` runs the actual computation.
- :code:`numpy()` copies the result array back in a format that can be
  interpreted.


.. GENERATED FROM PYTHON SOURCE LINES 357-377

.. code-block:: default


    # Get the remote device context
    ctx = remote.ext_dev(0)

    # Initialize the A and B arrays randomly in the int range of (-128, 128]
    A_orig = np.random.randint(-128, 128, size=(o * env.BATCH, m * env.BLOCK_OUT)).astype(A.dtype)
    B_orig = np.random.randint(-128, 128, size=(o * env.BATCH, m * env.BLOCK_OUT)).astype(B.dtype)

    # Apply packing to the A and B arrays from a 2D to a 4D packed layout
    A_packed = A_orig.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))
    B_packed = B_orig.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))

    # Format the input/output arrays with tvm.nd.array to the DLPack standard
    A_nd = tvm.nd.array(A_packed, ctx)
    B_nd = tvm.nd.array(B_packed, ctx)
    C_nd = tvm.nd.array(np.zeros((o, m, env.BATCH, env.BLOCK_OUT)).astype(C.dtype), ctx)

    # Invoke the module to perform the computation
    f(A_nd, B_nd, C_nd)


.. GENERATED FROM PYTHON SOURCE LINES 378-382

Verifying Correctness
---------------------
Compute the reference result with numpy and assert that the output of the
matrix multiplication indeed is correct

.. GENERATED FROM PYTHON SOURCE LINES 382-389

.. code-block:: default


    # Compute reference result with numpy
    C_ref = (A_orig.astype(env.acc_dtype) + B_orig.astype(env.acc_dtype)).astype(C.dtype)
    C_ref = C_ref.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))
    np.testing.assert_equal(C_ref, C_nd.numpy())
    print("Successful vector add test!")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Successful vector add test!


.. GENERATED FROM PYTHON SOURCE LINES 390-406

Summary
-------
This tutorial provides a walk-through of TVM for programming the
deep learning accelerator VTA with a simple vector addition example.
The general workflow includes:

- Programming the FPGA with the VTA bitstream over RPC.
- Describing the vector add computation via a series of computations.
- Describing how we want to perform the computation using schedule primitives.
- Compiling the function to the VTA target.
- Running the compiled module and verifying it against a numpy implementation.

You are more than welcome to check other examples out and tutorials
to learn more about the supported operations, schedule primitives
and other features supported by TVM to program VTA.


.. _sphx_glr_download_topic_vta_tutorials_vta_get_started.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: vta_get_started.py <vta_get_started.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: vta_get_started.ipynb <vta_get_started.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_