.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
.. CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "topic/vta/tutorials/optimize/matrix_multiply_opt.py"

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        This tutorial can be used interactively with Google Colab! You can also click
        :ref:`here <sphx_glr_download_topic_vta_tutorials_optimize_matrix_multiply_opt.py>` to run the Jupyter notebook locally.

        .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
            :align: center
            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/4d3f955a709b320db0d42740fead8ac1/matrix_multiply_opt.ipynb
            :width: 300px

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py:


.. _vta-mat-mult-opt:

Matrix Multiply Blocking
========================
**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_

This tutorial provides an overview on how to use TVM to map matrix
multiplication efficiently on the VTA design.
We recommend covering the :ref:`basic-mat-mult` tutorial first.

In this tutorial, we will demonstrate TVM schedule optimizations to break large
neural network operators down onto smaller blocks to achieve computation within
limited hardware accelerator resources.

.. GENERATED FROM PYTHON SOURCE LINES 34-37

RPC Setup
---------
We start by programming the Pynq's FPGA and building its RPC runtime.

.. GENERATED FROM PYTHON SOURCE LINES 37-76

.. code-block:: default


    from __future__ import absolute_import, print_function

    import os
    import tvm
    from tvm import te
    import vta
    import numpy as np
    from tvm import rpc
    from tvm.contrib import utils
    from vta.testing import simulator

    # Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file
    env = vta.get_env()

    # We read the Pynq RPC host IP address and port number from the OS environment
    host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
    port = int(os.environ.get("VTA_RPC_PORT", "9091"))

    # We configure both the bitstream and the runtime system on the Pynq
    # to match the VTA configuration specified by the vta_config.json file.
    if env.TARGET == "pynq":

        # Make sure that TVM was compiled with RPC=1
        assert tvm.runtime.enabled("rpc")
        remote = rpc.connect(host, port)

        # Reconfigure the JIT runtime
        vta.reconfig_runtime(remote)

        # Program the FPGA with a pre-compiled VTA bitstream.
        # You can program the FPGA with your own custom bitstream
        # by passing the path to the bitstream file instead of None.
        vta.program_fpga(remote, bitstream=None)

    # In simulation mode, host the RPC server locally.
    elif env.TARGET in ["sim", "tsim"]:
        remote = rpc.LocalSession()


.. GENERATED FROM PYTHON SOURCE LINES 77-98

Computation Declaration
-----------------------
As a first step, we need to describe our matrix multiplication computation.
We define the matrix multiplication as the computation one would find in a
fully connected layer, defined by its batch size, input channels, and output
channels.
These have to be integer multiples of the VTA tensor shape:
:code:`BATCH`, :code:`BLOCK_IN`, and :code:`BLOCK_OUT` respectively.

We've added extra operators to the matrix multiplication that apply
shifting and clipping to the output in order to mimic a fixed-point
matrix multiplication followed by a rectified linear activation.
We describe the TVM dataflow graph of the fully connected layer below:

.. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/fc_dataflow.png
     :align: center

This computation is intentionally too large to fit onto VTA's on-chip
buffers all at once. Therefore in the scheduling phase we'll
rely on computation blocking strategies to break the computation down into
manageable chunks.

.. GENERATED FROM PYTHON SOURCE LINES 98-152

.. code-block:: default


    # Fully connected layer dimensions: 1024 x 1024
    batch_size = 1
    in_channels = 1024
    out_channels = 1024
    assert batch_size % env.BATCH == 0
    assert in_channels % env.BLOCK_IN == 0
    assert out_channels % env.BLOCK_OUT == 0

    # Let's derive the tiled input tensor shapes
    data_shape = (batch_size // env.BATCH, in_channels // env.BLOCK_IN, env.BATCH, env.BLOCK_IN)
    weight_shape = (
        out_channels // env.BLOCK_OUT,
        in_channels // env.BLOCK_IN,
        env.BLOCK_OUT,
        env.BLOCK_IN,
    )
    output_shape = (batch_size // env.BATCH, out_channels // env.BLOCK_OUT, env.BATCH, env.BLOCK_OUT)
    num_ops = in_channels * out_channels * batch_size * 2

    # Reduction axes
    ic = te.reduce_axis((0, in_channels // env.BLOCK_IN), name="ic")
    ic_tns = te.reduce_axis((0, env.BLOCK_IN), name="ic_tns")

    # Input placeholder tensors
    data = te.placeholder(data_shape, name="data", dtype=env.inp_dtype)
    weight = te.placeholder(weight_shape, name="weight", dtype=env.wgt_dtype)

    # Copy buffers
    data_buf = te.compute(data_shape, lambda *i: data(*i), "data_buf")
    weight_buf = te.compute(weight_shape, lambda *i: weight(*i), "weight_buf")

    # Declare matrix multiply computation
    res_gemm = te.compute(
        output_shape,
        lambda bo, co, bi, ci: te.sum(
            data_buf[bo, ic, bi, ic_tns].astype(env.acc_dtype)
            * weight_buf[co, ic, ci, ic_tns].astype(env.acc_dtype),
            axis=[ic, ic_tns],
        ),
        name="res_gem",
    )

    # Add shift stage for fix-point normalization
    res_shr = te.compute(output_shape, lambda *i: res_gemm(*i) >> env.INP_WIDTH, name="res_shr")

    # Apply clipping between (0, input max value)
    inp_max = (1 << (env.INP_WIDTH - 1)) - 1
    res_max = te.compute(output_shape, lambda *i: tvm.te.max(res_shr(*i), 0), "res_max")
    res_min = te.compute(output_shape, lambda *i: tvm.te.min(res_max(*i), inp_max), "res_min")

    # Apply typecast to input data type before sending results back
    res = te.compute(output_shape, lambda *i: res_min(*i).astype(env.inp_dtype), name="res")


.. GENERATED FROM PYTHON SOURCE LINES 153-161

Scheduling the Computation
--------------------------
We'll look at a set of schedule transformations necessary to map the
matrix multiplications onto VTA in an efficient fashion.
Those include:

- Computation blocking
- Lowering to VTA hardware intrinsics

.. GENERATED FROM PYTHON SOURCE LINES 161-168

.. code-block:: default


    # Create TVM schedule
    s = te.create_schedule(res.op)
    # Let's look at the default TVM schedule
    print(tvm.lower(s, [data, weight, res], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    # from tvm.script import ir as I
    # from tvm.script import tir as T

    @I.ir_module
    class Module:
        @T.prim_func
        def main(data: T.Buffer((1, 64, 1, 16), "int8"), weight: T.Buffer((64, 64, 16, 16), "int8"), res: T.Buffer((1, 64, 1, 16), "int8")):
            T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
            data_buf = T.allocate([1024], "int8", "global")
            weight_buf = T.allocate([1048576], "int8", "global")
            res_gem = T.allocate([1024], "int32", "global")
            data_buf_1 = T.Buffer((1024,), "int8", data=data_buf)
            for i1, i3 in T.grid(64, 16):
                cse_var_1: T.int32 = i1 * 16 + i3
                data_1 = T.Buffer((1024,), "int8", data=data.data)
                data_buf_1[cse_var_1] = data_1[cse_var_1]
            weight_buf_1 = T.Buffer((1048576,), "int8", data=weight_buf)
            for i0, i1, i2, i3 in T.grid(64, 64, 16, 16):
                cse_var_2: T.int32 = i0 * 16384 + i1 * 256 + i2 * 16 + i3
                weight_1 = T.Buffer((1048576,), "int8", data=weight.data)
                weight_buf_1[cse_var_2] = weight_1[cse_var_2]
            res_gem_1 = T.Buffer((1024,), "int32", data=res_gem)
            for co, ci in T.grid(64, 16):
                res_gem_1[co * 16 + ci] = 0
                for ic, ic_tns in T.grid(64, 16):
                    cse_var_3: T.int32 = co * 16 + ci
                    res_gem_1[cse_var_3] = res_gem_1[cse_var_3] + T.Cast("int32", data_buf_1[ic * 16 + ic_tns]) * T.Cast("int32", weight_buf_1[co * 16384 + ic * 256 + ci * 16 + ic_tns])
            res_gem_2 = T.Buffer((1024,), "int32", data=res_gem)
            for i1, i3 in T.grid(64, 16):
                cse_var_4: T.int32 = i1 * 16 + i3
                res_gem_2[cse_var_4] = T.shift_right(res_gem_1[cse_var_4], 8)
            res_gem_3 = T.Buffer((1024,), "int32", data=res_gem)
            for i1, i3 in T.grid(64, 16):
                cse_var_5: T.int32 = i1 * 16 + i3
                res_gem_3[cse_var_5] = T.max(res_gem_2[cse_var_5], 0)
            res_gem_4 = T.Buffer((1024,), "int32", data=res_gem)
            for i1, i3 in T.grid(64, 16):
                cse_var_6: T.int32 = i1 * 16 + i3
                res_gem_4[cse_var_6] = T.min(res_gem_3[cse_var_6], 127)
            for i1, i3 in T.grid(64, 16):
                cse_var_7: T.int32 = i1 * 16 + i3
                res_1 = T.Buffer((1024,), "int8", data=res.data)
                res_1[cse_var_7] = T.Cast("int8", res_gem_4[cse_var_7])


.. GENERATED FROM PYTHON SOURCE LINES 169-223

Blocking the Computation
~~~~~~~~~~~~~~~~~~~~~~~~
The matrix multiplication is by default too large for activations or weights
to fit on VTA's on-chip buffers all at once.
We block the (1, 1024) by (1024, 1024) matrix multiplication into
smaller (1, 256) by (256, 256) matrix multiplications so the intermediate
tensors can fit on the accelerator's on-chip SRAM.
This approach is similar to blocking techniques applied to CPUs and GPUs in
order to increase cache hit rate.

We perform blocking along each axes (the batch axis being untouched since
we are performing singe-batch inference).
We also leave the inner-most tensorization axes as-is in order to allow
TVM to pattern-match tensorization.
We show the outcome of blocking on the computation schedule in the diagram
below:

.. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/blocking.png
     :align: center
     :width: 480px

.. note::

  The code after loop splitting and reordering is equivalent to the following
  pseudo-code. We ignore the batch axis since we are only performing single-batch
  inference in this example:

  .. code-block:: c

     for (int oc_out = 0; oc_out < 4; ++oc_out) {
       // Initialization loop
       for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {
        for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {
         int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;
         C[0][j] = 0;
        }
       }
       for (int ic_out = 0; ic_out < 4; ++ic_out) {
        // Block loop
        for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {
         for (int ic_inn = 0; ic_inn < 16; ++ic_inn) {
          // Tensorization loop
          for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {
           for (int ic_tns = 0; ic_tns < 16; ++ic_tns) {
            int i = (ic_out * 16 + ic_inn) * 16 + ic_tns;
            int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;
            C[0][i] = C[0][i] + A[0][i] * B[j][i];
           }
          }
         }
        }
       }
      }
     }

.. GENERATED FROM PYTHON SOURCE LINES 223-254

.. code-block:: default


    # Let's define tiling sizes (expressed in multiples of VTA tensor shape size)
    b_block = 1 // env.BATCH
    i_block = 256 // env.BLOCK_IN
    o_block = 256 // env.BLOCK_OUT

    # Tile the output tensor along the batch and output channel dimensions
    # (since by default we are doing single batch inference, the split along
    #  the batch dimension has no effect)
    b, oc, b_tns, oc_tns = s[res].op.axis
    b_out, b_inn = s[res].split(b, b_block)
    oc_out, oc_inn = s[res].split(oc, o_block)
    s[res].reorder(b_out, oc_out, b_inn, oc_inn)

    # Move intermediate computation into each output compute tile
    s[res_gemm].compute_at(s[res], oc_out)
    s[res_shr].compute_at(s[res], oc_out)
    s[res_max].compute_at(s[res], oc_out)
    s[res_min].compute_at(s[res], oc_out)

    # Apply additional loop split along reduction axis (input channel)
    b_inn, oc_inn, b_tns, oc_tns = s[res_gemm].op.axis
    ic_out, ic_inn = s[res_gemm].split(ic, i_block)

    # Reorder axes. We move the ic_out axis all the way out of the GEMM
    # loop to block along the reduction axis
    s[res_gemm].reorder(ic_out, b_inn, oc_inn, ic_inn, b_tns, oc_tns, ic_tns)

    # Let's look at the current TVM schedule after blocking
    print(tvm.lower(s, [data, weight, res], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    # from tvm.script import ir as I
    # from tvm.script import tir as T

    @I.ir_module
    class Module:
        @T.prim_func
        def main(data: T.Buffer((1, 64, 1, 16), "int8"), weight: T.Buffer((64, 64, 16, 16), "int8"), res: T.Buffer((1, 64, 1, 16), "int8")):
            T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
            data_buf = T.allocate([1024], "int8", "global")
            weight_buf = T.allocate([1048576], "int8", "global")
            res_gem = T.allocate([256], "int32", "global")
            data_buf_1 = T.Buffer((1024,), "int8", data=data_buf)
            for i1, i3 in T.grid(64, 16):
                cse_var_1: T.int32 = i1 * 16 + i3
                data_1 = T.Buffer((1024,), "int8", data=data.data)
                data_buf_1[cse_var_1] = data_1[cse_var_1]
            weight_buf_1 = T.Buffer((1048576,), "int8", data=weight_buf)
            for i0, i1, i2, i3 in T.grid(64, 64, 16, 16):
                cse_var_2: T.int32 = i0 * 16384 + i1 * 256 + i2 * 16 + i3
                weight_1 = T.Buffer((1048576,), "int8", data=weight.data)
                weight_buf_1[cse_var_2] = weight_1[cse_var_2]
            for i1_outer in range(4):
                res_gem_1 = T.Buffer((256,), "int32", data=res_gem)
                for co_init, ci_init in T.grid(16, 16):
                    res_gem_1[co_init * 16 + ci_init] = 0
                for ic_outer, co, ic_inner, ci, ic_tns in T.grid(4, 16, 16, 16, 16):
                    cse_var_3: T.int32 = co * 16 + ci
                    res_gem_1[cse_var_3] = res_gem_1[cse_var_3] + T.Cast("int32", data_buf_1[ic_outer * 256 + ic_inner * 16 + ic_tns]) * T.Cast("int32", weight_buf_1[i1_outer * 262144 + co * 16384 + ic_outer * 4096 + ic_inner * 256 + ci * 16 + ic_tns])
                res_gem_2 = T.Buffer((256,), "int32", data=res_gem)
                for i1, i3 in T.grid(16, 16):
                    cse_var_4: T.int32 = i1 * 16 + i3
                    res_gem_2[cse_var_4] = T.shift_right(res_gem_1[cse_var_4], 8)
                res_gem_3 = T.Buffer((256,), "int32", data=res_gem)
                for i1, i3 in T.grid(16, 16):
                    cse_var_5: T.int32 = i1 * 16 + i3
                    res_gem_3[cse_var_5] = T.max(res_gem_2[cse_var_5], 0)
                res_gem_4 = T.Buffer((256,), "int32", data=res_gem)
                for i1, i3 in T.grid(16, 16):
                    cse_var_6: T.int32 = i1 * 16 + i3
                    res_gem_4[cse_var_6] = T.min(res_gem_3[cse_var_6], 127)
                for i1_inner, i3 in T.grid(16, 16):
                    cse_var_7: T.int32 = i1_inner * 16
                    res_1 = T.Buffer((1024,), "int8", data=res.data)
                    res_1[i1_outer * 256 + cse_var_7 + i3] = T.Cast("int8", res_gem_4[cse_var_7 + i3])


.. GENERATED FROM PYTHON SOURCE LINES 255-262

Lowering Copies to DMA Transfers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Next we set the buffer scopes to the corresponding on-chip VTA SRAM buffers.
We move the load loops into the matrix multiply computation loop to stage
memory loads such that they fit in the on-chip SRAM buffers.
Finally we annotate the load/store loop outer axes with the DMA copy pragma
to perform bulk memory transfers on VTA.

.. GENERATED FROM PYTHON SOURCE LINES 262-284

.. code-block:: default


    # Set scope of SRAM buffers
    s[data_buf].set_scope(env.inp_scope)
    s[weight_buf].set_scope(env.wgt_scope)
    s[res_gemm].set_scope(env.acc_scope)
    s[res_shr].set_scope(env.acc_scope)
    s[res_min].set_scope(env.acc_scope)
    s[res_max].set_scope(env.acc_scope)

    # Block data and weight cache reads
    s[data_buf].compute_at(s[res_gemm], ic_out)
    s[weight_buf].compute_at(s[res_gemm], ic_out)

    # Use DMA copy pragma on DRAM->SRAM operations
    s[data_buf].pragma(s[data_buf].op.axis[0], env.dma_copy)
    s[weight_buf].pragma(s[weight_buf].op.axis[0], env.dma_copy)

    # Use DMA copy pragma on SRAM->DRAM operation
    # (this implies that these copies should be performed along b_inn,
    # or result axis 2)
    s[res].pragma(s[res].op.axis[2], env.dma_copy)


.. GENERATED FROM PYTHON SOURCE LINES 285-290

Lowering Computation to VTA Compute Intrinsics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The last phase is to lower the computation loops down to VTA hardware
intrinsics by mapping the matrix multiplication to tensor intrinsics,
and mapping the shift, and clipping computation to the vector ALU.

.. GENERATED FROM PYTHON SOURCE LINES 290-304

.. code-block:: default


    # Apply tensorization over the batch tensor tile axis
    s[res_gemm].tensorize(b_tns, env.gemm)

    # Add an ALU pragma over the shift and clipping operations
    s[res_shr].pragma(s[res_shr].op.axis[0], env.alu)
    s[res_min].pragma(s[res_min].op.axis[0], env.alu)
    s[res_max].pragma(s[res_max].op.axis[0], env.alu)

    # Let's look at the final lowered TVM schedule after lowering memory
    # loads/stores down to DMA copy intrinsics, and the computation down to
    # VTA compute intrinsics.
    print(vta.lower(s, [data, weight, res], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    # from tvm.script import ir as I
    # from tvm.script import tir as T

    @I.ir_module
    class Module:
        @T.prim_func
        def main(data: T.Buffer((1, 64, 1, 16), "int8"), weight: T.Buffer((64, 64, 16, 16), "int8"), res: T.Buffer((1, 64, 1, 16), "int8")):
            T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
            T.tir.vta.coproc_dep_push(3, 2)
            for i1_outer in range(4):
                vta = T.int32()
                with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 2):
                    T.tir.vta.coproc_dep_pop(3, 2)
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushGEMMOp"):
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 1, 0, 0)
                        T.tir.vta.uop_push(0, 1, 0, 0, 0, 0, 0, 0)
                        T.call_extern("int32", "VTAUopLoopEnd")
                    T.tir.vta.coproc_dep_push(2, 1)
                for ic_outer in range(4):
                    cse_var_1: T.int32 = ic_outer * 16
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 1):
                        T.tir.vta.coproc_dep_pop(2, 1)
                        T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), data.data, cse_var_1, 16, 1, 16, 0, 0, 0, 0, 0, 2)
                        T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), weight.data, i1_outer * 1024 + cse_var_1, 16, 16, 64, 0, 0, 0, 0, 0, 1)
                        T.tir.vta.coproc_dep_push(1, 2)
                    T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 2)
                    T.tir.vta.coproc_dep_pop(1, 2)
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushGEMMOp"):
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 1, 0, 16)
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 0, 1, 1)
                        T.tir.vta.uop_push(0, 0, 0, 0, 0, 0, 0, 0)
                        T.call_extern("int32", "VTAUopLoopEnd")
                        T.call_extern("int32", "VTAUopLoopEnd")
                    T.tir.vta.coproc_dep_push(2, 1)
                T.tir.vta.coproc_dep_pop(2, 1)
                with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 2):
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushALUOp"):
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 1, 1, 0)
                        T.tir.vta.uop_push(1, 0, 0, 0, 0, 3, 1, 8)
                        T.call_extern("int32", "VTAUopLoopEnd")
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushALUOp"):
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 1, 1, 0)
                        T.tir.vta.uop_push(1, 0, 0, 0, 0, 1, 1, 0)
                        T.call_extern("int32", "VTAUopLoopEnd")
                    with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushALUOp"):
                        T.call_extern("int32", "VTAUopLoopBegin", 16, 1, 1, 0)
                        T.tir.vta.uop_push(1, 0, 0, 0, 0, 0, 1, 127)
                        T.call_extern("int32", "VTAUopLoopEnd")
                    T.tir.vta.coproc_dep_push(2, 3)
                T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 3)
                T.tir.vta.coproc_dep_pop(2, 3)
                for i1_inner in range(16):
                    T.call_extern("int32", "VTAStoreBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), i1_inner, 4, res.data, i1_outer * 16 + i1_inner, 1, 1, 1)
                T.tir.vta.coproc_dep_push(3, 2)
            T.tir.vta.coproc_sync()
            T.tir.vta.coproc_dep_pop(3, 2)


.. GENERATED FROM PYTHON SOURCE LINES 305-311

TVM Compilation and Verification
--------------------------------
After specifying the schedule, we can compile it into a TVM function.
We save the module so we can send it over RPC.
We run the function and verify it against a numpy implementation to
ensure correctness.

.. GENERATED FROM PYTHON SOURCE LINES 311-367

.. code-block:: default


    # Compile the TVM module
    my_gemm = vta.build(
        s, [data, weight, res], tvm.target.Target("ext_dev", host=env.target_host), name="my_gemm"
    )
    temp = utils.tempdir()
    my_gemm.save(temp.relpath("gemm.o"))
    remote.upload(temp.relpath("gemm.o"))
    f = remote.load_module("gemm.o")

    # Get the remote device context
    ctx = remote.ext_dev(0)

    # Initialize the data and weight arrays randomly in the int range of (-128, 128]
    data_np = np.random.randint(-128, 128, size=(batch_size, in_channels)).astype(data.dtype)
    weight_np = np.random.randint(-128, 128, size=(out_channels, in_channels)).astype(weight.dtype)

    # Apply packing to the data and weight arrays from a 2D to a 4D packed layout
    data_packed = data_np.reshape(
        batch_size // env.BATCH, env.BATCH, in_channels // env.BLOCK_IN, env.BLOCK_IN
    ).transpose((0, 2, 1, 3))
    weight_packed = weight_np.reshape(
        out_channels // env.BLOCK_OUT, env.BLOCK_OUT, in_channels // env.BLOCK_IN, env.BLOCK_IN
    ).transpose((0, 2, 1, 3))

    # Format the input/output arrays with tvm.nd.array to the DLPack standard
    data_nd = tvm.nd.array(data_packed, ctx)
    weight_nd = tvm.nd.array(weight_packed, ctx)
    res_nd = tvm.nd.array(np.zeros(output_shape).astype(res.dtype), ctx)

    # Clear stats
    if env.TARGET in ["sim", "tsim"]:
        simulator.clear_stats()

    # Invoke the module to perform the computation
    f(data_nd, weight_nd, res_nd)

    # Verify against numpy implementation
    res_ref = np.dot(data_np.astype(env.acc_dtype), weight_np.T.astype(env.acc_dtype))
    res_ref = res_ref >> env.INP_WIDTH
    res_ref = np.clip(res_ref, 0, inp_max)
    res_ref = res_ref.astype(res.dtype)
    res_ref = res_ref.reshape(
        batch_size // env.BATCH, env.BATCH, out_channels // env.BLOCK_OUT, env.BLOCK_OUT
    ).transpose((0, 2, 1, 3))
    np.testing.assert_equal(res_ref, res_nd.numpy())

    # Print stats
    if env.TARGET in ["sim", "tsim"]:
        sim_stats = simulator.stats()
        print("Execution statistics:")
        for k, v in sim_stats.items():
            print("\t{:<16}: {:>16}".format(k, v))

    print("Successful blocked matrix multiply test!")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Execution statistics:
            inp_load_nbytes :             4096
            wgt_load_nbytes :          1048576
            acc_load_nbytes :                0
            uop_load_nbytes :               20
            out_store_nbytes:             1024
            gemm_counter    :             4096
            alu_counter     :              192
    Successful blocked matrix multiply test!


.. GENERATED FROM PYTHON SOURCE LINES 368-375

Summary
-------
This tutorial demonstrates how TVM scheduling primitives can achieve
computation blocking for a matrix multiplication example.
This allows us to map arbitrarily large computation onto limited
hardware accelerator resources.


.. _sphx_glr_download_topic_vta_tutorials_optimize_matrix_multiply_opt.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: matrix_multiply_opt.py <matrix_multiply_opt.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: matrix_multiply_opt.ipynb <matrix_multiply_opt.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_