.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
.. CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "how_to/tune_with_autoscheduler/tune_network_arm.py"

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        This tutorial can be used interactively with Google Colab! You can also click
        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_arm.py>` to run the Jupyter notebook locally.

        .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
            :align: center
            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb
            :width: 300px

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py:


Auto-scheduling a Neural Network for ARM CPU
=============================================
**Author**: `Thierry Moreau <https://github.com/tmoreau89>`_,             `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_

Auto-tuning for specific devices and workloads is critical for getting the
best performance. This is a tutorial on how to tune a whole neural
network for ARM CPU with the auto-scheduler via RPC.

To auto-tune a neural network, we partition the network into small subgraphs and
tune them independently. Each subgraph is treated as one search task.
A task scheduler slices the time and dynamically allocates time resources to
these tasks. The task scheduler predicts the impact of each task on the end-to-end
execution time and prioritizes the one that can reduce the execution time the most.

For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to
get the computational DAG in the tensor expression form.
We then use the auto-scheduler to construct a search space of this DAG and search
for good schedules (low-level optimizations).

Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
manual templates to define the search space, the auto-scheduler does not require any
schedule templates. In other words, the auto-scheduler only uses the compute declarations
in :code:`tvm/python/topi` and does not use existing schedule templates.

Note that this tutorial will not run on Windows or recent versions of macOS. To
get it to run, you will need to wrap the body of this tutorial in a :code:`if
__name__ == "__main__":` block.

.. GENERATED FROM PYTHON SOURCE LINES 48-61

.. code-block:: default


    import numpy as np
    import os
    import sys

    import tvm
    from tvm import relay, auto_scheduler
    from tvm.relay import data_dep_optimization as ddo
    import tvm.relay.testing
    from tvm.contrib import graph_executor
    from tvm.contrib.utils import tempdir


.. GENERATED FROM PYTHON SOURCE LINES 62-74

Define a Network
----------------
First, we need to define the network with relay frontend API.
We can load some pre-defined network from :code:`tvm.relay.testing`.
We can also load models from MXNet, ONNX, PyTorch, and TensorFlow
(see :ref:`front end tutorials<tutorial-frontend>`).

For convolutional neural networks, although auto-scheduler can work correctly
with any layout, we found the best performance is typically achieved with NHWC layout.
We also implemented more optimizations for NHWC layout with the auto-scheduler.
So it is recommended to convert your models to NHWC layout to use the auto-scheduler.
You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.

.. GENERATED FROM PYTHON SOURCE LINES 74-138

.. code-block:: default


    def get_network(name, batch_size, layout="NHWC", dtype="float32", use_sparse=False):
        """Get the symbol definition and random weight of a network"""

        # auto-scheduler prefers NHWC layout
        if layout == "NHWC":
            image_shape = (224, 224, 3)
        elif layout == "NCHW":
            image_shape = (3, 224, 224)
        else:
            raise ValueError("Invalid layout: " + layout)

        input_shape = (batch_size,) + image_shape
        output_shape = (batch_size, 1000)

        if name.startswith("resnet-"):
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.resnet.get_workload(
                num_layers=n_layer,
                batch_size=batch_size,
                layout=layout,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name.startswith("resnet3d-"):
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.resnet.get_workload(
                num_layers=n_layer,
                batch_size=batch_size,
                layout=layout,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name == "mobilenet":
            mod, params = relay.testing.mobilenet.get_workload(
                batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
            )
        elif name == "squeezenet_v1.1":
            assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
            mod, params = relay.testing.squeezenet.get_workload(
                version="1.1",
                batch_size=batch_size,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name == "inception_v3":
            input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
            mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
        elif name == "mlp":
            mod, params = relay.testing.mlp.get_workload(
                batch_size=batch_size, dtype=dtype, image_shape=image_shape, num_classes=1000
            )
        else:
            raise ValueError("Network not found.")

        if use_sparse:
            from tvm.topi.sparse.utils import convert_model_dense_to_sparse

            mod, params = convert_model_dense_to_sparse(mod, params, random_params=True)

        return mod, params, input_shape, output_shape


.. GENERATED FROM PYTHON SOURCE LINES 139-163

Start RPC Tracker
-----------------
TVM uses RPC session to communicate with ARM boards.
During tuning, the tuner will send the generated code to the board and
measure the speed of code on the board.

To scale up the tuning, TVM uses RPC Tracker to manage distributed devices.
The RPC Tracker is a centralized controller node. We can register all devices to
the tracker. For example, if we have 10 phones, we can register all of them
to the tracker, and run 10 measurements in parallel, accelerating the tuning process.

To start an RPC tracker, run this command on the host machine. The tracker is
required during the whole tuning process, so we need to open a new terminal for
this command:

.. code-block:: bash

  python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190

The expected output is

.. code-block:: bash

  INFO:RPCTracker:bind to 0.0.0.0:9190

.. GENERATED FROM PYTHON SOURCE LINES 165-207

Register Devices to RPC Tracker
-----------------------------------
Now we can register our devices to the tracker. The first step is to
build the TVM runtime for the ARM devices.

* For Linux:
  Follow this section :ref:`build-tvm-runtime-on-device` to build
  the TVM runtime on the device. Then register the device to tracker by

  .. code-block:: bash

    python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rasp4b-64

  (replace :code:`[HOST_IP]` with the IP address of your host machine)

* For Android:
  Follow this `readme page <https://github.com/apache/tvm/tree/main/apps/android_rpc>`_ to
  install the TVM RPC APK on the android device. Make sure you can pass the android rpc test.
  Then you have already registered your device. During tuning, you have to go to developer option
  and enable "Keep screen awake during changing" and charge your phone to make it stable.

After registering devices, we can confirm it by querying rpc_tracker

.. code-block:: bash

  python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190

For example, if we have 2 Huawei mate10 pro, 11 Raspberry Pi 4B with 64bit OS, and 2 rk3399,
the output can be

.. code-block:: bash

   Queue Status
   ----------------------------------
   key          total  free  pending
   ----------------------------------
   mate10pro    2      2     0
   rk3399       2      2     0
   rasp4b-64    11     11    0
   ----------------------------------

You can register multiple devices to the tracker to accelerate the measurement in tuning.

.. GENERATED FROM PYTHON SOURCE LINES 209-215

Set Tuning Options
------------------
Before tuning, we should apply some configurations. Here I use a Raspberry Pi 4b 4GB board
as example with a 64bit OS (Ubuntu 20.04). In your setting, you should modify the target
and device_key accordingly.
set :code:`use_ndk` to True if you use android phone.

.. GENERATED FROM PYTHON SOURCE LINES 215-242

.. code-block:: default


    #### DEVICE CONFIG ####

    # Replace "aarch64-linux-gnu" with the correct target of your board.
    # This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.
    # FIXME(tmoreau89, merrymercy): We leave '-device=arm_cpu' out of the target string
    #                               because we're sharing x86 op strategy.
    target = tvm.target.Target("llvm -mtriple=aarch64-linux-gnu -mattr=+neon")

    # Also replace this with the device key, rpc host and rpc port in your tracker
    device_key = "rasp4b-64"
    rpc_host = "127.0.0.1"
    rpc_port = 9190

    # Set this to True if you use ndk tools for cross compiling
    # And also set the environment variable below to point to the cross compiler
    use_ndk = False
    # os.environ["TVM_NDK_CC"] = "/usr/bin/aarch64-linux-gnu-g++"

    #### TUNING OPTION ####
    network = "mobilenet"
    use_sparse = False
    batch_size = 1
    layout = "NHWC"
    dtype = "float32"
    log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)


.. GENERATED FROM PYTHON SOURCE LINES 243-252

Extract Search Tasks
--------------------
Next, we extract the search tasks and their weights from a network.
The weight of a task is the number of appearances of the task's subgraph
in the whole network.
By using the weight, we can approximate the end-to-end latency of the network
as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
latency of a task and :code:`weight[t]` is the weight of the task.
The task scheduler will just optimize this objective.

.. GENERATED FROM PYTHON SOURCE LINES 252-266

.. code-block:: default


    # Extract tasks from the network
    print("Get model...")
    mod, params, input_shape, output_shape = get_network(
        network, batch_size, layout, dtype=dtype, use_sparse=use_sparse
    )
    print("Extract tasks...")
    tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

    for idx, task in enumerate(tasks):
        print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
        print(task.compute_dag)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Get model...
    Extract tasks...
    ========== Task 0  (workload key: ["f439cb5bb83bb6be4824f8b51c68bdb9", [1, 28, 28, 256], [3, 3, 256, 1], [1, 1, 1, 256], [1, 14, 14, 256]]) ==========
    p0 = PLACEHOLDER [1, 28, 28, 256]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 256, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, ((2*h) + reduce_h), ((w*2) + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 1  (workload key: ["f439cb5bb83bb6be4824f8b51c68bdb9", [1, 112, 112, 64], [3, 3, 64, 1], [1, 1, 1, 64], [1, 56, 56, 64]]) ==========
    p0 = PLACEHOLDER [1, 112, 112, 64]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 64, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, ((2*h) + reduce_h), ((w*2) + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 2  (workload key: ["08f7449d79e570b7274174709e5e5e01", [1, 1024], [1000, 1024], [1, 1000], [1, 1000]]) ==========
    p0 = PLACEHOLDER [1, 1024]
    p1 = PLACEHOLDER [1000, 1024]
    T_matmul_NT(i0, i1) += (p0[i0, k]*p1[i1, k])
    p2 = PLACEHOLDER [1, 1000]
    T_add(ax0, ax1) = (T_matmul_NT[ax0, ax1] + p2[ax0, ax1])

    ========== Task 3  (workload key: ["8c53ca2904398da2889aa7508082d7bb", [1, 7, 7, 1024], [1, 1, 1, 1024]]) ==========
    p0 = PLACEHOLDER [1, 7, 7, 1024]
    adaptive_pool_sum(ax0, ax1, ax2, ax3) += p0[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
    adaptive_pool_avg(ax0, ax1, ax2, ax3) = (adaptive_pool_sum[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))

    ========== Task 4  (workload key: ["52ea56a3ba2b8548ac83748ee8a69093", [1, 7, 7, 1024], [3, 3, 1024, 1], [1, 1, 1, 1024], [1, 7, 7, 1024]]) ==========
    p0 = PLACEHOLDER [1, 7, 7, 1024]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 1024, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, (h + reduce_h), (w + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 1024]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 5  (workload key: ["4af5640cbcabd364154463f55de7967a", [1, 112, 112, 32], [4, 8, 4, 16], [1, 1, 1, 64], [1, 112, 112, 64]]) ==========
    p0 = PLACEHOLDER [1, 112, 112, 32]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*12544) + ax1)*32) + ax2), 32), 112), 112), floormod(floordiv(((((ax0*12544) + ax1)*32) + ax2), 32), 112), floormod(((((ax0*12544) + ax1)*32) + ax2), 32)]
    p1 = PLACEHOLDER [4, 8, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (112*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 6  (workload key: ["d3515ea31fce37df25c44b9fc6df3249", [1, 7, 7, 1024], [64, 256, 4, 16], [1, 1, 1, 1024], [1, 7, 7, 1024]]) ==========
    p0 = PLACEHOLDER [1, 7, 7, 1024]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*49) + ax1)*1024) + ax2), 1024), 7), 7), floormod(floordiv(((((ax0*49) + ax1)*1024) + ax2), 1024), 7), floormod(((((ax0*49) + ax1)*1024) + ax2), 1024)]
    A_padded_M(i0, i1, i2) = tir.if_then_else(((i1 >= 0) && (i1 < 49)), T_reshape[i0, i1, i2], 0f)
    p1 = PLACEHOLDER [64, 256, 4, 16]
    C(b, x, y) += (A_padded_M[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (7*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 1024]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 7  (workload key: ["52ea56a3ba2b8548ac83748ee8a69093", [1, 28, 28, 256], [3, 3, 256, 1], [1, 1, 1, 256], [1, 28, 28, 256]]) ==========
    p0 = PLACEHOLDER [1, 28, 28, 256]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 256, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, (h + reduce_h), (w + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 8  (workload key: ["bdc285eec5e051bc3f230cdb78c10af1", [1, 28, 28, 256], [16, 64, 4, 16], [1, 1, 1, 256], [1, 28, 28, 256]]) ==========
    p0 = PLACEHOLDER [1, 28, 28, 256]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*784) + ax1)*256) + ax2), 256), 28), 28), floormod(floordiv(((((ax0*784) + ax1)*256) + ax2), 256), 28), floormod(((((ax0*784) + ax1)*256) + ax2), 256)]
    p1 = PLACEHOLDER [16, 64, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (28*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 9  (workload key: ["6065995ec7d6d6ae2eae7d4e874e8441", [1, 28, 28, 128], [16, 32, 4, 16], [1, 1, 1, 256], [1, 28, 28, 256]]) ==========
    p0 = PLACEHOLDER [1, 28, 28, 128]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*784) + ax1)*128) + ax2), 128), 28), 28), floormod(floordiv(((((ax0*784) + ax1)*128) + ax2), 128), 28), floormod(((((ax0*784) + ax1)*128) + ax2), 128)]
    p1 = PLACEHOLDER [16, 32, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (28*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 10  (workload key: ["7d79c516e212fe1d73f5dbb90eaca2cf", [1, 1000], [1, 1000]]) ==========
    p0 = PLACEHOLDER [1, 1000]
    T_softmax_maxelem(i0) max= p0[i0, k]
    T_softmax_exp(i0, i1) = tir.exp((p0[i0, i1] - T_softmax_maxelem[i0]))
    T_softmax_expsum(i0) += T_softmax_exp[i0, k]
    T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])

    ========== Task 11  (workload key: ["52ea56a3ba2b8548ac83748ee8a69093", [1, 112, 112, 32], [3, 3, 32, 1], [1, 1, 1, 32], [1, 112, 112, 32]]) ==========
    p0 = PLACEHOLDER [1, 112, 112, 32]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 32, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, (h + reduce_h), (w + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 32]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 12  (workload key: ["f439cb5bb83bb6be4824f8b51c68bdb9", [1, 14, 14, 512], [3, 3, 512, 1], [1, 1, 1, 512], [1, 7, 7, 512]]) ==========
    p0 = PLACEHOLDER [1, 14, 14, 512]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 512, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, ((2*h) + reduce_h), ((w*2) + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 13  (workload key: ["6d2a3ff0fc84a990f7c7fc00e597955f", [1, 14, 14, 256], [32, 64, 4, 16], [1, 1, 1, 512], [1, 14, 14, 512]]) ==========
    p0 = PLACEHOLDER [1, 14, 14, 256]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*196) + ax1)*256) + ax2), 256), 14), 14), floormod(floordiv(((((ax0*196) + ax1)*256) + ax2), 256), 14), floormod(((((ax0*196) + ax1)*256) + ax2), 256)]
    p1 = PLACEHOLDER [32, 64, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (14*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 14  (workload key: ["f439cb5bb83bb6be4824f8b51c68bdb9", [1, 56, 56, 128], [3, 3, 128, 1], [1, 1, 1, 128], [1, 28, 28, 128]]) ==========
    p0 = PLACEHOLDER [1, 56, 56, 128]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 128, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, ((2*h) + reduce_h), ((w*2) + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 15  (workload key: ["e2c58e45605443e8e4dd0ce586633232", [1, 56, 56, 64], [8, 16, 4, 16], [1, 1, 1, 128], [1, 56, 56, 128]]) ==========
    p0 = PLACEHOLDER [1, 56, 56, 64]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*3136) + ax1)*64) + ax2), 64), 56), 56), floormod(floordiv(((((ax0*3136) + ax1)*64) + ax2), 64), 56), floormod(((((ax0*3136) + ax1)*64) + ax2), 64)]
    p1 = PLACEHOLDER [8, 16, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (56*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 16  (workload key: ["56da86c7b2bc22e4d100fe5b39dce716", [1, 224, 224, 3], [2, 8, 4, 16], [1, 112, 1, 1], [1, 112, 1, 1], [1, 112, 112, 32]]) ==========
    p0 = PLACEHOLDER [1, 224, 224, 3]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 225)) && (i2 >= 1)) && (i2 < 225)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    data_im2col(n, x, y) = data_pad[n, ((2*floordiv(x, 112)) + floordiv(floordiv(y, 3), 3)), ((2*floormod(x, 112)) + floormod(floordiv(y, 3), 3)), floormod(y, 3)]
    A_padded_K(i0, i1, i2) = tir.if_then_else(((i2 >= 0) && (i2 < 27)), data_im2col[i0, i1, i2], 0f)
    p1 = PLACEHOLDER [2, 8, 4, 16]
    C(b, x, y) += (A_padded_K[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (112*x)), z]
    p2 = PLACEHOLDER [1, 112, 1, 1]
    T_multiply(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3]*p2[ax0, ax1, 0, 0])
    p3 = PLACEHOLDER [1, 112, 1, 1]
    T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + p3[ax0, ax1, 0, 0])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 17  (workload key: ["f1b4831e3be69b07c24f46d5513875ea", [1, 14, 14, 512], [32, 128, 4, 16], [1, 1, 1, 512], [1, 14, 14, 512]]) ==========
    p0 = PLACEHOLDER [1, 14, 14, 512]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*196) + ax1)*512) + ax2), 512), 14), 14), floormod(floordiv(((((ax0*196) + ax1)*512) + ax2), 512), 14), floormod(((((ax0*196) + ax1)*512) + ax2), 512)]
    p1 = PLACEHOLDER [32, 128, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (14*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 18  (workload key: ["52ea56a3ba2b8548ac83748ee8a69093", [1, 14, 14, 512], [3, 3, 512, 1], [1, 1, 1, 512], [1, 14, 14, 512]]) ==========
    p0 = PLACEHOLDER [1, 14, 14, 512]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 512, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, (h + reduce_h), (w + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 19  (workload key: ["52ea56a3ba2b8548ac83748ee8a69093", [1, 56, 56, 128], [3, 3, 128, 1], [1, 1, 1, 128], [1, 56, 56, 128]]) ==========
    p0 = PLACEHOLDER [1, 56, 56, 128]
    data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
    p1 = PLACEHOLDER [3, 3, 128, 1]
    depthwise_conv2d_nhwc_output(n, h, w, c) += (data_pad[n, (h + reduce_h), (w + reduce_w), c]*p1[reduce_h, reduce_w, c, 0])
    p2 = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (depthwise_conv2d_nhwc_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 20  (workload key: ["bcb562e2cf839ad7d45b93ea7b0abb0d", [1, 56, 56, 128], [8, 32, 4, 16], [1, 1, 1, 128], [1, 56, 56, 128]]) ==========
    p0 = PLACEHOLDER [1, 56, 56, 128]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*3136) + ax1)*128) + ax2), 128), 56), 56), floormod(floordiv(((((ax0*3136) + ax1)*128) + ax2), 128), 56), floormod(((((ax0*3136) + ax1)*128) + ax2), 128)]
    p1 = PLACEHOLDER [8, 32, 4, 16]
    C(b, x, y) += (T_reshape[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (56*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

    ========== Task 21  (workload key: ["ec95d1874524b128a6afe8f7bff0c738", [1, 7, 7, 512], [64, 128, 4, 16], [1, 1, 1, 1024], [1, 7, 7, 1024]]) ==========
    p0 = PLACEHOLDER [1, 7, 7, 512]
    T_reshape(ax0, ax1, ax2) = p0[0, floormod(floordiv(floordiv(((((ax0*49) + ax1)*512) + ax2), 512), 7), 7), floormod(floordiv(((((ax0*49) + ax1)*512) + ax2), 512), 7), floormod(((((ax0*49) + ax1)*512) + ax2), 512)]
    A_padded_M(i0, i1, i2) = tir.if_then_else(((i1 >= 0) && (i1 < 49)), T_reshape[i0, i1, i2], 0f)
    p1 = PLACEHOLDER [64, 128, 4, 16]
    C(b, x, y) += (A_padded_M[b, x, k]*p1[floordiv(y, 16), floordiv(k, 4), floormod(k, 4), floormod(y, 16)])
    conv2d_gemm_output(b, x, y, z) = C[b, (y + (7*x)), z]
    p2 = PLACEHOLDER [1, 1, 1, 1024]
    T_add(ax0, ax1, ax2, ax3) = (conv2d_gemm_output[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)


.. GENERATED FROM PYTHON SOURCE LINES 267-286

Tuning and Evaluation
---------------------
Now, we set some options for tuning and launch the search tasks

* :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
  You can set it to a small number (e.g., 200) for a fast demonstrative run.
  In practice, we recommend setting it around :code:`800 * len(tasks)`,
  which is typically enough for the search to converge.
  For example, there are 29 tasks in resnet-50, so we can set it as 20000.
  You can adjust this parameter according to your time budget.
* In addition, we use :code:`RecordToFile` to dump measurement records into a log file,
  The measurement records can be used to query the history best, resume the search,
  and do more analyses later.
* see :any:`auto_scheduler.TuningOptions`,
  :any:`auto_scheduler.LocalRunner` for more parameters.

After auto-tuning, we can compile the network with the best schedules we found.
All measurement records are dumped into the log file during auto-tuning,
so we can read the log file and load the best schedules.

.. GENERATED FROM PYTHON SOURCE LINES 286-351

.. code-block:: default


    def tune_and_evaluate():
        print("Begin tuning...")
        tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
        tune_option = auto_scheduler.TuningOptions(
            num_measure_trials=200,  # change this to 20000 to achieve the best performance
            builder=auto_scheduler.LocalBuilder(build_func="ndk" if use_ndk else "default"),
            runner=auto_scheduler.RPCRunner(
                device_key,
                host=rpc_host,
                port=rpc_port,
                timeout=30,
                repeat=1,
                min_repeat_ms=200,
                enable_cpu_cache_flush=True,
            ),
            measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
        )

        tuner.tune(tune_option)

        # Compile with the history best
        print("Compile...")
        with auto_scheduler.ApplyHistoryBest(log_file):
            with tvm.transform.PassContext(
                opt_level=3, config={"relay.backend.use_auto_scheduler": True}
            ):
                lib = relay.build(mod, target=target, params=params)

        # Export library
        tmp = tempdir()
        if use_ndk:
            from tvm.contrib import ndk

            filename = "net.so"
            lib.export_library(tmp.relpath(filename), fcompile=ndk.create_shared)
        else:
            filename = "net.tar"
            lib.export_library(tmp.relpath(filename))

        # Upload module to device
        print("Upload...")
        remote = auto_scheduler.utils.request_remote(device_key, rpc_host, rpc_port, timeout=10000)
        remote.upload(tmp.relpath(filename))
        rlib = remote.load_module(filename)

        # Create graph executor
        dev = remote.cpu()
        module = graph_executor.GraphModule(rlib["default"](dev))
        data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
        module.set_input("data", data_tvm)

        # Evaluate
        print("Evaluate inference time cost...")
        print(module.benchmark(dev, repeat=3, min_repeat_ms=500))


    # We do not run the tuning in our webpage server since the server doesn't have a Raspberry Pi,
    # or device tracker running.
    # Uncomment the following line to run it by yourself.

    # tune_and_evaluate()


.. GENERATED FROM PYTHON SOURCE LINES 352-403

.. note:: Explaining the printed information during tuning

  During the tuning, a lot of information will be printed on the console.
  They are used for debugging purposes. The most important info is the output
  of the task scheduler. The following table is a sample output.

  .. code-block:: c

   ----------------------------------------------------------------------
   ------------------------------  [ Task Scheduler ]
   ----------------------------------------------------------------------
   |  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
   -------------------------------------------------
   |    0 |        0.013 |           0.31 |     64 |
   |    1 |        0.845 |           2.43 |    448 |
   |    2 |        0.046 |          -0.00 |     64 |
   |    3 |        4.194 |          24.53 |   2112 |
   |    4 |        0.109 |           9.21 |     64 |
   |    5 |        1.759 |          29.27 |    896 |
   |    6 |        0.083 |           6.01 |     64 |
   |    7 |        3.084 |          33.38 |   7680 |
   |    8 |        0.136 |          14.78 |    384 |
   |    9 |        1.349 |          38.23 |    768 |
   |   10 |        0.133 |           7.55 |    128 |
   |   11 |        2.747 |          37.56 |   1536 |
   |   12 |        0.338 |          11.87 |    192 |
   |   13 |        1.295 |          40.00 |    704 |
   |   14 |        0.482 |           4.16 |    256 |
   |   15 |        2.686 |          38.56 |   1344 |
   |   16 |        0.884 |           9.08 |    448 |
   |   17 |        1.332 |          39.18 |    704 |
   |   18 |        1.045 |           3.84 |    576 |
   |   19 |        1.391 |          38.09 |    704 |
   |   20 |        0.777 |          10.34 |    448 |
   |   21 |        0.739 |          30.97 |    448 |
   -------------------------------------------------
    Estimated total latency: 38.347 ms      Trials: 19992   Used time : 19260 s     Next ID: 3

  This table lists the latency and (estimated) speed of all tasks.
  It also lists the allocation of measurement trials for all tasks.
  The last line prints the total weighted latency of these tasks,
  which can be a rough estimation of the end-to-end execution time
  of the network.
  The last line also prints the total number of measurement trials,
  total time spent on auto-tuning and the id of the next task to tune.

  There will also be some "dmlc::Error"s errors, because the
  auto-scheduler will try some invalid schedules.
  You can safely ignore them if the tuning can continue, because these
  errors are isolated from the main process.


.. GENERATED FROM PYTHON SOURCE LINES 405-411

.. note:: Terminate the tuning earlier

  You can terminate the tuning earlier by forcibly killing this process.
  As long as you get at least one valid schedule for each task in the log file,
  you should be able to do the compilation (the secion below).


.. GENERATED FROM PYTHON SOURCE LINES 413-429

Other Tips
----------
1. During the tuning, the auto-scheduler needs to compile many programs and
   extract feature from them. This part is CPU-intensive,
   so a high-performance CPU with many cores is recommended for faster search.
2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json`
   to distill the large log file and only save the best useful records.
3. You can resume a search from the previous log file. You just need to
   add a new argument :code:`load_log_file` when creating the task scheduler
   in function :code:`run_tuning`. Say,
   :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)`
4. If you have multiple target CPUs, you can use all of them for measurements to
   parallelize the measurements. Check this :ref:`section <tutorials-autotvm-scale-up-rpc-tracker>`
   to learn how to use the RPC Tracker and RPC Server.
   To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions`
   with :any:`auto_scheduler.RPCRunner`.


.. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_arm.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: tune_network_arm.py <tune_network_arm.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: tune_network_arm.ipynb <tune_network_arm.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_