.. note::
    :class: sphx-glr-download-link-note

    Click :ref:`here <sphx_glr_download_how_to_work_with_schedules_tuple_inputs.py>` to download the full example code
.. rst-class:: sphx-glr-example-title

.. _sphx_glr_how_to_work_with_schedules_tuple_inputs.py:


Compute and Reduce with Tuple Inputs
=======================================
**Author**: `Ziheng Jiang <https://github.com/ZihengJiang>`_

Often we want to compute multiple outputs with the same shape within
a single loop or perform reduction that involves multiple values like
:code:`argmax`. These problems can be addressed by tuple inputs.

In this tutorial, we will introduce the usage of tuple inputs in TVM.


.. code-block:: default

    from __future__ import absolute_import, print_function

    import tvm
    from tvm import te
    import numpy as np


Describe Batchwise Computation
------------------------------
For operators which have the same shape, we can put them together as
the inputs of :any:`te.compute`, if we want them to be scheduled
together in the next schedule procedure.


.. code-block:: default

    n = te.var("n")
    m = te.var("m")
    A0 = te.placeholder((m, n), name="A0")
    A1 = te.placeholder((m, n), name="A1")
    B0, B1 = te.compute((m, n), lambda i, j: (A0[i, j] + 2, A1[i, j] * 3), name="B")

    # The generated IR code would be:
    s = te.create_schedule(B0.op)
    print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    primfn(A0_1: handle, A1_1: handle, B_2: handle, B_3: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {B_1: Buffer(B_4: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
                 B: Buffer(B_5: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
                 A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto"),
                 A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_6: int32, stride_7: int32], type="auto")}
      buffer_map = {A0_1: A0, A1_1: A1, B_2: B, B_3: B_1} {
      for (i: int32, 0, m) {
        for (j: int32, 0, n) {
          B_5[((i*stride_2) + (j*stride_3))] = ((float32*)A0_2[((i*stride_4) + (j*stride_5))] + 2f32)
          B_4[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_6) + (j*stride_7))]*3f32)
        }
      }
    }


.. _reduction-with-tuple-inputs:

Describe Reduction with Collaborative Inputs
--------------------------------------------
Sometimes, we require multiple inputs to express some reduction
operators, and the inputs will collaborate together, e.g. :code:`argmax`.
In the reduction procedure, :code:`argmax` need to compare the value of
operands, also need to keep the index of operand. It can be expressed
with :py:func:`te.comm_reducer` as below:


.. code-block:: default


    # x and y are the operands of reduction, both of them is a tuple of index
    # and value.
    def fcombine(x, y):
        lhs = tvm.tir.Select((x[1] >= y[1]), x[0], y[0])
        rhs = tvm.tir.Select((x[1] >= y[1]), x[1], y[1])
        return lhs, rhs


    # our identity element also need to be a tuple, so `fidentity` accepts
    # two types as inputs.
    def fidentity(t0, t1):
        return tvm.tir.const(-1, t0), tvm.te.min_value(t1)


    argmax = te.comm_reducer(fcombine, fidentity, name="argmax")

    # describe the reduction computation
    m = te.var("m")
    n = te.var("n")
    idx = te.placeholder((m, n), name="idx", dtype="int32")
    val = te.placeholder((m, n), name="val", dtype="int32")
    k = te.reduce_axis((0, n), "k")
    T0, T1 = te.compute((m,), lambda i: argmax((idx[i, k], val[i, k]), axis=k), name="T")

    # the generated IR code would be:
    s = te.create_schedule(T0.op)
    print(tvm.lower(s, [idx, val, T0, T1], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    primfn(idx_1: handle, val_1: handle, T_2: handle, T_3: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {T_1: Buffer(T_4: Pointer(int32), int32, [m: int32], [stride: int32], type="auto"),
                 T: Buffer(T_5: Pointer(int32), int32, [m], [stride_1: int32], type="auto"),
                 idx: Buffer(idx_2: Pointer(int32), int32, [m, n: int32], [stride_2: int32, stride_3: int32], type="auto"),
                 val: Buffer(val_2: Pointer(int32), int32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
      buffer_map = {idx_1: idx, val_1: val, T_2: T, T_3: T_1} {
      for (i: int32, 0, m) {
        T_5[(i*stride_1)] = -1
        T_4[(i*stride)] = -2147483648
        for (k: int32, 0, n) {
          T_5[(i*stride_1)] = @tir.if_then_else(((int32*)val_2[((i*stride_4) + (k*stride_5))] <= (int32*)T_4[(i*stride)]), (int32*)T_5[(i*stride_1)], (int32*)idx_2[((i*stride_2) + (k*stride_3))], dtype=int32)
          T_4[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_4) + (k*stride_5))] <= (int32*)T_4[(i*stride)]), (int32*)T_4[(i*stride)], (int32*)val_2[((i*stride_4) + (k*stride_5))], dtype=int32)
        }
      }
    }


.. note::

  For ones who are not familiar with reduction, please refer to
  :ref:`general-reduction`.

Schedule Operation with Tuple Inputs
------------------------------------
It is worth mentioning that although you will get multiple outputs
with one batch operation, but they can only be scheduled together
in terms of operation.


.. code-block:: default


    n = te.var("n")
    m = te.var("m")
    A0 = te.placeholder((m, n), name="A0")
    B0, B1 = te.compute((m, n), lambda i, j: (A0[i, j] + 2, A0[i, j] * 3), name="B")
    A1 = te.placeholder((m, n), name="A1")
    C = te.compute((m, n), lambda i, j: A1[i, j] + B0[i, j], name="C")

    s = te.create_schedule(C.op)
    s[B0].compute_at(s[C], C.op.axis[0])
    # as you can see in the below generated IR code:
    print(tvm.lower(s, [A0, A1, C], simple_mode=True))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    primfn(A0_1: handle, A1_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
                 A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
                 A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
      buffer_map = {A0_1: A0, A1_1: A1, C_1: C} {
      allocate(B.v0: Pointer(global float32), float32, [n]), storage_scope = global;
      allocate(B.v1: Pointer(global float32), float32, [n]), storage_scope = global;
      for (i: int32, 0, m) {
        for (j: int32, 0, n) {
          B.v0[j] = ((float32*)A0_2[((i*stride_2) + (j*stride_3))] + 2f32)
          B.v1[j] = ((float32*)A0_2[((i*stride_2) + (j*stride_3))]*3f32)
        }
        for (j_1: int32, 0, n) {
          C_2[((i*stride) + (j_1*stride_1))] = ((float32*)A1_2[((i*stride_4) + (j_1*stride_5))] + (float32*)B.v0[j_1])
        }
      }
    }


Summary
-------
This tutorial introduces the usage of tuple inputs operation.

- Describe normal batchwise computation.
- Describe reduction operation with tuple inputs.
- Notice that you can only schedule computation in terms of operation instead of tensor.


.. _sphx_glr_download_how_to_work_with_schedules_tuple_inputs.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download

     :download:`Download Python source code: tuple_inputs.py <tuple_inputs.py>`


  .. container:: sphx-glr-download

     :download:`Download Jupyter notebook: tuple_inputs.ipynb <tuple_inputs.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_