.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_how_to_work_with_schedules_tuple_inputs.py: Compute and Reduce with Tuple Inputs ======================================= **Author**: `Ziheng Jiang `_ Often we want to compute multiple outputs with the same shape within a single loop or perform reduction that involves multiple values like :code:`argmax`. These problems can be addressed by tuple inputs. In this tutorial, we will introduce the usage of tuple inputs in TVM. .. code-block:: default from __future__ import absolute_import, print_function import tvm from tvm import te import numpy as np Describe Batchwise Computation ------------------------------ For operators which have the same shape, we can put them together as the inputs of :any:`te.compute`, if we want them to be scheduled together in the next schedule procedure. .. code-block:: default n = te.var("n") m = te.var("m") A0 = te.placeholder((m, n), name="A0") A1 = te.placeholder((m, n), name="A1") B0, B1 = te.compute((m, n), lambda i, j: (A0[i, j] + 2, A1[i, j] * 3), name="B") # The generated IR code would be: s = te.create_schedule(B0.op) print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none primfn(A0_1: handle, A1_1: handle, B_2: handle, B_3: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} buffers = {B_1: Buffer(B_4: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"), B: Buffer(B_5: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"), A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto"), A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_6: int32, stride_7: int32], type="auto")} buffer_map = {A0_1: A0, A1_1: A1, B_2: B, B_3: B_1} { for (i: int32, 0, m) { for (j: int32, 0, n) { B_5[((i*stride_2) + (j*stride_3))] = ((float32*)A0_2[((i*stride_4) + (j*stride_5))] + 2f32) B_4[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_6) + (j*stride_7))]*3f32) } } } .. _reduction-with-tuple-inputs: Describe Reduction with Collaborative Inputs -------------------------------------------- Sometimes, we require multiple inputs to express some reduction operators, and the inputs will collaborate together, e.g. :code:`argmax`. In the reduction procedure, :code:`argmax` need to compare the value of operands, also need to keep the index of operand. It can be expressed with :py:func:`te.comm_reducer` as below: .. code-block:: default # x and y are the operands of reduction, both of them is a tuple of index # and value. def fcombine(x, y): lhs = tvm.tir.Select((x[1] >= y[1]), x[0], y[0]) rhs = tvm.tir.Select((x[1] >= y[1]), x[1], y[1]) return lhs, rhs # our identity element also need to be a tuple, so `fidentity` accepts # two types as inputs. def fidentity(t0, t1): return tvm.tir.const(-1, t0), tvm.te.min_value(t1) argmax = te.comm_reducer(fcombine, fidentity, name="argmax") # describe the reduction computation m = te.var("m") n = te.var("n") idx = te.placeholder((m, n), name="idx", dtype="int32") val = te.placeholder((m, n), name="val", dtype="int32") k = te.reduce_axis((0, n), "k") T0, T1 = te.compute((m,), lambda i: argmax((idx[i, k], val[i, k]), axis=k), name="T") # the generated IR code would be: s = te.create_schedule(T0.op) print(tvm.lower(s, [idx, val, T0, T1], simple_mode=True)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none primfn(idx_1: handle, val_1: handle, T_2: handle, T_3: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} buffers = {T_1: Buffer(T_4: Pointer(int32), int32, [m: int32], [stride: int32], type="auto"), T: Buffer(T_5: Pointer(int32), int32, [m], [stride_1: int32], type="auto"), idx: Buffer(idx_2: Pointer(int32), int32, [m, n: int32], [stride_2: int32, stride_3: int32], type="auto"), val: Buffer(val_2: Pointer(int32), int32, [m, n], [stride_4: int32, stride_5: int32], type="auto")} buffer_map = {idx_1: idx, val_1: val, T_2: T, T_3: T_1} { for (i: int32, 0, m) { T_5[(i*stride_1)] = -1 T_4[(i*stride)] = -2147483648 for (k: int32, 0, n) { T_5[(i*stride_1)] = @tir.if_then_else(((int32*)val_2[((i*stride_4) + (k*stride_5))] <= (int32*)T_4[(i*stride)]), (int32*)T_5[(i*stride_1)], (int32*)idx_2[((i*stride_2) + (k*stride_3))], dtype=int32) T_4[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_4) + (k*stride_5))] <= (int32*)T_4[(i*stride)]), (int32*)T_4[(i*stride)], (int32*)val_2[((i*stride_4) + (k*stride_5))], dtype=int32) } } } .. note:: For ones who are not familiar with reduction, please refer to :ref:`general-reduction`. Schedule Operation with Tuple Inputs ------------------------------------ It is worth mentioning that although you will get multiple outputs with one batch operation, but they can only be scheduled together in terms of operation. .. code-block:: default n = te.var("n") m = te.var("m") A0 = te.placeholder((m, n), name="A0") B0, B1 = te.compute((m, n), lambda i, j: (A0[i, j] + 2, A0[i, j] * 3), name="B") A1 = te.placeholder((m, n), name="A1") C = te.compute((m, n), lambda i, j: A1[i, j] + B0[i, j], name="C") s = te.create_schedule(C.op) s[B0].compute_at(s[C], C.op.axis[0]) # as you can see in the below generated IR code: print(tvm.lower(s, [A0, A1, C], simple_mode=True)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none primfn(A0_1: handle, A1_1: handle, C_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"), A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"), A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto")} buffer_map = {A0_1: A0, A1_1: A1, C_1: C} { allocate(B.v0: Pointer(global float32), float32, [n]), storage_scope = global; allocate(B.v1: Pointer(global float32), float32, [n]), storage_scope = global; for (i: int32, 0, m) { for (j: int32, 0, n) { B.v0[j] = ((float32*)A0_2[((i*stride_2) + (j*stride_3))] + 2f32) B.v1[j] = ((float32*)A0_2[((i*stride_2) + (j*stride_3))]*3f32) } for (j_1: int32, 0, n) { C_2[((i*stride) + (j_1*stride_1))] = ((float32*)A1_2[((i*stride_4) + (j_1*stride_5))] + (float32*)B.v0[j_1]) } } } Summary ------- This tutorial introduces the usage of tuple inputs operation. - Describe normal batchwise computation. - Describe reduction operation with tuple inputs. - Notice that you can only schedule computation in terms of operation instead of tensor. .. _sphx_glr_download_how_to_work_with_schedules_tuple_inputs.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: tuple_inputs.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: tuple_inputs.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_