# Optimizing Operators with Auto-scheduling¶

Author: Lianmin Zheng, Chengfan Jia

In this tutorial, we will show how TVM’s Auto Scheduling feature can find optimal schedules without the need for writing a custom template.

Different from the template-based AutoTVM which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space.

We use matrix multiplication as an example in this tutorial.

Note

Note that this tutorial will not run on Windows or recent versions of macOS. To get it to run, you will need to wrap the body of this tutorial in a ```if __name__ == "__main__":``` block.

```import numpy as np
import tvm
from tvm import te, auto_scheduler
```

## Defining the Matrix Multiplication¶

To start, we define a matrix multiplication with a bias addition. Note that this uses standard operations available in TVMs Tensor Expression language. The major difference is the use of the `register_workload` decorator at the top of the function definition. The function should return a list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph.

```@auto_scheduler.register_workload  # Note the auto_scheduler decorator
A = te.placeholder((N, L), name="A", dtype=dtype)
B = te.placeholder((L, M), name="B", dtype=dtype)
C = te.placeholder((N, M), name="C", dtype=dtype)

k = te.reduce_axis((0, L), name="k")
matmul = te.compute(
(N, M),
lambda i, j: te.sum(A[i, k] * B[k, j], axis=k),
name="matmul",
attrs={"layout_free_placeholders": [B]},  # enable automatic layout transform for tensor B
)
out = te.compute((N, M), lambda i, j: matmul[i, j] + C[i, j], name="out")

return [A, B, C, out]
```

With the function defined, we can now create the task for the auto_scheduler to search against. We specify the particular parameters for this matrix multiplication, in this case a multiplication of two square matrices of size 1024x1024. We then create a search task with N=L=M=1024 and dtype=”float32”

Improve performance with custom targets

In order for TVM to take full advantage of specific hardware platforms, you will want to manually specify your CPU capabilities. For example:

• replace `llvm` below with `llvm -mcpu=core-avx2` to enable AVX2

• replace `llvm` below with `llvm -mcpu=skylake-avx512` to enable AVX-512

```target = tvm.target.Target("llvm")
N = L = M = 1024

# Inspect the computational graph
print("Computational DAG:")
```
```Computational DAG:
A = PLACEHOLDER [1024, 1024]
B = PLACEHOLDER [1024, 1024]
matmul(i, j) += (A[i, k]*B[k, j])
C = PLACEHOLDER [1024, 1024]
out(i, j) = (matmul[i, j] + C[i, j])
```

## Set Parameters for Auto-Scheduler¶

Next, we set parameters for the auto-scheduler.

• `num_measure_trials` is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget.

• In addition, we use `RecordToFile` to log measurement records into a file `matmul.json`. The measurement records can be used to query the history best, resume the search, and do more analyses later.

• see `TuningOptions` for more parameters

```log_file = "matmul.json"
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
```

## Inspecting the Optimized Schedule¶

We can lower the schedule to see the IR after auto-scheduling. The auto-scheduler correctly performs optimizations including multi-level tiling, layout transformation, parallelization, vectorization, unrolling, and operator fusion.

```print("Lowered TIR:")
print(tvm.lower(sch, args, simple_mode=True))
```
```Lowered TIR:
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
@T.prim_func
def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32"), out: T.Buffer((1024, 1024), "float32")):
T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
auto_scheduler_layout_transform = T.allocate([1048576], "float32", "global")
auto_scheduler_layout_transform_1 = T.Buffer((1048576,), data=auto_scheduler_layout_transform)
for ax0_ax1_fused_ax2_fused in T.parallel(128):
for ax4, ax6, ax7 in T.grid(256, 4, 8):
B_1 = T.Buffer((1048576,), data=B.data)
auto_scheduler_layout_transform_1[ax0_ax1_fused_ax2_fused * 8192 + ax4 * 32 + ax6 * 8 + ax7] = B_1[ax4 * 4096 + ax6 * 1024 + ax0_ax1_fused_ax2_fused * 8 + ax7]
for i_outer_outer_j_outer_outer_fused in T.parallel(16384):
matmul = T.allocate([4], "float32x8", "global")
for i_outer_inner in range(2):
matmul_1 = T.Buffer((4,), "float32x8", data=matmul)
for k_outer, k_inner in T.grid(256, 4):
cse_var_2: T.int32 = i_outer_outer_j_outer_outer_fused % 128 * 8192 + k_outer * 32 + k_inner * 8
cse_var_1: T.int32 = i_outer_outer_j_outer_outer_fused // 128 * 8192 + i_outer_inner * 4096 + k_outer * 4 + k_inner
A_1 = T.Buffer((1048576,), data=A.data)
matmul_1[0] = matmul_1[0] + T.Broadcast(A_1[cse_var_1], 8) * auto_scheduler_layout_transform_1[cse_var_2:cse_var_2 + 8]
matmul_1[1] = matmul_1[1] + T.Broadcast(A_1[cse_var_1 + 1024], 8) * auto_scheduler_layout_transform_1[cse_var_2:cse_var_2 + 8]
matmul_1[2] = matmul_1[2] + T.Broadcast(A_1[cse_var_1 + 2048], 8) * auto_scheduler_layout_transform_1[cse_var_2:cse_var_2 + 8]
matmul_1[3] = matmul_1[3] + T.Broadcast(A_1[cse_var_1 + 3072], 8) * auto_scheduler_layout_transform_1[cse_var_2:cse_var_2 + 8]
for i_inner in range(4):
cse_var_3: T.int32 = i_outer_outer_j_outer_outer_fused // 128 * 8192 + i_outer_inner * 4096 + i_inner * 1024 + i_outer_outer_j_outer_outer_fused % 128 * 8
out_1 = T.Buffer((1048576,), data=out.data)
C_1 = T.Buffer((1048576,), data=C.data)
out_1[cse_var_3:cse_var_3 + 8] = matmul_1[i_inner] + C_1[cse_var_3:cse_var_3 + 8]
```

## Check correctness and evaluate performance¶

We build the binary and check its correctness and performance.

```func = tvm.build(sch, args, target)
a_np = np.random.uniform(size=(N, L)).astype(np.float32)
b_np = np.random.uniform(size=(L, M)).astype(np.float32)
c_np = np.random.uniform(size=(N, M)).astype(np.float32)
out_np = a_np.dot(b_np) + c_np

dev = tvm.cpu()
a_tvm = tvm.nd.array(a_np, device=dev)
b_tvm = tvm.nd.array(b_np, device=dev)
c_tvm = tvm.nd.array(c_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
func(a_tvm, b_tvm, c_tvm, out_tvm)

# Check results
np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)

# Evaluate execution time.
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print(
"Execution time of this operator: %.3f ms"
% (np.median(evaluator(a_tvm, b_tvm, c_tvm, out_tvm).results) * 1000)
)
```
```Execution time of this operator: 93.277 ms
```

## Using the record file¶

During the search, all measurement records are logged into the record file `matmul.json``. The measurement records can be used to re-apply search results, resume the search, and perform other analyses.

Here is an example where we load the best schedule from a file, and print the equivalent python schedule API. This can be used for debugging and learning the behavior of the auto-scheduler.

```print("Equivalent python schedule:")
```
```Equivalent python schedule:
matmul_i, matmul_j, matmul_k = tuple(matmul.op.axis) + tuple(matmul.op.reduce_axis)
out_i, out_j = tuple(out.op.axis) + tuple(out.op.reduce_axis)
matmul_i_o_i, matmul_i_i = s[matmul].split(matmul_i, factor=4)
matmul_i_o_o_i, matmul_i_o_i = s[matmul].split(matmul_i_o_i, factor=1)
matmul_i_o_o_o, matmul_i_o_o_i = s[matmul].split(matmul_i_o_o_i, factor=2)
matmul_j_o_i, matmul_j_i = s[matmul].split(matmul_j, factor=8)
matmul_j_o_o_i, matmul_j_o_i = s[matmul].split(matmul_j_o_i, factor=1)
matmul_j_o_o_o, matmul_j_o_o_i = s[matmul].split(matmul_j_o_o_i, factor=1)
matmul_k_o, matmul_k_i = s[matmul].split(matmul_k, factor=4)
s[matmul].reorder(matmul_i_o_o_o, matmul_j_o_o_o, matmul_i_o_o_i, matmul_j_o_o_i, matmul_k_o, matmul_i_o_i, matmul_j_o_i, matmul_k_i, matmul_i_i, matmul_j_i)
out_i_o_i, out_i_i = s[out].split(out_i, factor=4)
out_i_o_o, out_i_o_i = s[out].split(out_i_o_i, factor=2)
out_j_o_i, out_j_i = s[out].split(out_j, factor=8)
out_j_o_o, out_j_o_i = s[out].split(out_j_o_i, factor=1)
s[out].reorder(out_i_o_o, out_j_o_o, out_i_o_i, out_j_o_i, out_i_i, out_j_i)
s[matmul].compute_at(s[out], out_j_o_i)
out_i_o_o_j_o_o_fused = s[out].fuse(out_i_o_o, out_j_o_o)
s[out].parallel(out_i_o_o_j_o_o_fused)
s[matmul].pragma(matmul_i_o_o_o, "auto_unroll_max_step", 8)
s[matmul].pragma(matmul_i_o_o_o, "unroll_explicit", True)
s[matmul].vectorize(matmul_j_i)
s[out].vectorize(out_j_i)
```

A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials.

```def resume_search(task, log_file):
print("Resume search:")
cost_model = auto_scheduler.XGBModel()
cost_model.update_from_file(log_file)
search_policy = auto_scheduler.SketchPolicy(
)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=5, measure_callbacks=[auto_scheduler.RecordToFile(log_file)]
)

```Resume search: