Note
This tutorial can be used interactively with Google Colab! You can also click here to run the Jupyter notebook locally.
Auto-scheduling a Convolution Layer for GPU¶
Author: Lianmin Zheng, Chengfan Jia
This is a tutorial on how to use the auto-scheduler for GPUs.
Different from the template-based autotvm which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space.
We use a convolution layer as an example in this tutorial.
Note that this tutorial will not run on Windows or recent versions of macOS. To
get it to run, you will need to wrap the body of this tutorial in a if
__name__ == "__main__":
block.
import os
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python
Define the computation¶
To begin with, let us define the computation of a convolution layer. The function should return the list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph.
@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name="data")
kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
bias = te.placeholder((1, CO, 1, 1), name="bias")
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
out = topi.nn.relu(conv + bias)
return [data, kernel, bias, out]
Create the search task¶
We then create a search task for the last convolution layer in the resnet.
target = tvm.target.Target("cuda")
# Use the last layer in ResNet-50
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)
# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)
Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
conv2d_nchw(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nchw[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)
Next, we set parameters for the auto-scheduler. These parameters mainly specify how we do the measurement during the search.
measure_ctx
launches a different process for measurement to provide isolation. It can protect the main process from GPU crashes during measurement and avoid other runtime conflicts.min_repeat_ms
defines the minimum duration of one “repeat” in every measurement. This can warmup the GPU, which is necessary to get accurate measurement results. Typically, we recommend a value >= 300 ms.num_measure_trials
is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget.In addition, we use
RecordToFile
to dump measurement records into a file conv2d.json. The measurement records can be used to query the history best, resume the search, and do more analyses later.see
auto_scheduler.TuningOptions
,auto_scheduler.LocalRPCMeasureContext
for more parameters.
log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10, # change this to 1000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
Get devices for measurement successfully!
Run the search¶
Now we get all inputs ready. Pretty simple, isn’t it? We can kick off the search and let the auto-scheduler do its magic. After some measurement trials, we can load the best schedule from the log file and apply it.
# Run auto-tuning (search)
task.tune(tune_option)
# Apply the best schedule
sch, args = task.apply_best(log_file)
# Kill the measurement process
del measure_ctx
We can lower the schedule to see the IR after auto-scheduling. The auto-scheduler correctly performs optimizations including multi-level tiling, cooperative fetching, unrolling and operator fusion.
Lowered TIR:
# from tvm.script import ir as I
# from tvm.script import tir as T
@I.ir_module
class Module:
@T.prim_func
def main(data: T.Buffer((1, 512, 7, 7), "float32"), kernel: T.Buffer((512, 512, 3, 3), "float32"), bias: T.Buffer((1, 512, 1, 1), "float32"), compute: T.Buffer((1, 512, 7, 7), "float32")):
T.func_attr({"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True})
blockIdx_x = T.env_thread("blockIdx.x")
T.launch_thread(blockIdx_x, 128)
conv2d_nchw = T.allocate([14], "float32", "local")
pad_temp_shared = T.allocate([162], "float32", "shared")
kernel_shared = T.allocate([72], "float32", "shared")
threadIdx_x = T.env_thread("threadIdx.x")
T.launch_thread(threadIdx_x, 14)
conv2d_nchw_1 = T.Buffer((14,), data=conv2d_nchw, scope="local", align=32)
conv2d_nchw_1[0] = T.float32(0)
conv2d_nchw_1[1] = T.float32(0)
conv2d_nchw_1[2] = T.float32(0)
conv2d_nchw_1[3] = T.float32(0)
conv2d_nchw_1[4] = T.float32(0)
conv2d_nchw_1[5] = T.float32(0)
conv2d_nchw_1[6] = T.float32(0)
conv2d_nchw_1[7] = T.float32(0)
conv2d_nchw_1[8] = T.float32(0)
conv2d_nchw_1[9] = T.float32(0)
conv2d_nchw_1[10] = T.float32(0)
conv2d_nchw_1[11] = T.float32(0)
conv2d_nchw_1[12] = T.float32(0)
conv2d_nchw_1[13] = T.float32(0)
for rc_outer_outer in range(256):
cse_var_1: T.int32 = rc_outer_outer * 98
threadIdx_x_1 = T.env_thread("threadIdx.x")
pad_temp_shared_1 = T.Buffer((162,), data=pad_temp_shared, scope="shared")
data_1 = T.Buffer((25088,), data=data.data)
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1] = T.if_then_else(9 <= threadIdx_x_1 and 1 <= threadIdx_x_1 % 9 and threadIdx_x_1 % 9 < 8, data_1[cse_var_1 + threadIdx_x_1 // 9 * 7 + threadIdx_x_1 % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 14] = T.if_then_else(1 <= (threadIdx_x_1 + 5) % 9 and (threadIdx_x_1 + 5) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 14) // 9 * 7 + (threadIdx_x_1 + 5) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 28] = T.if_then_else(1 <= (threadIdx_x_1 + 1) % 9 and (threadIdx_x_1 + 1) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 28) // 9 * 7 + (threadIdx_x_1 + 1) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 42] = T.if_then_else(1 <= (threadIdx_x_1 + 6) % 9 and (threadIdx_x_1 + 6) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 42) // 9 * 7 + (threadIdx_x_1 + 6) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 56] = T.if_then_else(1 <= (threadIdx_x_1 + 2) % 9 and (threadIdx_x_1 + 2) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 56) // 9 * 7 + (threadIdx_x_1 + 2) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 70] = T.if_then_else(9 <= (threadIdx_x_1 + 70) % 81 and (threadIdx_x_1 + 70) % 81 < 72 and 1 <= (threadIdx_x_1 + 7) % 9 and (threadIdx_x_1 + 7) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 70) // 81 * 49 + (threadIdx_x_1 + 70) % 81 // 9 * 7 + (threadIdx_x_1 + 7) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 84] = T.if_then_else(6 <= threadIdx_x_1 and 1 <= (threadIdx_x_1 + 3) % 9 and (threadIdx_x_1 + 3) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 84) // 81 * 49 + (threadIdx_x_1 + 3) // 9 * 7 + (threadIdx_x_1 + 3) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 98] = T.if_then_else(1 <= (threadIdx_x_1 + 8) % 9 and (threadIdx_x_1 + 8) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 98) // 81 * 49 + (threadIdx_x_1 + 17) // 9 * 7 + (threadIdx_x_1 + 8) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 112] = T.if_then_else(1 <= (threadIdx_x_1 + 4) % 9 and (threadIdx_x_1 + 4) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 112) // 81 * 49 + (threadIdx_x_1 + 31) // 9 * 7 + (threadIdx_x_1 + 4) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 126] = T.if_then_else(1 <= threadIdx_x_1 % 9 and threadIdx_x_1 % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 126) // 81 * 49 + threadIdx_x_1 // 9 * 7 + threadIdx_x_1 % 9 + 27], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
pad_temp_shared_1[threadIdx_x_1 + 140] = T.if_then_else(threadIdx_x_1 < 13 and 1 <= (threadIdx_x_1 + 5) % 9 and (threadIdx_x_1 + 5) % 9 < 8, data_1[cse_var_1 + (threadIdx_x_1 + 140) // 81 * 49 + (threadIdx_x_1 + 59) // 9 * 7 + (threadIdx_x_1 + 5) % 9 - 8], T.float32(0))
with T.launch_thread(threadIdx_x_1, 14):
if T.likely(threadIdx_x_1 < 8):
pad_temp_shared_1[threadIdx_x_1 + 154] = T.float32(0)
threadIdx_x_2 = T.env_thread("threadIdx.x")
kernel_shared_1 = T.Buffer((72,), data=kernel_shared, scope="shared")
with T.launch_thread(threadIdx_x_2, 14):
kernel_1 = T.Buffer((2359296,), data=kernel.data)
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6]
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6 + 1] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6 + 1]
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6 + 2] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6 + 2]
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6 + 3] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6 + 3]
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6 + 4] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6 + 4]
if T.likely(threadIdx_x_2 < 12):
kernel_shared_1[threadIdx_x_2 * 6 + 5] = kernel_1[blockIdx_x * 18432 + threadIdx_x_2 // 3 * 4608 + rc_outer_outer * 18 + threadIdx_x_2 % 3 * 6 + 5]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 9] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 9] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36 + 18]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 9] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 63] * kernel_shared_1[threadIdx_x // 7 * 36 + 3]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 9] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 63] * kernel_shared_1[threadIdx_x // 7 * 36 + 21]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 63] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 72] * kernel_shared_1[threadIdx_x // 7 * 36 + 6]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 18] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 27] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 36] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 45] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 54] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 63] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 72] * kernel_shared_1[threadIdx_x // 7 * 36 + 24]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 81] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 90] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 9]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 81] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 90] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 27]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 90] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 144] * kernel_shared_1[threadIdx_x // 7 * 36 + 12]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 90] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 144] * kernel_shared_1[threadIdx_x // 7 * 36 + 30]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 144] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 153] * kernel_shared_1[threadIdx_x // 7 * 36 + 15]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 99] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 108] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 117] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 126] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 135] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 144] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 153] * kernel_shared_1[threadIdx_x // 7 * 36 + 33]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 1] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 10] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 1]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 1] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 10] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 19]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 10] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 64] * kernel_shared_1[threadIdx_x // 7 * 36 + 4]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 10] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 64] * kernel_shared_1[threadIdx_x // 7 * 36 + 22]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 64] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 73] * kernel_shared_1[threadIdx_x // 7 * 36 + 7]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 19] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 28] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 37] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 46] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 55] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 64] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 73] * kernel_shared_1[threadIdx_x // 7 * 36 + 25]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 82] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 91] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 10]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 82] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 91] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 28]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 91] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 145] * kernel_shared_1[threadIdx_x // 7 * 36 + 13]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 91] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 145] * kernel_shared_1[threadIdx_x // 7 * 36 + 31]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 145] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 154] * kernel_shared_1[threadIdx_x // 7 * 36 + 16]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 100] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 109] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 118] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 127] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 136] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 145] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 154] * kernel_shared_1[threadIdx_x // 7 * 36 + 34]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 2] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 11] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 2]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 2] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 11] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 20]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 11] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 65] * kernel_shared_1[threadIdx_x // 7 * 36 + 5]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 11] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 65] * kernel_shared_1[threadIdx_x // 7 * 36 + 23]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 65] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 74] * kernel_shared_1[threadIdx_x // 7 * 36 + 8]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 20] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 29] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 38] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 47] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 56] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 65] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 74] * kernel_shared_1[threadIdx_x // 7 * 36 + 26]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 83] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 92] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 11]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 83] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 92] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 29]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 92] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 146] * kernel_shared_1[threadIdx_x // 7 * 36 + 14]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 92] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 146] * kernel_shared_1[threadIdx_x // 7 * 36 + 32]
conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x % 7 + 146] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x % 7 + 155] * kernel_shared_1[threadIdx_x // 7 * 36 + 17]
conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x % 7 + 101] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x % 7 + 110] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x % 7 + 119] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x % 7 + 128] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x % 7 + 137] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x % 7 + 146] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x % 7 + 155] * kernel_shared_1[threadIdx_x // 7 * 36 + 35]
for i1_inner, i2_inner in T.grid(2, 7):
compute_1 = T.Buffer((25088,), data=compute.data)
bias_1 = T.Buffer((512,), data=bias.data)
compute_1[blockIdx_x * 196 + threadIdx_x // 7 * 98 + i1_inner * 49 + i2_inner * 7 + threadIdx_x % 7] = T.max(conv2d_nchw_1[i1_inner * 7 + i2_inner] + bias_1[blockIdx_x * 4 + threadIdx_x // 7 * 2 + i1_inner], T.float32(0))
Check correctness and evaluate performance¶
We build the binary and check its correctness and performance.
func = tvm.build(sch, args, target)
# Check correctness
data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)
dev = tvm.cuda()
data_tvm = tvm.nd.array(data_np, device=dev)
weight_tvm = tvm.nd.array(weight_np, device=dev)
bias_tvm = tvm.nd.array(bias_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)
# Check results
np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)
# Evaluate execution time
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print(
"Execution time of this operator: %.3f ms"
% (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)
Execution time of this operator: 0.264 ms
Using the record file¶
During the search, all measurement records are dumped into the record file “conv2d.json”. The measurement records can be used to re-apply search results, resume the search, and perform other analyses.
Here is an example where we load the best schedule from a file, print the equivalent python schedule API and CUDA source code. They can be used for debugging and learning the behavior of the auto-scheduler.
print("Equivalent python schedule:")
print(task.print_best(log_file, print_mode="schedule"))
print("CUDA source code:")
print(task.print_best(log_file, print_mode="cuda"))
Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
conv2d_nchw_nn, conv2d_nchw_ff, conv2d_nchw_yy, conv2d_nchw_xx, conv2d_nchw_rc, conv2d_nchw_ry, conv2d_nchw_rx = tuple(conv2d_nchw.op.axis) + tuple(conv2d_nchw.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
conv2d_nchw_nn_o_i, conv2d_nchw_nn_i = s[conv2d_nchw].split(conv2d_nchw_nn, factor=1)
conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1)
conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=2)
conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=2)
conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=7)
conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=1)
conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nchw_xx_o_i, conv2d_nchw_rc_i, conv2d_nchw_ry_i, conv2d_nchw_rx_i, conv2d_nchw_nn_i, conv2d_nchw_ff_i, conv2d_nchw_yy_i, conv2d_nchw_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=2)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=7)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, "shared", [conv2d_nchw])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, "shared", [conv2d_nchw])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x"))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread"))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=6)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=14)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=14)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
CUDA source code:
#ifdef _WIN32
using uint = unsigned int;
using uchar = unsigned char;
using ushort = unsigned short;
using int64_t = long long;
using uint64_t = unsigned long long;
#else
#define uint unsigned int
#define uchar unsigned char
#define ushort unsigned short
#define int64_t long long
#define uint64_t unsigned long long
#endif
extern "C" __global__ void __launch_bounds__(14) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
float conv2d_nchw[14];
__shared__ float pad_temp_shared[162];
__shared__ float kernel_shared[72];
conv2d_nchw[0] = 0.000000e+00f;
conv2d_nchw[1] = 0.000000e+00f;
conv2d_nchw[2] = 0.000000e+00f;
conv2d_nchw[3] = 0.000000e+00f;
conv2d_nchw[4] = 0.000000e+00f;
conv2d_nchw[5] = 0.000000e+00f;
conv2d_nchw[6] = 0.000000e+00f;
conv2d_nchw[7] = 0.000000e+00f;
conv2d_nchw[8] = 0.000000e+00f;
conv2d_nchw[9] = 0.000000e+00f;
conv2d_nchw[10] = 0.000000e+00f;
conv2d_nchw[11] = 0.000000e+00f;
conv2d_nchw[12] = 0.000000e+00f;
conv2d_nchw[13] = 0.000000e+00f;
for (int rc_outer_outer = 0; rc_outer_outer < 256; ++rc_outer_outer) {
__syncthreads();
pad_temp_shared[((int)threadIdx.x)] = ((((9 <= ((int)threadIdx.x)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((rc_outer_outer * 98) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 14)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 14) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 28)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 28) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 42)] = (((1 <= ((((int)threadIdx.x) + 6) % 9)) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 42) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 56)] = (((1 <= ((((int)threadIdx.x) + 2) % 9)) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 56) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 70)] = (((((9 <= ((((int)threadIdx.x) + 70) % 81)) && (((((int)threadIdx.x) + 70) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 70) / 81) * 49)) + ((((((int)threadIdx.x) + 70) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 84)] = ((((6 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 84) / 81) * 49)) + (((((int)threadIdx.x) + 3) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 98)] = (((1 <= ((((int)threadIdx.x) + 8) % 9)) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 98) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 112)] = (((1 <= ((((int)threadIdx.x) + 4) % 9)) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 112) / 81) * 49)) + (((((int)threadIdx.x) + 31) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 126)] = (((1 <= (((int)threadIdx.x) % 9)) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 126) / 81) * 49)) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) + 27)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 140)] = ((((((int)threadIdx.x) < 13) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 98) + (((((int)threadIdx.x) + 140) / 81) * 49)) + (((((int)threadIdx.x) + 59) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
if (((int)threadIdx.x) < 8) {
pad_temp_shared[(((int)threadIdx.x) + 154)] = 0.000000e+00f;
}
if (((int)threadIdx.x) < 12) {
kernel_shared[(((int)threadIdx.x) * 6)] = kernel[((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6))];
}
if (((int)threadIdx.x) < 12) {
kernel_shared[((((int)threadIdx.x) * 6) + 1)] = kernel[(((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6)) + 1)];
}
if (((int)threadIdx.x) < 12) {
kernel_shared[((((int)threadIdx.x) * 6) + 2)] = kernel[(((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6)) + 2)];
}
if (((int)threadIdx.x) < 12) {
kernel_shared[((((int)threadIdx.x) * 6) + 3)] = kernel[(((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6)) + 3)];
}
if (((int)threadIdx.x) < 12) {
kernel_shared[((((int)threadIdx.x) * 6) + 4)] = kernel[(((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6)) + 4)];
}
if (((int)threadIdx.x) < 12) {
kernel_shared[((((int)threadIdx.x) * 6) + 5)] = kernel[(((((((int)blockIdx.x) * 18432) + ((((int)threadIdx.x) / 3) * 4608)) + (rc_outer_outer * 18)) + ((((int)threadIdx.x) % 3) * 6)) + 5)];
}
__syncthreads();
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 7)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 9)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[((((int)threadIdx.x) / 7) * 36)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((int)threadIdx.x) % 7)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 9)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 18)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 9)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 63)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 3)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 9)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 63)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 21)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 63)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 72)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 6)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 63)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 72)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 24)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 81)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 90)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 9)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 81)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 90)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 27)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 90)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 144)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 12)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 90)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 144)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 30)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 144)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 153)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 15)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 99)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 108)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 117)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 126)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 135)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 144)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 153)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 33)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 1)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 10)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 1)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 1)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 10)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 19)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 10)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 64)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 4)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 10)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 64)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 22)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 64)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 73)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 7)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 19)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 28)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 37)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 46)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 55)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 64)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 73)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 25)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 82)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 91)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 10)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 82)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 91)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 28)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 91)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 145)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 13)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 91)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 145)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 31)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 145)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 154)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 16)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 100)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 109)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 118)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 127)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 136)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 145)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 154)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 34)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 2)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 11)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 2)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 2)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 11)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 20)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 11)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 65)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 5)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 11)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 65)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 23)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 65)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 74)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 8)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 20)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 29)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 38)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 47)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 56)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 65)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 74)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 26)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 83)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 92)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 11)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 83)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 92)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 29)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 92)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 146)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 14)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 92)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 146)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 32)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 146)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 155)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 17)]));
conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 101)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 110)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 119)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 128)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 137)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 146)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[((((int)threadIdx.x) % 7) + 155)] * kernel_shared[(((((int)threadIdx.x) / 7) * 36) + 35)]));
}
for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
for (int i2_inner = 0; i2_inner < 7; ++i2_inner) {
compute[(((((((int)blockIdx.x) * 196) + ((((int)threadIdx.x) / 7) * 98)) + (i1_inner * 49)) + (i2_inner * 7)) + (((int)threadIdx.x) % 7))] = max((conv2d_nchw[((i1_inner * 7) + i2_inner)] + bias[(((((int)blockIdx.x) * 4) + ((((int)threadIdx.x) / 7) * 2)) + i1_inner)]), 0.000000e+00f);
}
}
}
A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials.
def resume_search(task, log_file):
print("Resume search:")
cost_model = auto_scheduler.XGBModel()
cost_model.update_from_file(log_file)
search_policy = auto_scheduler.SketchPolicy(
task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=5,
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
task.tune(tune_option, search_policy=search_policy)
# Kill the measurement process
del measure_ctx
resume_search(task, log_file)
Resume search:
/venv/apache-tvm-py3.7/lib/python3.7/site-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
Get devices for measurement successfully!
Total running time of the script: ( 6 minutes 8.038 seconds)