Note
Click here to download the full example code
Auto-scheduling a Convolution Layer for GPU¶
Author: Lianmin Zheng, Chengfan Jia
This is a tutorial on how to use the auto-scheduler for GPUs.
Different from the template-based autotvm which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space.
We use a convolution layer as an example in this tutorial.
Note that this tutorial will not run on Windows or recent versions of macOS. To
get it to run, you will need to wrap the body of this tutorial in a if
__name__ == "__main__":
block.
import os
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python
Define the computation¶
To begin with, let us define the computation of a convolution layer. The function should return the list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph.
@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name="data")
kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
bias = te.placeholder((1, CO, 1, 1), name="bias")
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
out = topi.nn.relu(conv + bias)
return [data, kernel, bias, out]
Create the search task¶
We then create a search task for the last convolution layer in the resnet.
target = tvm.target.Target("cuda")
# Use the last layer in ResNet-50
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)
# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)
Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
conv2d_nchw(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nchw[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)
Next, we set parameters for the auto-scheduler. These parameters mainly specify how we do the measurement during the search.
measure_ctx
launches a different process for measurement to provide isolation. It can protect the main process from GPU crashes during measurement and avoid other runtime conflicts.min_repeat_ms
defines the minimum duration of one “repeat” in every measurement. This can warmup the GPU, which is necessary to get accurate measurement results. Typically, we recommend a value >= 300 ms.num_measure_trials
is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget.In addition, we use
RecordToFile
to dump measurement records into a file conv2d.json. The measurement records can be used to query the history best, resume the search, and do more analyses later.see
auto_scheduler.TuningOptions
,auto_scheduler.LocalRPCMeasureContext
for more parameters.
log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10, # change this to 1000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
Get devices for measurement successfully!
Run the search¶
Now we get all inputs ready. Pretty simple, isn’t it? We can kick off the search and let the auto-scheduler do its magic. After some measurement trials, we can load the best schedule from the log file and apply it.
# Run auto-tuning (search)
task.tune(tune_option)
# Apply the best schedule
sch, args = task.apply_best(log_file)
# Kill the measurement process
del measure_ctx
We can lower the schedule to see the IR after auto-scheduling. The auto-scheduler correctly performs optimizations including multi-level tiling, cooperative fetching, unrolling and operator fusion.
Lowered TIR:
@main = primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {data: Buffer(data_2: Pointer(float32), float32, [25088], []),
kernel: Buffer(kernel_2: Pointer(float32), float32, [2359296], []),
bias: Buffer(bias_2: Pointer(float32), float32, [512], []),
compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 128;
allocate(conv2d_nchw: Pointer(local float32), float32, [4]), storage_scope = local;
allocate(pad_temp.shared: Pointer(shared float32), float32, [1296]), storage_scope = shared;
allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared;
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
conv2d_nchw_1: Buffer(conv2d_nchw, float32, [4], [], scope="local", align=16)[0] = 0f32
conv2d_nchw_1[1] = 0f32
conv2d_nchw_1[2] = 0f32
conv2d_nchw_1[3] = 0f32
for (rc.outer.outer: int32, 0, 32) {
let cse_var_2: int32 = (rc.outer.outer*784)
let cse_var_1: int32 = (rc.outer.outer*144)
{
attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1296], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((9 <= threadIdx.x_1) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 49)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 49), 81)) && (floormod((threadIdx.x_1 + 49), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 49), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 49), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 8), 9)) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 98), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 147)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 66), 81)) && (floormod((threadIdx.x_1 + 66), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 147), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 66), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 34), 81)) && (floormod((threadIdx.x_1 + 34), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 196), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 34), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 245)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 2), 81)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 245), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 2), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 51), 81)) && (floormod((threadIdx.x_1 + 51), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 294), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 51), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 343)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 343), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 19), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 68), 81)) && (floormod((threadIdx.x_1 + 68), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 392), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 68), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 441)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 4), 9)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 441), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 4), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 4), 81)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 490), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 4), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 539)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 53), 81)) && (floormod((threadIdx.x_1 + 53), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 539), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 53), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 588), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 21), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 637)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 70), 81)) && (floormod((threadIdx.x_1 + 70), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 637), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 70), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 38), 81)) && (floormod((threadIdx.x_1 + 38), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 686), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 38), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 735)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 6), 81)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 735), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 6), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 784), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 833)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 833), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 23), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 8), 9)) && (floormod((threadIdx.x_1 + 72), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 882), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 8), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 931)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 40), 81)) && (floormod((threadIdx.x_1 + 40), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 931), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 40), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 8), 81)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 980), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 8), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 1029)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 57), 81)) && (floormod((threadIdx.x_1 + 57), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1029), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 57), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 1078)] = @tir.if_then_else((((threadIdx.x_1 < 47) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1078), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 25), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 1127)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1127), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 42), 81)) && (floormod((threadIdx.x_1 + 42), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1176), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 42), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
pad_temp.shared_1[(threadIdx.x_1 + 1225)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1225), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 10), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
if @tir.likely((threadIdx.x_1 < 22), dtype=bool) {
pad_temp.shared_1[(threadIdx.x_1 + 1274)] = @tir.if_then_else((((threadIdx.x_1 < 13) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1274), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 59), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
}
attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope="shared")[threadIdx.x_2] = kernel[(((blockIdx.x*18432) + cse_var_1) + threadIdx.x_2)]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 49)] = kernel[((((blockIdx.x*18432) + cse_var_1) + (floordiv((threadIdx.x_2 + 49), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 98)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 98), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 98), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 147)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 147), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 1)*3)) + floormod(threadIdx.x_2, 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 196), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 52), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 245)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 245), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 101), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 294)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 294), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 2)*3)) + floormod(threadIdx.x_2, 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 343)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 343), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 55), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 392), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 104), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 441)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 441), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 3)*3)) + floormod(threadIdx.x_2, 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
kernel.shared_1[(threadIdx.x_2 + 490)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 490), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 58), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
if @tir.likely((threadIdx.x_2 < 37), dtype=bool) {
kernel.shared_1[(threadIdx.x_2 + 539)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 539), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 107), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
}
for (rc.outer.inner: int32, 0, 2) {
let cse_var_3: int32 = (rc.outer.inner*72)
{
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[cse_var_3]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 1)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 2)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 3)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 4)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 5)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 6)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 7)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 8)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 9)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 10)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 11)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 12)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 13)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 14)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 15)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 16)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 17)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 18)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 19)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 20)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 21)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 22)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 23)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 24)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 25)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 26)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 27)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 28)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 29)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 30)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 31)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 32)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 33)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 34)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 35)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 36)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 37)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 38)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 39)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 40)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 41)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 42)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 43)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 44)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 45)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 46)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 47)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 48)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 49)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 50)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 51)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 52)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 53)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 54)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 55)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 56)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 57)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 58)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 59)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 60)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 61)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 62)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 63)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 64)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 65)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 66)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 67)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 68)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 69)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 70)]))
conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 71)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 144)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 145)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 146)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 147)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 148)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 149)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 150)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 151)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 152)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 153)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 154)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 155)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 156)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 157)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 158)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 159)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 160)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 161)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 162)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 163)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 164)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 165)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 166)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 167)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 168)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 169)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 170)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 171)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 172)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 173)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 174)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 175)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 176)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 177)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 178)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 179)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 180)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 181)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 182)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 183)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 184)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 185)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 186)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 187)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 188)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 189)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 190)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 191)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 192)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 193)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 194)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 195)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 196)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 197)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 198)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 199)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 200)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 201)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 202)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 203)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 204)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 205)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 206)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 207)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 208)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 209)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 210)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 211)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 212)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 213)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 214)]))
conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 215)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 288)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 289)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 290)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 291)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 292)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 293)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 294)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 295)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 296)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 297)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 298)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 299)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 300)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 301)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 302)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 303)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 304)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 305)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 306)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 307)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 308)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 309)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 310)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 311)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 312)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 313)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 314)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 315)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 316)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 317)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 318)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 319)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 320)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 321)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 322)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 323)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 324)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 325)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 326)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 327)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 328)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 329)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 330)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 331)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 332)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 333)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 334)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 335)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 336)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 337)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 338)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 339)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 340)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 341)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 342)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 343)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 344)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 345)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 346)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 347)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 348)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 349)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 350)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 351)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 352)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 353)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 354)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 355)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 356)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 357)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 358)]))
conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 359)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 432)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 433)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 434)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 435)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 436)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 437)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 438)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 439)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 440)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 441)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 442)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 443)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 444)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 445)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 446)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 447)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 448)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 449)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 450)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 451)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 452)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 453)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 454)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 455)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 456)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 457)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 458)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 459)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 460)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 461)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 462)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 463)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 464)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 465)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 466)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 467)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 468)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 469)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 470)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 471)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 472)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 473)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 474)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 475)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 476)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 477)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 478)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 479)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 480)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 481)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 482)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 483)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 484)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 485)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 486)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 487)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 488)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 489)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 490)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 491)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 492)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 493)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 494)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 495)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 496)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 497)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 498)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 499)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 500)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 501)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 502)]))
conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 503)]))
}
}
}
}
for (i1.inner: int32, 0, 4) {
compute[(((blockIdx.x*196) + (i1.inner*49)) + threadIdx.x)] = max((conv2d_nchw_1[i1.inner] + bias[((blockIdx.x*4) + i1.inner)]), 0f32)
}
}
}
Check correctness and evaluate performance¶
We build the binary and check its correctness and performance.
func = tvm.build(sch, args, target)
# Check correctness
data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)
dev = tvm.cuda()
data_tvm = tvm.nd.array(data_np, device=dev)
weight_tvm = tvm.nd.array(weight_np, device=dev)
bias_tvm = tvm.nd.array(bias_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)
# Check results
np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)
# Evaluate execution time
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print(
"Execution time of this operator: %.3f ms"
% (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)
Execution time of this operator: 0.236 ms
Using the record file¶
During the search, all measurement records are dumped into the record file “conv2d.json”. The measurement records can be used to re-apply search results, resume the search, and perform other analyses.
Here is an example where we load the best schedule from a file, print the equivalent python schedule API and CUDA source code. They can be used for debugging and learning the behavior of the auto-scheduler.
print("Equivalent python schedule:")
print(task.print_best(log_file, print_mode="schedule"))
print("CUDA source code:")
print(task.print_best(log_file, print_mode="cuda"))
Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
conv2d_nchw_nn, conv2d_nchw_ff, conv2d_nchw_yy, conv2d_nchw_xx, conv2d_nchw_rc, conv2d_nchw_ry, conv2d_nchw_rx = tuple(conv2d_nchw.op.axis) + tuple(conv2d_nchw.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
conv2d_nchw_nn_o_i, conv2d_nchw_nn_i = s[conv2d_nchw].split(conv2d_nchw_nn, factor=1)
conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1)
conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=4)
conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=1)
conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=8)
conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=2)
conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nchw_xx_o_i, conv2d_nchw_rc_i, conv2d_nchw_ry_i, conv2d_nchw_rx_i, conv2d_nchw_nn_i, conv2d_nchw_ff_i, conv2d_nchw_yy_i, conv2d_nchw_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=4)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=1)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, "shared", [conv2d_nchw])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, "shared", [conv2d_nchw])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x"))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread"))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512)
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
CUDA source code:
#ifdef _WIN32
using uint = unsigned int;
using uchar = unsigned char;
using ushort = unsigned short;
using int64_t = long long;
using uint64_t = unsigned long long;
#else
#define uint unsigned int
#define uchar unsigned char
#define ushort unsigned short
#define int64_t long long
#define uint64_t unsigned long long
#endif
extern "C" __global__ void __launch_bounds__(49) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
float conv2d_nchw[4];
__shared__ float pad_temp_shared[1296];
__shared__ float kernel_shared[576];
conv2d_nchw[0] = 0.000000e+00f;
conv2d_nchw[1] = 0.000000e+00f;
conv2d_nchw[2] = 0.000000e+00f;
conv2d_nchw[3] = 0.000000e+00f;
for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
__syncthreads();
pad_temp_shared[((int)threadIdx.x)] = ((((9 <= ((int)threadIdx.x)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 49)] = (((((9 <= ((((int)threadIdx.x) + 49) % 81)) && (((((int)threadIdx.x) + 49) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 49) / 81) * 49)) + ((((((int)threadIdx.x) + 49) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 98)] = (((1 <= ((((int)threadIdx.x) + 8) % 9)) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 98) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 147)] = (((((9 <= ((((int)threadIdx.x) + 66) % 81)) && (((((int)threadIdx.x) + 66) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 147) / 81) * 49)) + ((((((int)threadIdx.x) + 66) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 196)] = (((((9 <= ((((int)threadIdx.x) + 34) % 81)) && (((((int)threadIdx.x) + 34) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 81) * 49)) + ((((((int)threadIdx.x) + 34) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 245)] = ((((7 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 245) / 81) * 49)) + (((((int)threadIdx.x) + 2) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((9 <= ((((int)threadIdx.x) + 51) % 81)) && (((((int)threadIdx.x) + 51) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 294) / 81) * 49)) + ((((((int)threadIdx.x) + 51) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 343)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 343) / 81) * 49)) + (((((int)threadIdx.x) + 19) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 392)] = (((((9 <= ((((int)threadIdx.x) + 68) % 81)) && (((((int)threadIdx.x) + 68) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 81) * 49)) + ((((((int)threadIdx.x) + 68) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 441)] = (((((1 <= (((((int)threadIdx.x) / 9) + 4) % 9)) && (((((int)threadIdx.x) + 36) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 441) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 4) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 490)] = ((((5 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 490) / 81) * 49)) + (((((int)threadIdx.x) + 4) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 539)] = (((((9 <= ((((int)threadIdx.x) + 53) % 81)) && (((((int)threadIdx.x) + 53) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 539) / 81) * 49)) + ((((((int)threadIdx.x) + 53) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 81) * 49)) + (((((int)threadIdx.x) + 21) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 637)] = (((((9 <= ((((int)threadIdx.x) + 70) % 81)) && (((((int)threadIdx.x) + 70) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 637) / 81) * 49)) + ((((((int)threadIdx.x) + 70) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((9 <= ((((int)threadIdx.x) + 38) % 81)) && (((((int)threadIdx.x) + 38) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 686) / 81) * 49)) + ((((((int)threadIdx.x) + 38) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 735)] = ((((3 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 735) / 81) * 49)) + (((((int)threadIdx.x) + 6) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((9 <= ((((int)threadIdx.x) + 55) % 81)) && (((((int)threadIdx.x) + 55) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 81) * 49)) + ((((((int)threadIdx.x) + 55) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 833)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 833) / 81) * 49)) + (((((int)threadIdx.x) + 23) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 882)] = (((((1 <= (((((int)threadIdx.x) / 9) + 8) % 9)) && (((((int)threadIdx.x) + 72) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 882) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 8) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 931)] = (((((9 <= ((((int)threadIdx.x) + 40) % 81)) && (((((int)threadIdx.x) + 40) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 931) / 81) * 49)) + ((((((int)threadIdx.x) + 40) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 980)] = ((((1 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 81) * 49)) + (((((int)threadIdx.x) + 8) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 1029)] = (((((9 <= ((((int)threadIdx.x) + 57) % 81)) && (((((int)threadIdx.x) + 57) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1029) / 81) * 49)) + ((((((int)threadIdx.x) + 57) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 1078)] = ((((((int)threadIdx.x) < 47) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1078) / 81) * 49)) + (((((int)threadIdx.x) + 25) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 1127)] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1127) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((((9 <= ((((int)threadIdx.x) + 42) % 81)) && (((((int)threadIdx.x) + 42) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1176) / 81) * 49)) + ((((((int)threadIdx.x) + 42) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
pad_temp_shared[(((int)threadIdx.x) + 1225)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1225) / 81) * 49)) + (((((int)threadIdx.x) + 10) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
if (((int)threadIdx.x) < 22) {
pad_temp_shared[(((int)threadIdx.x) + 1274)] = ((((((int)threadIdx.x) < 13) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1274) / 81) * 49)) + (((((int)threadIdx.x) + 59) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
}
kernel_shared[((int)threadIdx.x)] = kernel[(((((int)blockIdx.x) * 18432) + (rc_outer_outer * 144)) + ((int)threadIdx.x))];
kernel_shared[(((int)threadIdx.x) + 49)] = kernel[((((((int)blockIdx.x) * 18432) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 49) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
kernel_shared[(((int)threadIdx.x) + 98)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 98) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 98) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
kernel_shared[(((int)threadIdx.x) + 147)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 147) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 3)];
kernel_shared[(((int)threadIdx.x) + 196)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 196) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 52) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
kernel_shared[(((int)threadIdx.x) + 245)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 245) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 101) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
kernel_shared[(((int)threadIdx.x) + 294)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 294) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 6)];
kernel_shared[(((int)threadIdx.x) + 343)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 343) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 55) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
kernel_shared[(((int)threadIdx.x) + 392)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 392) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 104) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
kernel_shared[(((int)threadIdx.x) + 441)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 441) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 9)];
kernel_shared[(((int)threadIdx.x) + 490)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 490) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 58) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
if (((int)threadIdx.x) < 37) {
kernel_shared[(((int)threadIdx.x) + 539)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 539) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 107) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
}
__syncthreads();
for (int rc_outer_inner = 0; rc_outer_inner < 2; ++rc_outer_inner) {
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(rc_outer_inner * 72)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 1)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 2)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 3)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 4)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 5)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 6)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 7)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 8)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 9)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 10)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 11)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 12)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 13)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 14)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 15)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 16)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 17)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 18)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 19)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 20)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 21)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 22)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 23)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 24)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 25)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 26)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 27)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 28)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 29)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 30)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 31)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 32)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 33)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 34)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 35)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 36)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 37)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 38)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 39)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 40)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 41)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 42)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 43)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 44)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 45)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 46)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 47)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 48)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 49)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 50)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 51)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 52)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 53)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 54)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 55)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 56)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 57)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 58)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 59)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 60)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 61)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 62)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 63)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 64)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 65)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 66)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 67)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 68)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 69)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 70)]));
conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 71)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 144)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 145)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 146)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 147)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 148)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 149)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 150)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 151)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 152)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 153)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 154)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 155)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 156)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 157)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 158)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 159)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 160)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 161)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 162)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 163)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 164)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 165)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 166)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 167)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 168)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 169)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 170)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 171)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 172)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 173)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 174)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 175)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 176)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 177)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 178)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 179)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 180)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 181)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 182)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 183)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 184)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 185)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 186)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 187)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 188)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 189)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 190)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 191)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 192)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 193)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 194)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 195)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 196)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 197)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 198)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 199)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 200)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 201)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 202)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 203)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 204)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 205)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 206)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 207)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 208)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 209)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 210)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 211)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 212)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 213)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 214)]));
conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 215)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 288)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 289)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 290)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 291)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 292)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 293)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 294)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 295)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 296)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 297)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 298)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 299)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 300)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 301)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 302)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 303)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 304)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 305)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 306)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 307)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 308)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 309)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 310)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 311)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 312)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 313)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 314)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 315)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 316)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 317)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 318)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 319)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 320)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 321)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 322)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 323)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 324)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 325)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 326)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 327)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 328)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 329)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 330)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 331)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 332)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 333)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 334)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 335)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 336)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 337)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 338)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 339)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 340)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 341)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 342)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 343)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 344)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 345)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 346)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 347)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 348)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 349)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 350)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 351)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 352)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 353)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 354)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 355)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 356)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 357)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 358)]));
conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 359)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 432)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 433)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 434)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 435)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 436)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 437)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 438)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 439)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 440)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 441)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 442)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 443)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 444)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 445)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 446)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 447)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 448)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 449)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 450)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 451)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 452)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 453)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 454)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 455)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 456)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 457)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 458)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 459)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 460)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 461)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 462)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 463)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 464)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 465)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 466)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 467)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 468)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 469)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 470)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 471)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 472)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 473)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 474)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 475)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 476)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 477)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 478)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 479)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 480)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 481)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 482)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 483)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 484)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 485)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 486)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 487)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 488)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 489)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 490)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 491)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 492)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 493)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 494)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 495)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 496)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 497)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 498)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 499)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 500)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 501)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 502)]));
conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 503)]));
}
}
for (int i1_inner = 0; i1_inner < 4; ++i1_inner) {
compute[(((((int)blockIdx.x) * 196) + (i1_inner * 49)) + ((int)threadIdx.x))] = max((conv2d_nchw[i1_inner] + bias[((((int)blockIdx.x) * 4) + i1_inner)]), 0.000000e+00f);
}
}
A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials.
def resume_search(task, log_file):
print("Resume search:")
cost_model = auto_scheduler.XGBModel()
cost_model.update_from_file(log_file)
search_policy = auto_scheduler.SketchPolicy(
task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=5,
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
task.tune(tune_option, search_policy=search_policy)
# Kill the measurement process
del measure_ctx
resume_search(task, log_file)
Resume search:
/usr/local/lib/python3.7/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
Get devices for measurement successfully!
Total running time of the script: ( 2 minutes 35.003 seconds)