Auto-scheduling a Convolution Layer for GPU

Author: Lianmin Zheng, Chengfan Jia

This is a tutorial on how to use the auto-scheduler for GPUs.

Different from the template-based autotvm which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space.

We use a convolution layer as an example in this tutorial.

Note that this tutorial will not run on Windows or recent versions of macOS. To get it to run, you will need to wrap the body of this tutorial in a if __name__ == "__main__": block.

import os

import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python

Define the computation

To begin with, let us define the computation of a convolution layer. The function should return the list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph.

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, CI, H, W), name="data")
    kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
    bias = te.placeholder((1, CO, 1, 1), name="bias")
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
    out = topi.nn.relu(conv + bias)
    return [data, kernel, bias, out]

Create the search task

We then create a search task for the last convolution layer in the resnet.

target = tvm.target.Target("cuda")

# Use the last layer in ResNet-50
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
    func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)

# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)

Out:

Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)

Next, we set parameters for the auto-scheduler. These parameters mainly specify how we do the measurement during the search.

  • measure_ctx launches a different process for measurement to provide isolation. It can protect the master process from GPU crashes during measurement and avoid other runtime conflicts.

  • min_repeat_ms defines the minimum duration of one “repeat” in every measurement. This can warmup the GPU, which is necessary to get accurate measurement results. Typically, we recommend a value >= 300 ms.

  • num_measure_trials is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget.

  • In addition, we use RecordToFile to dump measurement records into a file conv2d.json. The measurement records can be used to query the history best, resume the search, and do more analyses later.

  • see auto_scheduler.TuningOptions, auto_scheduler.LocalRPCMeasureContext for more parameters.

log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=10,  # change this to 1000 to achieve the best performance
    runner=measure_ctx.runner,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=2,
)

Out:

Get devices for measurement successfully!

Check correctness and evaluate performance

We build the binary and check its correctness and performance.

func = tvm.build(sch, args, target)

# Check correctness
data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)

dev = tvm.cuda()
data_tvm = tvm.nd.array(data_np, device=dev)
weight_tvm = tvm.nd.array(weight_np, device=dev)
bias_tvm = tvm.nd.array(bias_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)

# Check results
np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)

# Evaluate execution time
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print(
    "Execution time of this operator: %.3f ms"
    % (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)

Out:

Execution time of this operator: 0.353 ms

Using the record file

During the search, all measurement records are dumped into the record file “conv2d.json”. The measurement records can be used to re-apply search results, resume the search, and perform other analyses.

Here is an example where we load the best schedule from a file, print the equivalent python schedule API and CUDA source code. They can be used for debugging and learning the behavior of the auto-scheduler.

print("Equivalent python schedule:")
print(task.print_best(log_file, print_mode="schedule"))

print("CUDA source code:")
print(task.print_best(log_file, print_mode="cuda"))

Out:

Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1)
compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1)
compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)
compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)
compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)
compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=1)
compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=8)
compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=1)
compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)
compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)
compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=7)
compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)
compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=1)
compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=1)
compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)
compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=7)
compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=16)
compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=1)
compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=3)
compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=1)
compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=1)
compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=1)
s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=7)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[compute].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, "shared", [compute])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[compute], compute_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, "shared", [compute])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x"))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread"))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=56)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=56)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 1024)
s[compute].pragma(compute_nn_o_o_o_o, "unroll_explicit", True)

CUDA source code:

#ifdef _WIN32
  using uint = unsigned int;
  using uchar = unsigned char;
  using ushort = unsigned short;
  using int64_t = long long;
  using uint64_t = unsigned long long;
#else
  #define uint unsigned int
  #define uchar unsigned char
  #define ushort unsigned short
  #define int64_t long long
  #define uint64_t unsigned long long
#endif
extern "C" __global__ void default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
  float compute1[7];
  __shared__ float pad_temp_shared[1008];
  __shared__ float kernel_shared[384];
  compute1[(0)] = 0.000000e+00f;
  compute1[(1)] = 0.000000e+00f;
  compute1[(2)] = 0.000000e+00f;
  compute1[(3)] = 0.000000e+00f;
  compute1[(4)] = 0.000000e+00f;
  compute1[(5)] = 0.000000e+00f;
  compute1[(6)] = 0.000000e+00f;
  for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
    for (int rx_outer_outer = 0; rx_outer_outer < 3; ++rx_outer_outer) {
      __syncthreads();
      pad_temp_shared[(((int)threadIdx.x))] = ((((7 <= ((int)threadIdx.x)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + ((int)threadIdx.x)) + rx_outer_outer) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 56))] = (((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 56) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 112))] = (((((1 <= (((((int)threadIdx.x) / 7) + 7) % 9)) && ((((((int)threadIdx.x) / 7) + 7) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 112) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 7) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 168))] = (((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 168) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 224))] = (((((1 <= (((((int)threadIdx.x) / 7) + 5) % 9)) && ((((((int)threadIdx.x) / 7) + 5) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 224) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 5) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 280))] = (((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 280) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 336))] = (((((1 <= (((((int)threadIdx.x) / 7) + 3) % 9)) && ((((((int)threadIdx.x) / 7) + 3) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 336) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 3) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 392))] = (((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 448))] = ((((((int)threadIdx.x) < 49) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 448) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 504))] = ((((7 <= ((int)threadIdx.x)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + ((int)threadIdx.x)) + rx_outer_outer) + 384))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 560))] = (((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 560) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 616))] = (((((1 <= (((((int)threadIdx.x) / 7) + 7) % 9)) && ((((((int)threadIdx.x) / 7) + 7) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 616) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 7) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 672))] = (((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 672) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 728))] = (((((1 <= (((((int)threadIdx.x) / 7) + 5) % 9)) && ((((((int)threadIdx.x) / 7) + 5) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 728) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 5) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 784))] = (((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 840))] = (((((1 <= (((((int)threadIdx.x) / 7) + 3) % 9)) && ((((((int)threadIdx.x) / 7) + 3) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 840) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 3) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 896))] = (((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 896) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      pad_temp_shared[((((int)threadIdx.x) + 952))] = ((((((int)threadIdx.x) < 49) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 952) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8))] : 0.000000e+00f);
      kernel_shared[(((int)threadIdx.x))] = kernel[((((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer))];
      kernel_shared[((((int)threadIdx.x) + 56))] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 56) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) % 48) * 3)) + rx_outer_outer))];
      kernel_shared[((((int)threadIdx.x) + 112))] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 112) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) % 48) * 3)) + rx_outer_outer))];
      kernel_shared[((((int)threadIdx.x) + 168))] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 168) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 24) % 48) * 3)) + rx_outer_outer))];
      kernel_shared[((((int)threadIdx.x) + 224))] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 224) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 32) % 48) * 3)) + rx_outer_outer))];
      kernel_shared[((((int)threadIdx.x) + 280))] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 280) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 40) % 48) * 3)) + rx_outer_outer))];
      if (((int)threadIdx.x) < 48) {
        kernel_shared[((((int)threadIdx.x) + 336))] = kernel[((((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 144)) + (((int)threadIdx.x) * 3)) + rx_outer_outer) + 32256))];
      }
      __syncthreads();
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(((((int)threadIdx.x) % 7) * 7))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 1))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 2))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 3))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 4))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 5))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 6))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 8))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 9))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 10))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 11))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 12))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 13))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 1))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 14))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 15))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 16))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 17))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 18))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 19))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 20))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 2))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 63))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 64))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 65))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 66))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 67))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 68))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 69))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 3))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 70))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 71))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 72))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 73))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 74))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 75))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 76))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 4))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 77))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 78))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 79))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 80))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 81))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 82))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 83))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 5))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 126))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 127))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 128))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 129))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 130))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 131))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 132))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 6))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 133))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 134))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 135))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 136))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 137))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 138))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 139))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 7))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 140))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 141))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 142))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 143))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 144))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 145))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 146))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 8))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 189))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 190))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 191))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 192))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 193))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 194))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 195))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 9))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 196))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 197))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 198))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 199))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 200))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 201))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 202))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 10))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 203))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 204))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 205))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 206))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 207))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 208))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 209))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 11))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 252))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 253))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 254))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 255))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 256))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 257))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 258))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 12))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 259))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 260))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 261))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 262))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 263))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 264))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 265))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 13))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 266))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 267))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 268))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 269))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 270))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 271))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 272))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 14))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 315))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 316))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 317))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 318))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 319))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 320))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 321))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 15))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 322))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 323))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 324))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 325))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 326))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 327))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 328))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 16))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 329))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 330))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 331))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 332))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 333))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 334))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 335))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 17))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 378))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 379))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 380))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 381))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 382))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 383))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 384))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 18))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 385))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 386))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 387))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 388))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 389))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 390))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 391))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 19))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 392))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 393))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 394))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 395))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 396))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 397))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 398))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 20))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 441))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 442))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 443))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 444))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 445))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 446))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 447))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 21))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 448))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 449))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 450))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 451))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 452))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 453))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 454))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 22))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 455))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 456))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 457))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 458))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 459))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 460))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 461))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 23))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 504))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 505))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 506))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 507))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 508))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 509))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 510))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 24))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 511))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 512))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 513))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 514))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 515))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 516))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 517))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 25))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 518))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 519))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 520))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 521))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 522))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 523))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 524))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 26))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 567))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 568))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 569))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 570))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 571))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 572))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 573))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 27))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 574))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 575))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 576))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 577))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 578))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 579))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 580))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 28))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 581))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 582))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 583))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 584))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 585))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 586))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 587))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 29))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 630))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 631))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 632))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 633))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 634))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 635))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 636))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 30))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 637))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 638))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 639))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 640))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 641))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 642))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 643))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 31))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 644))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 645))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 646))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 647))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 648))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 649))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 650))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 32))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 693))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 694))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 695))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 696))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 697))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 698))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 699))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 33))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 700))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 701))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 702))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 703))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 704))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 705))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 706))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 34))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 707))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 708))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 709))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 710))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 711))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 712))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 713))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 35))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 756))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 757))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 758))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 759))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 760))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 761))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 762))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 36))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 763))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 764))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 765))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 766))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 767))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 768))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 769))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 37))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 770))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 771))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 772))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 773))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 774))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 775))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 776))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 38))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 819))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 820))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 821))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 822))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 823))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 824))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 825))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 39))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 826))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 827))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 828))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 829))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 830))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 831))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 832))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 40))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 833))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 834))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 835))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 836))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 837))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 838))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 839))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 41))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 882))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 883))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 884))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 885))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 886))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 887))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 888))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 42))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 889))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 890))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 891))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 892))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 893))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 894))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 895))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 43))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 896))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 897))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 898))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 899))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 900))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 901))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 902))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 44))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 945))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 946))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 947))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 948))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 949))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 950))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 951))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 45))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 952))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 953))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 954))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 955))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 956))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 957))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 958))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 46))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 959))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 960))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 961))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 962))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 963))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 964))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((((int)threadIdx.x) % 7) * 7) + 965))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + 47))]));
    }
  }
  compute[(((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)))] = max((compute1[(0)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 1))] = max((compute1[(1)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 2))] = max((compute1[(2)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 3))] = max((compute1[(3)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 4))] = max((compute1[(4)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 5))] = max((compute1[(5)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
  compute[((((((int)blockIdx.x) * 392) + (((int)threadIdx.x) * 7)) + 6))] = max((compute1[(6)] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);
}

A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials.

def resume_search(task, log_file):
    print("Resume search:")
    cost_model = auto_scheduler.XGBModel()
    cost_model.update_from_file(log_file)
    search_policy = auto_scheduler.SketchPolicy(
        task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
    )
    measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=5,
        runner=measure_ctx.runner,
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )
    task.tune(tune_option, search_policy=search_policy)

    # Kill the measurement process
    del measure_ctx


resume_search(task, log_file)

Out:

Resume search:
/usr/local/lib/python3.6/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated.  See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
  warnings.warn(f'Old style callback is deprecated.  See: {link}', UserWarning)
Get devices for measurement successfully!

Total running time of the script: ( 1 minutes 3.026 seconds)

Gallery generated by Sphinx-Gallery