.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py: .. _auto-scheduler-conv-gpu: Auto-scheduling a Convolution Layer for GPU =========================================== **Author**: `Lianmin Zheng `_, `Chengfan Jia `_ This is a tutorial on how to use the auto-scheduler for GPUs. Different from the template-based :ref:`autotvm ` which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space. We use a convolution layer as an example in this tutorial. Note that this tutorial will not run on Windows or recent versions of macOS. To get it to run, you will need to wrap the body of this tutorial in a :code:`if __name__ == "__main__":` block. .. code-block:: default import os import numpy as np import tvm from tvm import te, auto_scheduler, topi from tvm.topi.testing import conv2d_nchw_python Define the computation ^^^^^^^^^^^^^^^^^^^^^^ To begin with, let us define the computation of a convolution layer. The function should return the list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph. .. code-block:: default @auto_scheduler.register_workload def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding): data = te.placeholder((N, CI, H, W), name="data") kernel = te.placeholder((CO, CI, KH, KW), name="kernel") bias = te.placeholder((1, CO, 1, 1), name="bias") conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32") out = topi.nn.relu(conv + bias) return [data, kernel, bias, out] Create the search task ^^^^^^^^^^^^^^^^^^^^^^ We then create a search task for the last convolution layer in the resnet. .. code-block:: default target = tvm.target.Target("cuda") # Use the last layer in ResNet-50 N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1) task = auto_scheduler.SearchTask( func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target ) # Inspect the computational graph print("Computational DAG:") print(task.compute_dag) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Computational DAG: data = PLACEHOLDER [1, 512, 7, 7] pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f) kernel = PLACEHOLDER [512, 512, 3, 3] compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx]) bias = PLACEHOLDER [1, 512, 1, 1] T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0]) compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f) Next, we set parameters for the auto-scheduler. These parameters mainly specify how we do the measurement during the search. * :code:`measure_ctx` launches a different process for measurement to provide isolation. It can protect the master process from GPU crashes during measurement and avoid other runtime conflicts. * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement. This can warmup the GPU, which is necessary to get accurate measurement results. Typically, we recommend a value >= 300 ms. * :code:`num_measure_trials` is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget. * In addition, we use :code:`RecordToFile` to dump measurement records into a file `conv2d.json`. The measurement records can be used to query the history best, resume the search, and do more analyses later. * see :any:`auto_scheduler.TuningOptions`, :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters. .. code-block:: default log_file = "conv2d.json" measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300) tune_option = auto_scheduler.TuningOptions( num_measure_trials=10, # change this to 1000 to achieve the best performance runner=measure_ctx.runner, measure_callbacks=[auto_scheduler.RecordToFile(log_file)], verbose=2, ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Get devices for measurement successfully! Run the search ^^^^^^^^^^^^^^ Now we get all inputs ready. Pretty simple, isn't it? We can kick off the search and let the auto-scheduler do its magic. After some measurement trials, we can load the best schedule from the log file and apply it. .. code-block:: default # Run auto-tuning (search) task.tune(tune_option) # Apply the best schedule sch, args = task.apply_best(log_file) # Kill the measurement process del measure_ctx .. rst-class:: sphx-glr-script-out Out: .. code-block:: none We can lower the schedule to see the IR after auto-scheduling. The auto-scheduler correctly performs optimizations including multi-level tiling, cooperative fetching, unrolling and operator fusion. .. code-block:: default print("Lowered TIR:") print(tvm.lower(sch, args, simple_mode=True)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Lowered TIR: primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []), bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []), data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], []), kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], [])} buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} { attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 28; allocate(compute_3: Pointer(local float32), float32, [14]), storage_scope = local; allocate(pad_temp.shared: Pointer(shared float32), float32, [72]), storage_scope = shared; allocate(kernel.shared: Pointer(shared float32), float32, [3072]), storage_scope = shared; attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 { compute_3[0] = 0f32 compute_3[1] = 0f32 compute_3[2] = 0f32 compute_3[3] = 0f32 compute_3[4] = 0f32 compute_3[5] = 0f32 compute_3[6] = 0f32 compute_3[7] = 0f32 compute_3[8] = 0f32 compute_3[9] = 0f32 compute_3[10] = 0f32 compute_3[11] = 0f32 compute_3[12] = 0f32 compute_3[13] = 0f32 for (rc.outer.outer: int32, 0, 64) { for (ry.outer.outer: int32, 0, 3) { attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 { if @tir.likely((threadIdx.x_1 < 18), dtype=bool) { pad_temp.shared[(threadIdx.x_1*4)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv((threadIdx.x_1*4), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32) } if @tir.likely((threadIdx.x_1 < 18), dtype=bool) { pad_temp.shared[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 1), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32) } if @tir.likely((threadIdx.x_1 < 18), dtype=bool) { pad_temp.shared[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 2), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32) } if @tir.likely((threadIdx.x_1 < 18), dtype=bool) { pad_temp.shared[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 3), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0f32, dtype=float32) } } attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[threadIdx.x_2] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 64)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 64), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 128)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 128), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 192)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 36864)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 256)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 256), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 320)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 320), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 384)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 73728)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 448)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 448), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 512)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 512), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 576)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 110592)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 640)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 640), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 704)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 704), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 768)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 147456)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 832)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 832), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 896)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 896), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 960)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 184320)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1024)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1024), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1088)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1088), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1152)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 221184)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1216)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1216), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1280)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1280), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1344)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 258048)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1408)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1408), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1472)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1472), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1536)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 294912)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1600)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1600), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1664)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1664), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1728)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 331776)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1792)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1792), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1856)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1856), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1920)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 368640)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 1984)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1984), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2048)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2048), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2112)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 405504)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2176)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2176), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2240)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2240), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2304)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 442368)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2368)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2368), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2432)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2432), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2496)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 479232)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2560)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2560), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2624)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2624), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2688)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 516096)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2752)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2752), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2816)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2816), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2880)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 552960)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 2944)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2944), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64; kernel.shared[(threadIdx.x_2 + 3008)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 3008), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))] compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[0]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[9]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[(threadIdx.x*48)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 3)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[0]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[9]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 24)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 27)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 1)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 4)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 25)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 28)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[8]*(float32*)kernel.shared[((threadIdx.x*48) + 2)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[17]*(float32*)kernel.shared[((threadIdx.x*48) + 5)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[8]*(float32*)kernel.shared[((threadIdx.x*48) + 26)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[17]*(float32*)kernel.shared[((threadIdx.x*48) + 29)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[18]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[27]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 6)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 9)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[18]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[27]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 30)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 33)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 7)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 10)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 31)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 34)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[26]*(float32*)kernel.shared[((threadIdx.x*48) + 8)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[35]*(float32*)kernel.shared[((threadIdx.x*48) + 11)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[26]*(float32*)kernel.shared[((threadIdx.x*48) + 32)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[35]*(float32*)kernel.shared[((threadIdx.x*48) + 35)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[36]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[45]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 12)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 15)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[36]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[45]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 36)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 39)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 13)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 16)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 37)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 40)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[44]*(float32*)kernel.shared[((threadIdx.x*48) + 14)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[53]*(float32*)kernel.shared[((threadIdx.x*48) + 17)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[44]*(float32*)kernel.shared[((threadIdx.x*48) + 38)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[53]*(float32*)kernel.shared[((threadIdx.x*48) + 41)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[54]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[63]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 18)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 21)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[54]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[63]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 42)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 45)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 19)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 22)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 43)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 46)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[62]*(float32*)kernel.shared[((threadIdx.x*48) + 20)])) compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[71]*(float32*)kernel.shared[((threadIdx.x*48) + 23)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[62]*(float32*)kernel.shared[((threadIdx.x*48) + 44)])) compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[71]*(float32*)kernel.shared[((threadIdx.x*48) + 47)])) } } for (i1.inner: int32, 0, 2) { for (i3.inner: int32, 0, 7) { compute_2[(((((floordiv(blockIdx.x, 7)*6272) + (threadIdx.x*98)) + (i1.inner*49)) + (floormod(blockIdx.x, 7)*7)) + i3.inner)] = max(((float32*)compute_3[((i1.inner*7) + i3.inner)] + (float32*)bias_2[(((floordiv(blockIdx.x, 7)*128) + (threadIdx.x*2)) + i1.inner)]), 0f32) } } } } Check correctness and evaluate performance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We build the binary and check its correctness and performance. .. code-block:: default func = tvm.build(sch, args, target) # Check correctness data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32) weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32) bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32) conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding) out_np = np.maximum(conv_np + bias_np, 0.0) dev = tvm.cuda() data_tvm = tvm.nd.array(data_np, device=dev) weight_tvm = tvm.nd.array(weight_np, device=dev) bias_tvm = tvm.nd.array(bias_np, device=dev) out_tvm = tvm.nd.empty(out_np.shape, device=dev) func(data_tvm, weight_tvm, bias_tvm, out_tvm) # Check results np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3) # Evaluate execution time evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500) print( "Execution time of this operator: %.3f ms" % (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000) ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Execution time of this operator: 0.362 ms Using the record file ^^^^^^^^^^^^^^^^^^^^^ During the search, all measurement records are dumped into the record file "conv2d.json". The measurement records can be used to re-apply search results, resume the search, and perform other analyses. Here is an example where we load the best schedule from a file, print the equivalent python schedule API and CUDA source code. They can be used for debugging and learning the behavior of the auto-scheduler. .. code-block:: default print("Equivalent python schedule:") print(task.print_best(log_file, print_mode="schedule")) print("CUDA source code:") print(task.print_best(log_file, print_mode="cuda")) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Equivalent python schedule: pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis) compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis) T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis) compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis) s[T_add].compute_inline() compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1) compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1) compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1) compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1) compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1) compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=2) compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=64) compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=1) compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1) compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1) compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=1) compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1) compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=1) compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=7) compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1) compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1) compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=2) compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4) compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1) compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=1) compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=1) compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=3) s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i) compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1) compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1) compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1) compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2) compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=64) compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1) compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1) compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1) compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1) compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7) compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1) compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1) s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i) s[compute].compute_at(s[compute], compute_i3_o_i) kernel_shared = s.cache_read(kernel, "shared", [compute]) kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis) s[kernel_shared].compute_at(s[compute], compute_rx_o_o) pad_temp_shared = s.cache_read(pad_temp, "shared", [compute]) pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis) s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o) s[pad_temp].compute_inline() compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o) s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x")) compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i) s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread")) compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i) s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x")) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1) s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64) s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x")) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4) s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64) s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x")) s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 512) s[compute].pragma(compute_nn_o_o_o_o, "unroll_explicit", True) CUDA source code: #ifdef _WIN32 using uint = unsigned int; using uchar = unsigned char; using ushort = unsigned short; using int64_t = long long; using uint64_t = unsigned long long; #else #define uint unsigned int #define uchar unsigned char #define ushort unsigned short #define int64_t long long #define uint64_t unsigned long long #endif extern "C" __global__ void __launch_bounds__(64) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) { float compute1[14]; __shared__ float pad_temp_shared[72]; __shared__ float kernel_shared[3072]; compute1[(0)] = 0.000000e+00f; compute1[(1)] = 0.000000e+00f; compute1[(2)] = 0.000000e+00f; compute1[(3)] = 0.000000e+00f; compute1[(4)] = 0.000000e+00f; compute1[(5)] = 0.000000e+00f; compute1[(6)] = 0.000000e+00f; compute1[(7)] = 0.000000e+00f; compute1[(8)] = 0.000000e+00f; compute1[(9)] = 0.000000e+00f; compute1[(10)] = 0.000000e+00f; compute1[(11)] = 0.000000e+00f; compute1[(12)] = 0.000000e+00f; compute1[(13)] = 0.000000e+00f; for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) { for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) { __syncthreads(); if (((int)threadIdx.x) < 18) { pad_temp_shared[((((int)threadIdx.x) * 4))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= ((((int)threadIdx.x) * 4) % 9))) && (((((int)threadIdx.x) * 4) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + (((((int)threadIdx.x) * 4) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + ((((int)threadIdx.x) * 4) % 9)) - 8))] : 0.000000e+00f); } if (((int)threadIdx.x) < 18) { pad_temp_shared[(((((int)threadIdx.x) * 4) + 1))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 1) % 9))) && ((((((int)threadIdx.x) * 4) + 1) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 1) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 1) % 9)) - 8))] : 0.000000e+00f); } if (((int)threadIdx.x) < 18) { pad_temp_shared[(((((int)threadIdx.x) * 4) + 2))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 2) % 9))) && ((((((int)threadIdx.x) * 4) + 2) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 2) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 2) % 9)) - 8))] : 0.000000e+00f); } if (((int)threadIdx.x) < 18) { pad_temp_shared[(((((int)threadIdx.x) * 4) + 3))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 3) % 9))) && ((((((int)threadIdx.x) * 4) + 3) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 3) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 3) % 9)) - 8))] : 0.000000e+00f); } kernel_shared[(((int)threadIdx.x))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)))]; kernel_shared[((((int)threadIdx.x) + 64))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 64) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 128))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 128) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 192))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36864))]; kernel_shared[((((int)threadIdx.x) + 256))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 256) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 320))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 320) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 384))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 73728))]; kernel_shared[((((int)threadIdx.x) + 448))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 512))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 512) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 576))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 110592))]; kernel_shared[((((int)threadIdx.x) + 640))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 640) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 704))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 704) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 768))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 147456))]; kernel_shared[((((int)threadIdx.x) + 832))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 832) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 896))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 896) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 960))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 184320))]; kernel_shared[((((int)threadIdx.x) + 1024))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1024) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1088))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1088) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1152))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 221184))]; kernel_shared[((((int)threadIdx.x) + 1216))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1216) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1280))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1280) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1344))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 258048))]; kernel_shared[((((int)threadIdx.x) + 1408))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1408) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1472))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1472) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1536))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 294912))]; kernel_shared[((((int)threadIdx.x) + 1600))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1600) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1664))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1664) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1728))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 331776))]; kernel_shared[((((int)threadIdx.x) + 1792))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1792) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1856))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1856) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 1920))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 368640))]; kernel_shared[((((int)threadIdx.x) + 1984))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1984) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2048))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2048) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2112))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 405504))]; kernel_shared[((((int)threadIdx.x) + 2176))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2176) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2240))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2240) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2304))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 442368))]; kernel_shared[((((int)threadIdx.x) + 2368))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2368) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2432))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2432) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2496))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 479232))]; kernel_shared[((((int)threadIdx.x) + 2560))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2624))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2624) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2688))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 516096))]; kernel_shared[((((int)threadIdx.x) + 2752))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2752) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2816))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2816) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; kernel_shared[((((int)threadIdx.x) + 2880))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 552960))]; kernel_shared[((((int)threadIdx.x) + 2944))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2944) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))]; kernel_shared[((((int)threadIdx.x) + 3008))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 3008) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))]; __syncthreads(); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(0)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(1)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(2)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(3)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(4)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(5)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(6)] * kernel_shared[((((int)threadIdx.x) * 48))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(0)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))])); compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))])); compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))])); } } for (int i1_inner = 0; i1_inner < 2; ++i1_inner) { for (int i3_inner = 0; i3_inner < 7; ++i3_inner) { compute[(((((((((int)blockIdx.x) / 7) * 6272) + (((int)threadIdx.x) * 98)) + (i1_inner * 49)) + ((((int)blockIdx.x) % 7) * 7)) + i3_inner))] = max((compute1[(((i1_inner * 7) + i3_inner))] + bias[(((((((int)blockIdx.x) / 7) * 128) + (((int)threadIdx.x) * 2)) + i1_inner))]), 0.000000e+00f); } } } A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials. .. code-block:: default def resume_search(task, log_file): print("Resume search:") cost_model = auto_scheduler.XGBModel() cost_model.update_from_file(log_file) search_policy = auto_scheduler.SketchPolicy( task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)] ) measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300) tune_option = auto_scheduler.TuningOptions( num_measure_trials=5, runner=measure_ctx.runner, measure_callbacks=[auto_scheduler.RecordToFile(log_file)], ) task.tune(tune_option, search_policy=search_policy) # Kill the measurement process del measure_ctx resume_search(task, log_file) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Resume search: /usr/local/lib/python3.6/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning) Get devices for measurement successfully! .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 13.406 seconds) .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: tune_conv2d_layer_cuda.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: tune_conv2d_layer_cuda.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_