.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py: .. _auto-scheduler-conv-gpu: Auto-scheduling a Convolution Layer for GPU =========================================== **Author**: `Lianmin Zheng `_, `Chengfan Jia `_ This is a tutorial on how to use the auto-scheduler for GPUs. Different from the template-based :ref:`autotvm ` which relies on manual templates to define the search space, the auto-scheduler does not require any templates. Users only need to write the computation declaration without any schedule commands or templates. The auto-scheduler can automatically generate a large search space and find a good schedule in the space. We use a convolution layer as an example in this tutorial. Note that this tutorial will not run on Windows or recent versions of macOS. To get it to run, you will need to wrap the body of this tutorial in a :code:`if __name__ == "__main__":` block. .. GENERATED FROM PYTHON SOURCE LINES 39-47 .. code-block:: default import os import numpy as np import tvm from tvm import te, auto_scheduler, topi from tvm.topi.testing import conv2d_nchw_python .. GENERATED FROM PYTHON SOURCE LINES 48-53 Define the computation ^^^^^^^^^^^^^^^^^^^^^^ To begin with, let us define the computation of a convolution layer. The function should return the list of input/output tensors. From these tensors, the auto-scheduler can get the whole computational graph. .. GENERATED FROM PYTHON SOURCE LINES 53-65 .. code-block:: default @auto_scheduler.register_workload def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding): data = te.placeholder((N, CI, H, W), name="data") kernel = te.placeholder((CO, CI, KH, KW), name="kernel") bias = te.placeholder((1, CO, 1, 1), name="bias") conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32") out = topi.nn.relu(conv + bias) return [data, kernel, bias, out] .. GENERATED FROM PYTHON SOURCE LINES 66-69 Create the search task ^^^^^^^^^^^^^^^^^^^^^^ We then create a search task for the last convolution layer in the resnet. .. GENERATED FROM PYTHON SOURCE LINES 69-82 .. code-block:: default target = tvm.target.Target("cuda") # Use the last layer in ResNet-50 N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1) task = auto_scheduler.SearchTask( func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target ) # Inspect the computational graph print("Computational DAG:") print(task.compute_dag) .. rst-class:: sphx-glr-script-out .. code-block:: none Computational DAG: data = PLACEHOLDER [1, 512, 7, 7] pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f) kernel = PLACEHOLDER [512, 512, 3, 3] conv2d_nchw(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx]) bias = PLACEHOLDER [1, 512, 1, 1] T_add(ax0, ax1, ax2, ax3) = (conv2d_nchw[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0]) compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f) .. GENERATED FROM PYTHON SOURCE LINES 83-100 Next, we set parameters for the auto-scheduler. These parameters mainly specify how we do the measurement during the search. * :code:`measure_ctx` launches a different process for measurement to provide isolation. It can protect the main process from GPU crashes during measurement and avoid other runtime conflicts. * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement. This can warmup the GPU, which is necessary to get accurate measurement results. Typically, we recommend a value >= 300 ms. * :code:`num_measure_trials` is the number of measurement trials we can use during the search. We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a good value for the search to converge. You can do more trials according to your time budget. * In addition, we use :code:`RecordToFile` to dump measurement records into a file `conv2d.json`. The measurement records can be used to query the history best, resume the search, and do more analyses later. * see :any:`auto_scheduler.TuningOptions`, :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters. .. GENERATED FROM PYTHON SOURCE LINES 100-110 .. code-block:: default log_file = "conv2d.json" measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300) tune_option = auto_scheduler.TuningOptions( num_measure_trials=10, # change this to 1000 to achieve the best performance runner=measure_ctx.runner, measure_callbacks=[auto_scheduler.RecordToFile(log_file)], verbose=2, ) .. rst-class:: sphx-glr-script-out .. code-block:: none Get devices for measurement successfully! .. GENERATED FROM PYTHON SOURCE LINES 111-117 Run the search ^^^^^^^^^^^^^^ Now we get all inputs ready. Pretty simple, isn't it? We can kick off the search and let the auto-scheduler do its magic. After some measurement trials, we can load the best schedule from the log file and apply it. .. GENERATED FROM PYTHON SOURCE LINES 117-126 .. code-block:: default # Run auto-tuning (search) task.tune(tune_option) # Apply the best schedule sch, args = task.apply_best(log_file) # Kill the measurement process del measure_ctx .. GENERATED FROM PYTHON SOURCE LINES 127-130 We can lower the schedule to see the IR after auto-scheduling. The auto-scheduler correctly performs optimizations including multi-level tiling, cooperative fetching, unrolling and operator fusion. .. GENERATED FROM PYTHON SOURCE LINES 130-134 .. code-block:: default print("Lowered TIR:") print(tvm.lower(sch, args, simple_mode=True)) .. rst-class:: sphx-glr-script-out .. code-block:: none Lowered TIR: @main = primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} buffers = {data: Buffer(data_2: Pointer(float32), float32, [25088], []), kernel: Buffer(kernel_2: Pointer(float32), float32, [2359296], []), bias: Buffer(bias_2: Pointer(float32), float32, [512], []), compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])} buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} { attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 128; allocate(conv2d_nchw: Pointer(local float32), float32, [4]), storage_scope = local; allocate(pad_temp.shared: Pointer(shared float32), float32, [1296]), storage_scope = shared; allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared; attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 { conv2d_nchw_1: Buffer(conv2d_nchw, float32, [4], [], scope="local", align=16)[0] = 0f32 conv2d_nchw_1[1] = 0f32 conv2d_nchw_1[2] = 0f32 conv2d_nchw_1[3] = 0f32 for (rc.outer.outer: int32, 0, 32) { let cse_var_2: int32 = (rc.outer.outer*784) let cse_var_1: int32 = (rc.outer.outer*144) { attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1296], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((9 <= threadIdx.x_1) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 49)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 49), 81)) && (floormod((threadIdx.x_1 + 49), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 49), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 49), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 8), 9)) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 98), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 147)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 66), 81)) && (floormod((threadIdx.x_1 + 66), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 147), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 66), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 34), 81)) && (floormod((threadIdx.x_1 + 34), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 196), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 34), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 245)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 2), 81)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 245), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 2), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 51), 81)) && (floormod((threadIdx.x_1 + 51), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 294), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 51), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 343)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 343), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 19), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 68), 81)) && (floormod((threadIdx.x_1 + 68), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 392), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 68), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 441)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 4), 9)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 441), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 4), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 4), 81)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 490), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 4), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 539)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 53), 81)) && (floormod((threadIdx.x_1 + 53), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 539), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 53), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 588), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 21), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 637)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 70), 81)) && (floormod((threadIdx.x_1 + 70), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 637), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 70), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 38), 81)) && (floormod((threadIdx.x_1 + 38), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 686), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 38), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 735)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 6), 81)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 735), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 6), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 784), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 833)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 833), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 23), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 8), 9)) && (floormod((threadIdx.x_1 + 72), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 882), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 8), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 931)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 40), 81)) && (floormod((threadIdx.x_1 + 40), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 931), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 40), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 8), 81)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 980), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 8), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 1029)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 57), 81)) && (floormod((threadIdx.x_1 + 57), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1029), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 57), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 1078)] = @tir.if_then_else((((threadIdx.x_1 < 47) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1078), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 25), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 1127)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1127), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 42), 81)) && (floormod((threadIdx.x_1 + 42), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1176), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 42), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; pad_temp.shared_1[(threadIdx.x_1 + 1225)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1225), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 10), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32) attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; if @tir.likely((threadIdx.x_1 < 22), dtype=bool) { pad_temp.shared_1[(threadIdx.x_1 + 1274)] = @tir.if_then_else((((threadIdx.x_1 < 13) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1274), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 59), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32) } attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope="shared")[threadIdx.x_2] = kernel[(((blockIdx.x*18432) + cse_var_1) + threadIdx.x_2)] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 49)] = kernel[((((blockIdx.x*18432) + cse_var_1) + (floordiv((threadIdx.x_2 + 49), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 98)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 98), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 98), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 147)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 147), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 1)*3)) + floormod(threadIdx.x_2, 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 196), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 52), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 245)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 245), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 101), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 294)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 294), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 2)*3)) + floormod(threadIdx.x_2, 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 343)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 343), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 55), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 392), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 104), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 441)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 441), 144)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 3)*3)) + floormod(threadIdx.x_2, 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; kernel.shared_1[(threadIdx.x_2 + 490)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 490), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 58), 144), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))] attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49; if @tir.likely((threadIdx.x_2 < 37), dtype=bool) { kernel.shared_1[(threadIdx.x_2 + 539)] = kernel[(((((blockIdx.x*18432) + (floordiv((threadIdx.x_2 + 539), 144)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 107), 144), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))] } for (rc.outer.inner: int32, 0, 2) { let cse_var_3: int32 = (rc.outer.inner*72) { conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[cse_var_3])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 1)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 2)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 3)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 4)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 5)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 6)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 7)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 8)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 9)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 10)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 11)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 12)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 13)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 14)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 15)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 16)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 17)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 18)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 19)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 20)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 21)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 22)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 23)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 24)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 25)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 26)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 27)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 28)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 29)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 30)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 31)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 32)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 33)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 34)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 35)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 36)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 37)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 38)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 39)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 40)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 41)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 42)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 43)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 44)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 45)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 46)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 47)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 48)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 49)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 50)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 51)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 52)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 53)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 54)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 55)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 56)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 57)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 58)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 59)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 60)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 61)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 62)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 63)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 64)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 65)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 66)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 67)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 68)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 69)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 70)])) conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 71)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 144)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 145)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 146)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 147)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 148)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 149)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 150)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 151)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 152)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 153)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 154)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 155)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 156)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 157)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 158)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 159)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 160)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 161)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 162)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 163)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 164)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 165)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 166)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 167)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 168)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 169)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 170)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 171)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 172)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 173)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 174)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 175)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 176)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 177)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 178)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 179)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 180)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 181)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 182)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 183)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 184)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 185)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 186)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 187)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 188)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 189)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 190)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 191)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 192)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 193)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 194)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 195)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 196)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 197)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 198)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 199)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 200)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 201)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 202)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 203)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 204)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 205)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 206)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 207)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 208)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 209)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 210)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 211)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 212)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 213)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 214)])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 215)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 288)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 289)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 290)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 291)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 292)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 293)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 294)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 295)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 296)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 297)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 298)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 299)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 300)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 301)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 302)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 303)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 304)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 305)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 306)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 307)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 308)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 309)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 310)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 311)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 312)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 313)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 314)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 315)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 316)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 317)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 318)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 319)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 320)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 321)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 322)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 323)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 324)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 325)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 326)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 327)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 328)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 329)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 330)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 331)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 332)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 333)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 334)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 335)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 336)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 337)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 338)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 339)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 340)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 341)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 342)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 343)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 344)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 345)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 346)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 347)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 348)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 349)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 350)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 351)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 352)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 353)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 354)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 355)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 356)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 357)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 358)])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 359)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 432)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 433)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 434)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 435)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 436)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 437)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 438)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 439)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 440)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 441)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 442)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 443)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 444)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 445)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 446)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 447)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 448)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 449)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_3 + 450)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[(cse_var_3 + 451)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[(cse_var_3 + 452)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_3 + 453)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[(cse_var_3 + 454)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[(cse_var_3 + 455)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_3 + 456)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[(cse_var_3 + 457)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[(cse_var_3 + 458)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_3 + 459)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[(cse_var_3 + 460)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[(cse_var_3 + 461)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_3 + 462)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[(cse_var_3 + 463)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[(cse_var_3 + 464)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_3 + 465)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[(cse_var_3 + 466)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[(cse_var_3 + 467)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_3 + 468)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 325)]*kernel.shared_1[(cse_var_3 + 469)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 326)]*kernel.shared_1[(cse_var_3 + 470)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_3 + 471)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 334)]*kernel.shared_1[(cse_var_3 + 472)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 335)]*kernel.shared_1[(cse_var_3 + 473)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_3 + 474)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 343)]*kernel.shared_1[(cse_var_3 + 475)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 344)]*kernel.shared_1[(cse_var_3 + 476)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_3 + 477)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 406)]*kernel.shared_1[(cse_var_3 + 478)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 407)]*kernel.shared_1[(cse_var_3 + 479)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_3 + 480)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 415)]*kernel.shared_1[(cse_var_3 + 481)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 416)]*kernel.shared_1[(cse_var_3 + 482)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_3 + 483)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 424)]*kernel.shared_1[(cse_var_3 + 484)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 425)]*kernel.shared_1[(cse_var_3 + 485)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_3 + 486)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 487)]*kernel.shared_1[(cse_var_3 + 487)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 488)]*kernel.shared_1[(cse_var_3 + 488)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_3 + 489)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 496)]*kernel.shared_1[(cse_var_3 + 490)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 497)]*kernel.shared_1[(cse_var_3 + 491)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_3 + 492)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 505)]*kernel.shared_1[(cse_var_3 + 493)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 506)]*kernel.shared_1[(cse_var_3 + 494)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_3 + 495)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 568)]*kernel.shared_1[(cse_var_3 + 496)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 569)]*kernel.shared_1[(cse_var_3 + 497)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_3 + 498)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 577)]*kernel.shared_1[(cse_var_3 + 499)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 578)]*kernel.shared_1[(cse_var_3 + 500)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_3 + 501)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 586)]*kernel.shared_1[(cse_var_3 + 502)])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*648) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 587)]*kernel.shared_1[(cse_var_3 + 503)])) } } } } for (i1.inner: int32, 0, 4) { compute[(((blockIdx.x*196) + (i1.inner*49)) + threadIdx.x)] = max((conv2d_nchw_1[i1.inner] + bias[((blockIdx.x*4) + i1.inner)]), 0f32) } } } .. GENERATED FROM PYTHON SOURCE LINES 135-138 Check correctness and evaluate performance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We build the binary and check its correctness and performance. .. GENERATED FROM PYTHON SOURCE LINES 138-165 .. code-block:: default func = tvm.build(sch, args, target) # Check correctness data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32) weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32) bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32) conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding) out_np = np.maximum(conv_np + bias_np, 0.0) dev = tvm.cuda() data_tvm = tvm.nd.array(data_np, device=dev) weight_tvm = tvm.nd.array(weight_np, device=dev) bias_tvm = tvm.nd.array(bias_np, device=dev) out_tvm = tvm.nd.empty(out_np.shape, device=dev) func(data_tvm, weight_tvm, bias_tvm, out_tvm) # Check results np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3) # Evaluate execution time evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500) print( "Execution time of this operator: %.3f ms" % (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000) ) .. rst-class:: sphx-glr-script-out .. code-block:: none Execution time of this operator: 0.236 ms .. GENERATED FROM PYTHON SOURCE LINES 166-171 Using the record file ^^^^^^^^^^^^^^^^^^^^^ During the search, all measurement records are dumped into the record file "conv2d.json". The measurement records can be used to re-apply search results, resume the search, and perform other analyses. .. GENERATED FROM PYTHON SOURCE LINES 173-176 Here is an example where we load the best schedule from a file, print the equivalent python schedule API and CUDA source code. They can be used for debugging and learning the behavior of the auto-scheduler. .. GENERATED FROM PYTHON SOURCE LINES 176-183 .. code-block:: default print("Equivalent python schedule:") print(task.print_best(log_file, print_mode="schedule")) print("CUDA source code:") print(task.print_best(log_file, print_mode="cuda")) .. rst-class:: sphx-glr-script-out .. code-block:: none Equivalent python schedule: pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis) conv2d_nchw_nn, conv2d_nchw_ff, conv2d_nchw_yy, conv2d_nchw_xx, conv2d_nchw_rc, conv2d_nchw_ry, conv2d_nchw_rx = tuple(conv2d_nchw.op.axis) + tuple(conv2d_nchw.op.reduce_axis) T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis) compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis) s[T_add].compute_inline() conv2d_nchw_nn_o_i, conv2d_nchw_nn_i = s[conv2d_nchw].split(conv2d_nchw_nn, factor=1) conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1) conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1) conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1) conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1) conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=4) conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=1) conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1) conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1) conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1) conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7) conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1) conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1) conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1) conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7) conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1) conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=8) conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=2) conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3) conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1) conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3) conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1) s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nchw_xx_o_i, conv2d_nchw_rc_i, conv2d_nchw_ry_i, conv2d_nchw_rx_i, conv2d_nchw_nn_i, conv2d_nchw_ff_i, conv2d_nchw_yy_i, conv2d_nchw_xx_i) compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1) compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1) compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1) compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=4) compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=1) compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1) compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1) compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7) compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1) compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1) compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7) compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1) s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i) s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i) kernel_shared = s.cache_read(kernel, "shared", [conv2d_nchw]) kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis) s[kernel_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o) pad_temp_shared = s.cache_read(pad_temp, "shared", [conv2d_nchw]) pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis) s[pad_temp_shared].compute_at(s[conv2d_nchw], conv2d_nchw_rx_o_o) s[pad_temp].compute_inline() compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o) s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x")) compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i) s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread")) compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i) s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x")) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1) s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i) kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49) s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x")) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1) s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i) pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49) s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x")) s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512) s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True) CUDA source code: #ifdef _WIN32 using uint = unsigned int; using uchar = unsigned char; using ushort = unsigned short; using int64_t = long long; using uint64_t = unsigned long long; #else #define uint unsigned int #define uchar unsigned char #define ushort unsigned short #define int64_t long long #define uint64_t unsigned long long #endif extern "C" __global__ void __launch_bounds__(49) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) { float conv2d_nchw[4]; __shared__ float pad_temp_shared[1296]; __shared__ float kernel_shared[576]; conv2d_nchw[0] = 0.000000e+00f; conv2d_nchw[1] = 0.000000e+00f; conv2d_nchw[2] = 0.000000e+00f; conv2d_nchw[3] = 0.000000e+00f; for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) { __syncthreads(); pad_temp_shared[((int)threadIdx.x)] = ((((9 <= ((int)threadIdx.x)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 49)] = (((((9 <= ((((int)threadIdx.x) + 49) % 81)) && (((((int)threadIdx.x) + 49) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 49) / 81) * 49)) + ((((((int)threadIdx.x) + 49) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 98)] = (((1 <= ((((int)threadIdx.x) + 8) % 9)) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 98) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 147)] = (((((9 <= ((((int)threadIdx.x) + 66) % 81)) && (((((int)threadIdx.x) + 66) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 147) / 81) * 49)) + ((((((int)threadIdx.x) + 66) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 196)] = (((((9 <= ((((int)threadIdx.x) + 34) % 81)) && (((((int)threadIdx.x) + 34) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 81) * 49)) + ((((((int)threadIdx.x) + 34) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 245)] = ((((7 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 245) / 81) * 49)) + (((((int)threadIdx.x) + 2) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((9 <= ((((int)threadIdx.x) + 51) % 81)) && (((((int)threadIdx.x) + 51) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 294) / 81) * 49)) + ((((((int)threadIdx.x) + 51) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 343)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 343) / 81) * 49)) + (((((int)threadIdx.x) + 19) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 392)] = (((((9 <= ((((int)threadIdx.x) + 68) % 81)) && (((((int)threadIdx.x) + 68) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 81) * 49)) + ((((((int)threadIdx.x) + 68) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 441)] = (((((1 <= (((((int)threadIdx.x) / 9) + 4) % 9)) && (((((int)threadIdx.x) + 36) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 441) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 4) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 490)] = ((((5 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 490) / 81) * 49)) + (((((int)threadIdx.x) + 4) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 539)] = (((((9 <= ((((int)threadIdx.x) + 53) % 81)) && (((((int)threadIdx.x) + 53) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 539) / 81) * 49)) + ((((((int)threadIdx.x) + 53) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 81) * 49)) + (((((int)threadIdx.x) + 21) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 637)] = (((((9 <= ((((int)threadIdx.x) + 70) % 81)) && (((((int)threadIdx.x) + 70) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 637) / 81) * 49)) + ((((((int)threadIdx.x) + 70) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((9 <= ((((int)threadIdx.x) + 38) % 81)) && (((((int)threadIdx.x) + 38) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 686) / 81) * 49)) + ((((((int)threadIdx.x) + 38) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 735)] = ((((3 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 735) / 81) * 49)) + (((((int)threadIdx.x) + 6) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((9 <= ((((int)threadIdx.x) + 55) % 81)) && (((((int)threadIdx.x) + 55) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 81) * 49)) + ((((((int)threadIdx.x) + 55) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 833)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 833) / 81) * 49)) + (((((int)threadIdx.x) + 23) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 882)] = (((((1 <= (((((int)threadIdx.x) / 9) + 8) % 9)) && (((((int)threadIdx.x) + 72) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 882) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 8) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 931)] = (((((9 <= ((((int)threadIdx.x) + 40) % 81)) && (((((int)threadIdx.x) + 40) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 931) / 81) * 49)) + ((((((int)threadIdx.x) + 40) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 980)] = ((((1 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 81) * 49)) + (((((int)threadIdx.x) + 8) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 1029)] = (((((9 <= ((((int)threadIdx.x) + 57) % 81)) && (((((int)threadIdx.x) + 57) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1029) / 81) * 49)) + ((((((int)threadIdx.x) + 57) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 1078)] = ((((((int)threadIdx.x) < 47) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1078) / 81) * 49)) + (((((int)threadIdx.x) + 25) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 1127)] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1127) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((((9 <= ((((int)threadIdx.x) + 42) % 81)) && (((((int)threadIdx.x) + 42) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1176) / 81) * 49)) + ((((((int)threadIdx.x) + 42) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f); pad_temp_shared[(((int)threadIdx.x) + 1225)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1225) / 81) * 49)) + (((((int)threadIdx.x) + 10) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f); if (((int)threadIdx.x) < 22) { pad_temp_shared[(((int)threadIdx.x) + 1274)] = ((((((int)threadIdx.x) < 13) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1274) / 81) * 49)) + (((((int)threadIdx.x) + 59) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f); } kernel_shared[((int)threadIdx.x)] = kernel[(((((int)blockIdx.x) * 18432) + (rc_outer_outer * 144)) + ((int)threadIdx.x))]; kernel_shared[(((int)threadIdx.x) + 49)] = kernel[((((((int)blockIdx.x) * 18432) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 49) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))]; kernel_shared[(((int)threadIdx.x) + 98)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 98) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 98) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))]; kernel_shared[(((int)threadIdx.x) + 147)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 147) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 3)]; kernel_shared[(((int)threadIdx.x) + 196)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 196) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 52) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))]; kernel_shared[(((int)threadIdx.x) + 245)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 245) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 101) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))]; kernel_shared[(((int)threadIdx.x) + 294)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 294) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 6)]; kernel_shared[(((int)threadIdx.x) + 343)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 343) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 55) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))]; kernel_shared[(((int)threadIdx.x) + 392)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 392) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 104) % 144) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))]; kernel_shared[(((int)threadIdx.x) + 441)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 441) / 144) * 4608)) + (rc_outer_outer * 144)) + ((int)threadIdx.x)) + 9)]; kernel_shared[(((int)threadIdx.x) + 490)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 490) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 58) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))]; if (((int)threadIdx.x) < 37) { kernel_shared[(((int)threadIdx.x) + 539)] = kernel[(((((((int)blockIdx.x) * 18432) + (((((int)threadIdx.x) + 539) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 107) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))]; } __syncthreads(); for (int rc_outer_inner = 0; rc_outer_inner < 2; ++rc_outer_inner) { conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(rc_outer_inner * 72)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 1)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 2)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 3)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 4)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 5)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 6)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 7)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 8)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 9)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 10)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 11)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 12)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 13)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 14)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 15)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 16)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 17)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 18)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 19)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 20)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 21)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 22)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 23)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 24)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 25)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 26)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 27)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 28)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 29)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 30)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 31)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 32)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 33)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 34)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 35)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 36)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 37)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 38)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 39)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 40)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 41)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 42)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 43)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 44)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 45)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 46)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 47)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 48)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 49)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 50)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 51)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 52)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 53)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 54)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 55)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 56)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 57)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 58)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 59)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 60)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 61)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 62)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 63)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 64)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 65)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 66)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 67)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 68)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 69)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 70)])); conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 71)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 144)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 145)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 146)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 147)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 148)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 149)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 150)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 151)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 152)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 153)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 154)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 155)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 156)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 157)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 158)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 159)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 160)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 161)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 162)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 163)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 164)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 165)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 166)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 167)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 168)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 169)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 170)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 171)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 172)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 173)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 174)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 175)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 176)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 177)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 178)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 179)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 180)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 181)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 182)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 183)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 184)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 185)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 186)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 187)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 188)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 189)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 190)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 191)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 192)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 193)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 194)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 195)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 196)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 197)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 198)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 199)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 200)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 201)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 202)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 203)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 204)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 205)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 206)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 207)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 208)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 209)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 210)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 211)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 212)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 213)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 214)])); conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 215)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 288)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 289)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 290)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 291)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 292)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 293)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 294)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 295)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 296)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 297)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 298)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 299)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 300)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 301)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 302)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 303)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 304)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 305)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 306)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 307)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 308)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 309)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 310)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 311)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 312)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 313)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 314)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 315)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 316)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 317)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 318)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 319)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 320)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 321)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 322)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 323)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 324)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 325)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 326)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 327)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 328)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 329)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 330)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 331)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 332)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 333)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 334)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 335)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 336)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 337)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 338)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 339)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 340)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 341)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 342)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 343)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 344)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 345)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 346)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 347)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 348)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 349)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 350)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 351)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 352)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 353)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 354)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 355)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 356)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 357)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 358)])); conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 359)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 72) + 432)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 72) + 433)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 72) + 434)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 72) + 435)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 72) + 436)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 72) + 437)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 72) + 438)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 72) + 439)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 72) + 440)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 72) + 441)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 72) + 442)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 72) + 443)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 72) + 444)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 72) + 445)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 72) + 446)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 72) + 447)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 72) + 448)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 72) + 449)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[((rc_outer_inner * 72) + 450)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[((rc_outer_inner * 72) + 451)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[((rc_outer_inner * 72) + 452)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[((rc_outer_inner * 72) + 453)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[((rc_outer_inner * 72) + 454)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[((rc_outer_inner * 72) + 455)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[((rc_outer_inner * 72) + 456)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[((rc_outer_inner * 72) + 457)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[((rc_outer_inner * 72) + 458)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[((rc_outer_inner * 72) + 459)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[((rc_outer_inner * 72) + 460)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[((rc_outer_inner * 72) + 461)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[((rc_outer_inner * 72) + 462)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[((rc_outer_inner * 72) + 463)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[((rc_outer_inner * 72) + 464)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[((rc_outer_inner * 72) + 465)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[((rc_outer_inner * 72) + 466)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[((rc_outer_inner * 72) + 467)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[((rc_outer_inner * 72) + 468)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 325)] * kernel_shared[((rc_outer_inner * 72) + 469)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 326)] * kernel_shared[((rc_outer_inner * 72) + 470)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[((rc_outer_inner * 72) + 471)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 334)] * kernel_shared[((rc_outer_inner * 72) + 472)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 335)] * kernel_shared[((rc_outer_inner * 72) + 473)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[((rc_outer_inner * 72) + 474)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 343)] * kernel_shared[((rc_outer_inner * 72) + 475)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 344)] * kernel_shared[((rc_outer_inner * 72) + 476)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[((rc_outer_inner * 72) + 477)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 406)] * kernel_shared[((rc_outer_inner * 72) + 478)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 407)] * kernel_shared[((rc_outer_inner * 72) + 479)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[((rc_outer_inner * 72) + 480)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 415)] * kernel_shared[((rc_outer_inner * 72) + 481)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 416)] * kernel_shared[((rc_outer_inner * 72) + 482)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[((rc_outer_inner * 72) + 483)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 424)] * kernel_shared[((rc_outer_inner * 72) + 484)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 425)] * kernel_shared[((rc_outer_inner * 72) + 485)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[((rc_outer_inner * 72) + 486)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 487)] * kernel_shared[((rc_outer_inner * 72) + 487)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 488)] * kernel_shared[((rc_outer_inner * 72) + 488)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[((rc_outer_inner * 72) + 489)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 496)] * kernel_shared[((rc_outer_inner * 72) + 490)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 497)] * kernel_shared[((rc_outer_inner * 72) + 491)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[((rc_outer_inner * 72) + 492)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 505)] * kernel_shared[((rc_outer_inner * 72) + 493)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 506)] * kernel_shared[((rc_outer_inner * 72) + 494)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[((rc_outer_inner * 72) + 495)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 568)] * kernel_shared[((rc_outer_inner * 72) + 496)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 569)] * kernel_shared[((rc_outer_inner * 72) + 497)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[((rc_outer_inner * 72) + 498)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 577)] * kernel_shared[((rc_outer_inner * 72) + 499)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 578)] * kernel_shared[((rc_outer_inner * 72) + 500)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[((rc_outer_inner * 72) + 501)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 586)] * kernel_shared[((rc_outer_inner * 72) + 502)])); conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 648) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 587)] * kernel_shared[((rc_outer_inner * 72) + 503)])); } } for (int i1_inner = 0; i1_inner < 4; ++i1_inner) { compute[(((((int)blockIdx.x) * 196) + (i1_inner * 49)) + ((int)threadIdx.x))] = max((conv2d_nchw[i1_inner] + bias[((((int)blockIdx.x) * 4) + i1_inner)]), 0.000000e+00f); } } .. GENERATED FROM PYTHON SOURCE LINES 184-188 A more complicated example is to resume the search. In this case, we need to create the search policy and cost model by ourselves and resume the status of search policy and cost model with the log file. In the example below we resume the status and do more 5 trials. .. GENERATED FROM PYTHON SOURCE LINES 188-210 .. code-block:: default def resume_search(task, log_file): print("Resume search:") cost_model = auto_scheduler.XGBModel() cost_model.update_from_file(log_file) search_policy = auto_scheduler.SketchPolicy( task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)] ) measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300) tune_option = auto_scheduler.TuningOptions( num_measure_trials=5, runner=measure_ctx.runner, measure_callbacks=[auto_scheduler.RecordToFile(log_file)], ) task.tune(tune_option, search_policy=search_policy) # Kill the measurement process del measure_ctx resume_search(task, log_file) .. rst-class:: sphx-glr-script-out .. code-block:: none Resume search: /usr/local/lib/python3.7/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning) Get devices for measurement successfully! .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 35.003 seconds) .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: tune_conv2d_layer_cuda.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: tune_conv2d_layer_cuda.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_