#### Note

**Data Tiling**\n\n One source of complexity when targeting accelerators is to make sure\n that the data layout matches the layout imposed by the accelerator design.\n VTA is designed around a *tensor core* that performs, one matrix-matrix\n operation per cycle between an activation matrix and a weight matrix,\n adding the result matrix to an accumulator matrix, as shown in the\n figure below.\n\n .. image:: https://raw.githubusercontent.com/uwsaml/web-data/master/vta/tutorial/tensor_core.png\n :align: center\n :width: 480px\n\n The dimensions of that matrix-matrix multiplication are specified in\n the :code:`vta_config.json` configuration file.\n The activation matrix has a :code:`(BATCH, BLOCK_IN)` shape\n and the transposed weight matrix has a :code:`(BLOCK_OUT, BLOCK_IN)` shape,\n thus inferring that the resulting output matrix has a\n :code:`(BATCH, BLOCK_OUT)` shape.\n Consequently input and output tensors processed by VTA need to be\n tiled according to these aforementioned dimension.\n\n The diagram below shows the impact of data tiling on a matrix that is\n originally of shape (4, 8).\n Tiling by a (2, 2) tile shape ensures that data within each tile is\n contiguous.\n The resulting tiled tensor has a shape of (2, 4, 2, 2).\n\n .. image:: https://raw.githubusercontent.com/uwsaml/web-data/master/vta/tutorial/data_tiling.png\n :align: center\n :width: 480px

\n\nWe first define the variables :code:`m`, :code:`n`, :code:`o` to represent\nthe shape of the matrix multiplication. These variables are multiplicative\nfactors over the :code:`BLOCK_OUT`, :code:`BLOCK_IN`, and :code:`BATCH`\ntensor dimensions respectively.\nBy default, the configuration file sets :code:`BATCH`, :code:`BLOCK_IN`, and\n:code:`BLOCK_OUT` to be 1, 16 and 16 respectively (:code:`BATCH` being set to\n1 implies that our compute building block is vector-matrix multiply).\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

**Data Types**\n\n It's important to not only match the inner-tile\n dimension of VTA's tensor core, but also to match the specific data types\n expected by VTA.\n VTA for now only supports fixed point data types, which integer width is\n specified in the :code:`vta_config.json` file by :code:`INP_WIDTH` and\n :code:`WGT_WIDTH` for the activations and weights data types respectively.\n In addition, the accumulator data type integer width is specified by\n :code:`ACC_WIDTH`.

\n\nBy default, the configuration file sets :code:`INP_WIDTH`\nand :code:`WGT_WIDTH` to 8.\nThe accumulator width :code:`ACC_WIDTH` is set to 32, in order to avoid\noverflow during accumulation.\nAs a result, :code:`env.inp_dtype` and :code:`env.wgt_dtype` are all\nnarrow 8-bit integers, while :code:`env.acc_dtype` is a standard 32-bit\ninteger.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Output channel factor m - total 16x16=256 output channels\nm = 16\n# Input channel factor n - total 16x16=256 input channels\nn = 16\n# Batch factor o (we use single batch inference)\no = 1\n# A placeholder tensor in tiled data format\nA = te.placeholder((o, n, env.BATCH, env.BLOCK_IN), name=\"A\", dtype=env.inp_dtype)\n# B placeholder tensor in tiled data format\nB = te.placeholder((m, n, env.BLOCK_OUT, env.BLOCK_IN), name=\"B\", dtype=env.wgt_dtype)\n# A copy buffer\nA_buf = te.compute((o, n, env.BATCH, env.BLOCK_IN), lambda *i: A(*i), \"A_buf\")\n# B copy buffer\nB_buf = te.compute((m, n, env.BLOCK_OUT, env.BLOCK_IN), lambda *i: B(*i), \"B_buf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Matrix Multiplication\n~~~~~~~~~~~~~~~~~~~~~\nNow we're ready to describe the matrix multiplication result tensor :code:`C`,\nwith another compute operation.\nThe compute function takes the shape of the tensor, as well as a lambda\nfunction that describes the computation rule for each position of the tensor.\n\nIn order to implement matrix multiplication, the lambda function needs to\ninclude a reduction formula over the input channel dimension axes.\nTo create a reduction formula, we can declare a reduction axis using\n:code:`te.reduce_axis`, which takes in the range of reductions.\n:code:`te.sum` takes in the expression to be reduced as well as\nthe reduction axes to compute the sum of value over all k in the declared\nranges.\n\nNote that the reduction needs to be performed over 32-bit :code:`env.acc_dtype`\naccumulator data types.\n\nNo computation happens during this phase, as we are only declaring how\nthe computation should be done.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Outer input feature reduction axis\nko = te.reduce_axis((0, n), name=\"ko\")\n# Inner input feature reduction axis\nki = te.reduce_axis((0, env.BLOCK_IN), name=\"ki\")\n# Describe the in-VTA matrix multiplication\nC_buf = te.compute(\n (o, m, env.BATCH, env.BLOCK_OUT),\n lambda bo, co, bi, ci: te.sum(\n A_buf[bo, ko, bi, ki].astype(env.acc_dtype) * B_buf[co, ko, ci, ki].astype(env.acc_dtype),\n axis=[ko, ki],\n ),\n name=\"C_buf\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Casting the Results\n~~~~~~~~~~~~~~~~~~~\nAfter the computation is done, we'll need to send the results computed by VTA\nback to main memory.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

**Memory Store Restrictions**\n\n One specificity of VTA is that it only supports DRAM stores in the narrow\n :code:`env.inp_dtype` data type format.\n This lets us reduce the data footprint for memory transfers, but also lets\n us quantize the wide accumulator data type down to a data format that\n matches the input activation data type.\n This means that in the context of neural network inference, the outputs\n of a given layer after activation can be consumed directly by the next\n layer.

\n\nWe perform one last typecast operation to the narrow\ninput activation data format.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Cast to output type, and send to main memory\nC = te.compute(\n (o, m, env.BATCH, env.BLOCK_OUT), lambda *i: C_buf(*i).astype(env.inp_dtype), name=\"C\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This concludes the computation declaration part of this tutorial.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scheduling the Computation\n--------------------------\nWhile the above lines describes the computation rule, we can obtain\n:code:`C` in many ways.\nTVM asks the user to provide an implementation of the computation called\n*schedule*.\n\nA schedule is a set of transformations to an original computation that\ntransforms the implementation of the computation without affecting\ncorrectness.\nThis simple VTA programming tutorial aims to demonstrate basic schedule\ntransformations that will map the original schedule down to VTA hardware\nprimitives.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Default Schedule\n~~~~~~~~~~~~~~~~\nAfter we construct the schedule, by default the schedule computes\n:code:`C` in the following way:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's take a look at the generated schedule\ns = te.create_schedule(C.op)\nprint(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although this schedule makes sense, it won't compile to VTA.\nIn order to obtain correct code generation, we need to apply scheduling\nprimitives and code annotation that will transform the schedule into\none that can be directly lowered onto VTA hardware intrinsics.\nThose include:\n\n - DMA copy operations which will take globally-scoped tensors and copy\n those into locally-scoped tensors.\n - Tensor operations that will perform the matrix multiplication.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Buffer Scopes\n~~~~~~~~~~~~~\nFirst, we set the scope of the buffers to tell TVM that these buffers\nwill be living in the VTA's on-chip SRAM caches.\nBelow, we tell TVM that :code:`A_buf`, :code:`B_buf`, :code:`C_buf`\nwill respectively live in VTA's on-chip input, weight and accumulator\nmemory.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

**VTA's On-Chip SRAMs**\n\n VTA has three different memory scopes, each corresponding to different\n on-chip SRAM buffers.\n\n - :code:`env.inp_scope`: Input buffer, which is a read-only SRAM buffer\n that stores input matrices of shape :code:`(env.BATCH, env.BLOCK_IN)`\n of type :code:`env.inp_dtype`. The input buffer contains\n `2 ^ LOG_INP_BUFF_SIZE` matrix elements (as specified in the\n :code:`vta_config.json` file).\n - :code:`env.wgt_scope`: Weight buffer, which is a read-only SRAM buffer\n that stores weight matrices of shape :code:`(env.BLOCK_OUT, env.BLOCK_IN)`\n of type :code:`env.wgt_dtype`. The weight buffer contains\n `2 ^ LOG_WGT_BUFF_SIZE` matrix elements.\n - :code:`env.acc_scope`: Accumulator buffer, which is a read/write SRAM\n buffer that stores accumulator matrices of shape\n :code:`(env.BATCH, env.BLOCK_OUT)` of type :code:`env.acc_dtype`.\n The accumulator buffer is VTA's general purpose register file: it holds\n both intermediate results of convolutions and matrix multiplications\n as well as intermediate results of pooling, batch normalization, and\n activation layers. The accumulator buffer contains\n `2 ^ LOG_ACC_BUFF_SIZE` matrix elements.