{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n\n# Matrix Multiply Blocking\n**Author**: [Thierry Moreau](https://homes.cs.washington.edu/~moreau/)\n\nThis tutorial provides an overview on how to use TVM to map matrix\nmultiplication efficiently on the VTA design.\nWe recommend covering the `basic-mat-mult` tutorial first.\n\nIn this tutorial, we will demonstrate TVM schedule optimizations to break large\nneural network operators down onto smaller blocks to achieve computation within\nlimited hardware accelerator resources.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RPC Setup\nWe start by programming the Pynq's FPGA and building its RPC runtime.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import absolute_import, print_function\n\nimport os\nimport tvm\nfrom tvm import te\nimport vta\nimport numpy as np\nfrom tvm import rpc\nfrom tvm.contrib import utils\nfrom vta.testing import simulator\n\n# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\nenv = vta.get_env()\n\n# We read the Pynq RPC host IP address and port number from the OS environment\nhost = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\nport = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n\n# We configure both the bitstream and the runtime system on the Pynq\n# to match the VTA configuration specified by the vta_config.json file.\nif env.TARGET == \"pynq\":\n\n # Make sure that TVM was compiled with RPC=1\n assert tvm.runtime.enabled(\"rpc\")\n remote = rpc.connect(host, port)\n\n # Reconfigure the JIT runtime\n vta.reconfig_runtime(remote)\n\n # Program the FPGA with a pre-compiled VTA bitstream.\n # You can program the FPGA with your own custom bitstream\n # by passing the path to the bitstream file instead of None.\n vta.program_fpga(remote, bitstream=None)\n\n# In simulation mode, host the RPC server locally.\nelif env.TARGET in [\"sim\", \"tsim\"]:\n remote = rpc.LocalSession()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computation Declaration\nAs a first step, we need to describe our matrix multiplication computation.\nWe define the matrix multiplication as the computation one would find in a\nfully connected layer, defined by its batch size, input channels, and output\nchannels.\nThese have to be integer multiples of the VTA tensor shape:\n:code:`BATCH`, :code:`BLOCK_IN`, and :code:`BLOCK_OUT` respectively.\n\nWe've added extra operators to the matrix multiplication that apply\nshifting and clipping to the output in order to mimic a fixed-point\nmatrix multiplication followed by a rectified linear activation.\nWe describe the TVM dataflow graph of the fully connected layer below:\n\n \n\nThis computation is intentionally too large to fit onto VTA's on-chip\nbuffers all at once. Therefore in the scheduling phase we'll\nrely on computation blocking strategies to break the computation down into\nmanageable chunks.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fully connected layer dimensions: 1024 x 1024\nbatch_size = 1\nin_channels = 1024\nout_channels = 1024\nassert batch_size % env.BATCH == 0\nassert in_channels % env.BLOCK_IN == 0\nassert out_channels % env.BLOCK_OUT == 0\n\n# Let's derive the tiled input tensor shapes\ndata_shape = (batch_size // env.BATCH, in_channels // env.BLOCK_IN, env.BATCH, env.BLOCK_IN)\nweight_shape = (\n out_channels // env.BLOCK_OUT,\n in_channels // env.BLOCK_IN,\n env.BLOCK_OUT,\n env.BLOCK_IN,\n)\noutput_shape = (batch_size // env.BATCH, out_channels // env.BLOCK_OUT, env.BATCH, env.BLOCK_OUT)\nnum_ops = in_channels * out_channels * batch_size * 2\n\n# Reduction axes\nic = te.reduce_axis((0, in_channels // env.BLOCK_IN), name=\"ic\")\nic_tns = te.reduce_axis((0, env.BLOCK_IN), name=\"ic_tns\")\n\n# Input placeholder tensors\ndata = te.placeholder(data_shape, name=\"data\", dtype=env.inp_dtype)\nweight = te.placeholder(weight_shape, name=\"weight\", dtype=env.wgt_dtype)\n\n# Copy buffers\ndata_buf = te.compute(data_shape, lambda *i: data(*i), \"data_buf\")\nweight_buf = te.compute(weight_shape, lambda *i: weight(*i), \"weight_buf\")\n\n# Declare matrix multiply computation\nres_gemm = te.compute(\n output_shape,\n lambda bo, co, bi, ci: te.sum(\n data_buf[bo, ic, bi, ic_tns].astype(env.acc_dtype)\n * weight_buf[co, ic, ci, ic_tns].astype(env.acc_dtype),\n axis=[ic, ic_tns],\n ),\n name=\"res_gem\",\n)\n\n# Add shift stage for fix-point normalization\nres_shr = te.compute(output_shape, lambda *i: res_gemm(*i) >> env.INP_WIDTH, name=\"res_shr\")\n\n# Apply clipping between (0, input max value)\ninp_max = (1 << (env.INP_WIDTH - 1)) - 1\nres_max = te.compute(output_shape, lambda *i: tvm.te.max(res_shr(*i), 0), \"res_max\")\nres_min = te.compute(output_shape, lambda *i: tvm.te.min(res_max(*i), inp_max), \"res_min\")\n\n# Apply typecast to input data type before sending results back\nres = te.compute(output_shape, lambda *i: res_min(*i).astype(env.inp_dtype), name=\"res\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scheduling the Computation\nWe'll look at a set of schedule transformations necessary to map the\nmatrix multiplications onto VTA in an efficient fashion.\nThose include:\n\n- Computation blocking\n- Lowering to VTA hardware intrinsics\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create TVM schedule\ns = te.create_schedule(res.op)\n# Let's look at the default TVM schedule\nprint(tvm.lower(s, [data, weight, res], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blocking the Computation\nThe matrix multiplication is by default too large for activations or weights\nto fit on VTA's on-chip buffers all at once.\nWe block the (1, 1024) by (1024, 1024) matrix multiplication into\nsmaller (1, 256) by (256, 256) matrix multiplications so the intermediate\ntensors can fit on the accelerator's on-chip SRAM.\nThis approach is similar to blocking techniques applied to CPUs and GPUs in\norder to increase cache hit rate.\n\nWe perform blocking along each axes (the batch axis being untouched since\nwe are performing singe-batch inference).\nWe also leave the inner-most tensorization axes as-is in order to allow\nTVM to pattern-match tensorization.\nWe show the outcome of blocking on the computation schedule in the diagram\nbelow:\n\n \n\n

#### Note

The code after loop splitting and reordering is equivalent to the following\n pseudo-code. We ignore the batch axis since we are only performing single-batch\n inference in this example:\n\n```c\nfor (int oc_out = 0; oc_out < 4; ++oc_out) {\n // Initialization loop\n for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {\n for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {\n int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;\n C[j] = 0;\n }\n }\n for (int ic_out = 0; ic_out < 4; ++ic_out) {\n // Block loop\n for (int oc_inn = 0; oc_inn < 16; ++oc_inn) {\n for (int ic_inn = 0; ic_inn < 16; ++ic_inn) {\n // Tensorization loop\n for (int oc_tns = 0; oc_tns < 16; ++oc_tns) {\n for (int ic_tns = 0; ic_tns < 16; ++ic_tns) {\n int i = (ic_out * 16 + ic_inn) * 16 + ic_tns;\n int j = (oc_out * 16 + oc_inn) * 16 + oc_tns;\n C[i] = C[i] + A[i] * B[j][i];\n }\n }\n }\n }\n }\n }\n}