tvm.s_tir.tensor_intrin

tvm.s_tir.tensor_intrin

Intrinsics for tensorization.

tvm.s_tir.tensor_intrin.enabled(target)

Whether module runtime is enabled for target

Parameters:

target (str or Dict[str, Any] or tvm.target.Target) – The target device type.

Returns:

enabled – Whether runtime is enabled.

Return type:

bool

Examples

The following code checks if gpu is enabled.

>>> tvm.runtime.enabled("gpu")

tvm.s_tir.tensor_intrin.cuda

Intrinsics for tensorization on NVIDIA GPU.

tvm.s_tir.tensor_intrin.cuda.get_mma_intrin_group(load_scope: Literal['shared', 'shared.dyn'], store_scope: Literal['global', 'shared', 'shared.dyn'], in_dtype: Literal['float16', 'int8', 'float8_e4m3fn', 'float8_e5m2'], out_dtype: Literal['float16', 'float32', 'int32'], trans_a: bool, trans_b: bool, not_use_mma_store_intrinic: bool = True, store_to_smem_dtype: Literal['float16', 'float32', 'int32'] | None = None) dict[str, str]

Get a group of intrinsics for mma tensor core with the given configurations

Parameters:
  • load_scope (Literal["shared", "shared.dyn"]) – The memory scope of the input buffer.

  • store_scope (Literal["global", "shared", "shared.dyn"]) – The memory scope of the result buffer.

  • in_dtype (str) – The input data type.

  • out_dtype (str) – The output data dtype.

  • trans_a (bool) – Whether the input matrix A is transposed.

  • trans_b (bool) – Whether the input matrix B is transposed.

  • not_use_mma_store_intrinic (bool) –

    Whether to not use the mma_store intrinsic. If True, use BufferStore stmts to store the result of mma. Otherwise, use mma_store intrinsic.

    This is because if we use mma_store intrinsic, during swizzling shared memory visits, our rearrangement scheme will involve areas accessed by different mma_store calls. This makes swizzling quite complex. But BufferStore will not face this problem.

  • store_to_smem_dtype (Optional[Literal["float16", "float32", "int32"]]) – The dtype that we use to store from register to shared memory. By default it is out_dtype.

Returns:

ret – A group of tensor intrinsics.

Return type:

Dict[str, str]

tvm.s_tir.tensor_intrin.cuda.get_wmma_fragment_index(buffer, stride, m_dim, n_dim)

Compute wmma fragment index using elem_offset of the buffer

tvm.s_tir.tensor_intrin.cuda.get_wmma_load_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, shared_scope: str, is_b: bool, is_col_major: bool) tuple[PrimFunc, PrimFunc]

Generator of wmma_load intrins

tvm.s_tir.tensor_intrin.cuda.get_wmma_fill_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str) tuple[PrimFunc, PrimFunc]

Generator of wmma_fill intrins

tvm.s_tir.tensor_intrin.cuda.get_wmma_store_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, scope: str) tuple[PrimFunc, PrimFunc]

Generator of wmma_store intrins

tvm.s_tir.tensor_intrin.cuda.get_wmma_sync_intrin(m_dim: int, n_dim: int, k_dim: int, in_dtype: str, out_dtype: str, b_transposed: bool) tuple[PrimFunc, PrimFunc]

Generator of wmma_sync intrins

tvm.s_tir.tensor_intrin.cuda.get_wmma_intrin_group(load_scope: Literal['shared', 'shared.dyn'], store_scope: Literal['global', 'shared', 'shared.dyn'], in_dtype: str, out_dtype: str, trans_b: bool) dict[str, str]

Get a group of intrinsics for wmma tensor core with the given configurations

Parameters:
  • load_scope (Literal["shared", "shared.dyn"]) – The memory scope of the input buffer.

  • store_scope (Literal["global", "shared", "shared.dyn"]) – The memory scope of the result buffer.

  • in_dtype (str) – The input data type.

  • out_dtype (str) – The output data dtype.

  • trans_b (bool) – Whether the input matrix B is transposed.

Returns:

ret – A group of tensor intrinsics.

Return type:

Dict[str, str]

tvm.s_tir.tensor_intrin.cuda.get_mma_init_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str) tuple[PrimFunc, PrimFunc]

Generator of mma init intrins

tvm.s_tir.tensor_intrin.cuda.get_mma_load_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, shared_scope: str, is_b: bool, is_col_major: bool) tuple[PrimFunc, PrimFunc]

Generator of mma ldmatrix intrins

tvm.s_tir.tensor_intrin.cuda.get_mma_sync_intrin(m_dim: int, n_dim: int, k_dim: int, in_dtype: str, out_dtype: str, b_transposed: bool) tuple[PrimFunc, PrimFunc]

Generator of mma sync intrins

tvm.s_tir.tensor_intrin.cuda.get_mma_store_dummy_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str) tuple[PrimFunc, PrimFunc]

Disable mma store intrin for now.

tvm.s_tir.tensor_intrin.arm_cpu

Intrinsics for ARM tensorization.

tvm.s_tir.tensor_intrin.arm_cpu.get_sme_transpose_interleave_2svlx2svl_fp32_intrin(cols, rows)

Transpose a matrix of size 2SVL x 2SVL (where ‘SVL’ is the Scalable Vector Length) using the Scalable Matrix Extension (SME).

This is completed by loading rows of the input matrix into the accumulator tile, then storing the columns. The SME accumulator tile is divided into a series of sub-tiles which must be loaded to / stored from independently.

Example

An example case for float32. In this instance the accumulator tile is divided into 4 sub-tiles of size SVLxSVL numbered 0-3. We start by loading rows of A, each SVL in length, into each of the sub-tiles. In the diagram below, each load for a sub-tile is sequenced by a, b, … till the tile is full.

The columns of each sub-tile are then stored into A_t. Note that to perform a transpose, the contents of sub-tile 1 and 2 are stored in opposite locations - see the diagram below.

A:                                  Accumulator tile:                     A_t:
            2SVL                                2SVL                               2SVL
     +----------------+                 +-----------------+                +-------------------+
     | --0a--  --1a-- |                 |                 |                | |  |     |  |     |
     | --0b--  --1b-- |                 |    0       1    |                | 0a 0b .. 2a 2b .. |
     |   ...     ...  | ld1w.horiz      |                 | st1w.vert      | |  |     |  |     |
2SVL | --2a--  --3a-- |   ====>    2SVL |                 |   ====>   2SVL | |  |     |  |     |
     | --2a--  --3b-- |                 |    2       3    |                | 1a 1b .. 3a 3b .. |
     |   ...     ...  |                 |                 |                | |  |     |  |     |
     +----------------+                 +-----------------+                +-------------------+
Returns:

intrin – The SME TensorIntrin that can be used in tensorizing a schedule.

Return type:

TensorIntrin

tvm.s_tir.tensor_intrin.arm_cpu.get_sme_transpose_interleave_block2_2svl_fp16_intrin()

Transpose and block pack a matrix of size 2SVL x 1SVL (where ‘SVL’ is the Scalable Vector Length for the fp16 datatype) using the Scalable Matrix Extension (SME).

Rows of the fp16 input matrix are loaded into the accumulator tile and columns are stored as fp32 SVL length vectors to the output matrix. When loading, the accumulator tile is interpreted to be of shape 2 * 8 * vscale x 8 * vscale. When storing, we interpret the accumulator tile to be of shape 2 * 4 * vscale x 2 * 4 * vscale.

Example

In the fp16 instance, the accumulator tile consists of two sub-tiles numbered 0-1. Rows of A are loaded onto the accumulator tile by interleaving rows in the first half (0, SVL//2] of the tile and rows in the second half (SVL//2, SVL]. Columns of fp32 values are stored into the output buffer. The fp32 store is used to group pairs of consecutive values together, resulting in the arrangement displayed below.

A: Accumulator tile:

+—————-+ +—————-+ |-------0a-------| |-------0a-------| |-------0b-------| |-------0x-------| | … | |-------0b-------| A_t: |-------0x-------| |-------0y-------| +————————————————+ |-------0y-------| | … | |0a.0 0a.1 0b.0 0b.1 | 1a.0 1a.1 1b.0 1b.1 | | ... | ld1h.horiz | | st1w.vert |0x.0 0x.1 0y.0 0y.1 | 1x.0 1x.1 1y.0 1y.1 | |================| ====> |================| ====> |0a.2 0a.3 0b.2 0b.3 ...| 1a.2 1a.3 1b.2 1b.3 …| |-------1a-------| |-------1a-------| |0x.2 0x.3 0y.2 0y.3 | 1x.2 1x.3 1y.2 1y.3 | |-------1b-------| |-------1x-------| |... ... ... ... | ... ... ... ... | | ... | |-------1b-------| +————————————————+ |-------1x-------| |-------1y-------| |-------1y-------| | … | | … | | | +—————-+ +—————-+

In the A_t output matrix in the diagram above, .x is used to denote the offset into the labelled row.

Returns:

intrin – The SME TensorIntrin that can be used in tensorizing a schedule.

Return type:

TensorIntrin

tvm.s_tir.tensor_intrin.arm_cpu.get_sme_gemm_interleaved_mopa_2svlx2svl_intrin(M, K, in_dtype)

Compute a GEMM of size 2SVL x 2SVL (where ‘SVL’ is the Scalable Vector Length using outer product operations from the Scalable Matrix Extension (SME).

The inputs A and B are expected to be of size K x 2SVL and produce a result C of size 2SVL x 2SVL.

The SME accumulator tile is divided into sub-tiles, each of which is utilized to calculate the outer-product using columns / rows of A and B respectively. For each sub-tile, elements in the first column of input matrix A (accessed sequentially due to being transpose-interleaved) and first row of input matrix B are used to calculate an outer-product. This is then accumulated with the result of performing an outer-product on the second column and row of A and B respectively. This process is repeated K times. Finally, the results of the accumulation are stored.

Note: The input tensor ‘A’ must be transpose-interleaved.

Example

Diagram showing outer-product performed on each of the accumulator sub-tiles for the fp32 datatype:

                SVL           SVL
         +----------------------------+
         |       l     |       h      | K
     K   +----------------------------+
  +---+  +----------------------------+
  |   |  |  0:            1:          |-+
  |   |  |  mopa(l, l)    mopa(l, h)  | |-+
l |   |  |                            | | |
  |   |  |                            | | |
  |---|  |                            | | |
  |   |  |  2:            3:          | | |
h |   |  |  mopa(h, l)    mopa(h, h)  | | |
  |   |  |                            | | |
  |   |  |                            | | |
  +---+  +----------------------------+ | |
           +----------------------------+ |
              +---------------------------+
                             (accumulate K times)

Pseudo code computing 2SVL x 2SVL GEMM for fp32 inputs:

// Number of fp32 elements in a scalable vector
int SVF = SVL / 32;

// Reset the accumulator tile
sme.zero();

// Calculate outer products and accumulate
for (k = 0; k < K; k++) {
    float32xSVF A_row_0 = A[k][0];
    float32xSVF A_row_1 = A[k][SVF];
    float32xSVF B_row_0 = B[k][0];
    float32xSVF B_row_1 = B[k][SVF];

    float32xSVFxSVF sub_tile_0 += sme.mopa(A_row_0, B_row_0);
    float32xSVFxSVF sub_tile_1 += sme.mopa(A_row_0, B_row_1);
    float32xSVFxSVF sub_tile_2 += sme.mopa(A_row_1, B_row_0);
    float32xSVFxSVF sub_tile_3 += sme.mopa(A_row_1, B_row_1);
}

// Store the results of accumulation
for (i = 0; i < SVF; i++) {
    C[i][0] = sme.horiz(sub_tile_0[i]);
    C[i][0] = sme.horiz(sub_tile_0[i + SVF]);
    C[i + SVF][0] = sme.horiz(sub_tile_0[i]);
    C[i + SVF][0] = sme.horiz(sub_tile_0[i + SVF]);
}

Notes:

  • Recall that A has been transposed beforehand such that each column is now accessed by row.

  • ‘sme.zero’ resets the accumulator tile to contain all zero’s.

  • ‘sme.mopa’ is the outer product and accumulate intrinsic.

  • ‘sme.horiz’ stores rows of an accumulator sub-tile to memory.

Returns:

intrin – The SME TensorIntrin that can be used in tensorizing a schedule.

Return type:

TensorIntrin

tvm.s_tir.tensor_intrin.arm_cpu.get_sme_init_intrin()

Reset the entire matrix tile storage to 0.

tvm.s_tir.tensor_intrin.x86

Intrinsics for x86 tensorization.

tvm.s_tir.tensor_intrin.rocm

Intrinsics for AMDGPU tensorization.

tvm.s_tir.tensor_intrin.metal

Intrinsics for tensorization on Apple GPU.

tvm.s_tir.tensor_intrin.metal.get_simdgroup_index(buffer: Buffer, stride: PrimExpr, col: int, row: int)

Compute simdgroup index using elem_offset of the buffer

tvm.s_tir.tensor_intrin.metal.get_simdgroup_intrin_group(load_scope: Literal['shared'], store_scope: Literal['global', 'shared'], dtype: str, trans_a: bool = False, trans_b: bool = False) dict[str, str]

Get a group of intrinsics for tensorization on Apple GPU.

Parameters:
  • load_scope (Literal["shared"]) – The memory scope of the input buffer.

  • store_scope (Literal["global", "shared"]) – The memory scope of the result buffer.

  • dtype (str) – The data type of the input and output buffers.

  • trans_a (bool) – Whether the input matrix A is transposed.

  • trans_b (bool) – Whether the input matrix B is transposed.

Returns:

ret – A group of tensor intrinsics.

Return type:

Dict[str, str]

tvm.s_tir.tensor_intrin.hexagon

Intrinsics for Hexagon tensorization.

tvm.s_tir.tensor_intrin.hexagon.generate_dma_load_intrin(size: int, dtype: str)

Generator of dma_load intrins

tvm.s_tir.tensor_intrin.riscv_cpu

Intrinsics for RISCV tensorization

tvm.s_tir.tensor_intrin.riscv_cpu.get_max_elems(vlen: int, lmul: int, sew: int) int

Returns number of elements of a given data type (SEW) that fits multiple (LMUL) of the vector registers (VLEN).

Parameters:
  • vlen (int) – VLEN vector length in bits

  • lmul (int) – LMUL vector lenght multiplier

  • sew (int) – SEW standard (single) element width

Returns:

Number of elements

Return type:

int

tvm.s_tir.tensor_intrin.riscv_cpu.rvv_vec_dot_product_kernels(n_elems: int, n_lanes: int, data_dtype: str, weight_dtype: str, out_dtype: str, lmul: int)

Dot product of vector and matrix rows using RISC-V vector instructions.

These kernels takes two arrays A[ELEMS] and B[ELEMS][MACS] and computes dot product of A[ELEMS] with each row of B[LANES], accumulating results with C[LANES].

The pseudo code is as follows:

void vec_dot_prod(A[ELEMS], B[LANES][ELEMS], C[LANES]){
    for (j = 0; j < LANES; j++) {
        for (k = 0; k < ELEMS; k++) {
            C[j] += A[k] * B[j][k]
        }
    }
}
tvm.s_tir.tensor_intrin.riscv_cpu.register_riscv_intrinsics(target: Target)

Register RISCV intrinsics

Parameters:

target (Target) – TVM target

tvm.s_tir.tensor_intrin.dot_product_common

Dot product related intrinsics.