tvm.s_tir.tensor_intrin
tvm.s_tir.tensor_intrin
Intrinsics for tensorization.
- tvm.s_tir.tensor_intrin.enabled(target)
Whether module runtime is enabled for target
- Parameters:
target (str or Dict[str, Any] or tvm.target.Target) – The target device type.
- Returns:
enabled – Whether runtime is enabled.
- Return type:
Examples
The following code checks if gpu is enabled.
>>> tvm.runtime.enabled("gpu")
tvm.s_tir.tensor_intrin.cuda
Intrinsics for tensorization on NVIDIA GPU.
- tvm.s_tir.tensor_intrin.cuda.get_mma_intrin_group(load_scope: Literal['shared', 'shared.dyn'], store_scope: Literal['global', 'shared', 'shared.dyn'], in_dtype: Literal['float16', 'int8', 'float8_e4m3fn', 'float8_e5m2'], out_dtype: Literal['float16', 'float32', 'int32'], trans_a: bool, trans_b: bool, not_use_mma_store_intrinic: bool = True, store_to_smem_dtype: Literal['float16', 'float32', 'int32'] | None = None) dict[str, str]
Get a group of intrinsics for mma tensor core with the given configurations
- Parameters:
load_scope (Literal["shared", "shared.dyn"]) – The memory scope of the input buffer.
store_scope (Literal["global", "shared", "shared.dyn"]) – The memory scope of the result buffer.
in_dtype (str) – The input data type.
out_dtype (str) – The output data dtype.
trans_a (bool) – Whether the input matrix A is transposed.
trans_b (bool) – Whether the input matrix B is transposed.
not_use_mma_store_intrinic (bool) –
Whether to not use the mma_store intrinsic. If True, use BufferStore stmts to store the result of mma. Otherwise, use mma_store intrinsic.
This is because if we use mma_store intrinsic, during swizzling shared memory visits, our rearrangement scheme will involve areas accessed by different mma_store calls. This makes swizzling quite complex. But BufferStore will not face this problem.
store_to_smem_dtype (Optional[Literal["float16", "float32", "int32"]]) – The dtype that we use to store from register to shared memory. By default it is out_dtype.
- Returns:
ret – A group of tensor intrinsics.
- Return type:
- tvm.s_tir.tensor_intrin.cuda.get_wmma_fragment_index(buffer, stride, m_dim, n_dim)
Compute wmma fragment index using elem_offset of the buffer
- tvm.s_tir.tensor_intrin.cuda.get_wmma_load_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, shared_scope: str, is_b: bool, is_col_major: bool) tuple[PrimFunc, PrimFunc]
Generator of wmma_load intrins
- tvm.s_tir.tensor_intrin.cuda.get_wmma_fill_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str) tuple[PrimFunc, PrimFunc]
Generator of wmma_fill intrins
- tvm.s_tir.tensor_intrin.cuda.get_wmma_store_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, scope: str) tuple[PrimFunc, PrimFunc]
Generator of wmma_store intrins
- tvm.s_tir.tensor_intrin.cuda.get_wmma_sync_intrin(m_dim: int, n_dim: int, k_dim: int, in_dtype: str, out_dtype: str, b_transposed: bool) tuple[PrimFunc, PrimFunc]
Generator of wmma_sync intrins
- tvm.s_tir.tensor_intrin.cuda.get_wmma_intrin_group(load_scope: Literal['shared', 'shared.dyn'], store_scope: Literal['global', 'shared', 'shared.dyn'], in_dtype: str, out_dtype: str, trans_b: bool) dict[str, str]
Get a group of intrinsics for wmma tensor core with the given configurations
- Parameters:
load_scope (Literal["shared", "shared.dyn"]) – The memory scope of the input buffer.
store_scope (Literal["global", "shared", "shared.dyn"]) – The memory scope of the result buffer.
in_dtype (str) – The input data type.
out_dtype (str) – The output data dtype.
trans_b (bool) – Whether the input matrix B is transposed.
- Returns:
ret – A group of tensor intrinsics.
- Return type:
- tvm.s_tir.tensor_intrin.cuda.get_mma_init_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str) tuple[PrimFunc, PrimFunc]
Generator of mma init intrins
- tvm.s_tir.tensor_intrin.cuda.get_mma_load_intrin(m_dim: int, n_dim: int, k_dim: int, dtype: str, shared_scope: str, is_b: bool, is_col_major: bool) tuple[PrimFunc, PrimFunc]
Generator of mma ldmatrix intrins
tvm.s_tir.tensor_intrin.arm_cpu
Intrinsics for ARM tensorization.
- tvm.s_tir.tensor_intrin.arm_cpu.get_sme_transpose_interleave_2svlx2svl_fp32_intrin(cols, rows)
Transpose a matrix of size 2SVL x 2SVL (where ‘SVL’ is the Scalable Vector Length) using the Scalable Matrix Extension (SME).
This is completed by loading rows of the input matrix into the accumulator tile, then storing the columns. The SME accumulator tile is divided into a series of sub-tiles which must be loaded to / stored from independently.
Example
An example case for float32. In this instance the accumulator tile is divided into 4 sub-tiles of size SVLxSVL numbered 0-3. We start by loading rows of A, each SVL in length, into each of the sub-tiles. In the diagram below, each load for a sub-tile is sequenced by a, b, … till the tile is full.
The columns of each sub-tile are then stored into A_t. Note that to perform a transpose, the contents of sub-tile 1 and 2 are stored in opposite locations - see the diagram below.
A: Accumulator tile: A_t: 2SVL 2SVL 2SVL +----------------+ +-----------------+ +-------------------+ | --0a-- --1a-- | | | | | | | | | | --0b-- --1b-- | | 0 1 | | 0a 0b .. 2a 2b .. | | ... ... | ld1w.horiz | | st1w.vert | | | | | | 2SVL | --2a-- --3a-- | ====> 2SVL | | ====> 2SVL | | | | | | | --2a-- --3b-- | | 2 3 | | 1a 1b .. 3a 3b .. | | ... ... | | | | | | | | | +----------------+ +-----------------+ +-------------------+
- Returns:
intrin – The SME TensorIntrin that can be used in tensorizing a schedule.
- Return type:
- tvm.s_tir.tensor_intrin.arm_cpu.get_sme_transpose_interleave_block2_2svl_fp16_intrin()
Transpose and block pack a matrix of size 2SVL x 1SVL (where ‘SVL’ is the Scalable Vector Length for the fp16 datatype) using the Scalable Matrix Extension (SME).
Rows of the fp16 input matrix are loaded into the accumulator tile and columns are stored as fp32 SVL length vectors to the output matrix. When loading, the accumulator tile is interpreted to be of shape 2 * 8 * vscale x 8 * vscale. When storing, we interpret the accumulator tile to be of shape 2 * 4 * vscale x 2 * 4 * vscale.
Example
In the fp16 instance, the accumulator tile consists of two sub-tiles numbered 0-1. Rows of A are loaded onto the accumulator tile by interleaving rows in the first half (0, SVL//2] of the tile and rows in the second half (SVL//2, SVL]. Columns of fp32 values are stored into the output buffer. The fp32 store is used to group pairs of consecutive values together, resulting in the arrangement displayed below.
- A: Accumulator tile:
+—————-+ +—————-+ |-------0a-------| |-------0a-------| |-------0b-------| |-------0x-------| | … | |-------0b-------| A_t: |-------0x-------| |-------0y-------| +————————————————+ |-------0y-------| | … | |0a.0 0a.1 0b.0 0b.1 | 1a.0 1a.1 1b.0 1b.1 | | ... | ld1h.horiz | | st1w.vert |0x.0 0x.1 0y.0 0y.1 | 1x.0 1x.1 1y.0 1y.1 | |================| ====> |================| ====> |0a.2 0a.3 0b.2 0b.3 ...| 1a.2 1a.3 1b.2 1b.3 …| |-------1a-------| |-------1a-------| |0x.2 0x.3 0y.2 0y.3 | 1x.2 1x.3 1y.2 1y.3 | |-------1b-------| |-------1x-------| |... ... ... ... | ... ... ... ... | | ... | |-------1b-------| +————————————————+ |-------1x-------| |-------1y-------| |-------1y-------| | … | | … | | | +—————-+ +—————-+
In the A_t output matrix in the diagram above, .x is used to denote the offset into the labelled row.
- Returns:
intrin – The SME TensorIntrin that can be used in tensorizing a schedule.
- Return type:
- tvm.s_tir.tensor_intrin.arm_cpu.get_sme_gemm_interleaved_mopa_2svlx2svl_intrin(M, K, in_dtype)
Compute a GEMM of size 2SVL x 2SVL (where ‘SVL’ is the Scalable Vector Length using outer product operations from the Scalable Matrix Extension (SME).
The inputs A and B are expected to be of size K x 2SVL and produce a result C of size 2SVL x 2SVL.
The SME accumulator tile is divided into sub-tiles, each of which is utilized to calculate the outer-product using columns / rows of A and B respectively. For each sub-tile, elements in the first column of input matrix A (accessed sequentially due to being transpose-interleaved) and first row of input matrix B are used to calculate an outer-product. This is then accumulated with the result of performing an outer-product on the second column and row of A and B respectively. This process is repeated K times. Finally, the results of the accumulation are stored.
Note: The input tensor ‘A’ must be transpose-interleaved.
Example
Diagram showing outer-product performed on each of the accumulator sub-tiles for the fp32 datatype:
SVL SVL +----------------------------+ | l | h | K K +----------------------------+ +---+ +----------------------------+ | | | 0: 1: |-+ | | | mopa(l, l) mopa(l, h) | |-+ l | | | | | | | | | | | | |---| | | | | | | | 2: 3: | | | h | | | mopa(h, l) mopa(h, h) | | | | | | | | | | | | | | | +---+ +----------------------------+ | | +----------------------------+ | +---------------------------+ (accumulate K times)
Pseudo code computing 2SVL x 2SVL GEMM for fp32 inputs:
// Number of fp32 elements in a scalable vector int SVF = SVL / 32; // Reset the accumulator tile sme.zero(); // Calculate outer products and accumulate for (k = 0; k < K; k++) { float32xSVF A_row_0 = A[k][0]; float32xSVF A_row_1 = A[k][SVF]; float32xSVF B_row_0 = B[k][0]; float32xSVF B_row_1 = B[k][SVF]; float32xSVFxSVF sub_tile_0 += sme.mopa(A_row_0, B_row_0); float32xSVFxSVF sub_tile_1 += sme.mopa(A_row_0, B_row_1); float32xSVFxSVF sub_tile_2 += sme.mopa(A_row_1, B_row_0); float32xSVFxSVF sub_tile_3 += sme.mopa(A_row_1, B_row_1); } // Store the results of accumulation for (i = 0; i < SVF; i++) { C[i][0] = sme.horiz(sub_tile_0[i]); C[i][0] = sme.horiz(sub_tile_0[i + SVF]); C[i + SVF][0] = sme.horiz(sub_tile_0[i]); C[i + SVF][0] = sme.horiz(sub_tile_0[i + SVF]); }
Notes:
Recall that A has been transposed beforehand such that each column is now accessed by row.
‘sme.zero’ resets the accumulator tile to contain all zero’s.
‘sme.mopa’ is the outer product and accumulate intrinsic.
‘sme.horiz’ stores rows of an accumulator sub-tile to memory.
- Returns:
intrin – The SME TensorIntrin that can be used in tensorizing a schedule.
- Return type:
- tvm.s_tir.tensor_intrin.arm_cpu.get_sme_init_intrin()
Reset the entire matrix tile storage to 0.
tvm.s_tir.tensor_intrin.x86
Intrinsics for x86 tensorization.
tvm.s_tir.tensor_intrin.rocm
Intrinsics for AMDGPU tensorization.
tvm.s_tir.tensor_intrin.metal
Intrinsics for tensorization on Apple GPU.
- tvm.s_tir.tensor_intrin.metal.get_simdgroup_index(buffer: Buffer, stride: PrimExpr, col: int, row: int)
Compute simdgroup index using elem_offset of the buffer
- tvm.s_tir.tensor_intrin.metal.get_simdgroup_intrin_group(load_scope: Literal['shared'], store_scope: Literal['global', 'shared'], dtype: str, trans_a: bool = False, trans_b: bool = False) dict[str, str]
Get a group of intrinsics for tensorization on Apple GPU.
- Parameters:
load_scope (Literal["shared"]) – The memory scope of the input buffer.
store_scope (Literal["global", "shared"]) – The memory scope of the result buffer.
dtype (str) – The data type of the input and output buffers.
trans_a (bool) – Whether the input matrix A is transposed.
trans_b (bool) – Whether the input matrix B is transposed.
- Returns:
ret – A group of tensor intrinsics.
- Return type:
tvm.s_tir.tensor_intrin.hexagon
Intrinsics for Hexagon tensorization.
tvm.s_tir.tensor_intrin.riscv_cpu
Intrinsics for RISCV tensorization
- tvm.s_tir.tensor_intrin.riscv_cpu.get_max_elems(vlen: int, lmul: int, sew: int) int
Returns number of elements of a given data type (SEW) that fits multiple (LMUL) of the vector registers (VLEN).
- tvm.s_tir.tensor_intrin.riscv_cpu.rvv_vec_dot_product_kernels(n_elems: int, n_lanes: int, data_dtype: str, weight_dtype: str, out_dtype: str, lmul: int)
Dot product of vector and matrix rows using RISC-V vector instructions.
These kernels takes two arrays A[ELEMS] and B[ELEMS][MACS] and computes dot product of A[ELEMS] with each row of B[LANES], accumulating results with C[LANES].
The pseudo code is as follows:
void vec_dot_prod(A[ELEMS], B[LANES][ELEMS], C[LANES]){ for (j = 0; j < LANES; j++) { for (k = 0; k < ELEMS; k++) { C[j] += A[k] * B[j][k] } } }
tvm.s_tir.tensor_intrin.dot_product_common
Dot product related intrinsics.