tvm.backend.cuda
The CUDA backend — the tile-primitive dispatch, intrinsic builders, the T.cuda
/ T.ptx script namespaces, and the shared/tensor-memory pools — lives under
tvm.backend.cuda, separate from the TIRx frontend (tvm.tirx). Other
backends sit alongside it (tvm.backend.rocm and so on).
tvm.backend.cuda
CUDA-owned TIRx modules.
- tvm.backend.cuda.register_backend()
Register CUDA-owned Python semantics.
- tvm.backend.cuda.script_namespace(**kwargs)
Return the CUDA TVMScript namespace object.
- tvm.backend.cuda.script_namespaces(**_)
Return CUDA-owned TVMScript namespaces.
tvm.backend.cuda.lang
CUDA-specific TIRx language helpers.
tvm.backend.cuda.op
CUDA, PTX, and NVSHMEM TIR intrinsic builders.
- tvm.backend.cuda.op.const(value, dtype=None, span=None)
construct a constant
- tvm.backend.cuda.op.bitwise_and(x, y, span=None)
Take bitwise and of two values
- tvm.backend.cuda.op.call_intrin(dtype, func_name, *args, attrs=None, span=None)
Build expression by calling an intrinsic function.
Intrinsics can be overloaded with multiple data types via the intrinsic translation rule.
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.tvm_access_ptr(ptype, data, offset, extent, rw_mask)
Get head access address with memory access pattern info
- Parameters:
ptype (Expr or str) – The data type of pointer. If a
str, it is wrapped viatype_annotation()so that the lowering rule (which readsargs[0].dtype()for the cast type) sees the intended dtype instead ofvoidfrom a raw StringImm.data (DType*) – The data of pointer.
offset (int) – The offset of pointer.
extent (int) – The extent of pointer.
rw_mask (int) – The read write mask.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_func_call(func_name, *args, source_code, return_type='void')
TVM intrinsic to call a CUDA function. Source code is provided as a string.
- tvm.backend.cuda.op.cuda_warp_reduce(value, op, width=32)
Warp-level butterfly shuffle-XOR reduction.
Reduces
valueacrosswidthadjacent lanes using the specified operation. Codegen emitslog2(width)steps of__shfl_xor_sync(0xFFFFFFFF, val, mask)with descending XOR masks.- Parameters:
- Returns:
call – The reduced value (same dtype as value).
- Return type:
- tvm.backend.cuda.op.cuda_warp_sum(value, width=32)
Convenience wrapper:
cuda_warp_reduce(value, "sum", width).
- tvm.backend.cuda.op.cuda_warp_max(value, width=32)
Convenience wrapper:
cuda_warp_reduce(value, "max", width).
- tvm.backend.cuda.op.cuda_warp_min(value, width=32)
Convenience wrapper:
cuda_warp_reduce(value, "min", width).
- tvm.backend.cuda.op.cuda_cta_reduce(value, op, num_warps, scratch)
CTA-wide reduction via warp shuffle + shared memory.
Two-step reduction: (1) intra-warp shuffle reduction, (2) warp-0 collects per-warp partials from
scratch, reduces, broadcasts via__syncthreads(). All CTA threads must participate.- Parameters:
- Returns:
call – The reduced value broadcast to all threads (same dtype as value).
- Return type:
- tvm.backend.cuda.op.cuda_cta_sum(value, num_warps, scratch)
Convenience wrapper:
cuda_cta_reduce(value, "sum", num_warps, scratch).
- tvm.backend.cuda.op.cuda_cta_max(value, num_warps, scratch)
Convenience wrapper:
cuda_cta_reduce(value, "max", num_warps, scratch).
- tvm.backend.cuda.op.cuda_cta_min(value, num_warps, scratch)
Convenience wrapper:
cuda_cta_reduce(value, "min", num_warps, scratch).
- tvm.backend.cuda.op.cuda_copy_bytes(dst, src, num_bytes)
Typed load/store copy of
num_bytesbytes.Copies
num_bytesbytes fromsrctodstusing a single typed load/store instruction. Codegen selects the appropriate C++ vector type (uint4,uint2,unsigned int, etc.).
- tvm.backend.cuda.op.cuda_copy_128b(dst, src)
Convenience wrapper:
cuda_copy_bytes(dst, src, 16)— copies 128 bits.
- tvm.backend.cuda.op.cuda_copy_64b(dst, src)
Convenience wrapper:
cuda_copy_bytes(dst, src, 8)— copies 64 bits.
- tvm.backend.cuda.op.cuda_copy_32b(dst, src)
Convenience wrapper:
cuda_copy_bytes(dst, src, 4)— copies 32 bits.
- tvm.backend.cuda.op.cuda_copy_16b(dst, src)
Convenience wrapper:
cuda_copy_bytes(dst, src, 2)— copies 16 bits.
- tvm.backend.cuda.op.cuda_copy_8b(dst, src)
Convenience wrapper:
cuda_copy_bytes(dst, src, 1)— copies 8 bits.
- tvm.backend.cuda.op.cuda_warp_sync()
TVM intrinsic to synchronize threads within the current warp.
This lowers to a CUDA __syncwarp() call.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_cta_sync()
TVM intrinsic to call CUDA syncthreads (block-wide barrier)
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_grid_sync()
TVM intrinsic to call CUDA grid-wide sync (cooperative groups)
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_cluster_sync()
TVM intrinsic to call CUDA cluster-wide barrier sync
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_thread_rank()
TVM intrinsic that returns
cooperative_groups::thread_rank()for the enclosing CTA – the linear thread index within the block.Useful for building “single thread of CTA” predicates without referencing user-declared scope_id vars. For example, the idiomatic mbarrier.init leader predicate is:
T.cuda.thread_rank() == 0
- Returns:
call – The call expression (
int32).- Return type:
- tvm.backend.cuda.op.cuda_half2float(src)
TVM intrinsic to convert half to float
- tvm.backend.cuda.op.cuda_bfloat162float(src)
TVM intrinsic to convert bfloat16 to float
- tvm.backend.cuda.op.cuda_float22half2(dst, src)
TVM intrinsic to convert float2 to half2 with rounding
- tvm.backend.cuda.op.cuda_trap_when_assert_failed(cond)
TVM intrinsic to trap when assertion failed (cond == false)
- tvm.backend.cuda.op.cuda_runtime_instr_desc(desc, sf_id)
TVM intrinsic to update runtime instruction descriptor
- tvm.backend.cuda.op.cuda_half8tofloat8(src_addr, dst_addr)
TVM intrinsic to convert 8 half2s to 8 float2s
- tvm.backend.cuda.op.cuda_float8tohalf8(src_addr, dst_addr)
TVM intrinsic to convert 8 float2s to 8 half2s
- tvm.backend.cuda.op.ptx_mma_sp(dtype, shape, A_layout, B_layout, A_dtype, B_dtype, C_dtype, multiplicand_a, a_index, multiplicand_b, b_index, accumulator, c_index, metadata, meta_index, sparse_selector, saturate)
TVM intrinsic for sparse tensor core ptx instructions https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma
- Parameters:
dtype (str) – The data type of the result.
shape (str) – The shape of mma fragment.
A_layout (Literal["row", "col"]) – The layout of multiplicand fragment A.
B_layout (Literal["row", "col"]) – The layout of multiplicand fragment B.
A_dtype (str) – The data type of multiplicand fragment A.
B_dtype (str) – The data type of multiplicand fragment B.
C_dtype (str) – The data type of multiplicand fragment C.
multiplicand_a (Var) – The multiplicand fragment A variable.
a_index (Expr) – The index of multiplicand fragment A.
multiplicand_b (Var) – The multiplicand fragment B variable.
b_index (Expr) – The index of multiplicand fragment B.
accumulator (Var) – The accumulator fragment C variable.
c_index (Expr) – The index of accumulator fragment C.
metadata (Expr) – The metadata of operand.
meta_index (Expr) – The metadata index of operand.
sparse_selector (Expr) – The sparse selector indicating the thread that stores the metadata.
saturate (bool) – The optional saturation at the output.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_bulk(dtype, shared_ptr, shared_offset, global_ptr, global_offset, bytes, barrier_id)
TVM intrinsic for ptx async copy from global to shared memory using cp.async.bulk https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk
- Parameters:
dtype (str) – The data type of the result.
shared_ptr (Var) – The shared memory pointer variable.
shared_offset (Expr) – The offset of shared memory pointer.
global_ptr (Var) – The global memory pointer variable.
global_offset (Expr) – The offset of global memory pointer.
bytes (int) – The data size to copy.
barrier_id (int) – The ID of the barrier shared memory pointer.
- Returns:
call – The call expression.
- Return type:
PTX cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes
Asynchronous bulk copy from executing CTA’s shared memory to a remote CTA’s shared memory within the same cluster.
- Parameters:
dst_ptr (PrimExpr) – Destination pointer in shared::cluster address space (remote CTA).
src_ptr (PrimExpr) – Source pointer in shared::cta address space (local CTA).
size (PrimExpr) – Number of bytes to copy (must be multiple of 16).
mbar (PrimExpr) – Mbarrier address in shared::cluster space for completion signaling, usually produced by
T.ptx.map_shared_rank.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_mbarrier_arrive(barrier_id)
TVM intrinsic for ptx async copy barrier using cp.async.mbarrier.arrive https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive
- tvm.backend.cuda.op.ptx_fence(sem: str, scope: str)
TVM intrinsic for PTX fence instruction.
Generates: fence.{sem}.{scope};
- tvm.backend.cuda.op.ptx_fence_proxy_async(space: str = '')
TVM intrinsic for PTX fence.proxy.async instruction.
Generates: fence.proxy.async[.{space}];
- tvm.backend.cuda.op.ptx_mbarrier_init(bar, thread_count)
TVM intrinsic to call mbarrier.init.shared::cta.b64
- tvm.backend.cuda.op.ptx_mbarrier_arrive(bar, cta_id=None, pred=None, count=None)
- TVM intrinsic to call
mbarrier.arrive.shared::cta.b64
- or
@p mapa.shared::cluster.u32 @p mbarrier.arrive.shared::cluster.b64 [, count]
- Parameters:
bar (Var) – The pointer to barrier variable.
cta_id (Optional[PrimExpr]) – The cta id.
pred (Optional[PrimExpr]) – The predicate to guard the operation.
count (Optional[PrimExpr]) – Explicit arrival count operand for the cross-CTA (cluster) form. When
Nonethe implicit count-of-1 form is emitted; when given, emitsmbarrier.arrive.shared::cluster.b64 _, [addr], count.
- tvm.backend.cuda.op.ptx_mbarrier_arrive_cluster_count(bar, cta_id, count)
Cross-CTA
mbarrier.arriveon CTActa_idwith an explicit count.Convenience for an already-elected thread: emits
@p mapa.shared::cluster.u32+@p mbarrier.arrive.shared::cluster.b64 _, [addr], countwith the guard defaulted to 1.
- tvm.backend.cuda.op.ptx_mbarrier_arrive_expect_tx(bar, byte_count, cta_id=None, pred=None)
- TVM intrinsic to call
mbarrier.arrive_expect_tx.shared::cta.b64
- or
@p mapa.shared::cluster.u32 @p mbarrier.arrive_expect_tx.shared::cluster.b64
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_mbarrier_try_wait(bar, phase)
TVM intrinsic to call mbarrier.try_wait.parity repeatedly until it returns true
- tvm.backend.cuda.op.ptx_mbarrier_try_wait_acquire_cluster(bar, phase)
mbarrier.try_wait.parity.acquire.clusterretry loop.Cluster-scope acquire wait — used to wait on a barrier that a remote CTA in the cluster arrives on (a group cluster wait).
- tvm.backend.cuda.op.ptx_mbarrier_try_wait_once(bar, phase, ticks)
TVM intrinsic for one-shot non-blocking
mbarrier.try_wait.parity.Returns
1if the requested parity has been reached and0otherwise. This is intended for bounded debug waits; production waits should useptx_mbarrier_try_wait().
- tvm.backend.cuda.op.ptx_bar_arrive(name_bar_id, thread_count)
TVM intrinsic to call bar.arrive a, b
- tvm.backend.cuda.op.ptx_bar_sync(name_bar_id, thread_count)
TVM intrinsic to call bar.sync a, {b}
- tvm.backend.cuda.op.ptx_cp_async(dst_ptr, src_ptr, cp_size, *, cache_hint='', cache_policy=None, prefetch_size=-1, predicate=-1, fill_mode='')
TVM intrinsic for ptx async copy from global to shared memory using cp.async https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async
Dispatches to one of three PTX-form-aligned ops:
ptx_cp_async_src_sizeforfill_mode == "zero"(zero-fill viasrc_size = pred ? cp_size : 0).ptx_cp_async_ignore_srcfor a non-emptypredicatewith no fill_mode (setp+@pguards the asm).ptx_cp_async_plainfor the no-predicate / no-fill_mode case.
- Parameters:
shared_ptr (PrimExpr) – The pointer to the shared memory.
global_ptr (PrimExpr) – The pointer to the global memory.
cp_size (int) – The data size to copy.
cache_hint (str["evict_last", "evict_first", "evict_normal", ""]) – The cache hint.
prefetch_size (int[-1, 64, 128, 256]) – The prefetch size.
predicate (PrimExpr) – The predicate to guard the operation.
fill_mode (str["zero", ""]) – The fill mode.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_legacy(*all_args)
Legacy
ptx_cp_asyncAPI taking explicit src/dst offsets.Signature:
(dst_ptr, dst_offset, src_ptr, src_offset, cp_size). Offsets are folded into the pointers viatvm_access_ptrthen dispatched to fork-nativeptx_cp_async().T.ptx.cp_async_legacyruns through_dtype_forwardwhich prepends adtype=kwarg as a leading positional. The dtype names the element type of the buffer (offsets are in elements of that dtype, not bytes), so this function accepts either 5 or 6 positional args.
- tvm.backend.cuda.op.ptx_cp_async_commit_group()
TVM intrinsic for ptx async copy commit https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-commit-group
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_wait_group(num=0)
TVM intrinsic for ptx async copy wait https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-wait-group
- tvm.backend.cuda.op.ptx_cp_async_bulk_tensor_global_to_cluster(dim, dst_ptr, bar, tensormap_addr, cta_mask, cta_group, cache_hint, *coords, cache_policy=None)
TVM intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile.mbarrier::complete_tx::bytes
- Parameters:
dim (int) – The dimension of the source tensor.
dst_ptr (PrimExpr) – The destination pointer to the shared memory.
bar (PrimExpr) – The pointer to mbarrier variable.
tensormap_addr (PrimExpr) – The generic address of the tensor map object.
cta_mask (int) – The mask of the cta for multicast.
cta_group (int) –
Must be either 1 or 2. If set to 1, mbarrier must be in the shared memory of the same CTA as the shared memory destination If set to 2, mbarrier can be in shared memory of either the same CTA as the shared memory destination
or the shared memory of the peer CTA.
cache_hint (str) – The cache hint.
coords (List[PrimExpr]) – specifies the starting coordinates in the tensor data in the global memory
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster(dim, dst_ptr, bar, tensormap_addr, cta_mask, cta_group, cache_hint, *coords, cache_policy=None)
TVM intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile::gather4.mbarrier::complete_tx::bytes
- Parameters:
dim (int) – The dimension of the source tensor.
dst_ptr (PrimExpr) – The destination pointer to the shared memory.
bar (PrimExpr) – The pointer to mbarrier variable.
tensormap_addr (PrimExpr) – The generic address of the tensor map object.
cta_mask (int) – The mask of the cta for multicast.
cta_group (int) – Must be either 1 or 2.
cache_hint (str) – The cache hint.
coords (List[PrimExpr]) – The TMA coordinates followed by the 4 gather row indices.
- Returns:
call – The call expression.
- Return type:
TVM intrinsic to call cp.async.bulk.tensor.dim.global.shared::cta.tile.bulk_group
- Parameters:
dim (int) – The dimension of the copy tensor.
src_ptr (PrimExpr) – The source pointer to the shared memory.
tensormap_addr (PrimExpr) – The generic address of the tensor map object.
cache_hint (str) – The cache hint.
coords (List[PrimExpr]) – specifies the starting coordinates in the tensor data in the global memory
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_bulk_tensor_global_to_cluster_prefetch(dim, tensormap_addr, cache_hint, *coords, cache_policy=None)
TVM intrinsic to call cp.async.bulk.prefetch.tensor.dim.L2.global.tile
- Parameters:
- Returns:
call – The call expression.
- Return type:
TVM intrinsic to call cp.reduce.async.bulk.tensor.dim.dst.src.redOp
- Parameters:
dim (int) – The dimension of the copy tensor.
src_ptr (PrimExpr) – The source pointer to the shared memory.
tensormap_addr (PrimExpr) – The generic address of the tensor map object.
cache_hint (str) – The cache hint.
red_op (str) – The reduction operator.
coords (List[PrimExpr]) – The coordinates of the tensor.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_bulk_commit_group()
TVM intrinsic to call cp.async.bulk.tensor.commit_group
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_cp_async_bulk_wait_group(n=0, read=True)
TVM intrinsic to call cp.async.bulk.tensor.wait_group
- tvm.backend.cuda.op.ptx_barrier_cluster_arrive(sem='', aligned=True)
TVM intrinsic to call barrier.cluster.arrive{.sem}{.aligned}
- tvm.backend.cuda.op.ptx_barrier_cluster_wait(acquire=False, aligned=True)
TVM intrinsic to call barrier.cluster.wait{.acquire}{.aligned}
- tvm.backend.cuda.op.ptx_clc_try_cancel(handle, mbar)
TVM intrinsic to call clusterlaunchcontrol.try_cancel.
Async-requests cancelling the next cluster’s launch (work-stealing): writes the 16B response handle to smem and signals
mbar(complete_tx, multicast to both cluster CTAs).
- tvm.backend.cuda.op.ptx_clc_query_cancel(handle)
TVM intrinsic to call clusterlaunchcontrol.query_cancel.
Decodes the response handle written by
ptx_clc_try_cancel(). Returns the cancelled cluster’s firstctaid.x, or0xFFFFFFFFwhen no work was stolen.- Parameters:
handle (PrimExpr) – Pointer to the 16B (uint4) smem response handle.
- tvm.backend.cuda.op.ptx_elect_sync()
TVM intrinsic to call elect.sync
- tvm.backend.cuda.op.ptx_fence_mbarrier_init()
TVM intrinsic for PTX fence.mbarrier_init.release.cluster instruction.
Generates: fence.mbarrier_init.release.cluster;
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_fetch_register(bits, reg_name)
TVM intrinsic to tvm instrinsics to fetch PTX pre-defined registers
- tvm.backend.cuda.op.ptx_mma(shape, a_layout, b_layout, d_type, a_type, b_type, c_type, d_ptrs, a_ptrs, b_ptrs, c_ptrs=None, saturate=False, bit_op=None)
TVM intrinsic for ptx tensor core mma instructions. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma
Each per-thread register of every operand is addressed by its OWN pointer (one
void*per b32/f32 register), so the register fragments need not be contiguous in the register file.d_ptrs/a_ptrs/b_ptrs/c_ptrsare lists of one pointer per 32-bit register (b32 for fp16/bf16/tf32/int8 multiplicands, f32/f64 for the accumulator), enumerated in the fixed PTX register order (see the gemm dispatch /tests/python/tirx-base/test_tir_ptx_mma.py).Within one b32 register the packed elements (e.g. 2 fp16 along k_pack) must stay contiguous (stride 1); only the b32 registers themselves may be scattered.
- Parameters:
shape (str) – The shape of mma fragment.
a_layout (Literal["row", "col"]) – The layout of multiplicand fragment A.
b_layout (Literal["row", "col"]) – The layout of multiplicand fragment B.
d_type (str) – The data type of result fragment D.
a_type (str) – The data type of multiplicand fragment A.
b_type (str) – The data type of multiplicand fragment B.
c_type (str) – The data type of accumulator fragment C.
d_ptrs (List[PrimExpr]) – One pointer per result-fragment D register, in PTX order.
a_ptrs (List[PrimExpr]) – One pointer per multiplicand-A register, in PTX order.
b_ptrs (List[PrimExpr]) – One pointer per multiplicand-B register, in PTX order.
c_ptrs (Optional[List[PrimExpr]]) – One pointer per accumulator-C register, in PTX order.
None(the default) means the accumulator is not used (beta == 0): codegen feeds a literal 0 for each C slot.saturate (bool) – The optional saturation at the output.
bit_op (Optional[Literal["xor", "and"]]) – The 1-bit operator (for the b1 subbyte form).
Nonemeans unused.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_mma_legacy(*all_args, operator=None)
Legacy
ptx_mmaAPI.Signature:
(shape, A_layout, B_layout, A_dtype, B_dtype, C_dtype, multiplicand_a, a_index, multiplicand_b, b_index, accumulator, c_index, saturate, operator=None). The accumulator is reused as both input and output (no separated/cslot), unlike fork-nativeptx_mma()which distinguishes them. Translation:a_dtype, b_dtype, c_dtype→ forka_type, b_type, c_type(and reusec_dtypeas forkd_typesince the accumulator dtype is the output dtype here).(a_ptr, a_offset)and(b_ptr, b_offset)→ folded viatvm_access_ptr().(accumulator, c_index)→ folded; passed for bothd_ptrandc_ptrsince the accumulator is reused as the output.
T.ptx.mma.legacyruns through_dtype_forwardwhich prepends adtype=kwarg as a leading positional, so this function accepts either 13 or 14 positional args.
- tvm.backend.cuda.op.ptx_mma_sp_legacy(*all_args)
Legacy
ptx_mma_spAPI.Signature:
(shape, A_layout, B_layout, A_dtype, B_dtype, C_dtype, multiplicand_a, a_index, multiplicand_b, b_index, accumulator, c_index, metadata, meta_index, sparse_selector, saturate).T.ptx.mma_sp.legacyruns through_dtype_forwardwhich prepends adtype=kwarg as a leading positional, so this function accepts either 16 or 17 positional args.
- tvm.backend.cuda.op.mma_store(dtype, m, n, dst_ptr, src_ptr, src_offset, dst_stride)
Store the result of PTX MMA into a destination pointer.
- tvm.backend.cuda.op.mma_store_legacy(dtype, m, n, dst_ptr, src_ptr, src_offset, dst_stride)
mma_store with apache-style pointer/offset semantics.
- tvm.backend.cuda.op.mma_fill(dtype, local_size, local_ptr, offset)
Zero-initialize an MMA accumulation register.
- tvm.backend.cuda.op.mma_fill_legacy(dtype, local_size, local_ptr, offset)
mma_fill with apache-style pointer/offset semantics.
- tvm.backend.cuda.op.ptx_ldmatrix(trans, num, dtype, smem_ptr, *dst_handles)
TVM intrinsic for ldmatrix.sync.aligned.m8n8.x{num}{.trans}.shared.{dtype}.
Mirrors the PTX ISA destination form: each output register is a separate operand. Pass
T.address_of(buf[idx])(orbuf.ptr_to([idx])) for each destination — the slots may be non-contiguous.- Parameters:
trans (bool) – Apply the
.transmodifier.num (int) – One of 1, 2, 4 — number of m8n8 fragments.
dtype (str) –
"b16"(4 bytes per fragment register) or"b8"(2 bytes per).smem_ptr (PrimExpr) – Generic pointer to source shared memory.
*dst_handles (PrimExpr) – N pointer-to-uint32 destinations, where
N = num if dtype == "b16" else num // 2.https (//docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix)
- tvm.backend.cuda.op.ptx_ldmatrix_legacy(*all_args)
Legacy
ptx_ldmatrixAPI taking explicit offsets.Signature:
(trans, num, dtype, local_ptr, local_offset, smem_ptr, smem_offset). Offsets are folded into the pointers viatvm_access_ptrand dispatched to the fork-nativeptx_ldmatrix().T.ptx.ldmatrix_legacyruns through_dtype_forwardwhich prepends adtype=kwarg as a leading positional naming the buffer element type — offsets are in elements of that dtype, not bytes, so we forward it totvm_access_ptrfor correct scaling.
- tvm.backend.cuda.op.ptx_stmatrix(trans, num, dtype, smem_ptr, *src_handles, shape='m8n8', space='shared')
TVM intrinsic for
stmatrix.sync.aligned.shape.x{num}{.trans}.space.{dtype}.Mirrors
ptx_ldmatrix(): each source register is a separate operand. PassT.address_of(buf[idx])(orbuf.ptr_to([idx])) for each source — the slots may be non-contiguous.- Parameters:
trans (bool) – Apply the
.transmodifier (required forshape == "m16n8").num (int) – One of 1, 2, 4 — number of m8n8 fragments per warp.
dtype (str) –
".b16"(4 bytes per fragment register) or".b8"(2 bytes per).smem_ptr (PrimExpr) – Destination pointer in shared memory.
*src_handles (PrimExpr) –
numpointer-to-uint32 sources.shape (str, keyword-only, default "m8n8") –
"m8n8"or"m16n8".space (str, keyword-only, default "shared") –
"shared"or"shared::cta".https (//docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-stmatrix)
- tvm.backend.cuda.op.ptx_wgmma_encode_matrix_descriptor(desc, addr, ldo, sdo, swizzle)
TVM intrinsic to create memory descriptor for wgmma instructions
- tvm.backend.cuda.op.ptx_wgmma_noop_barrier(reg)
TVM intrinsic to call “” : “+{format}”(reg)::”memory”
- tvm.backend.cuda.op.ptx_wgmma_mma_async_ss(descA, descB, *accums, M, N, K, in_dtype, out_dtype, transA, transB, scaleA, scaleB, scaleD)
TVM intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype over 2 smem operators
- Parameters:
M (int) – The number of rows in matrix A and D.
N (int) – The number of columns in matrix B and D.
K (int) – The number of columns in matrix A and rows in matrix B.
in_dtype (str) – The data type of the input matrices.
out_type (str) – The data type of the output matrices.
transA (bool) – True for M/N major, False for K major.
transB (bool) – True for M/N major, False for K major.
scaleA (float) – The scaling factor for matrix A.
scaleB (float) – The scaling factor for matrix B.
scaleD (PrimExpr) – True: D = A * B + D, False: D = A * B.
descA (PrimExpr) – The SMEM descriptor of matrix A
descB (PrimExpr) – The SMEM descriptor of matrix B
accums (list) – The accumulators registers.
- tvm.backend.cuda.op.ptx_wgmma_mma_async_rs(descB, *reg_list, M, N, K, in_dtype, out_dtype, transA, transB, scaleA, scaleB, scaleD)
- TVM intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype
When A is in register and B is in shared memory
- Parameters:
M (int) – The number of rows in matrix A and D.
N (int) – The number of columns in matrix B and D.
K (int) – The number of columns in matrix A and rows in matrix B.
in_dtype (str) – The data type of the input matrices.
out_type (str) – The data type of the output matrices.
transA (bool) – True for M/N major, False for K major.
transB (bool) – True for M/N major, False for K major.
scaleA (float) – The scaling factor for matrix A.
scaleB (float) – The scaling factor for matrix B.
scaleD (PrimExpr) – True: D = A * B + D, False: D = A * B.
descB (PrimExpr) – The SMEM descriptor of matrix B
reg_list (list) – The A registers and accumulators registers.
- tvm.backend.cuda.op.ptx_wgmma_fence()
TVM intrinsic to call wgmma.fence.sync.aligned
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_wgmma_commit_group()
TVM intrinsic to call wgmma.commit_group.sync.aligned
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.ptx_wgmma_wait_group(n)
TVM intrinsic to call wgmma.wait_group.sync.aligned
- tvm.backend.cuda.op.ptx_setmaxnreg(inc: bool, reg_count)
TVM intrinsic to call setmaxnreg.action.sync.aligned.u32 imm-reg-count
- tvm.backend.cuda.op.ptx_tcgen05_alloc(dst_ptr, n_cols, cta_group=1)
- TVM intrinsic to call tcgen05.alloc.cta_group.sync.aligned
Dynamically allocates the number of cols in tensor memory, and write the address of allocated memory to shared memory.
- Parameters:
dst_ptr (Var) – The pointer to the destination shared memory.
n_cols (int) – The number of columns to allocate in tensor memory. Must be a multiple of 32 and a power of 2, and within the range [32, 512].
cta_group (int) – The number of CTA groups involved in the allocation. If cta_group=1, one warp from CTA performs the allocation. Else, if cta_group=2, one warp from each of the peer CTAs perform the allocation.
- tvm.backend.cuda.op.ptx_tcgen05_dealloc(taddr, n_cols, cta_group=1)
- TVM intrinsic to call tcgen05.dealloc.cta_group.sync.aligned
Deallocates the tensor memory specified by the tensor memory address taddr.
- Parameters:
taddr (PrimExpr) – The address of previously allocated tensor memory, should be uint32_t.
n_cols (int) – The number of columns to deallocate in tensor memory. Must be a multiple of 32 and a power of 2, and within the range [32, 512].
cta_group (int) – The number of CTA groups involved in the deallocation. If cta_group=1, one warp from CTA performs the deallocation. Else, if cta_group=2, one warp from each of the peer CTAs perform the deallocation.
- tvm.backend.cuda.op.ptx_tcgen05_relinquish_alloc_permit(cta_group=1)
- TVM intrinsic to call tcgen05.relinquish_alloc_permit.cta_group.sync.aligned
The CTA of the executing thread is relinquishing the right to allocate Tensor Memory after calling this op.
- Parameters:
cta_group (int) – The number of CTA groups involved in relinquishing. If cta_group=1, one warp from CTA performs the relinquishing. Else, if cta_group=2, one warp from each of the peer CTAs perform the relinquishing.
- tvm.backend.cuda.op.ptx_tcgen05_encode_matrix_descriptor(desc, addr, ldo, sdo, swizzle)
TVM intrinsic to create memory descriptor for tcgen05 instructions
- tvm.backend.cuda.op.ptx_tcgen05_encode_instr_descriptor(desc, *, d_dtype, a_dtype, b_dtype, M, N, K, trans_a, trans_b, n_cta_groups=1, neg_a=False, neg_b=False, sat_d=False, is_sparse=False)
TVM intrinsic to create instruction descriptor for tcgen05 MMA without block scaling
- Parameters:
desc (PrimExpr) – The pointer to the instruction descriptor.
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
M (int) – The size of non-reduction dimension of Matrix A.
N (int) – The size of non-reduction dimension of Matrix B.
K (int) – The size of reduction dimension of Matrix A/B.
trans_a (bool) – Whether the multiplicand matrix A is transposed. True for M/N major, False for K major.
trans_b (bool) – Whether the multiplicand matrix B is transposed. True for M/N major, False for K major.
n_cta_groups (int) – The number of CTA groups involved in the MMA operation.
neg_a (bool) – Whether to negate the multiplicand matrix A.
neg_b (bool) – Whether to negate the multiplicand matrix B.
sat_d (bool) – Whether to saturate the resultant matrix D.
is_sparse (bool) – Whether the MMA operation is sparse.
- tvm.backend.cuda.op.ptx_tcgen05_encode_instr_descriptor_block_scaled(desc, *, d_dtype, a_dtype, b_dtype, sfa_dtype, sfb_dtype, sfa_tmem_addr, sfb_tmem_addr, M, N, K, trans_a, trans_b, n_cta_groups=1, neg_a=False, neg_b=False, is_sparse=False)
TVM intrinsic to create instruction descriptor for tcgen05 MMA with block scaling
- Parameters:
desc (PrimExpr) – The pointer to the instruction descriptor.
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
sfa_dtype (str) – The datatype of scale factor matrix A.
sfb_dtype (str) – The datatype of scale factor matrix B.
sfa_tmem_addr (PrimExpr) – The address of the scale factor matrix A in tensor memory, should be uint32_t.
sfb_tmem_addr (PrimExpr) – The address of the scale factor matrix B in tensor memory, should be uint32_t.
M (int) – The size of non-reduction dimension of Matrix A.
N (int) – The size of non-reduction dimension of Matrix B.
K (int) – The size of reduction dimension of Matrix A/B.
trans_a (bool) – Whether the multiplicand matrix A is transposed. True for M/N major, False for K major.
trans_b (bool) – Whether the multiplicand matrix B is transposed. True for M/N major, False for K major.
n_cta_groups (int) – The number of CTA groups involved in the MMA operation.
neg_a (bool) – Whether to negate the multiplicand matrix A.
neg_b (bool) – Whether to negate the multiplicand matrix B.
is_sparse (bool) – Whether the MMA operation is sparse.
- tvm.backend.cuda.op.ptx_tcgen05_mma(d_tmem_addr, a_operand, b_desc, i_desc, *disable_output_lane, d_dtype, a_dtype, b_dtype, use_a_tmem, cta_group, enable_input_d=1, scale_input_d=0, pred=None)
TVM intrinsic to call tcgen05.mma.cta_group.kind without block scaling.
- Parameters:
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
d_tmem_addr (PrimExpr) – The address of the resultant matrix D in tensor memory, should be uint32_t.
a_operand (PrimExpr) – Either the matrix descriptor of multiplicand matrix A in shared memory, or the address of the multiplicand matrix A in tensor memory (uint32_t).
b_desc (PrimExpr) – The matrix descriptor of multiplicand matrix B in shared memory.
i_desc (PrimExpr) – The instruction descriptor of the MMA operation.
use_a_tmem (bool) – Whether the multiplicand matrix A is in tensor memory.
cta_group (int) – The number of CTA groups involved in the MMA operation.
enable_input_d (PrimExpr) – Scale operand for the input accumulator C/D. The inline asm tests enable_input_d != 0: zero means D = A*B, non-zero means D = A*B + D.
scale_input_d (int) – The optional scaling factor to scale input matrix D. D = A*B+D * (2 ^ - scale-input-d)
disable_output_lane (list) – The lanes that should not be updated in the resultant matrix D.
pred (Optional[PrimExpr]) – Runtime
uint32instruction-level predicate. When given, emit@p_issue tcgen05.mma...withp_issue = (pred != 0). Preserves PTX-level predicate semantics (single predicated SASS instruction).
- tvm.backend.cuda.op.ptx_tcgen05_mma_block_scale(d_tmem_addr, a_operand, b_desc, sfa_tmem_addr, sfb_tmem_addr, i_desc, *, d_dtype, a_dtype, b_dtype, sfa_dtype, sfb_dtype, use_a_tmem, cta_group, enable_input_d=1)
- TVM intrinsic to call tcgen05.mma.cta_group.kind.block_scale
Performs matrix multiplication with block scaling: (A * scale_A) * (B * scale_B) + D
- Parameters:
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
sfa_dtype (str) – The datatype of scale factor matrix A.
sfb_dtype (str) – The datatype of scale factor matrix B.
d_tmem_addr (PrimExpr) – The address of the resultant matrix D in tensor memory, should be uint32_t.
a_operand (PrimExpr) – Either the matrix descriptor of multiplicand matrix A in shared memory, or the address of the multiplicand matrix A in tensor memory (uint32_t).
b_desc (PrimExpr) – The matrix descriptor of multiplicand matrix B in shared memory.
sfa_tmem_addr (PrimExpr) – The address of the scale factor matrix A in tensor memory, should be uint32_t.
sfb_tmem_addr (PrimExpr) – The address of the scale factor matrix B in tensor memory, should be uint32_t.
i_desc (PrimExpr) – The instruction descriptor of the MMA operation.
use_a_tmem (bool) – Whether the multiplicand matrix A is in tensor memory.
cta_group (int) – The number of CTA groups involved in the MMA operation.
enable_input_d (PrimExpr) – Scale operand for the input accumulator C/D. Zero means D = A*B, non-zero means D = A*B + D.
- tvm.backend.cuda.op.ptx_tcgen05_mma_sp(d_tmem_addr, a_operand, b_desc, sp_tmem_addr, i_desc, *disable_output_lane, d_dtype, a_dtype, b_dtype, use_a_tmem, cta_group, enable_input_d=1, scale_input_d=0)
TVM intrinsic to call tcgen05.mma.sp.cta_group.kind without block scaling.
- Parameters:
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
d_tmem_addr (PrimExpr) – The address of the resultant matrix D in tensor memory, should be uint32_t.
a_operand (PrimExpr) – Either the matrix descriptor of multiplicand matrix A in shared memory, or the address of the multiplicand matrix A in tensor memory (uint32_t).
b_desc (PrimExpr) – The matrix descriptor of multiplicand matrix B in shared memory.
sp_tmem_addr (PrimExpr) – The address of the metadata of sparse matrix in tensor memory, should be uint32_t.
i_desc (PrimExpr) – The instruction descriptor of the MMA operation.
use_a_tmem (bool) – Whether the multiplicand matrix A is in tensor memory.
cta_group (int) – The number of CTA groups involved in the MMA operation.
enable_input_d (PrimExpr) – Scale operand for the input accumulator C/D. The inline asm tests enable_input_d != 0: zero means D = A*B, non-zero means D = A*B + D.
scale_input_d (int) – The optional scaling factor to scale input matrix D. D = A*B+D * (2 ^ - scale-input-d)
disable_output_lane (list) – The lanes that should not be updated in the resultant matrix D.
- tvm.backend.cuda.op.ptx_tcgen05_mma_sp_block_scale(d_tmem_addr, a_operand, b_desc, sfa_tmem_addr, sfb_tmem_addr, sp_tmem_addr, i_desc, *, d_dtype, a_dtype, b_dtype, sfa_dtype, sfb_dtype, use_a_tmem, cta_group, enable_input_d=1)
- TVM intrinsic to call tcgen05.mma.sp.cta_group.kind.block_scale
Performs sparse matrix multiplication with block scaling: (A * scale_A) * (B * scale_B) + D
- Parameters:
d_dtype (str) – The datatype of resultant matrix D.
a_dtype (str) – The datatype of multiplicand matrix A.
b_dtype (str) – The datatype of multiplicand matrix B.
sfa_dtype (str) – The datatype of scale factor matrix A.
sfb_dtype (str) – The datatype of scale factor matrix B.
d_tmem_addr (PrimExpr) – The address of the resultant matrix D in tensor memory, should be uint32_t.
a_operand (PrimExpr) – Either the matrix descriptor of multiplicand matrix A in shared memory, or the address of the multiplicand matrix A in tensor memory (uint32_t).
b_desc (PrimExpr) – The matrix descriptor of multiplicand matrix B in shared memory.
sfa_tmem_addr (PrimExpr) – The address of the scale factor matrix A in tensor memory, should be uint32_t.
sfb_tmem_addr (PrimExpr) – The address of the scale factor matrix B in tensor memory, should be uint32_t.
sp_tmem_addr (PrimExpr) – The address of the metadata of sparse matrix in tensor memory, should be uint32_t.
i_desc (PrimExpr) – The instruction descriptor of the MMA operation.
use_a_tmem (bool) – Whether the multiplicand matrix A is in tensor memory.
cta_group (int) – The number of CTA groups involved in the MMA operation.
enable_input_d (PrimExpr) – Scale operand for the input accumulator C/D. Zero means D = A*B, non-zero means D = A*B + D.
- tvm.backend.cuda.op.ptx_tcgen05_fence_before_thread_sync()
TVM intrinsic to call tcgen05.fence::before_thread_sync Orders all prior asynchronous tcgen05 operations relative to subsequent operations.
- tvm.backend.cuda.op.ptx_tcgen05_fence_after_thread_sync()
TVM intrinsic to call tcgen05.fence::after_thread_sync Orders all subsequent asynchronous tcgen05 operations relative to previous operations.
- tvm.backend.cuda.op.ptx_tcgen05_cp(taddr, src_desc, *, shape, cta_group=1, multicast='', decompress='', row=0, col=0)
TVM intrinsic for the Blackwell tcgen05.cp PTX instruction.
The emitted PTX is:
tcgen05.cp.cta_group::{cta_group}.{shape}[.{multicast}][.{decompress}] [taddr], src_desc;
Each keyword argument maps 1:1 to a PTX token: read the call and you know what instruction is emitted.
- Parameters:
taddr (PrimExpr) – Destination tensor-memory address (uint32). Callers typically pass
tmem_base + column_offset_in_uint32sdirectly. Use the optionalrow/colkeyword arguments only when the address needs runtime row/col composition viaget_tmem_addr(high 16 bits row, low 16 bits col).src_desc (PrimExpr) – The 64-bit shared-memory matrix descriptor.
shape (str) – One of
"32x128b","4x256b","128x128b","128x256b","64x128b".cta_group (int) – 1 or 2.
multicast (str) – One of
"","warpx4","warpx2::02_13","warpx2::01_23"."32x128b"requires"warpx4";"64x128b"requires one of thewarpx2::*values; other shapes require"".decompress (str) – Trailing PTX suffix for fp4/fp6 → fp8 on-the-fly decompression. One of
"","b8x16.b4x16_p64","b8x16.b6x16_p32".row (PrimExpr) – Optional row/col offsets added to
taddrat runtime. Default 0.col (PrimExpr) – Optional row/col offsets added to
taddrat runtime. Default 0.
- tvm.backend.cuda.op.ptx_tcgen05_shift(taddr, cta_group=1)
- TVM intrinsic to call tcgen05.shift.cta_group.down
Asynchronously shift down the rows of the matrix in Tensor Memory for a warp.
- Parameters:
taddr (PrimExpr) – The address of matrix in tensor memory, should be uint32_t.
cta_group (int) – The number of CTA groups involved in the shift. If cta_group=1, shift operation is performed in the Tensor Memory of current CTA. Else, shift operation is performed in the Tensor Memory of both the current CTA and the peer CTA.
- tvm.backend.cuda.op.ptx_tcgen05_ld(src_addr, *regs, shape, num, row=0, col=0, pack=False)
TVM intrinsic for tcgen05.ld.sync.aligned — async collective load from TMEM.
Emits
tcgen05.ld.sync.aligned.{shape}.x{num}[.pack::16b].b32 {regs}, [addr];- Parameters:
src_addr (PrimExpr) – Tensor-memory source address (uint32).
regs (list[PrimExpr]) – Destination registers. Count depends on shape x num.
shape (str) – One of
"16x32bx2","16x64b","16x128b","16x256b","32x32b".num (int) – Repeat factor along the columns. Power-of-two in [1, 128].
row (PrimExpr) – Optional TMEM row/col offsets added to
src_addrat runtime (row must be a multiple of 32). Default 0.col (PrimExpr) – Optional TMEM row/col offsets added to
src_addrat runtime (row must be a multiple of 32). Default 0.pack (bool) – Pack two 16-bit chunks into a single 32-bit register.
- tvm.backend.cuda.op.ptx_tcgen05_st(dst_addr, *regs, shape, num, row=0, col=0, unpack=False)
TVM intrinsic for tcgen05.st.sync.aligned — async collective store to TMEM.
Emits
tcgen05.st.sync.aligned.{shape}.x{num}[.unpack::16b].b32 [addr], {regs};- Parameters:
dst_addr (PrimExpr) – Tensor-memory destination address (uint32).
regs (list[PrimExpr]) – Source registers. Count depends on shape x num.
shape (str) – One of
"16x32bx2","16x64b","16x128b","16x256b","32x32b".num (int) – Repeat factor along the columns. Power-of-two in [1, 128].
row (PrimExpr) – Optional TMEM row/col offsets added to
dst_addrat runtime (row must be a multiple of 32). Default 0.col (PrimExpr) – Optional TMEM row/col offsets added to
dst_addrat runtime (row must be a multiple of 32). Default 0.unpack (bool) – Unpack a 32-bit register into two 16-bit chunks.
- tvm.backend.cuda.op.ptx_tcgen05_wait_ld()
TVM intrinsic to call tcgen05.wait::ld.sync.aligned Wait for the completion of all prior async tcgen05.ld operations.
- tvm.backend.cuda.op.ptx_tcgen05_wait_st()
TVM intrinsic to call tcgen05.wait::st.sync.aligned Wait for the completion of all prior async tcgen05.st operations.
- tvm.backend.cuda.op.ptx_tcgen05_commit(bar, cta_group=1, cta_mask=0, *, pred=None)
TVM intrinsic to call tcgen05.commit.cta_group
- Parameters:
bar (PrimExpr) – The pointer to mbarrier variable.
cta_group (int) – The number of CTA groups involved in previous tcgen05 operations.
cta_mask (int) – The mask of the CTAs in the cluster, used for multicast.
pred (Optional[PrimExpr]) – Runtime
uint32predicate. When given, emit@p tcgen05.commit...withp = (pred != 0). This preserves PTX-level instruction predicate semantics (single predicated instruction in SASS), distinct from a C-levelifbranch.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.timer_init_cuda(profiler_buffer, profiler_tag, profiler_write_offset, num_groups, group_id)
TVM intrinsic for initializing the CUDA profiler, and store profiling result in a buffer.
- Parameters:
profiler_buffer (Var) – The buffer to store the profiling result.
profiler_tag (Var) – Buffer of length 1 storing the base tag of the current thread.
profiler_write_offset (Var) – Buffer of length 1 storing the offset in buffer to write the next profiling result for the current thread.
num_groups (int) – The number of groups in the profiler.
group_id (PrimExpr) – The group id of the current thread.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.timer_start_cuda(event_type, profiler_buffer, profiler_tag, profiler_write_offset, profiler_write_stride, leader_cond)
TVM intrinsic for starting the timer for profiling a specific event, and storing profiling result in a buffer.
- Parameters:
event_type (Enum) – The event to profile.
profiler_buffer (Var) – The buffer to store the profiling result.
profiler_tag (Var) – Buffer of length 1 storing the base tag of the current thread.
profiler_write_offset (Var) – Buffer of length 1 storing the offset in buffer to write the next profiling result for the current thread.
profiler_write_stride (int) – The stride to advance in buffer in the next write.
leader_cond (PrimExpr) – The condition to check if the current thread is the leader.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.timer_end_cuda(event_type, profiler_buffer, profiler_tag, profiler_write_offset, profiler_write_stride, leader_cond)
TVM intrinsic for ending the timer for profiling a specific event, and storing profiling result in a buffer.
- Parameters:
event_type (Enum) – The event to profile.
profiler_buffer (Var) – The buffer to store the profiling result.
profiler_tag (Var) – Buffer of length 1 storing the base tag of the current thread.
profiler_write_offset (Var) – Buffer of length 1 storing the offset in buffer to write the next profiling result for the current thread.
profiler_write_stride (int) – The stride to advance in buffer in the next write.
leader_cond (PrimExpr) – The condition to check if the current thread is the leader.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.timer_finalize_cuda(profiler_buffer, profiler_tag, profiler_write_offset, profiler_write_stride, leader_cond)
TVM intrinsic for finalizing the CUDA profiler, and store profiling result in a buffer.
- Parameters:
profiler_buffer (Var) – The buffer to store the profiling result.
profiler_tag (Var) – Buffer of length 1 storing the base tag of the current thread.
profiler_write_offset (Var) – Buffer of length 1 storing the offset in buffer to write the next profiling result for the current thread.
profiler_write_stride (int) – The stride to advance in buffer in the next write.
leader_cond (PrimExpr) – The condition to check if the current thread is the leader.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_atomic_add(res_addr, value)
TVM intrinsic to call cuda atomic add instruction
- tvm.backend.cuda.op.cuda_thread_fence()
TVM intrinsic to call cuda thread fence instruction
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_warpgroup_sync(bar_no)
TVM intrinsic to synchronize a CUDA warpgroup via a named barrier.
- Parameters:
bar_no (PrimExpr) – The named barrier id to use for the warpgroup.
Notes
Synchronizes 128 threads in a warpgroup using bar.sync bar_no, 128.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.cuda_syncthreads_and(cond)
TVM intrinsic to call cuda syncthreads_and instruction
- tvm.backend.cuda.op.cuda_syncthreads_or(cond)
TVM intrinsic to call cuda syncthreads_or instruction
- tvm.backend.cuda.op.cuda_nano_sleep(time)
TVM intrinsic to call cuda nano sleep instruction
- tvm.backend.cuda.op.cuda_printf(fmt, *args)
TVM intrinsic to call cuda printf instruction
- tvm.backend.cuda.op.cuda_ldg(addr, dtype)
TVM intrinsic to call CUDA C++ __ldg() function
- tvm.backend.cuda.op.cuda_get_tmem_addr(addr, row_offset, col_offset)
TVM intrinsic to call cuda tmem address calculation
Convert a generic pointer to a shared-memory address (uint32).
Wraps
__cvta_generic_to_shared(ptr). Used by op-wrappers that precompute the shared-memory address at the wrapper layer instead of inside the asm helper body.
- tvm.backend.cuda.op.cuda_smem_addr_from_uint64(cluster_addr)
Narrow a 64-bit cluster-mapped SMEM address to a 32-bit SMEM address.
Wraps
static_cast<unsigned int>(cluster_addr). Used by cp.async.bulk.shared::cluster.* op-wrappers.
- tvm.backend.cuda.op.cuda_sm100_tma_2sm_mbarrier_addr(bar)
Compute the SM100 2SM TMA mbarrier shared-address operand.
- tvm.backend.cuda.op.ptx_exp2(x)
TVM intrinsic for PTX fast exp2 approximation (ex2.approx.ftz.f32)
- tvm.backend.cuda.op.ptx_rcp(x)
TVM intrinsic for PTX fast reciprocal approximation (rcp.approx.ftz.f32)
- tvm.backend.cuda.op.ptx_any_sync(mask, pred)
TVM intrinsic for PTX warp-wide any predicate (__any_sync)
- tvm.backend.cuda.op.ptx_reduce3_max_f32(a, b, c)
TVM intrinsic to call 3-input max.f32 PTX instruction (sm_100a+)
- tvm.backend.cuda.op.ptx_reduce3_min_f32(a, b, c)
TVM intrinsic to call 3-input min.f32 PTX instruction (sm_100a+)
- tvm.backend.cuda.op.ptx_add_f32(d_addr, a, b, *, rounding='rn', ftz=False, sat=False)
PTX
add{.rnd}{.ftz}{.sat}.f32 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_add_f32x2(d_addr, a, b, *, rounding='rn', ftz=False)
PTX
add{.rnd}{.ftz}.f32x2 [d_addr], a, b— DPS form.a, b are packed-as-uint64 register operands (2 fp32 each).
- tvm.backend.cuda.op.ptx_add_f64(d_addr, a, b, *, rounding='rn')
PTX
add{.rnd}.f64 [d_addr], a, b— DPS form (no .ftz / .sat).
- tvm.backend.cuda.op.ptx_sub_f32(d_addr, a, b, *, rounding='rn', ftz=False, sat=False)
PTX
sub{.rnd}{.ftz}{.sat}.f32 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_sub_f32x2(d_addr, a, b, *, rounding='rn', ftz=False)
PTX
sub{.rnd}{.ftz}.f32x2 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_sub_f64(d_addr, a, b, *, rounding='rn')
PTX
sub{.rnd}.f64 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_mul_f32(d_addr, a, b, *, rounding='rn', ftz=False, sat=False)
PTX
mul{.rnd}{.ftz}{.sat}.f32 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_mul_f32x2(d_addr, a, b, *, rounding='rn', ftz=False)
PTX
mul{.rnd}{.ftz}.f32x2 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_mul_f64(d_addr, a, b, *, rounding='rn')
PTX
mul{.rnd}.f64 [d_addr], a, b— DPS form.
- tvm.backend.cuda.op.ptx_fma_f32(d_addr, a, b, c, *, rounding='rn', ftz=False, sat=False)
PTX
fma{.rnd}{.ftz}{.sat}.f32 [d_addr], a, b, c— DPS form.
- tvm.backend.cuda.op.ptx_fma_f32x2(d_addr, a, b, c, *, rounding='rn', ftz=False)
PTX
fma{.rnd}{.ftz}.f32x2 [d_addr], a, b, c— DPS form.a, b, c are packed-as-uint64 register operands.
- tvm.backend.cuda.op.ptx_fma_f64(d_addr, a, b, c, *, rounding='rn')
PTX
fma{.rnd}.f64 [d_addr], a, b, c— DPS form.
- tvm.backend.cuda.op.ptx_max_f32(a, b, *, ftz=False, nan=False)
TVM intrinsic for PTX
max{.ftz}{.NaN}.f32 d, a, b.2-operand form (distinct from
ptx_reduce3_max_f32()which is the 3-operand SM_100+ form)..NaNqualifier propagates NaN inputs to the output; without it, NaN inputs are silently ignored.
- tvm.backend.cuda.op.ptx_griddepcontrol_wait()
TVM intrinsic for PTX
griddepcontrol.wait(sm_90+).Blocks the current grid until prerequisite grids signalled via
ptx_griddepcontrol_launch_dependents()have finished. Acts as a full memory barrier.
- tvm.backend.cuda.op.ptx_griddepcontrol_launch_dependents()
TVM intrinsic for PTX
griddepcontrol.launch_dependents(sm_90+).Signals that the current grid has reached a point where dependent grids may begin execution.
- tvm.backend.cuda.op.ptx_ld_acquire(addr, return_type, ptx_type, *, scope='gpu', space='global')
TVM intrinsic for scalar PTX
ld.acquire.scope{.ss}.typeloads.This wrapper covers the scalar no-cache-policy/no-vector instances of the PTX ISA
ld.acquireform.scope, statespace, PTXtypeand TVMreturn_typeare explicit so callers can request either raw-bit or typed loads.- Parameters:
- Returns:
call – The loaded value.
- Return type:
- tvm.backend.cuda.op.ptx_ld(addr, return_type, ptx_type, *, weak=False, space='global', cop='', cache_hint='', cache_policy=None)
TVM intrinsic for scalar PTX
ld{.weak}{.ss}{.cop}{.level::cache_hint}.type.This wrapper covers scalar no-prefetch/no-vector instances of the weak generic load form.
- tvm.backend.cuda.op.ptx_ld_volatile(addr, return_type, ptx_type, *, space='global')
TVM intrinsic for scalar PTX
ld.volatile{.ss}.typeloads.This wrapper covers scalar no-prefetch/no-vector instances.
- tvm.backend.cuda.op.ptx_ld_global_acquire(res, addr)
TVM intrinsic to call the legacy ptx ld.global.acquire helper.
TVM intrinsic to call ptx map_shared_rank instruction
- tvm.backend.cuda.op.ptx_mapa(ptr, rank, *, space='', ptx_type='u64', return_type='uint64')
TVM intrinsic for PTX
mapa{.space}.type d, a, b.
- tvm.backend.cuda.op.cuda_atomic_cas(ptr, old_val, new_val)
TVM intrinsic to call cuda atomic cas instruction
- tvm.backend.cuda.op.nvshmem_my_pe()
TVM intrinsic to call nvshmem_my_pe()
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_n_pes()
TVM intrinsic to call nvshmem_n_pes()
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_getmem_nbi(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_getmem_nbi()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_nbi(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_putmem_nbi()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_getmem_nbi_warp(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_getmem_nbi_warp()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_nbi_warp(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_putmem_nbi_warp()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_getmem_nbi_block(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_getmem_nbi_block()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_nbi_block(dst, src, nelems, pe)
TVM intrinsic to call nvshmem_putmem_nbi_block()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_signal_op(sig_addr, signal, sig_op, pe)
TVM intrinsic to call nvshmem_signal_op()
- Parameters:
sig_addr (PrimExpr) – The pointer to the symmetric address of the signal word to be updated, must be uint64_t*.
signal (uint64_t) – The value used to update sig_addr.
sig_op (str) – Operation used to update sig_addr with signal, typical sig_op values are “set” and “add”.
pe (int) – The PE number of the remote PE.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_wait_until(ivar, cmp, cmp_value, type='uint64_t')
TVM intrinsic to call nvshmem_wait_until()
- Parameters:
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_quiet()
TVM intrinsic to call nvshmem_quiet()
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_signal_nbi(dst, src, nelems, sig_addr, signal, sig_op, pe)
TVM intrinsic to call nvshmem_putmem_signal_nbi()
- Parameters:
dst (PrimExpr) – The pointer to the symmetric address of the data object to be updated on the remote PE.
src (PrimExpr) – The pointer to the symmetric address or host/device address of data object containing the data to be copied.
nelems (int) – The number of bytes to put per thread.
sig_addr (PrimExpr) – The pointer to the symmetric address of the signal data object to be updated on the remote PE as a signal, must be uint64_t*.
signal (uint64_t) – The unsigned 64-bit value that is used for updating the remote sig_addr signal data object.
sig_op (str) – Signal operator that represents the type of update to be performed on the remote sig_addr signal data object.
pe (int) – The PE number of the remote PE.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_signal_nbi_warp(dst, src, nelems, sig_addr, signal, sig_op, pe)
TVM intrinsic to call nvshmem_putmem_signal_nbi_warp()
- Parameters:
dst (PrimExpr) – The pointer to the symmetric address of the data object to be updated on the remote PE.
src (PrimExpr) – The pointer to the symmetric address or host/device address of data object containing the data to be copied.
nelems (int) – The number of bytes to put per warp.
sig_addr (PrimExpr) – The pointer to the symmetric address of the signal data object to be updated on the remote PE as a signal, must be uint64_t*.
signal (uint64_t) – The unsigned 64-bit value that is used for updating the remote sig_addr signal data object.
sig_op (str) – Signal operator that represents the type of update to be performed on the remote sig_addr signal data object.
pe (int) – The PE number of the remote PE.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_putmem_signal_nbi_block(dst, src, nelems, sig_addr, signal, sig_op, pe)
TVM intrinsic to call nvshmem_putmem_signal_nbi_block()
- Parameters:
dst (PrimExpr) – The pointer to the symmetric address of the data object to be updated on the remote PE.
src (PrimExpr) – The pointer to the symmetric address or host/device address of data object containing the data to be copied.
nelems (int) – The number of bytes to put per block.
sig_addr (PrimExpr) – The pointer to the symmetric address of the signal data object to be updated on the remote PE as a signal, must be uint64_t*.
signal (uint64_t) – The unsigned 64-bit value that is used for updating the remote sig_addr signal data object.
sig_op (str) – Signal operator that represents the type of update to be performed on the remote sig_addr signal data object.
pe (int) – The PE number of the remote PE.
- Returns:
call – The call expression.
- Return type:
- tvm.backend.cuda.op.nvshmem_fence()
TVM intrinsic to call nvshmem_fence()
- Returns:
call – The call expression.
- Return type:
tvm.backend.cuda.script
CUDA TVMScript namespaces.
- class tvm.backend.cuda.script.CUDANamespace
The CUDA intrinsics submodule.
- class tvm.backend.cuda.script.NVSHMEMNamespace
The NVSHMEM intrinsics submodule.
- class tvm.backend.cuda.script.PTXNamespace
The PTX instruction submodule.
tvm.backend.cuda.operator
CUDA backend operator registrations and helpers.