CUDA C++/PTX intrinsics
When no tile primitive covers what you need, two escape hatches reach the hardware
directly: call a backend intrinsic (the T.cuda.* / T.ptx.* namespaces
from tvm.backend.cuda), or inline raw CUDA source.
Calling backend intrinsics
T.cuda.* and T.ptx.* expose the CUDA backend’s device intrinsics directly —
synchronization, mbarriers, reductions, and the PTX data-movement / MMA families:
T.cuda.cta_sync() # block barrier (__syncthreads)
T.cuda.warp_sync() # __syncwarp
T.cuda.warpgroup_sync(8) # warpgroup barrier
T.cuda.cta_sum(val, num_warps, scratch.ptr_to([0])) # block-level reduction
bar = T.alloc_shared((1,), "uint64")
T.ptx.mbarrier.init(bar.data, 1) # mbarrier for async completion
T.ptx.mbarrier.try_wait(bar.data, phase)
A complete, runnable example — a warp all-reduce via T.tvm_warp_shuffle_xor:
@T.prim_func
def warp_reduce(A_ptr: T.handle):
A = T.match_buffer(A_ptr, (32,), "float32", align=16)
T.device_entry()
cta_id = T.cta_id([1]); warp_id = T.warp_id([1]); lane_id = T.lane_id([32])
v = T.alloc_local((1,), "float32"); i = T.alloc_local((1,), "int32")
v[0] = T.float32(31 - lane_id)
i[0] = 16
while i[0] >= 1:
v[0] += T.tvm_warp_shuffle_xor(0xFFFFFFFF, v[0], i[0], 32, 32)
i[0] = i[0] // 2
A[lane_id] = v[0]
The shuffle lowers straight to __shfl_xor_sync:
v_ptr[0] = v_ptr[0] + __shfl_xor_sync(0xFFFFFFFF, v_ptr[0], i_ptr[0], 32);
Other families under T.ptx.* / T.cuda.*: cp_async (LDGSTS),
cp_async.bulk.tensor (TMA), ldmatrix / stmatrix, tcgen05.*
(Blackwell MMA), atomic_add, fence … See tvm.backend.cuda for the
full tvm.backend.cuda reference.
Inlining raw CUDA
For something with no intrinsic at all, inject a __device__ function from a
source string with T.cuda.func_call(name, *args, source_code=..., return_type=...):
SRC = r"""
__device__ __forceinline__ float my_relu(float x) { return x > 0.f ? x : 0.f; }
"""
@T.prim_func
def k(A_ptr: T.handle, B_ptr: T.handle):
A = T.match_buffer(A_ptr, (256,), "float32")
B = T.match_buffer(B_ptr, (256,), "float32")
T.device_entry(); bx = T.cta_id([1]); tx = T.thread_id([256])
B[tx] = T.cuda.func_call("my_relu", A[tx], source_code=SRC, return_type="float32")
The source is emitted verbatim and the call is wired in:
__device__ __forceinline__ float my_relu(float x) { return x > 0.f ? x : 0.f; }
// ...
B_ptr[tx] = my_relu(A_ptr[tx]);