tvm
Enumerations | Functions
tvm::tirx::builtin Namespace Reference

Collection of builtin intrinsics as ops. More...

Enumerations

enum  TVMStructFieldKind : int {
  kDLTensorAddr , kDLTensorData , kDLTensorShape , kDLTensorStrides ,
  kDLTensorNDim , kDLTensorTypeCode , kDLTensorTypeBits , kDLTensorTypeLanes ,
  kDLTensorByteOffset , kDLTensorDeviceId , kDLTensorDeviceType , kDLTensorKindBound_ ,
  kTVMValueContent , kTVMFFIAnyTypeIndex , kTVMFFIAnyZeroPadding , kTVMFFIAnyUnionValue ,
  kTVMValueKindBound_ , kInt64ArrayElem
}
 The kind of structure field info used in intrinsic. More...
 

Functions

const Opret ()
 Return value. More...
 
const Opthread_return ()
 Return from a GPU thread. More...
 
const Opcontinue_loop ()
 Loop continue. More...
 
const Opbreak_loop ()
 Loop break. More...
 
const Opreinterpret ()
 Reinterpret the value using the target type. More...
 
const Oplikely ()
 Marks a condition is likely going to happen. More...
 
const Opfilter ()
 Thread-set filter predicate. Used as the condition of an IfThenElse to narrow the active thread set A for the then-branch. Two forms: filter(var, lo, hi) – range form, true iff var in [lo, hi) filter(var, cond) – predicate form (e.g. var == k); true iff cond var must be a ScopeIdDef-declared Var at parse time (Verifier Rule 2). More...
 
const Opselector ()
 Analysis-only active-thread selector. More...
 
const Opbitwise_and ()
 Bitwise and operator. More...
 
const Opbitwise_or ()
 Bitwise or operator. More...
 
const Opbitwise_xor ()
 Bitwise xor operator. More...
 
const Opbitwise_not ()
 Bitwise not operator. More...
 
const Opshift_left ()
 Left shift. More...
 
const Opshift_right ()
 Right shift. More...
 
const Oplarge_uint_imm ()
 See pesudo code. More...
 
const Opq_multiply_shift ()
 Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1) More...
 
const Opaddress_of ()
 Returns the address of an element in the buffer (see pseudocode below). More...
 
const Opif_then_else ()
 Same as select, used for unsafe memory access. More...
 
const Opisnullptr ()
 See pesudo code. More...
 
const Opisnan ()
 Check if value is nan. More...
 
const Oppopcount ()
 Popcount. More...
 
const Opfma ()
 Fused multiply add. More...
 
const Opcall_extern ()
 Call an extern C function with given name and signature from the types of args in the runtime environment. More...
 
const Opcall_pure_extern ()
 Call an pure extern C function with given name and signature from the types of args in the runtime environment. More...
 
const Opcall_llvm_intrin ()
 Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More...
 
const Opcall_llvm_pure_intrin ()
 Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More...
 
const Opcall_spirv_pure_glsl450 ()
 Call an SPIRV pure GLSL450 intrinsic. More...
 
const Opprefetch ()
 same signature as llvm.prefetch More...
 
const Optvm_access_ptr ()
 Get head access address with memory access pattern info. More...
 
const Optvm_static_handle ()
 Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources. More...
 
const Optvm_context_id ()
 Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace. More...
 
const Optvm_tuple ()
 tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization. More...
 
const Ophandle_add_byte_offset ()
 See pesudo code. More...
 
const Optvm_struct_get ()
 See pesudo code. More...
 
const Optvm_struct_set ()
 See pesudo code. More...
 
const Oplookup_param ()
 See pseudo code Type lookup_param(ffi::String param_name) { return __tvm_param__param_name; }. More...
 
const Optvm_throw_last_error ()
 See pesudo code. More...
 
const Optvm_stack_alloca ()
 See pesudo code. More...
 
const Optvm_stack_make_shape ()
 Allocate a shape tuple on stack, return the handle. More...
 
const Optvm_stack_make_array ()
 Allocate a Tensor(DLTensor) on stack, return the handle. More...
 
const Optvm_call_packed ()
 See pesudo code. More...
 
const Optvm_call_cpacked ()
 See pesudo code. More...
 
const Optvm_call_trace_packed ()
 See pesudo code. More...
 
const Optvm_thread_invariant ()
 Mark a condition to be thread invariant. This means the condition must be the same for all threads. More...
 
const Optvm_call_packed_lowered ()
 Lowered version of call packed, the space of value and type codes are explicitly allocated. More...
 
const Optvm_call_cpacked_lowered ()
 Lowered version of call c-packed, the space of value and type codes are explicitly allocated. More...
 
const Optvm_call_trace_packed_lowered ()
 Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack. More...
 
const Optvm_storage_sync ()
 See pseudo code. More...
 
const Optvm_warp_shuffle ()
 See pseudo code. More...
 
const Optvm_warp_shuffle_up ()
 
const Optvm_warp_shuffle_down ()
 
const Optvm_warp_shuffle_xor ()
 
const Optvm_warp_activemask ()
 
const Optvm_global_barrier_kinit ()
 Initialize the global barrier. Call this at beginning of kernel that need global barrier. More...
 
const Optvm_thread_allreduce ()
 See pesudo code. More...
 
const Opmake_filled_simdgroup_matrix ()
 tvm intrinsic for initializing and simdgroup with given value. More...
 
const Opsimdgroup_load ()
 tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup. More...
 
const Opsimdgroup_store ()
 tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory. More...
 
const Opsimdgroup_multiply_accumulate ()
 tvm intrinsic for multiply and accumulate two matrices in simdgroup More...
 
const Opcooperative_tensor_fill ()
 Fill a cooperative_tensor with a given value. More...
 
const Opcooperative_tensor_load ()
 Load data from device or threadgroup memory into a cooperative_tensor. More...
 
const Opcooperative_tensor_store ()
 Store data from a cooperative_tensor to device or threadgroup memory. More...
 
const Opcooperative_tensor_multiply_accumulate ()
 Multiply and accumulate two matrices using cooperative_tensor (MetalPerformancePrimitives matmul2d). More...
 
const Opvectorhigh ()
 Get the high level half of the vector. More...
 
const Opvectorlow ()
 Get the low-level half of the vector. More...
 
const Opvectorcombine ()
 Concat two vectors. More...
 
const Opdp4a ()
 Dot product of two int8x4 vectors and add an optional accumulator. More...
 
const Opatomic_add ()
 atomic add instruction, corresponding e.g. to atomicAdd in CUDA More...
 
const Opnd_mem_alloc_with_scope ()
 Create an Nd memory allocation with storage scope. More...
 
const Optexture2d_store ()
 Store to texture 2d memory. More...
 
const Optexture2d_load ()
 Load from texture 2d memory. More...
 
const Opdma_copy ()
 Initiate a non-blocking DMA copy from source to destination. More...
 
const Opdma_wait ()
 Wait until the number of DMA groups in flight is less than or equal to some maximum. More...
 
const Opdma_start_group ()
 Start a group of DMA copies. More...
 
const Opdma_end_group ()
 End a group of DMA copies. More...
 
const Opassume ()
 Provide a true statement that can be used for simplifications. More...
 
const Opundef ()
 Returns an initialized but arbitrary value. More...
 
const Opstart_profile_intrinsic ()
 Profiling intrinsic. More...
 
const Opend_profile_intrinsic ()
 Profiling intrinsic. More...
 
const Opanylist_getitem ()
 Get a item from any list and return it. More...
 
const Opanylist_resetitem ()
 Reset and clear a item in any list. More...
 
const Opanylist_setitem_call_packed ()
 Set an item into any list by running packed function call. More...
 
const Opanylist_setitem_call_cpacked ()
 Same as anylist_setitem_call_packed but use C calling convention. More...
 
const Opvscale ()
 Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic) More...
 
const Opget_active_lane_mask ()
 Calculate a predicate mask given an upper bound (limit) and a current value (base). More...
 
const Opignore_loop_partition ()
 Annotate a predicate not be considered as target condition of loop partition. More...
 
const Opbuffer_offset ()
 Get the element offset of a buffer given logical indices. More...
 
const Opprint_buffer ()
 Print the content of a buffer during runtime. More...
 
const Optimer_init_cuda ()
 tvm intrinsic for initializing the CUDA profiler, and store profiling result in a buffer. More...
 
const Optimer_start_cuda ()
 tvm intrinsic for starting the timer for profiling a specific event, and storing profiling result in a buffer. More...
 
const Optimer_end_cuda ()
 tvm intrinsic for ending the timer for profiling a specific event, and storing profiling result in a buffer. More...
 
const Optimer_finalize_cuda ()
 tvm intrinsic for finalize the timer for profiling, and storing profiling result in a buffer. More...
 
const Opcuda_atomic_add ()
 tvm intrinsic for cuda atomic add instruction More...
 
const Opcuda_thread_fence ()
 tvm intrinsic for cuda thread fence instruction More...
 
const Opcuda_warp_reduce ()
 Warp-level butterfly shuffle-XOR reduction. More...
 
const Opcuda_cta_reduce ()
 CTA-wide reduction via warp shuffle + shared memory. More...
 
const Opcuda_copy_bytes ()
 Typed load/store copy of num_bytes bytes. More...
 
const Opcuda_warp_sync ()
 tvm intrinsic for cuda warp sync instruction More...
 
const Opcuda_cta_sync ()
 tvm intrinsic for cuda block-wide sync (syncthreads) More...
 
const Opcuda_grid_sync ()
 tvm intrinsic for cuda grid-wide sync (cooperative groups) More...
 
const Opcuda_thread_rank ()
 tvm intrinsic that returns cooperative_groups::thread_rank() for the enclosing CTA (linear thread index within the block). More...
 
const Opcuda_half2float ()
 tvm intrinsic for cuda half to float conversion More...
 
const Opcuda_bfloat162float ()
 tvm intrinsic for cuda bfloat16 to float conversion More...
 
const Opcuda_float22half2 ()
 tvm intrinsic for a helper converting float2 to half2 with rounding More...
 
const Opcuda_trap_when_assert_failed ()
 tvm intrinsic to trap when an assertion failed (cond == false) More...
 
const Opcuda_runtime_instr_desc ()
 tvm intrinsic to modify runtime instruction descriptor More...
 
const Opcuda_half8tofloat8 ()
 tvm intrinsic to convert 8 half2 lanes to 8 float2 lanes More...
 
const Opcuda_float8tohalf8 ()
 tvm intrinsic to convert 8 float2 lanes to 8 half2 lanes with rounding More...
 
const Opcuda_syncthreads_and ()
 tvm intrinsic for cuda syncthreads_and instruction More...
 
const Opcuda_syncthreads_or ()
 tvm intrinsic for cuda syncthreads_or instruction More...
 
const Opcuda_nano_sleep ()
 tvm intrinsic for cuda nano sleep instruction More...
 
const Opcuda_atomic_cas ()
 tvm intrinsic for cuda atomic compare and swap instruction More...
 
const Opcuda_printf ()
 tvm intrinsic for cuda printf instruction More...
 
const Opcuda_ldg ()
 tvm intrinsic for cuda ldg instruction More...
 
const Opcuda_get_tmem_addr ()
 tvm intrinsic for cuda tmem address calculation More...
 
const Opptx_exp2 ()
 tvm intrinsic for PTX fast exp2 approximation (ex2.approx.ftz.f32) More...
 
const Opptx_rcp ()
 tvm intrinsic for PTX fast reciprocal approximation (rcp.approx.ftz.f32) More...
 
const Opptx_any_sync ()
 tvm intrinsic for PTX warp-wide any predicate (__any_sync) More...
 
const Opptx_reduce3_max_f32 ()
 tvm intrinsic for PTX 3-input max instruction (sm_100a+) More...
 
const Opptx_reduce3_min_f32 ()
 tvm intrinsic for PTX 3-input min instruction (sm_100a+) More...
 
const Opptx_add_packed_f32x2 ()
 tvm intrinsic for PTX packed add instruction (sm_100a+) More...
 
const Opptx_sub_packed_f32x2 ()
 tvm intrinsic for PTX packed subtract instruction (sm_100a+) More...
 
const Opptx_mul_packed_f32x2 ()
 tvm intrinsic for PTX packed multiply instruction (sm_100a+) More...
 
const Opptx_fma_packed_f32x2 ()
 tvm intrinsic for PTX packed FMA instruction (sm_100a+) More...
 
const Optvm_load_matrix_sync ()
 tvm intrinsic for tensor core load operators. More...
 
const Optvm_mma_sync ()
 tvm intrinsic for tensor core mma_sync operators. More...
 
const Optvm_bmma_sync ()
 tvm intrinsic for tensor core bmma_sync operators. More...
 
const Optvm_fill_fragment ()
 tvm intrinsic for tensor core fill_fragment operators. More...
 
const Optvm_store_matrix_sync ()
 tvm intrinsic for tensor core store operators. More...
 
const Opptx_mma ()
 tvm intrinsic for ptx tensor core mma instructions. More...
 
const Opptx_mma_legacy ()
 ptx mma / ldmatrix / mma_store / mma_fill variants that take (ptr_var, offset) pairs (not a folded access_ptr Call). Codegen emits ptr + offset C pointer arithmetic; lower_warp_memory rewrites the offset's group component to its thread-local index. More...
 
const Opptx_ldmatrix_legacy ()
 
const Opmma_store_legacy ()
 
const Opmma_fill_legacy ()
 
const Opptx_ldg32 ()
 tvm intrinsic for ptx predicate load with 32-bit data type. More...
 
const Opptx_mma_sp ()
 tvm intrinsic for sparse tensor core ptx instructions. More...
 
const Opptx_ldmatrix ()
 tvm intrinsic for ptx load matrix from shared memory. More...
 
const Opptx_cp_async ()
 tvm intrinsics for ptx async copy from global to shared memory using cp.async More...
 
const Opptx_cp_async_bulk ()
 tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk More...
 
const Opptx_cp_async_bulk_shared_to_cluster ()
 tvm intrinsics for ptx async bulk copy from shared::cta to shared::cluster More...
 
const Opptx_cp_async_commit_group ()
 tvm intrinsics for ptx async copy commit and wait. More...
 
const Opptx_cp_async_wait_group ()
 
const Opptx_cp_async_mbarrier_arrive ()
 tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive More...
 
const Opptx_fence ()
 PTX fence instruction: fence.{sem}.{scope}. More...
 
const Opptx_fence_proxy_async ()
 PTX fence.proxy.async instruction: fence.proxy.async[.{space}]. More...
 
const Opptx_mbarrier_init ()
 tvm instrinsics to call mbarrier.init.shared::cta.b64 More...
 
const Opptx_mbarrier_arrive ()
 tvm instrinsics to call mbarrier.arrive.shared::cta.b64 or mapa.shared::cluster.u32 mbarrier.arrive.shared::cluster.b64 More...
 
const Opptx_mbarrier_arrive_expect_tx ()
 tvm instrinsics to call mbarrier.arrive.expect_tx.shared.b64 or mapa.shared::cluster.u32 mbarrier.arrive.expect_tx.shared.b64 More...
 
const Opptx_mbarrier_try_wait ()
 tvm instrinsics to call mbarrier.try_wait.parity repeatedly until it returns true More...
 
const Opptx_bar_arrive ()
 tvm instrinsics to call bar.arrive a, b More...
 
const Opptx_bar_sync ()
 tvm instrinsics to call bar.sync a, {b} More...
 
const Opptx_cp_async_bulk_tensor_global_to_cluster ()
 tvm instrinsics to call cp.async.bulk.tensor.dim.shared::cluster.global.tile.mbarrier::complete_tx::bytes More...
 
const Opptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster ()
 tvm intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile::gather4.mbarrier::complete_tx::bytes More...
 
const Opptx_cp_async_bulk_tensor_shared_to_global ()
 tvm instrinsics to call cp.async.bulk.tensor.dim.global.shared::cta.tile。bulk_group More...
 
const Opptx_cp_async_bulk_tensor_global_to_cluster_prefetch ()
 tvm instrinsics to call cp.async.bulk.prefetch.tensor.dim.L2.global.tile More...
 
const Opptx_cp_async_bulk_tensor_shared_to_global_reduce ()
 tvm instrinsics to call cp.reduce.async.bulk.tensor.dim.dst.src.redOp More...
 
const Opptx_cp_async_bulk_commit_group ()
 tvm instrinsics to call cp.async.bulk.commit_group More...
 
const Opptx_cp_async_bulk_wait_group ()
 tvm instrinsics to call cp.async.bulk.wait_group{.read} N More...
 
const Opptx_barrier_cluster_arrive ()
 tvm instrinsics to call barrier.cluster.arrive{.sem}{.aligned} More...
 
const Opptx_barrier_cluster_wait ()
 tvm instrinsics to call barrier.cluster.wait.{acquire}{.aligned} More...
 
const Opptx_elect_sync ()
 tvm instrinsics to call elect.sync _|p, membermask and return the predicate More...
 
const Opptx_fence_mbarrier_init ()
 PTX fence.mbarrier_init.release.cluster instruction. More...
 
const Opptx_fetch_register ()
 tvm instrinsics to fetch PTX pre-defined registers More...
 
const Opmma_store ()
 tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory. More...
 
const Opmma_fill ()
 tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers. More...
 
const Opptx_wgmma_encode_matrix_descriptor ()
 tvm intrinsic to encode matrix descriptor for wgmma instructions. More...
 
const Opptx_wgmma_noop_barrier ()
 tvm intrinsic to call "" : "+r"(reg) :: "memory" More...
 
const Opptx_wgmma_mma_async_ss ()
 tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where both A and B are in shared memory. More...
 
const Opptx_wgmma_mma_async_rs ()
 tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where A is in register and B is in shared memory. More...
 
const Opptx_wgmma_fence ()
 tvm intrinsic to call wgmma.fence.sync.aligned; More...
 
const Opptx_wgmma_commit_group ()
 tvm intrinsic to call wgmma.commit_group.sync.aligned; More...
 
const Opptx_wgmma_wait_group ()
 tvm intrinsic to call wgmma.wait_group.sync.aligned; More...
 
const Opptx_stmatrix ()
 tvm intrinsic to call stmatrix.sync.aligned.m8n8.num{.trans}.shared.b16 [p], r; More...
 
const Opptx_setmaxnreg ()
 tvm intrinsic to call setmaxnreg.action.sync.aligned.u32 imm-reg-count More...
 
const Opptx_ld_global_acquire ()
 tvm intrinsic to call ld.global.acquire.gpu.b32 More...
 
const Opptx_tcgen05_alloc ()
 tvm instrinsics to call tcgen05.alloc.cta_group.sync.aligned; More...
 
const Opptx_tcgen05_dealloc ()
 tvm instrinsics to call tcgen05.dealloc.cta_group.sync.aligned; More...
 
const Opptx_tcgen05_relinquish_alloc_permit ()
 tvm instrinsics to call tcgen05.relinquish_alloc_permit.cta_group.sync.aligned; More...
 
const Opptx_tcgen05_fence_before_thread_sync ()
 tvm instrinsics to call tcgen05.fence::before_thread_sync; More...
 
const Opptx_tcgen05_fence_after_thread_sync ()
 tvm instrinsics to call tcgen05.fence::after_thread_sync; More...
 
const Opptx_tcgen05_ld ()
 tvm instrinsics to call tcgen05.ld.sync.aligned; More...
 
const Opptx_tcgen05_st ()
 tvm instrinsics to call tcgen05.st.sync.aligned; More...
 
const Opptx_tcgen05_wait_ld ()
 tvm instrinsics to call tcgen05.wait::ld.sync.aligned; More...
 
const Opptx_tcgen05_wait_st ()
 tvm instrinsics to call tcgen05.wait::st.sync.aligned; More...
 
const Opptx_tcgen05_encode_matrix_descriptor ()
 tvm intrinsic to encode matrix descriptor for tcgen05 instructions. More...
 
const Opptx_tcgen05_encode_instr_descriptor ()
 tvm intrinsic to encode instruction descriptor for tcgen05 MMA. More...
 
const Opptx_tcgen05_encode_instr_descriptor_block_scaled ()
 tvm intrinsic to encode instruction descriptor for tcgen05 MMA block scaled. More...
 
const Opptx_tcgen05_mma ()
 tvm intrinsic to call tcgen05.mma.cta_group.kind without block scaling. More...
 
const Opptx_tcgen05_mma_block_scale ()
 tvm intrinsic to call tcgen05.mma.cta_group.kind.block_scale{.scale_vec_size} More...
 
const Opptx_tcgen05_mma_sp ()
 tvm intrinsic to call tcgen05.mma.sp.cta_group.kind without block scaling. More...
 
const Opptx_tcgen05_mma_sp_block_scale ()
 tvm intrinsic to call tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vec_size} More...
 
const Opptx_tcgen05_commit ()
 tvm instrinsics to call tcgen05.commit.cta_group More...
 
const Opptx_tcgen05_cp ()
 tvm instrinsics to call tcgen05.cp.cta_group More...
 
const Opptx_tcgen05_shift ()
 tvm instrinsics to call tcgen05.shift.cta_group.down More...
 
const Opptx_map_shared_rank ()
 tvm instrinsics to call map_shared_rank More...
 
const Opcuda_func_call ()
 tvm instrinsics to call a CUDA function. Source code is provided as a string. More...
 
const Opnvshmem_my_pe ()
 nvshmem intrinsics for nvshmem_my_pe() operation. More...
 
const Opnvshmem_n_pes ()
 nvshmem intrinsics for nvshmem_n_pes() operation. More...
 
const Opnvshmem_getmem_nbi ()
 nvshmem intrinsics for nvshmem_getmem_nbi() operation. More...
 
const Opnvshmem_putmem_nbi ()
 nvshmem intrinsics for nvshmem_putmem_nbi() operation. More...
 
const Opnvshmem_getmem_nbi_warp ()
 nvshmem intrinsics for nvshmemx_getmem_nbi_warp() operation. More...
 
const Opnvshmem_putmem_nbi_warp ()
 nvshmem intrinsics for nvshmemx_putmem_nbi_warp() operation. More...
 
const Opnvshmem_getmem_nbi_block ()
 nvshmem intrinsics for nvshmemx_getmem_nbi_block() operation. More...
 
const Opnvshmem_putmem_nbi_block ()
 nvshmem intrinsics for nvshmemx_putmem_nbi_block() operation. More...
 
const Opnvshmem_signal_op ()
 nvshmem intrinsics for nvshmemx_signal_op() operation. More...
 
const Opnvshmem_wait_until ()
 nvshmem intrinsics for nvshmem_FuncParam{TYPENAME}_wait_until() operation. More...
 
const Opnvshmem_quiet ()
 nvshmem intrinsics for nvshmem_quiet() operation. More...
 
const Opnvshmem_putmem_signal_nbi ()
 nvshmem intrinsics for nvshmemx_putmem_signal_nbi() operation. More...
 
const Opnvshmem_putmem_signal_nbi_warp ()
 nvshmem intrinsics for nvshmemx_putmem_signal_nbi_warp() operation. More...
 
const Opnvshmem_putmem_signal_nbi_block ()
 nvshmem intrinsics for nvshmemx_putmem_signal_nbi_block() operation. More...
 
const Opnvshmem_fence ()
 nvshmem intrinsics for nvshmem_fence() operation. More...
 
const Opnvshmem_barrier_all ()
 nvshmem intrinsics for nvshmem_barrier_all() operation. More...
 
const Opnki_load ()
 nki intrinsics for load operation. More...
 
const Opnki_store ()
 nki intrinsics for store operation. More...
 
const Opnki_tensor_copy ()
 nki intrinsics for tensor_copy operation. More...
 
const Opnki_matmul ()
 nki intrinsics for matmul operation. More...
 
const Opnki_activation ()
 nki intrinsics for activation operation. More...
 
const Opnki_reciprocal ()
 nki intrinsics for reciprocal operation. More...
 
const Opnki_tensortensor ()
 nki intrinsics for tensortensor operation. More...
 
const Opnki_tensorscalar ()
 nki intrinsics for tensorscalar operation. More...
 
const Opnki_tensorreduce ()
 nki intrinsics for tensorreduce operation. More...
 
const Opnki_memset ()
 nki intrinsics for memset operation. More...
 
const Opnki_activation_reduce ()
 nki intrinsics for activation reduce operation. More...
 
const Opnki_tensorscalar_reduce ()
 nki intrinsics for tensorscalar reduce operation. More...
 
const Opnki_identity ()
 nki intrinsics for initializing identity tensor. More...
 
const Opnki_scalar_tensor_tensor ()
 nki intrinsics for scalar tensor tensor operation. More...
 
const Opnki_scalar_tensor_scalar ()
 nki intrinsics for scalar tensor scalar operation. More...
 
const Opnki_affine_select ()
 nki intrinsics for affine_select operation. More...
 

Detailed Description

Collection of builtin intrinsics as ops.

Enumeration Type Documentation

◆ TVMStructFieldKind

The kind of structure field info used in intrinsic.

Enumerator
kDLTensorAddr 
kDLTensorData 
kDLTensorShape 
kDLTensorStrides 
kDLTensorNDim 
kDLTensorTypeCode 
kDLTensorTypeBits 
kDLTensorTypeLanes 
kDLTensorByteOffset 
kDLTensorDeviceId 
kDLTensorDeviceType 
kDLTensorKindBound_ 
kTVMValueContent 
kTVMFFIAnyTypeIndex 
kTVMFFIAnyZeroPadding 
kTVMFFIAnyUnionValue 
kTVMValueKindBound_ 
kInt64ArrayElem 

Function Documentation

◆ address_of()

const Op& tvm::tirx::builtin::address_of ( )

Returns the address of an element in the buffer (see pseudocode below).

The number of indices should match the dimensionality of the buffer being accessed. If this operation occurs after buffer flattening, the number of indices must be supported by the target (i.e. N>1 only on targets that support non-flat memory buffers).

Handle address_of(BufferLoad *op) { return &op->buffer_var[op->indices[0], op->indices[1], ..., op->indices[N-1]]; }

◆ anylist_getitem()

const Op& tvm::tirx::builtin::anylist_getitem ( )

Get a item from any list and return it.

Any anylist_getitem(Handle anylist, int index) return anylist[index]; }

Note
This intrinsic is only applicable when appearing in call_packed and anylist_setitem_call_packed.

◆ anylist_resetitem()

const Op& tvm::tirx::builtin::anylist_resetitem ( )

Reset and clear a item in any list.

void anylist_resetitem(Handle anylist, int index) anylist[index] = nullptr; }

Note
This intrinsic is only applicable when appearing in call_packed and anylist_setitem_call_packed.

◆ anylist_setitem_call_cpacked()

const Op& tvm::tirx::builtin::anylist_setitem_call_cpacked ( )

Same as anylist_setitem_call_packed but use C calling convention.

◆ anylist_setitem_call_packed()

const Op& tvm::tirx::builtin::anylist_setitem_call_packed ( )

Set an item into any list by running packed function call.

void anylist_setitem_call_packed(Handle anylist, int index, name, *args)

anylist[index] = call_packed(name, *args) }

Note
This intrinsic can be used in combination with anylist_getitem.

◆ assume()

const Op& tvm::tirx::builtin::assume ( )

Provide a true statement that can be used for simplifications.

Compile-time representation of known constraints about function inputs. This assumption is removed when lowering, and does not occur in codegen.

◆ atomic_add()

const Op& tvm::tirx::builtin::atomic_add ( )

atomic add instruction, corresponding e.g. to atomicAdd in CUDA

◆ bitwise_and()

const Op& tvm::tirx::builtin::bitwise_and ( )

Bitwise and operator.

◆ bitwise_not()

const Op& tvm::tirx::builtin::bitwise_not ( )

Bitwise not operator.

◆ bitwise_or()

const Op& tvm::tirx::builtin::bitwise_or ( )

Bitwise or operator.

◆ bitwise_xor()

const Op& tvm::tirx::builtin::bitwise_xor ( )

Bitwise xor operator.

◆ break_loop()

const Op& tvm::tirx::builtin::break_loop ( )

Loop break.

◆ buffer_offset()

const Op& tvm::tirx::builtin::buffer_offset ( )

Get the element offset of a buffer given logical indices.

The offset is determined by the layout of the buffer.

◆ call_extern()

const Op& tvm::tirx::builtin::call_extern ( )

Call an extern C function with given name and signature from the types of args in the runtime environment.

Type call_extern(name, args...) { return dlsym(name)(args...); }

Note
This intrinsic does not provide any type checking, and is main used for backward compatibility reasons. Always consider use pre-registered and typed tvm::Op first.

◆ call_llvm_intrin()

const Op& tvm::tirx::builtin::call_llvm_intrin ( )

Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment.

Type call_llvm_pure_intrin(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ call_llvm_pure_intrin()

const Op& tvm::tirx::builtin::call_llvm_pure_intrin ( )

Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment.

Type call_llvm_pure_intrin(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ call_pure_extern()

const Op& tvm::tirx::builtin::call_pure_extern ( )

Call an pure extern C function with given name and signature from the types of args in the runtime environment.

Type call_pure_extern(name, args...) { return dlsym(name)(args...); }

Note
This intrinsic does not provide any type checking, and is main used for backward compatibility reasons. Always consider use pre-registered and typed tvm::Op first.

◆ call_spirv_pure_glsl450()

const Op& tvm::tirx::builtin::call_spirv_pure_glsl450 ( )

Call an SPIRV pure GLSL450 intrinsic.

Type call_spirv_pure_glsl450(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ continue_loop()

const Op& tvm::tirx::builtin::continue_loop ( )

Loop continue.

◆ cooperative_tensor_fill()

const Op& tvm::tirx::builtin::cooperative_tensor_fill ( )

Fill a cooperative_tensor with a given value.

void cooperative_tensor_fill(Var d, PrimExpr index, PrimExpr value, int rows, int cols);

◆ cooperative_tensor_load()

const Op& tvm::tirx::builtin::cooperative_tensor_load ( )

Load data from device or threadgroup memory into a cooperative_tensor.

void cooperative_tensor_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int rows, int cols, bool transpose_matrix, int mma_M, int mma_N, int mma_K, int operand_role); operand_role: 0=left(A), 1=right(B), 2=destination(C)

◆ cooperative_tensor_multiply_accumulate()

const Op& tvm::tirx::builtin::cooperative_tensor_multiply_accumulate ( )

Multiply and accumulate two matrices using cooperative_tensor (MetalPerformancePrimitives matmul2d).

void cooperative_tensor_multiply_accumulate( Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c, int M, int N, int K, bool transpose_a, bool transpose_b);

◆ cooperative_tensor_store()

const Op& tvm::tirx::builtin::cooperative_tensor_store ( )

Store data from a cooperative_tensor to device or threadgroup memory.

void cooperative_tensor_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int rows, int cols, bool transpose_matrix, int mma_M, int mma_N, int mma_K, int operand_role); operand_role: 0=left(A), 1=right(B), 2=destination(C)

◆ cuda_atomic_add()

const Op& tvm::tirx::builtin::cuda_atomic_add ( )

tvm intrinsic for cuda atomic add instruction

◆ cuda_atomic_cas()

const Op& tvm::tirx::builtin::cuda_atomic_cas ( )

tvm intrinsic for cuda atomic compare and swap instruction

◆ cuda_bfloat162float()

const Op& tvm::tirx::builtin::cuda_bfloat162float ( )

tvm intrinsic for cuda bfloat16 to float conversion

◆ cuda_copy_bytes()

const Op& tvm::tirx::builtin::cuda_copy_bytes ( )

Typed load/store copy of num_bytes bytes.

cuda_copy_bytes(dst, src, num_bytes) copies num_bytes bytes from src to dst using a single typed load/store (uint4, uint2, unsigned int, etc.). num_bytes must be one of {1, 2, 4, 8, 16}.

◆ cuda_cta_reduce()

const Op& tvm::tirx::builtin::cuda_cta_reduce ( )

CTA-wide reduction via warp shuffle + shared memory.

cuda_cta_reduce(value, op, num_warps, scratch) reduces value across the entire CTA using the specified operation ("sum", "max", "min").

◆ cuda_cta_sync()

const Op& tvm::tirx::builtin::cuda_cta_sync ( )

tvm intrinsic for cuda block-wide sync (syncthreads)

◆ cuda_float22half2()

const Op& tvm::tirx::builtin::cuda_float22half2 ( )

tvm intrinsic for a helper converting float2 to half2 with rounding

◆ cuda_float8tohalf8()

const Op& tvm::tirx::builtin::cuda_float8tohalf8 ( )

tvm intrinsic to convert 8 float2 lanes to 8 half2 lanes with rounding

◆ cuda_func_call()

const Op& tvm::tirx::builtin::cuda_func_call ( )

tvm instrinsics to call a CUDA function. Source code is provided as a string.

cuda_func_call(String func_name, PrimExpr... args, String source_code)

◆ cuda_get_tmem_addr()

const Op& tvm::tirx::builtin::cuda_get_tmem_addr ( )

tvm intrinsic for cuda tmem address calculation

◆ cuda_grid_sync()

const Op& tvm::tirx::builtin::cuda_grid_sync ( )

tvm intrinsic for cuda grid-wide sync (cooperative groups)

◆ cuda_half2float()

const Op& tvm::tirx::builtin::cuda_half2float ( )

tvm intrinsic for cuda half to float conversion

◆ cuda_half8tofloat8()

const Op& tvm::tirx::builtin::cuda_half8tofloat8 ( )

tvm intrinsic to convert 8 half2 lanes to 8 float2 lanes

◆ cuda_ldg()

const Op& tvm::tirx::builtin::cuda_ldg ( )

tvm intrinsic for cuda ldg instruction

◆ cuda_nano_sleep()

const Op& tvm::tirx::builtin::cuda_nano_sleep ( )

tvm intrinsic for cuda nano sleep instruction

◆ cuda_printf()

const Op& tvm::tirx::builtin::cuda_printf ( )

tvm intrinsic for cuda printf instruction

◆ cuda_runtime_instr_desc()

const Op& tvm::tirx::builtin::cuda_runtime_instr_desc ( )

tvm intrinsic to modify runtime instruction descriptor

◆ cuda_syncthreads_and()

const Op& tvm::tirx::builtin::cuda_syncthreads_and ( )

tvm intrinsic for cuda syncthreads_and instruction

◆ cuda_syncthreads_or()

const Op& tvm::tirx::builtin::cuda_syncthreads_or ( )

tvm intrinsic for cuda syncthreads_or instruction

◆ cuda_thread_fence()

const Op& tvm::tirx::builtin::cuda_thread_fence ( )

tvm intrinsic for cuda thread fence instruction

◆ cuda_thread_rank()

const Op& tvm::tirx::builtin::cuda_thread_rank ( )

tvm intrinsic that returns cooperative_groups::thread_rank() for the enclosing CTA (linear thread index within the block).

◆ cuda_trap_when_assert_failed()

const Op& tvm::tirx::builtin::cuda_trap_when_assert_failed ( )

tvm intrinsic to trap when an assertion failed (cond == false)

◆ cuda_warp_reduce()

const Op& tvm::tirx::builtin::cuda_warp_reduce ( )

Warp-level butterfly shuffle-XOR reduction.

cuda_warp_reduce(value, op, width) reduces value across width adjacent lanes using the specified operation ("sum", "max", "min").

◆ cuda_warp_sync()

const Op& tvm::tirx::builtin::cuda_warp_sync ( )

tvm intrinsic for cuda warp sync instruction

◆ dma_copy()

const Op& tvm::tirx::builtin::dma_copy ( )

Initiate a non-blocking DMA copy from source to destination.

The copy is launched immediately.

If a dma_start_group() call is active, the copy will be added to the current group for tracking of in-flight group counts.

If no dma_start_group() call is active, the copy will be tracked individually i.e. as a group with size 1.

◆ dma_end_group()

const Op& tvm::tirx::builtin::dma_end_group ( )

End a group of DMA copies.

Track all calls to dma_copy() that occurred since the preceding dma_start_group() as a single group in-flight.

Calling dma_end_group() without an active group is unsupported.

Note: A group of DMA calls may be empty, and will still contribute to the count of in-flight groups used by dma_wait().

◆ dma_start_group()

const Op& tvm::tirx::builtin::dma_start_group ( )

Start a group of DMA copies.

Any call to dma_copy() that occurs after dma_start_group() will be added to the current group for tracking of in-flight group counts.

Only one DMA group may be active at a given time. Calling dma_start_group() while a group is active is unsupported.

◆ dma_wait()

const Op& tvm::tirx::builtin::dma_wait ( )

Wait until the number of DMA groups in flight is less than or equal to some maximum.

Calling dma_wait() while a group is active is unsupported.

◆ dp4a()

const Op& tvm::tirx::builtin::dp4a ( )

Dot product of two int8x4 vectors and add an optional accumulator.

◆ end_profile_intrinsic()

const Op& tvm::tirx::builtin::end_profile_intrinsic ( )

Profiling intrinsic.

◆ filter()

const Op& tvm::tirx::builtin::filter ( )

Thread-set filter predicate. Used as the condition of an IfThenElse to narrow the active thread set A for the then-branch. Two forms: filter(var, lo, hi) – range form, true iff var in [lo, hi) filter(var, cond) – predicate form (e.g. var == k); true iff cond var must be a ScopeIdDef-declared Var at parse time (Verifier Rule 2).

◆ fma()

const Op& tvm::tirx::builtin::fma ( )

Fused multiply add.

Type fma(a, b, c) { return a * b + c; }

◆ get_active_lane_mask()

const Op& tvm::tirx::builtin::get_active_lane_mask ( )

Calculate a predicate mask given an upper bound (limit) and a current value (base).

It will be lowered to the llvm.get.active.lane.mask intrinsic. (https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics)

◆ handle_add_byte_offset()

const Op& tvm::tirx::builtin::handle_add_byte_offset ( )

See pesudo code.

void* handle_add_byte_offset(void* handle, int offset) { return reinterpret_cast<v*>(reinterpret_cast<char*>(handle) + offset); }

◆ if_then_else()

const Op& tvm::tirx::builtin::if_then_else ( )

Same as select, used for unsafe memory access.

Type tvm_if_then_else(cond, a, b) { return cond ? a : b; }

◆ ignore_loop_partition()

const Op& tvm::tirx::builtin::ignore_loop_partition ( )

Annotate a predicate not be considered as target condition of loop partition.

◆ isnan()

const Op& tvm::tirx::builtin::isnan ( )

Check if value is nan.

◆ isnullptr()

const Op& tvm::tirx::builtin::isnullptr ( )

See pesudo code.

bool isnullptr(void* handle) { return handle == nullptr }

◆ large_uint_imm()

const Op& tvm::tirx::builtin::large_uint_imm ( )

See pesudo code.

Construct a big uint that may not be representable by int64

Expr large_uint_imm(uint32_t v0, uin32_t v1) { return (v1 << 32) | v0; }

◆ likely()

const Op& tvm::tirx::builtin::likely ( )

Marks a condition is likely going to happen.

◆ lookup_param()

const Op& tvm::tirx::builtin::lookup_param ( )

See pseudo code Type lookup_param(ffi::String param_name) { return __tvm_param__param_name; }.

◆ make_filled_simdgroup_matrix()

const Op& tvm::tirx::builtin::make_filled_simdgroup_matrix ( )

tvm intrinsic for initializing and simdgroup with given value.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void make_filled_simdgroup_matrix(Var d, PrimExpr index, PrimExpr value, int col = 8, int row = 8);

◆ mma_fill()

const Op& tvm::tirx::builtin::mma_fill ( )

tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers.

There is no real PTX instruction that does that, but we introduce this intrinsic for the same reason as mma_store above.

void mma_fill(IntImm local_size, Var local_ptr, Expr offset);

◆ mma_fill_legacy()

const Op& tvm::tirx::builtin::mma_fill_legacy ( )

◆ mma_store()

const Op& tvm::tirx::builtin::mma_store ( )

tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory.

There is no real PTX instruction that does that, but we want to hide details of complex index manipulation behind this intrinsic to simplify TIR lowering passes (e.g. LowerWarpMemory).

void mma_store(IntImm m, IntImm n, Var dst_ptr, Var src_ptr, Expr src_offset, Var dst_stride);

◆ mma_store_legacy()

const Op& tvm::tirx::builtin::mma_store_legacy ( )

◆ nd_mem_alloc_with_scope()

const Op& tvm::tirx::builtin::nd_mem_alloc_with_scope ( )

Create an Nd memory allocation with storage scope.

◆ nki_activation()

const Op& tvm::tirx::builtin::nki_activation ( )

nki intrinsics for activation operation.

nki_activation(result, data, opcode, bias, scale)

◆ nki_activation_reduce()

const Op& tvm::tirx::builtin::nki_activation_reduce ( )

nki intrinsics for activation reduce operation.

nki_activation_reduce(reduce_res, act_res, data, opcode, reduce_opcode, bias, scale)

◆ nki_affine_select()

const Op& tvm::tirx::builtin::nki_affine_select ( )

nki intrinsics for affine_select operation.

nki_affine_select(result, pred, true_value, false_value)

◆ nki_identity()

const Op& tvm::tirx::builtin::nki_identity ( )

nki intrinsics for initializing identity tensor.

nki_identity(result, size)

◆ nki_load()

const Op& tvm::tirx::builtin::nki_load ( )

nki intrinsics for load operation.

nki_load(result, data)

◆ nki_matmul()

const Op& tvm::tirx::builtin::nki_matmul ( )

nki intrinsics for matmul operation.

nki_matmul(C, A, B, accum)

equivalent to C += A.T @ B (if accum is true), or C = A.T @ B (if accum is false)

◆ nki_memset()

const Op& tvm::tirx::builtin::nki_memset ( )

nki intrinsics for memset operation.

nki_memset(result, value)

◆ nki_reciprocal()

const Op& tvm::tirx::builtin::nki_reciprocal ( )

nki intrinsics for reciprocal operation.

nki_reciprocal(result, data)

◆ nki_scalar_tensor_scalar()

const Op& tvm::tirx::builtin::nki_scalar_tensor_scalar ( )

nki intrinsics for scalar tensor scalar operation.

(data op1 operand1) op2 (operand2) where op1 and op2 are tensor-scalar

nki_scalar_tensor_scalar(result, data, operand0, operand1, opcode0, opcode1, reverse0, reverse1)

◆ nki_scalar_tensor_tensor()

const Op& tvm::tirx::builtin::nki_scalar_tensor_tensor ( )

nki intrinsics for scalar tensor tensor operation.

(data op1 operand1) op2 (operand2) where op1 is tensor-scalar and op2 is tensor-tensor

nki_scalar_tensor_tensor(result, data, operand0, operand1, opcode0, opcode1, reverse0, reverse1)

◆ nki_store()

const Op& tvm::tirx::builtin::nki_store ( )

nki intrinsics for store operation.

nki_store(result, data)

◆ nki_tensor_copy()

const Op& tvm::tirx::builtin::nki_tensor_copy ( )

nki intrinsics for tensor_copy operation.

nki_tensor_copy(result, data)

◆ nki_tensorreduce()

const Op& tvm::tirx::builtin::nki_tensorreduce ( )

nki intrinsics for tensorreduce operation.

nki_tensorreduce(result, data, opcode, negate, axes)

◆ nki_tensorscalar()

const Op& tvm::tirx::builtin::nki_tensorscalar ( )

nki intrinsics for tensorscalar operation.

nki_tensorscalar(result, operand0, operand1, opcode, reverse)

◆ nki_tensorscalar_reduce()

const Op& tvm::tirx::builtin::nki_tensorscalar_reduce ( )

nki intrinsics for tensorscalar reduce operation.

nki_tensorscalar_reduce(reduce_res, tensorscalar_res, operand0, operand1, opcode, reduce_opcode, reverse)

◆ nki_tensortensor()

const Op& tvm::tirx::builtin::nki_tensortensor ( )

nki intrinsics for tensortensor operation.

nki_tensortensor(result, operand0, operand1, opcode)

◆ nvshmem_barrier_all()

const Op& tvm::tirx::builtin::nvshmem_barrier_all ( )

nvshmem intrinsics for nvshmem_barrier_all() operation.

void nvshmem_barrier_all()

◆ nvshmem_fence()

const Op& tvm::tirx::builtin::nvshmem_fence ( )

nvshmem intrinsics for nvshmem_fence() operation.

void nvshmem_fence()

◆ nvshmem_getmem_nbi()

const Op& tvm::tirx::builtin::nvshmem_getmem_nbi ( )

nvshmem intrinsics for nvshmem_getmem_nbi() operation.

void nvshmem_getmem_nbi(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_getmem_nbi_block()

const Op& tvm::tirx::builtin::nvshmem_getmem_nbi_block ( )

nvshmem intrinsics for nvshmemx_getmem_nbi_block() operation.

void nvshmemx_getmem_nbi_block(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_getmem_nbi_warp()

const Op& tvm::tirx::builtin::nvshmem_getmem_nbi_warp ( )

nvshmem intrinsics for nvshmemx_getmem_nbi_warp() operation.

void nvshmemx_getmem_nbi_warp(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_my_pe()

const Op& tvm::tirx::builtin::nvshmem_my_pe ( )

nvshmem intrinsics for nvshmem_my_pe() operation.

int nvshmem_my_pe()

◆ nvshmem_n_pes()

const Op& tvm::tirx::builtin::nvshmem_n_pes ( )

nvshmem intrinsics for nvshmem_n_pes() operation.

int nvshmem_n_pes()

◆ nvshmem_putmem_nbi()

const Op& tvm::tirx::builtin::nvshmem_putmem_nbi ( )

nvshmem intrinsics for nvshmem_putmem_nbi() operation.

void nvshmem_putmem_nbi(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_putmem_nbi_block()

const Op& tvm::tirx::builtin::nvshmem_putmem_nbi_block ( )

nvshmem intrinsics for nvshmemx_putmem_nbi_block() operation.

void nvshmemx_putmem_nbi_block(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_putmem_nbi_warp()

const Op& tvm::tirx::builtin::nvshmem_putmem_nbi_warp ( )

nvshmem intrinsics for nvshmemx_putmem_nbi_warp() operation.

void nvshmemx_putmem_nbi_warp(void *dest, const void *source, size_t nelems, int pe)

◆ nvshmem_putmem_signal_nbi()

const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi ( )

nvshmem intrinsics for nvshmemx_putmem_signal_nbi() operation.

void nvshmemx_putmem_signal_nbi(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)

◆ nvshmem_putmem_signal_nbi_block()

const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi_block ( )

nvshmem intrinsics for nvshmemx_putmem_signal_nbi_block() operation.

void nvshmemx_putmem_signal_nbi_block(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)

◆ nvshmem_putmem_signal_nbi_warp()

const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi_warp ( )

nvshmem intrinsics for nvshmemx_putmem_signal_nbi_warp() operation.

void nvshmemx_putmem_signal_nbi_warp(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)

◆ nvshmem_quiet()

const Op& tvm::tirx::builtin::nvshmem_quiet ( )

nvshmem intrinsics for nvshmem_quiet() operation.

void nvshmem_quiet()

◆ nvshmem_signal_op()

const Op& tvm::tirx::builtin::nvshmem_signal_op ( )

nvshmem intrinsics for nvshmemx_signal_op() operation.

void nvshmemx_signal_op(uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)

◆ nvshmem_wait_until()

const Op& tvm::tirx::builtin::nvshmem_wait_until ( )

nvshmem intrinsics for nvshmem_FuncParam{TYPENAME}_wait_until() operation.

void nvshmem_FuncParam{TYPENAME}_wait_until(TYPE *ivar, int cmp, TYPE cmp_value)

◆ popcount()

const Op& tvm::tirx::builtin::popcount ( )

Popcount.

◆ prefetch()

const Op& tvm::tirx::builtin::prefetch ( )

same signature as llvm.prefetch

◆ print_buffer()

const Op& tvm::tirx::builtin::print_buffer ( )

Print the content of a buffer during runtime.

◆ ptx_add_packed_f32x2()

const Op& tvm::tirx::builtin::ptx_add_packed_f32x2 ( )

tvm intrinsic for PTX packed add instruction (sm_100a+)

◆ ptx_any_sync()

const Op& tvm::tirx::builtin::ptx_any_sync ( )

tvm intrinsic for PTX warp-wide any predicate (__any_sync)

◆ ptx_bar_arrive()

const Op& tvm::tirx::builtin::ptx_bar_arrive ( )

tvm instrinsics to call bar.arrive a, b

bar_arrive(int name_bar_id, int thread_count)

◆ ptx_bar_sync()

const Op& tvm::tirx::builtin::ptx_bar_sync ( )

tvm instrinsics to call bar.sync a, {b}

bar_sync(int name_bar_id, int thread_count)

◆ ptx_barrier_cluster_arrive()

const Op& tvm::tirx::builtin::ptx_barrier_cluster_arrive ( )

tvm instrinsics to call barrier.cluster.arrive{.sem}{.aligned}

ptx_barrier_cluster_arrive(string sem, bool aligned)

◆ ptx_barrier_cluster_wait()

const Op& tvm::tirx::builtin::ptx_barrier_cluster_wait ( )

tvm instrinsics to call barrier.cluster.wait.{acquire}{.aligned}

ptx_barrier_cluster_wait(bool acquire, bool aligned)

◆ ptx_cp_async()

const Op& tvm::tirx::builtin::ptx_cp_async ( )

tvm intrinsics for ptx async copy from global to shared memory using cp.async

void ptx_cp_async(Var shared_ptr, Expr shared_offset, Var global_ptr, Expr global_offset, size_t bytes);

◆ ptx_cp_async_bulk()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk ( )

tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk

void ptx_cp_async_bulk(Var shared_ptr, Expr shared_offset, Var global_ptr, Expr global_offset, size_t bytes, int barrier_arr_id, int barrier_id);

◆ ptx_cp_async_bulk_commit_group()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_commit_group ( )

tvm instrinsics to call cp.async.bulk.commit_group

ptx_cp_async_bulk_commit_group()

◆ ptx_cp_async_bulk_shared_to_cluster()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_shared_to_cluster ( )

tvm intrinsics for ptx async bulk copy from shared::cta to shared::cluster

void ptx_cp_async_bulk_shared_to_cluster(Expr dst_ptr, Expr src_ptr, Expr size, Expr mbar);

◆ ptx_cp_async_bulk_tensor_global_to_cluster()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_global_to_cluster ( )

tvm instrinsics to call cp.async.bulk.tensor.dim.shared::cluster.global.tile.mbarrier::complete_tx::bytes

TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma

ptx_cp_async_bulk_tensor_global_to_cluster(int dim, PrimExpr dst_ptr, PrimExpr bar_ptr, PrimExpr tensormap_addr, int...coords, int cta_mask, int cta_group, string cache_hint)

◆ ptx_cp_async_bulk_tensor_global_to_cluster_prefetch()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_global_to_cluster_prefetch ( )

tvm instrinsics to call cp.async.bulk.prefetch.tensor.dim.L2.global.tile

TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma

ptx_cp_async_bulk_tensor_global_to_cluster_prefetch(int dim, PrimExpr tensormap_addr, int...coords, string cache_hint)

◆ ptx_cp_async_bulk_tensor_shared_to_global()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_shared_to_global ( )

tvm instrinsics to call cp.async.bulk.tensor.dim.global.shared::cta.tile。bulk_group

TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma

ptx_cp_async_bulk_tensor_shared_to_global(int dim, PrimExpr src_ptr, PrimExpr tensormap_addr, int...coords, string cache_hint)

◆ ptx_cp_async_bulk_tensor_shared_to_global_reduce()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_shared_to_global_reduce ( )

tvm instrinsics to call cp.reduce.async.bulk.tensor.dim.dst.src.redOp

ptx_cp_async_bulk_tensor_shared_to_global_reduce(int dim, PrimExpr src_ptr, PrimExpr tensormap_addr, int...coords, string cache_hint)

◆ ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster ( )

tvm intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile::gather4.mbarrier::complete_tx::bytes

TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma

ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster(int dim, PrimExpr dst_ptr, PrimExpr bar_ptr, PrimExpr tensormap_addr, int...coords, int cta_mask, int cta_group, string cache_hint)

◆ ptx_cp_async_bulk_wait_group()

const Op& tvm::tirx::builtin::ptx_cp_async_bulk_wait_group ( )

tvm instrinsics to call cp.async.bulk.wait_group{.read} N

ptx_cp_async_bulk_wait_group(int N, bool read)

◆ ptx_cp_async_commit_group()

const Op& tvm::tirx::builtin::ptx_cp_async_commit_group ( )

tvm intrinsics for ptx async copy commit and wait.

void ptx_cp_async_commit_group(); void ptx_cp_async_wait_group(int num);

◆ ptx_cp_async_mbarrier_arrive()

const Op& tvm::tirx::builtin::ptx_cp_async_mbarrier_arrive ( )

tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive

ptx_cp_async_mbarrier_arrive(int barrier_arr_id, int barrier_id)

◆ ptx_cp_async_wait_group()

const Op& tvm::tirx::builtin::ptx_cp_async_wait_group ( )

◆ ptx_elect_sync()

const Op& tvm::tirx::builtin::ptx_elect_sync ( )

tvm instrinsics to call elect.sync _|p, membermask and return the predicate

elect_sync(membermask)

◆ ptx_exp2()

const Op& tvm::tirx::builtin::ptx_exp2 ( )

tvm intrinsic for PTX fast exp2 approximation (ex2.approx.ftz.f32)

◆ ptx_fence()

const Op& tvm::tirx::builtin::ptx_fence ( )

PTX fence instruction: fence.{sem}.{scope}.

ptx_fence(StringImm sem, StringImm scope)

◆ ptx_fence_mbarrier_init()

const Op& tvm::tirx::builtin::ptx_fence_mbarrier_init ( )

PTX fence.mbarrier_init.release.cluster instruction.

ptx_fence_mbarrier_init()

◆ ptx_fence_proxy_async()

const Op& tvm::tirx::builtin::ptx_fence_proxy_async ( )

PTX fence.proxy.async instruction: fence.proxy.async[.{space}].

ptx_fence_proxy_async(StringImm space)

◆ ptx_fetch_register()

const Op& tvm::tirx::builtin::ptx_fetch_register ( )

tvm instrinsics to fetch PTX pre-defined registers

ptx_fetch_register(int bits, string reg_name)

◆ ptx_fma_packed_f32x2()

const Op& tvm::tirx::builtin::ptx_fma_packed_f32x2 ( )

tvm intrinsic for PTX packed FMA instruction (sm_100a+)

◆ ptx_ld_global_acquire()

const Op& tvm::tirx::builtin::ptx_ld_global_acquire ( )

tvm intrinsic to call ld.global.acquire.gpu.b32

ptx_ld_global_acquire()

◆ ptx_ldg32()

const Op & tvm::tirx::builtin::ptx_ldg32 ( )

tvm intrinsic for ptx predicate load with 32-bit data type.

◆ ptx_ldmatrix()

const Op& tvm::tirx::builtin::ptx_ldmatrix ( )

tvm intrinsic for ptx load matrix from shared memory.

void ptx_ldmatrix(Bool trans, IntImm num, StringImm type, Var local_ptr, Expr local_offset, Var smem_ptr, Expr smem_offset);

◆ ptx_ldmatrix_legacy()

const Op& tvm::tirx::builtin::ptx_ldmatrix_legacy ( )

◆ ptx_map_shared_rank()

const Op& tvm::tirx::builtin::ptx_map_shared_rank ( )

tvm instrinsics to call map_shared_rank

ptx_map_shared_rank(PrimExpr ptr, int rank)

◆ ptx_mbarrier_arrive()

const Op& tvm::tirx::builtin::ptx_mbarrier_arrive ( )

tvm instrinsics to call mbarrier.arrive.shared::cta.b64 or mapa.shared::cluster.u32 mbarrier.arrive.shared::cluster.b64

◆ ptx_mbarrier_arrive_expect_tx()

const Op& tvm::tirx::builtin::ptx_mbarrier_arrive_expect_tx ( )

tvm instrinsics to call mbarrier.arrive.expect_tx.shared.b64 or mapa.shared::cluster.u32 mbarrier.arrive.expect_tx.shared.b64

ptx_mbarrier_arrive_expect_tx(uint64_t* bar_ptr, int byte_count)

◆ ptx_mbarrier_init()

const Op& tvm::tirx::builtin::ptx_mbarrier_init ( )

tvm instrinsics to call mbarrier.init.shared::cta.b64

ptx_mbarrier_init(uint64_t* bar_ptr, int thread_count)

◆ ptx_mbarrier_try_wait()

const Op& tvm::tirx::builtin::ptx_mbarrier_try_wait ( )

tvm instrinsics to call mbarrier.try_wait.parity repeatedly until it returns true

ptx_mbarrier_try_wait(uint64_t* bar_ptr, int phase)

◆ ptx_mma()

const Op& tvm::tirx::builtin::ptx_mma ( )

tvm intrinsic for ptx tensor core mma instructions.

void ptx_mma(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, bool saturate);

◆ ptx_mma_legacy()

const Op& tvm::tirx::builtin::ptx_mma_legacy ( )

ptx mma / ldmatrix / mma_store / mma_fill variants that take (ptr_var, offset) pairs (not a folded access_ptr Call). Codegen emits ptr + offset C pointer arithmetic; lower_warp_memory rewrites the offset's group component to its thread-local index.

◆ ptx_mma_sp()

const Op& tvm::tirx::builtin::ptx_mma_sp ( )

tvm intrinsic for sparse tensor core ptx instructions.

void ptx_mma_sp(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, Var metadata, Expr meta_index, Var sparse_selector, bool saturate);

◆ ptx_mul_packed_f32x2()

const Op& tvm::tirx::builtin::ptx_mul_packed_f32x2 ( )

tvm intrinsic for PTX packed multiply instruction (sm_100a+)

◆ ptx_rcp()

const Op& tvm::tirx::builtin::ptx_rcp ( )

tvm intrinsic for PTX fast reciprocal approximation (rcp.approx.ftz.f32)

◆ ptx_reduce3_max_f32()

const Op& tvm::tirx::builtin::ptx_reduce3_max_f32 ( )

tvm intrinsic for PTX 3-input max instruction (sm_100a+)

◆ ptx_reduce3_min_f32()

const Op& tvm::tirx::builtin::ptx_reduce3_min_f32 ( )

tvm intrinsic for PTX 3-input min instruction (sm_100a+)

◆ ptx_setmaxnreg()

const Op& tvm::tirx::builtin::ptx_setmaxnreg ( )

tvm intrinsic to call setmaxnreg.action.sync.aligned.u32 imm-reg-count

◆ ptx_stmatrix()

const Op& tvm::tirx::builtin::ptx_stmatrix ( )

tvm intrinsic to call stmatrix.sync.aligned.m8n8.num{.trans}.shared.b16 [p], r;

ptx_stmatrix(int num, bool trans, PrimExpr ptr, PrimExpr... vars)

◆ ptx_sub_packed_f32x2()

const Op& tvm::tirx::builtin::ptx_sub_packed_f32x2 ( )

tvm intrinsic for PTX packed subtract instruction (sm_100a+)

◆ ptx_tcgen05_alloc()

const Op& tvm::tirx::builtin::ptx_tcgen05_alloc ( )

tvm instrinsics to call tcgen05.alloc.cta_group.sync.aligned;

ptx_tcgen05_alloc(Var dst_ptr, int n_cols, int cta_group)

◆ ptx_tcgen05_commit()

const Op& tvm::tirx::builtin::ptx_tcgen05_commit ( )

tvm instrinsics to call tcgen05.commit.cta_group

ptx_tcgen05_commit()

◆ ptx_tcgen05_cp()

const Op& tvm::tirx::builtin::ptx_tcgen05_cp ( )

tvm instrinsics to call tcgen05.cp.cta_group

ptx_tcgen05_cp()

◆ ptx_tcgen05_dealloc()

const Op& tvm::tirx::builtin::ptx_tcgen05_dealloc ( )

tvm instrinsics to call tcgen05.dealloc.cta_group.sync.aligned;

ptx_tcgen05_dealloc(uint32_t taddr, int n_cols, int cta_group)

◆ ptx_tcgen05_encode_instr_descriptor()

const Op& tvm::tirx::builtin::ptx_tcgen05_encode_instr_descriptor ( )

tvm intrinsic to encode instruction descriptor for tcgen05 MMA.

ptx_tcgen05_encode_instr_descriptor(PrimExpr desc, string d_dtype, string a_dtype, string b_dtype, int M, int N, int K, bool trans_a, bool trans_b, int n_cta_groups, bool neg_a, bool neg_b, bool sat_d, bool is_sparse)

◆ ptx_tcgen05_encode_instr_descriptor_block_scaled()

const Op& tvm::tirx::builtin::ptx_tcgen05_encode_instr_descriptor_block_scaled ( )

tvm intrinsic to encode instruction descriptor for tcgen05 MMA block scaled.

ptx_tcgen05_encode_instr_descriptor_block_scaled(PrimExpr desc, string d_dtype, string a_dtype, string b_dtype, string sfa_dtype, string stb_dtype, int M, int N, int K, bool trans_a, bool trans_b, int n_cta_groups, bool neg_a, bool neg_b, bool is_sparse)

◆ ptx_tcgen05_encode_matrix_descriptor()

const Op& tvm::tirx::builtin::ptx_tcgen05_encode_matrix_descriptor ( )

tvm intrinsic to encode matrix descriptor for tcgen05 instructions.

ptx_tcgen05_encode_matrix_descriptor(PrimExpr ptr, PrimExpr ldo, PrimExpr sdo, int swizzle)

◆ ptx_tcgen05_fence_after_thread_sync()

const Op& tvm::tirx::builtin::ptx_tcgen05_fence_after_thread_sync ( )

tvm instrinsics to call tcgen05.fence::after_thread_sync;

ptx_tcgen05_fence_after_thread_sync()

◆ ptx_tcgen05_fence_before_thread_sync()

const Op& tvm::tirx::builtin::ptx_tcgen05_fence_before_thread_sync ( )

tvm instrinsics to call tcgen05.fence::before_thread_sync;

ptx_tcgen05_fence_before_thread_sync()

◆ ptx_tcgen05_ld()

const Op& tvm::tirx::builtin::ptx_tcgen05_ld ( )

tvm instrinsics to call tcgen05.ld.sync.aligned;

ptx_tcgen05_ld()

◆ ptx_tcgen05_mma()

const Op& tvm::tirx::builtin::ptx_tcgen05_mma ( )

tvm intrinsic to call tcgen05.mma.cta_group.kind without block scaling.

ptx_tcgen05_mma()

◆ ptx_tcgen05_mma_block_scale()

const Op& tvm::tirx::builtin::ptx_tcgen05_mma_block_scale ( )

tvm intrinsic to call tcgen05.mma.cta_group.kind.block_scale{.scale_vec_size}

ptx_tcgen05_mma_block_scale()

◆ ptx_tcgen05_mma_sp()

const Op& tvm::tirx::builtin::ptx_tcgen05_mma_sp ( )

tvm intrinsic to call tcgen05.mma.sp.cta_group.kind without block scaling.

ptx_tcgen05_mma_sp()

◆ ptx_tcgen05_mma_sp_block_scale()

const Op& tvm::tirx::builtin::ptx_tcgen05_mma_sp_block_scale ( )

tvm intrinsic to call tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vec_size}

ptx_tcgen05_mma_sp_block_scale()

◆ ptx_tcgen05_relinquish_alloc_permit()

const Op& tvm::tirx::builtin::ptx_tcgen05_relinquish_alloc_permit ( )

tvm instrinsics to call tcgen05.relinquish_alloc_permit.cta_group.sync.aligned;

ptx_tcgen05_relinquish_alloc_permit(int cta_group)

◆ ptx_tcgen05_shift()

const Op& tvm::tirx::builtin::ptx_tcgen05_shift ( )

tvm instrinsics to call tcgen05.shift.cta_group.down

ptx_tcgen05_shift()

◆ ptx_tcgen05_st()

const Op& tvm::tirx::builtin::ptx_tcgen05_st ( )

tvm instrinsics to call tcgen05.st.sync.aligned;

ptx_tcgen05_st()

◆ ptx_tcgen05_wait_ld()

const Op& tvm::tirx::builtin::ptx_tcgen05_wait_ld ( )

tvm instrinsics to call tcgen05.wait::ld.sync.aligned;

ptx_tcgen05_wait_ld()

◆ ptx_tcgen05_wait_st()

const Op& tvm::tirx::builtin::ptx_tcgen05_wait_st ( )

tvm instrinsics to call tcgen05.wait::st.sync.aligned;

ptx_tcgen05_wait_st()

◆ ptx_wgmma_commit_group()

const Op& tvm::tirx::builtin::ptx_wgmma_commit_group ( )

tvm intrinsic to call wgmma.commit_group.sync.aligned;

ptx_wgmma_commit_group()

◆ ptx_wgmma_encode_matrix_descriptor()

const Op& tvm::tirx::builtin::ptx_wgmma_encode_matrix_descriptor ( )

tvm intrinsic to encode matrix descriptor for wgmma instructions.

ptx_wgmma_encode_matrix_descriptor(PrimExpr ptr, PrimExpr ldo, PrimExpr sdo, int swizzle)

◆ ptx_wgmma_fence()

const Op& tvm::tirx::builtin::ptx_wgmma_fence ( )

tvm intrinsic to call wgmma.fence.sync.aligned;

ptx_wgmma_fence()

◆ ptx_wgmma_mma_async_rs()

const Op& tvm::tirx::builtin::ptx_wgmma_mma_async_rs ( )

tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where A is in register and B is in shared memory.

ptx_wgmma_mma_async_rs()

◆ ptx_wgmma_mma_async_ss()

const Op& tvm::tirx::builtin::ptx_wgmma_mma_async_ss ( )

tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where both A and B are in shared memory.

ptx_wgmma_mma_async_ss()

◆ ptx_wgmma_noop_barrier()

const Op& tvm::tirx::builtin::ptx_wgmma_noop_barrier ( )

tvm intrinsic to call "" : "+r"(reg) :: "memory"

ptx_wgmma_noop_barrier()

◆ ptx_wgmma_wait_group()

const Op& tvm::tirx::builtin::ptx_wgmma_wait_group ( )

tvm intrinsic to call wgmma.wait_group.sync.aligned;

ptx_wgmma_wait_group(int N)

◆ q_multiply_shift()

const Op& tvm::tirx::builtin::q_multiply_shift ( )

Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1)

◆ reinterpret()

const Op& tvm::tirx::builtin::reinterpret ( )

Reinterpret the value using the target type.

◆ ret()

const Op& tvm::tirx::builtin::ret ( )

Return value.

◆ selector()

const Op& tvm::tirx::builtin::selector ( )

Analysis-only active-thread selector.

selector(var, pred) denotes the unique value of var in the current active domain for which pred is true. It is used only inside ExecContext/DispatchContext metadata, for predicates such as ptx.elect_sync() whose selected lane cannot be inferred structurally.

◆ shift_left()

const Op& tvm::tirx::builtin::shift_left ( )

Left shift.

◆ shift_right()

const Op& tvm::tirx::builtin::shift_right ( )

Right shift.

◆ simdgroup_load()

const Op& tvm::tirx::builtin::simdgroup_load ( )

tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);

◆ simdgroup_multiply_accumulate()

const Op& tvm::tirx::builtin::simdgroup_multiply_accumulate ( )

tvm intrinsic for multiply and accumulate two matrices in simdgroup

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_mma(Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c);

◆ simdgroup_store()

const Op& tvm::tirx::builtin::simdgroup_store ( )

tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);

◆ start_profile_intrinsic()

const Op& tvm::tirx::builtin::start_profile_intrinsic ( )

Profiling intrinsic.

◆ texture2d_load()

const Op& tvm::tirx::builtin::texture2d_load ( )

Load from texture 2d memory.

◆ texture2d_store()

const Op& tvm::tirx::builtin::texture2d_store ( )

Store to texture 2d memory.

◆ thread_return()

const Op& tvm::tirx::builtin::thread_return ( )

Return from a GPU thread.

◆ timer_end_cuda()

const Op& tvm::tirx::builtin::timer_end_cuda ( )

tvm intrinsic for ending the timer for profiling a specific event, and storing profiling result in a buffer.

void timer_end_cuda(IntImm event_type, Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and event type, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }

◆ timer_finalize_cuda()

const Op& tvm::tirx::builtin::timer_finalize_cuda ( )

tvm intrinsic for finalize the timer for profiling, and storing profiling result in a buffer.

void timer_finalize_cuda(Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and end signal, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }

◆ timer_init_cuda()

const Op& tvm::tirx::builtin::timer_init_cuda ( )

tvm intrinsic for initializing the CUDA profiler, and store profiling result in a buffer.

void timer_init_cuda(Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, int num_groups, Expr group_id) { // initialize the tag and write to pos 0 in the buffer // initialize write offset for every leader thread in warp group across all blocks }

◆ timer_start_cuda()

const Op& tvm::tirx::builtin::timer_start_cuda ( )

tvm intrinsic for starting the timer for profiling a specific event, and storing profiling result in a buffer.

void timer_start_cuda(IntImm event_type, Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and event type, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }

◆ tvm_access_ptr()

const Op& tvm::tirx::builtin::tvm_access_ptr ( )

Get head access address with memory access pattern info.

This operator also marks range of the memory access The offset and extent are in unit of the DType(including vectorization factor). rw_mask is a bit_mask setting whether the access is a read(1) or write(2). The access is assume to happen in the current expression.

PtrType tvm_access_ptr(Expr dtype, DType* data, int offset, int extent, int rw_mask) { // DType == dtype.type(); return &data[offset]; }

◆ tvm_bmma_sync()

const Op& tvm::tirx::builtin::tvm_bmma_sync ( )

tvm intrinsic for tensor core bmma_sync operators.

void tvm_bmma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::bmma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }

◆ tvm_call_cpacked()

const Op& tvm::tirx::builtin::tvm_call_cpacked ( )

See pesudo code.

return_type tvm_call_packed(fname, TVMFFIAny* args) { TVMFFIAny result; (*fname)(args, args, len(args), &result); return cast(return_type, result); }

◆ tvm_call_cpacked_lowered()

const Op& tvm::tirx::builtin::tvm_call_cpacked_lowered ( )

Lowered version of call c-packed, the space of value and type codes are explicitly allocated.

int tvm_call_packed_lowered(fname, TVMFFIAny* args_stack, int begin, int end, void* self) { fname(ffi::PackedArgs(value_stack[begin:end], tcode_stack[begin:end]), ffi::Any(value_stack + end, tcode_stack + end)); }

◆ tvm_call_packed()

const Op& tvm::tirx::builtin::tvm_call_packed ( )

See pesudo code.

return_type tvm_call_packed(name, TVMFFIAny* args) { TVMFFIAny result; ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); (*f)(args, args, len(args), &result); // return type can be int, float, handle. return cast(return_type, result); }

◆ tvm_call_packed_lowered()

const Op& tvm::tirx::builtin::tvm_call_packed_lowered ( )

Lowered version of call packed, the space of value and type codes are explicitly allocated.

return_type tvm_call_packed_lowered(name, TVMFFIAny* args_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); f->CallPacked(ffi::PackedArgs(args_stack[begin:end]), ffi::Any(args_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(args_stack + end)) }

◆ tvm_call_trace_packed()

const Op& tvm::tirx::builtin::tvm_call_trace_packed ( )

See pesudo code.

return_type tvm_call_trace_packed(name, TVMFFIAny* args) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); (*f)(args, args, len(args)); // return type can be int, float, handle. return cast(return_type, result); }

◆ tvm_call_trace_packed_lowered()

const Op& tvm::tirx::builtin::tvm_call_trace_packed_lowered ( )

Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack.

return_type tvm_call_trace_packed_lowered(name, TVMFFIAny* args_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); f->CallPacked(ffi::PackedArgs(args_stack[begin:end]), ffi::Any(args_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(args_stack + end)) }

◆ tvm_context_id()

const Op& tvm::tirx::builtin::tvm_context_id ( )

Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace.

◆ tvm_fill_fragment()

const Op& tvm::tirx::builtin::tvm_fill_fragment ( )

tvm intrinsic for tensor core fill_fragment operators.

void tvm_fill_fragment(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr value) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::fill_fragment(fragment[index], value); }

◆ tvm_global_barrier_kinit()

const Op& tvm::tirx::builtin::tvm_global_barrier_kinit ( )

Initialize the global barrier. Call this at beginning of kernel that need global barrier.

◆ tvm_load_matrix_sync()

const Op& tvm::tirx::builtin::tvm_load_matrix_sync ( )

tvm intrinsic for tensor core load operators.

void tvm_load_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment. // Determine fragment layout(column-major or row major) by layout. // fragments must be in 'wmma.matrix_a' or 'wmma.matrix_b' scope. nvcuda::wmma::load_matrix_sync(fragment[index], buffer_ptr, stride); }

◆ tvm_mma_sync()

const Op& tvm::tirx::builtin::tvm_mma_sync ( )

tvm intrinsic for tensor core mma_sync operators.

void tvm_mma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::mma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }

◆ tvm_stack_alloca()

const Op& tvm::tirx::builtin::tvm_stack_alloca ( )

See pesudo code.

dtype in {shape, array, arg_value, arg_tcode}

Handle tvm_stack_alloca(string dtype, int num) { return new on stack dtype[num]; }

◆ tvm_stack_make_array()

const Op& tvm::tirx::builtin::tvm_stack_make_array ( )

Allocate a Tensor(DLTensor) on stack, return the handle.

Type tvm_stack_make_array(Expr data, Expr shape, Expr strides, Expr ndim, Expr dtype, Expr elem_offset) { ret = alloca stack DLTensor(); ret->data = data; ret->shape = shape; ret->strides = strides != 0 ? strides : nullptr; ret->ndim = ndim; ret->dtype = dtype.type(); ret->byte_offset = elem_offset * sizeof(dtype); return ret; }

◆ tvm_stack_make_shape()

const Op& tvm::tirx::builtin::tvm_stack_make_shape ( )

Allocate a shape tuple on stack, return the handle.

Handle tvm_stack_make_shape(list args) { ret = alloca stack int64_t[len(args)]; for i in range(len(args)): ret[i] = args[i] return &ret[0]; }

◆ tvm_static_handle()

const Op& tvm::tirx::builtin::tvm_static_handle ( )

Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources.

◆ tvm_storage_sync()

const Op& tvm::tirx::builtin::tvm_storage_sync ( )

See pseudo code.

int tvm_storage_sync(std::string storage_scope) { __sync(storage_scope); return 0; }

◆ tvm_store_matrix_sync()

const Op& tvm::tirx::builtin::tvm_store_matrix_sync ( )

tvm intrinsic for tensor core store operators.

void tvm_store_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::store_matrix_sync(fragment[index], buffer_ptr, stride, layout); }

◆ tvm_struct_get()

const Op& tvm::tirx::builtin::tvm_struct_get ( )

See pesudo code.

Type tvm_struct_get(StructType* arr, int index, int field_id) { return arr[index]->field; }

See also
TVMStructFieldKind

◆ tvm_struct_set()

const Op& tvm::tirx::builtin::tvm_struct_set ( )

See pesudo code.

Handle tvm_struct_set(StructType* arr, int index, int field_id, value) { arr[index]->field = value; }

See also
TVMStructFieldKind

◆ tvm_thread_allreduce()

const Op& tvm::tirx::builtin::tvm_thread_allreduce ( )

See pesudo code.

void tvm_thread_allreduce(UIntImm size, Expr source0, ..., Expr cond, Var reduce_temp0, .., Var thread_idx1, ...) { // constraint by the other thread_idx remain the same. // reduce_temp is used to save intermediate result. reduce_temp0, ... = reduce(combiner, source0, ..., cond over [thread_idx1, thread_idx2] passed by any caller) }

◆ tvm_thread_invariant()

const Op& tvm::tirx::builtin::tvm_thread_invariant ( )

Mark a condition to be thread invariant. This means the condition must be the same for all threads.

◆ tvm_throw_last_error()

const Op& tvm::tirx::builtin::tvm_throw_last_error ( )

See pesudo code.

void tvm_throw_last_error() { throw TVMGetLastError(); }

◆ tvm_tuple()

const Op& tvm::tirx::builtin::tvm_tuple ( )

tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization.

Handle tvm_tuple(value0, value1, ..., value_n);

◆ tvm_warp_activemask()

const Op& tvm::tirx::builtin::tvm_warp_activemask ( )

◆ tvm_warp_shuffle()

const Op& tvm::tirx::builtin::tvm_warp_shuffle ( )

See pseudo code.

Type tvm_warp_shuffle(mask, Type value, warp_id, width, warp_size) { return (value passed in by warp indicated by this_warp_id); }

Type tvm_warp_shuffle_up(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id - offset); }

Type tvm_warp_shuffle_down(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id + offset); }

unsigned tvm_warp_activemask() { return (32-bit mask of currently active threads in the calling warp); }

Parameter warp_id indicates the source thread ID in a warp.

Parameter offset indicates the relative distance to this_warp_id.

Parameter width indicates the number of threads involved in one shuffle. See CUDA document for __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync and __activemask.

Parameter warp_size is the size of a warp, which helps a backend to determine whether the width parameter is legal.

◆ tvm_warp_shuffle_down()

const Op& tvm::tirx::builtin::tvm_warp_shuffle_down ( )

◆ tvm_warp_shuffle_up()

const Op& tvm::tirx::builtin::tvm_warp_shuffle_up ( )

◆ tvm_warp_shuffle_xor()

const Op& tvm::tirx::builtin::tvm_warp_shuffle_xor ( )

◆ undef()

const Op& tvm::tirx::builtin::undef ( )

Returns an initialized but arbitrary value.

Compile-time representation of memory locations whose values may be altered as a result of optimizations.

◆ vectorcombine()

const Op& tvm::tirx::builtin::vectorcombine ( )

Concat two vectors.

◆ vectorhigh()

const Op& tvm::tirx::builtin::vectorhigh ( )

Get the high level half of the vector.

◆ vectorlow()

const Op& tvm::tirx::builtin::vectorlow ( )

Get the low-level half of the vector.

◆ vscale()

const Op& tvm::tirx::builtin::vscale ( )

Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic)