|
tvm
|
Collection of builtin intrinsics as ops. More...
Enumerations | |
| enum | TVMStructFieldKind : int { kDLTensorAddr , kDLTensorData , kDLTensorShape , kDLTensorStrides , kDLTensorNDim , kDLTensorTypeCode , kDLTensorTypeBits , kDLTensorTypeLanes , kDLTensorByteOffset , kDLTensorDeviceId , kDLTensorDeviceType , kDLTensorKindBound_ , kTVMValueContent , kTVMFFIAnyTypeIndex , kTVMFFIAnyZeroPadding , kTVMFFIAnyUnionValue , kTVMValueKindBound_ , kInt64ArrayElem } |
| The kind of structure field info used in intrinsic. More... | |
Functions | |
| const Op & | ret () |
| Return value. More... | |
| const Op & | thread_return () |
| Return from a GPU thread. More... | |
| const Op & | continue_loop () |
| Loop continue. More... | |
| const Op & | break_loop () |
| Loop break. More... | |
| const Op & | reinterpret () |
| Reinterpret the value using the target type. More... | |
| const Op & | likely () |
| Marks a condition is likely going to happen. More... | |
| const Op & | filter () |
Thread-set filter predicate. Used as the condition of an IfThenElse to narrow the active thread set A for the then-branch. Two forms: filter(var, lo, hi) – range form, true iff var in [lo, hi) filter(var, cond) – predicate form (e.g. var == k); true iff cond var must be a ScopeIdDef-declared Var at parse time (Verifier Rule 2). More... | |
| const Op & | selector () |
| Analysis-only active-thread selector. More... | |
| const Op & | bitwise_and () |
| Bitwise and operator. More... | |
| const Op & | bitwise_or () |
| Bitwise or operator. More... | |
| const Op & | bitwise_xor () |
| Bitwise xor operator. More... | |
| const Op & | bitwise_not () |
| Bitwise not operator. More... | |
| const Op & | shift_left () |
| Left shift. More... | |
| const Op & | shift_right () |
| Right shift. More... | |
| const Op & | large_uint_imm () |
| See pesudo code. More... | |
| const Op & | q_multiply_shift () |
| Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1) More... | |
| const Op & | address_of () |
| Returns the address of an element in the buffer (see pseudocode below). More... | |
| const Op & | if_then_else () |
| Same as select, used for unsafe memory access. More... | |
| const Op & | isnullptr () |
| See pesudo code. More... | |
| const Op & | isnan () |
| Check if value is nan. More... | |
| const Op & | popcount () |
| Popcount. More... | |
| const Op & | fma () |
| Fused multiply add. More... | |
| const Op & | call_extern () |
| Call an extern C function with given name and signature from the types of args in the runtime environment. More... | |
| const Op & | call_pure_extern () |
| Call an pure extern C function with given name and signature from the types of args in the runtime environment. More... | |
| const Op & | call_llvm_intrin () |
| Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More... | |
| const Op & | call_llvm_pure_intrin () |
| Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More... | |
| const Op & | call_spirv_pure_glsl450 () |
| Call an SPIRV pure GLSL450 intrinsic. More... | |
| const Op & | prefetch () |
| same signature as llvm.prefetch More... | |
| const Op & | tvm_access_ptr () |
| Get head access address with memory access pattern info. More... | |
| const Op & | tvm_static_handle () |
| Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources. More... | |
| const Op & | tvm_context_id () |
| Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace. More... | |
| const Op & | tvm_tuple () |
| tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization. More... | |
| const Op & | handle_add_byte_offset () |
| See pesudo code. More... | |
| const Op & | tvm_struct_get () |
| See pesudo code. More... | |
| const Op & | tvm_struct_set () |
| See pesudo code. More... | |
| const Op & | lookup_param () |
| See pseudo code Type lookup_param(ffi::String param_name) { return __tvm_param__param_name; }. More... | |
| const Op & | tvm_throw_last_error () |
| See pesudo code. More... | |
| const Op & | tvm_stack_alloca () |
| See pesudo code. More... | |
| const Op & | tvm_stack_make_shape () |
| Allocate a shape tuple on stack, return the handle. More... | |
| const Op & | tvm_stack_make_array () |
| Allocate a Tensor(DLTensor) on stack, return the handle. More... | |
| const Op & | tvm_call_packed () |
| See pesudo code. More... | |
| const Op & | tvm_call_cpacked () |
| See pesudo code. More... | |
| const Op & | tvm_call_trace_packed () |
| See pesudo code. More... | |
| const Op & | tvm_thread_invariant () |
| Mark a condition to be thread invariant. This means the condition must be the same for all threads. More... | |
| const Op & | tvm_call_packed_lowered () |
| Lowered version of call packed, the space of value and type codes are explicitly allocated. More... | |
| const Op & | tvm_call_cpacked_lowered () |
| Lowered version of call c-packed, the space of value and type codes are explicitly allocated. More... | |
| const Op & | tvm_call_trace_packed_lowered () |
| Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack. More... | |
| const Op & | tvm_storage_sync () |
| See pseudo code. More... | |
| const Op & | tvm_warp_shuffle () |
| See pseudo code. More... | |
| const Op & | tvm_warp_shuffle_up () |
| const Op & | tvm_warp_shuffle_down () |
| const Op & | tvm_warp_shuffle_xor () |
| const Op & | tvm_warp_activemask () |
| const Op & | tvm_global_barrier_kinit () |
| Initialize the global barrier. Call this at beginning of kernel that need global barrier. More... | |
| const Op & | tvm_thread_allreduce () |
| See pesudo code. More... | |
| const Op & | make_filled_simdgroup_matrix () |
| tvm intrinsic for initializing and simdgroup with given value. More... | |
| const Op & | simdgroup_load () |
| tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup. More... | |
| const Op & | simdgroup_store () |
| tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory. More... | |
| const Op & | simdgroup_multiply_accumulate () |
| tvm intrinsic for multiply and accumulate two matrices in simdgroup More... | |
| const Op & | cooperative_tensor_fill () |
| Fill a cooperative_tensor with a given value. More... | |
| const Op & | cooperative_tensor_load () |
| Load data from device or threadgroup memory into a cooperative_tensor. More... | |
| const Op & | cooperative_tensor_store () |
| Store data from a cooperative_tensor to device or threadgroup memory. More... | |
| const Op & | cooperative_tensor_multiply_accumulate () |
| Multiply and accumulate two matrices using cooperative_tensor (MetalPerformancePrimitives matmul2d). More... | |
| const Op & | vectorhigh () |
| Get the high level half of the vector. More... | |
| const Op & | vectorlow () |
| Get the low-level half of the vector. More... | |
| const Op & | vectorcombine () |
| Concat two vectors. More... | |
| const Op & | dp4a () |
| Dot product of two int8x4 vectors and add an optional accumulator. More... | |
| const Op & | atomic_add () |
| atomic add instruction, corresponding e.g. to atomicAdd in CUDA More... | |
| const Op & | nd_mem_alloc_with_scope () |
| Create an Nd memory allocation with storage scope. More... | |
| const Op & | texture2d_store () |
| Store to texture 2d memory. More... | |
| const Op & | texture2d_load () |
| Load from texture 2d memory. More... | |
| const Op & | dma_copy () |
| Initiate a non-blocking DMA copy from source to destination. More... | |
| const Op & | dma_wait () |
| Wait until the number of DMA groups in flight is less than or equal to some maximum. More... | |
| const Op & | dma_start_group () |
| Start a group of DMA copies. More... | |
| const Op & | dma_end_group () |
| End a group of DMA copies. More... | |
| const Op & | assume () |
| Provide a true statement that can be used for simplifications. More... | |
| const Op & | undef () |
| Returns an initialized but arbitrary value. More... | |
| const Op & | start_profile_intrinsic () |
| Profiling intrinsic. More... | |
| const Op & | end_profile_intrinsic () |
| Profiling intrinsic. More... | |
| const Op & | anylist_getitem () |
| Get a item from any list and return it. More... | |
| const Op & | anylist_resetitem () |
| Reset and clear a item in any list. More... | |
| const Op & | anylist_setitem_call_packed () |
| Set an item into any list by running packed function call. More... | |
| const Op & | anylist_setitem_call_cpacked () |
| Same as anylist_setitem_call_packed but use C calling convention. More... | |
| const Op & | vscale () |
| Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic) More... | |
| const Op & | get_active_lane_mask () |
| Calculate a predicate mask given an upper bound (limit) and a current value (base). More... | |
| const Op & | ignore_loop_partition () |
| Annotate a predicate not be considered as target condition of loop partition. More... | |
| const Op & | buffer_offset () |
| Get the element offset of a buffer given logical indices. More... | |
| const Op & | print_buffer () |
| Print the content of a buffer during runtime. More... | |
| const Op & | timer_init_cuda () |
| tvm intrinsic for initializing the CUDA profiler, and store profiling result in a buffer. More... | |
| const Op & | timer_start_cuda () |
| tvm intrinsic for starting the timer for profiling a specific event, and storing profiling result in a buffer. More... | |
| const Op & | timer_end_cuda () |
| tvm intrinsic for ending the timer for profiling a specific event, and storing profiling result in a buffer. More... | |
| const Op & | timer_finalize_cuda () |
| tvm intrinsic for finalize the timer for profiling, and storing profiling result in a buffer. More... | |
| const Op & | cuda_atomic_add () |
| tvm intrinsic for cuda atomic add instruction More... | |
| const Op & | cuda_thread_fence () |
| tvm intrinsic for cuda thread fence instruction More... | |
| const Op & | cuda_warp_reduce () |
| Warp-level butterfly shuffle-XOR reduction. More... | |
| const Op & | cuda_cta_reduce () |
| CTA-wide reduction via warp shuffle + shared memory. More... | |
| const Op & | cuda_copy_bytes () |
| Typed load/store copy of num_bytes bytes. More... | |
| const Op & | cuda_warp_sync () |
| tvm intrinsic for cuda warp sync instruction More... | |
| const Op & | cuda_cta_sync () |
| tvm intrinsic for cuda block-wide sync (syncthreads) More... | |
| const Op & | cuda_grid_sync () |
| tvm intrinsic for cuda grid-wide sync (cooperative groups) More... | |
| const Op & | cuda_thread_rank () |
tvm intrinsic that returns cooperative_groups::thread_rank() for the enclosing CTA (linear thread index within the block). More... | |
| const Op & | cuda_half2float () |
| tvm intrinsic for cuda half to float conversion More... | |
| const Op & | cuda_bfloat162float () |
| tvm intrinsic for cuda bfloat16 to float conversion More... | |
| const Op & | cuda_float22half2 () |
| tvm intrinsic for a helper converting float2 to half2 with rounding More... | |
| const Op & | cuda_trap_when_assert_failed () |
| tvm intrinsic to trap when an assertion failed (cond == false) More... | |
| const Op & | cuda_runtime_instr_desc () |
| tvm intrinsic to modify runtime instruction descriptor More... | |
| const Op & | cuda_half8tofloat8 () |
| tvm intrinsic to convert 8 half2 lanes to 8 float2 lanes More... | |
| const Op & | cuda_float8tohalf8 () |
| tvm intrinsic to convert 8 float2 lanes to 8 half2 lanes with rounding More... | |
| const Op & | cuda_syncthreads_and () |
| tvm intrinsic for cuda syncthreads_and instruction More... | |
| const Op & | cuda_syncthreads_or () |
| tvm intrinsic for cuda syncthreads_or instruction More... | |
| const Op & | cuda_nano_sleep () |
| tvm intrinsic for cuda nano sleep instruction More... | |
| const Op & | cuda_atomic_cas () |
| tvm intrinsic for cuda atomic compare and swap instruction More... | |
| const Op & | cuda_printf () |
| tvm intrinsic for cuda printf instruction More... | |
| const Op & | cuda_ldg () |
| tvm intrinsic for cuda ldg instruction More... | |
| const Op & | cuda_get_tmem_addr () |
| tvm intrinsic for cuda tmem address calculation More... | |
| const Op & | ptx_exp2 () |
| tvm intrinsic for PTX fast exp2 approximation (ex2.approx.ftz.f32) More... | |
| const Op & | ptx_rcp () |
| tvm intrinsic for PTX fast reciprocal approximation (rcp.approx.ftz.f32) More... | |
| const Op & | ptx_any_sync () |
| tvm intrinsic for PTX warp-wide any predicate (__any_sync) More... | |
| const Op & | ptx_reduce3_max_f32 () |
| tvm intrinsic for PTX 3-input max instruction (sm_100a+) More... | |
| const Op & | ptx_reduce3_min_f32 () |
| tvm intrinsic for PTX 3-input min instruction (sm_100a+) More... | |
| const Op & | ptx_add_packed_f32x2 () |
| tvm intrinsic for PTX packed add instruction (sm_100a+) More... | |
| const Op & | ptx_sub_packed_f32x2 () |
| tvm intrinsic for PTX packed subtract instruction (sm_100a+) More... | |
| const Op & | ptx_mul_packed_f32x2 () |
| tvm intrinsic for PTX packed multiply instruction (sm_100a+) More... | |
| const Op & | ptx_fma_packed_f32x2 () |
| tvm intrinsic for PTX packed FMA instruction (sm_100a+) More... | |
| const Op & | tvm_load_matrix_sync () |
| tvm intrinsic for tensor core load operators. More... | |
| const Op & | tvm_mma_sync () |
| tvm intrinsic for tensor core mma_sync operators. More... | |
| const Op & | tvm_bmma_sync () |
| tvm intrinsic for tensor core bmma_sync operators. More... | |
| const Op & | tvm_fill_fragment () |
| tvm intrinsic for tensor core fill_fragment operators. More... | |
| const Op & | tvm_store_matrix_sync () |
| tvm intrinsic for tensor core store operators. More... | |
| const Op & | ptx_mma () |
| tvm intrinsic for ptx tensor core mma instructions. More... | |
| const Op & | ptx_mma_legacy () |
ptx mma / ldmatrix / mma_store / mma_fill variants that take (ptr_var, offset) pairs (not a folded access_ptr Call). Codegen emits ptr + offset C pointer arithmetic; lower_warp_memory rewrites the offset's group component to its thread-local index. More... | |
| const Op & | ptx_ldmatrix_legacy () |
| const Op & | mma_store_legacy () |
| const Op & | mma_fill_legacy () |
| const Op & | ptx_ldg32 () |
| tvm intrinsic for ptx predicate load with 32-bit data type. More... | |
| const Op & | ptx_mma_sp () |
| tvm intrinsic for sparse tensor core ptx instructions. More... | |
| const Op & | ptx_ldmatrix () |
| tvm intrinsic for ptx load matrix from shared memory. More... | |
| const Op & | ptx_cp_async () |
| tvm intrinsics for ptx async copy from global to shared memory using cp.async More... | |
| const Op & | ptx_cp_async_bulk () |
| tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk More... | |
| const Op & | ptx_cp_async_bulk_shared_to_cluster () |
| tvm intrinsics for ptx async bulk copy from shared::cta to shared::cluster More... | |
| const Op & | ptx_cp_async_commit_group () |
| tvm intrinsics for ptx async copy commit and wait. More... | |
| const Op & | ptx_cp_async_wait_group () |
| const Op & | ptx_cp_async_mbarrier_arrive () |
| tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive More... | |
| const Op & | ptx_fence () |
| PTX fence instruction: fence.{sem}.{scope}. More... | |
| const Op & | ptx_fence_proxy_async () |
| PTX fence.proxy.async instruction: fence.proxy.async[.{space}]. More... | |
| const Op & | ptx_mbarrier_init () |
| tvm instrinsics to call mbarrier.init.shared::cta.b64 More... | |
| const Op & | ptx_mbarrier_arrive () |
tvm instrinsics to call mbarrier.arrive.shared::cta.b64 or mapa.shared::cluster.u32 mbarrier.arrive.shared::cluster.b64 More... | |
| const Op & | ptx_mbarrier_arrive_expect_tx () |
tvm instrinsics to call mbarrier.arrive.expect_tx.shared.b64 or mapa.shared::cluster.u32 mbarrier.arrive.expect_tx.shared.b64 More... | |
| const Op & | ptx_mbarrier_try_wait () |
| tvm instrinsics to call mbarrier.try_wait.parity repeatedly until it returns true More... | |
| const Op & | ptx_bar_arrive () |
| tvm instrinsics to call bar.arrive a, b More... | |
| const Op & | ptx_bar_sync () |
| tvm instrinsics to call bar.sync a, {b} More... | |
| const Op & | ptx_cp_async_bulk_tensor_global_to_cluster () |
| tvm instrinsics to call cp.async.bulk.tensor.dim.shared::cluster.global.tile.mbarrier::complete_tx::bytes More... | |
| const Op & | ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster () |
| tvm intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile::gather4.mbarrier::complete_tx::bytes More... | |
| const Op & | ptx_cp_async_bulk_tensor_shared_to_global () |
| tvm instrinsics to call cp.async.bulk.tensor.dim.global.shared::cta.tile。bulk_group More... | |
| const Op & | ptx_cp_async_bulk_tensor_global_to_cluster_prefetch () |
| tvm instrinsics to call cp.async.bulk.prefetch.tensor.dim.L2.global.tile More... | |
| const Op & | ptx_cp_async_bulk_tensor_shared_to_global_reduce () |
| tvm instrinsics to call cp.reduce.async.bulk.tensor.dim.dst.src.redOp More... | |
| const Op & | ptx_cp_async_bulk_commit_group () |
| tvm instrinsics to call cp.async.bulk.commit_group More... | |
| const Op & | ptx_cp_async_bulk_wait_group () |
| tvm instrinsics to call cp.async.bulk.wait_group{.read} N More... | |
| const Op & | ptx_barrier_cluster_arrive () |
| tvm instrinsics to call barrier.cluster.arrive{.sem}{.aligned} More... | |
| const Op & | ptx_barrier_cluster_wait () |
| tvm instrinsics to call barrier.cluster.wait.{acquire}{.aligned} More... | |
| const Op & | ptx_elect_sync () |
| tvm instrinsics to call elect.sync _|p, membermask and return the predicate More... | |
| const Op & | ptx_fence_mbarrier_init () |
| PTX fence.mbarrier_init.release.cluster instruction. More... | |
| const Op & | ptx_fetch_register () |
| tvm instrinsics to fetch PTX pre-defined registers More... | |
| const Op & | mma_store () |
| tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory. More... | |
| const Op & | mma_fill () |
| tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers. More... | |
| const Op & | ptx_wgmma_encode_matrix_descriptor () |
| tvm intrinsic to encode matrix descriptor for wgmma instructions. More... | |
| const Op & | ptx_wgmma_noop_barrier () |
| tvm intrinsic to call "" : "+r"(reg) :: "memory" More... | |
| const Op & | ptx_wgmma_mma_async_ss () |
| tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where both A and B are in shared memory. More... | |
| const Op & | ptx_wgmma_mma_async_rs () |
| tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where A is in register and B is in shared memory. More... | |
| const Op & | ptx_wgmma_fence () |
| tvm intrinsic to call wgmma.fence.sync.aligned; More... | |
| const Op & | ptx_wgmma_commit_group () |
| tvm intrinsic to call wgmma.commit_group.sync.aligned; More... | |
| const Op & | ptx_wgmma_wait_group () |
| tvm intrinsic to call wgmma.wait_group.sync.aligned; More... | |
| const Op & | ptx_stmatrix () |
| tvm intrinsic to call stmatrix.sync.aligned.m8n8.num{.trans}.shared.b16 [p], r; More... | |
| const Op & | ptx_setmaxnreg () |
| tvm intrinsic to call setmaxnreg.action.sync.aligned.u32 imm-reg-count More... | |
| const Op & | ptx_ld_global_acquire () |
| tvm intrinsic to call ld.global.acquire.gpu.b32 More... | |
| const Op & | ptx_tcgen05_alloc () |
| tvm instrinsics to call tcgen05.alloc.cta_group.sync.aligned; More... | |
| const Op & | ptx_tcgen05_dealloc () |
| tvm instrinsics to call tcgen05.dealloc.cta_group.sync.aligned; More... | |
| const Op & | ptx_tcgen05_relinquish_alloc_permit () |
| tvm instrinsics to call tcgen05.relinquish_alloc_permit.cta_group.sync.aligned; More... | |
| const Op & | ptx_tcgen05_fence_before_thread_sync () |
| tvm instrinsics to call tcgen05.fence::before_thread_sync; More... | |
| const Op & | ptx_tcgen05_fence_after_thread_sync () |
| tvm instrinsics to call tcgen05.fence::after_thread_sync; More... | |
| const Op & | ptx_tcgen05_ld () |
| tvm instrinsics to call tcgen05.ld.sync.aligned; More... | |
| const Op & | ptx_tcgen05_st () |
| tvm instrinsics to call tcgen05.st.sync.aligned; More... | |
| const Op & | ptx_tcgen05_wait_ld () |
| tvm instrinsics to call tcgen05.wait::ld.sync.aligned; More... | |
| const Op & | ptx_tcgen05_wait_st () |
| tvm instrinsics to call tcgen05.wait::st.sync.aligned; More... | |
| const Op & | ptx_tcgen05_encode_matrix_descriptor () |
| tvm intrinsic to encode matrix descriptor for tcgen05 instructions. More... | |
| const Op & | ptx_tcgen05_encode_instr_descriptor () |
| tvm intrinsic to encode instruction descriptor for tcgen05 MMA. More... | |
| const Op & | ptx_tcgen05_encode_instr_descriptor_block_scaled () |
| tvm intrinsic to encode instruction descriptor for tcgen05 MMA block scaled. More... | |
| const Op & | ptx_tcgen05_mma () |
| tvm intrinsic to call tcgen05.mma.cta_group.kind without block scaling. More... | |
| const Op & | ptx_tcgen05_mma_block_scale () |
| tvm intrinsic to call tcgen05.mma.cta_group.kind.block_scale{.scale_vec_size} More... | |
| const Op & | ptx_tcgen05_mma_sp () |
| tvm intrinsic to call tcgen05.mma.sp.cta_group.kind without block scaling. More... | |
| const Op & | ptx_tcgen05_mma_sp_block_scale () |
| tvm intrinsic to call tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vec_size} More... | |
| const Op & | ptx_tcgen05_commit () |
| tvm instrinsics to call tcgen05.commit.cta_group More... | |
| const Op & | ptx_tcgen05_cp () |
| tvm instrinsics to call tcgen05.cp.cta_group More... | |
| const Op & | ptx_tcgen05_shift () |
| tvm instrinsics to call tcgen05.shift.cta_group.down More... | |
| const Op & | ptx_map_shared_rank () |
| tvm instrinsics to call map_shared_rank More... | |
| const Op & | cuda_func_call () |
| tvm instrinsics to call a CUDA function. Source code is provided as a string. More... | |
| const Op & | nvshmem_my_pe () |
| nvshmem intrinsics for nvshmem_my_pe() operation. More... | |
| const Op & | nvshmem_n_pes () |
| nvshmem intrinsics for nvshmem_n_pes() operation. More... | |
| const Op & | nvshmem_getmem_nbi () |
| nvshmem intrinsics for nvshmem_getmem_nbi() operation. More... | |
| const Op & | nvshmem_putmem_nbi () |
| nvshmem intrinsics for nvshmem_putmem_nbi() operation. More... | |
| const Op & | nvshmem_getmem_nbi_warp () |
| nvshmem intrinsics for nvshmemx_getmem_nbi_warp() operation. More... | |
| const Op & | nvshmem_putmem_nbi_warp () |
| nvshmem intrinsics for nvshmemx_putmem_nbi_warp() operation. More... | |
| const Op & | nvshmem_getmem_nbi_block () |
| nvshmem intrinsics for nvshmemx_getmem_nbi_block() operation. More... | |
| const Op & | nvshmem_putmem_nbi_block () |
| nvshmem intrinsics for nvshmemx_putmem_nbi_block() operation. More... | |
| const Op & | nvshmem_signal_op () |
| nvshmem intrinsics for nvshmemx_signal_op() operation. More... | |
| const Op & | nvshmem_wait_until () |
| nvshmem intrinsics for nvshmem_FuncParam{TYPENAME}_wait_until() operation. More... | |
| const Op & | nvshmem_quiet () |
| nvshmem intrinsics for nvshmem_quiet() operation. More... | |
| const Op & | nvshmem_putmem_signal_nbi () |
| nvshmem intrinsics for nvshmemx_putmem_signal_nbi() operation. More... | |
| const Op & | nvshmem_putmem_signal_nbi_warp () |
| nvshmem intrinsics for nvshmemx_putmem_signal_nbi_warp() operation. More... | |
| const Op & | nvshmem_putmem_signal_nbi_block () |
| nvshmem intrinsics for nvshmemx_putmem_signal_nbi_block() operation. More... | |
| const Op & | nvshmem_fence () |
| nvshmem intrinsics for nvshmem_fence() operation. More... | |
| const Op & | nvshmem_barrier_all () |
| nvshmem intrinsics for nvshmem_barrier_all() operation. More... | |
| const Op & | nki_load () |
| nki intrinsics for load operation. More... | |
| const Op & | nki_store () |
| nki intrinsics for store operation. More... | |
| const Op & | nki_tensor_copy () |
| nki intrinsics for tensor_copy operation. More... | |
| const Op & | nki_matmul () |
| nki intrinsics for matmul operation. More... | |
| const Op & | nki_activation () |
| nki intrinsics for activation operation. More... | |
| const Op & | nki_reciprocal () |
| nki intrinsics for reciprocal operation. More... | |
| const Op & | nki_tensortensor () |
| nki intrinsics for tensortensor operation. More... | |
| const Op & | nki_tensorscalar () |
| nki intrinsics for tensorscalar operation. More... | |
| const Op & | nki_tensorreduce () |
| nki intrinsics for tensorreduce operation. More... | |
| const Op & | nki_memset () |
| nki intrinsics for memset operation. More... | |
| const Op & | nki_activation_reduce () |
| nki intrinsics for activation reduce operation. More... | |
| const Op & | nki_tensorscalar_reduce () |
| nki intrinsics for tensorscalar reduce operation. More... | |
| const Op & | nki_identity () |
| nki intrinsics for initializing identity tensor. More... | |
| const Op & | nki_scalar_tensor_tensor () |
| nki intrinsics for scalar tensor tensor operation. More... | |
| const Op & | nki_scalar_tensor_scalar () |
| nki intrinsics for scalar tensor scalar operation. More... | |
| const Op & | nki_affine_select () |
| nki intrinsics for affine_select operation. More... | |
Collection of builtin intrinsics as ops.
| enum tvm::tirx::builtin::TVMStructFieldKind : int |
The kind of structure field info used in intrinsic.
| const Op& tvm::tirx::builtin::address_of | ( | ) |
Returns the address of an element in the buffer (see pseudocode below).
The number of indices should match the dimensionality of the buffer being accessed. If this operation occurs after buffer flattening, the number of indices must be supported by the target (i.e. N>1 only on targets that support non-flat memory buffers).
Handle address_of(BufferLoad *op) { return &op->buffer_var[op->indices[0], op->indices[1], ..., op->indices[N-1]]; }
| const Op& tvm::tirx::builtin::anylist_getitem | ( | ) |
Get a item from any list and return it.
Any anylist_getitem(Handle anylist, int index) return anylist[index]; }
| const Op& tvm::tirx::builtin::anylist_resetitem | ( | ) |
Reset and clear a item in any list.
void anylist_resetitem(Handle anylist, int index) anylist[index] = nullptr; }
| const Op& tvm::tirx::builtin::anylist_setitem_call_cpacked | ( | ) |
Same as anylist_setitem_call_packed but use C calling convention.
| const Op& tvm::tirx::builtin::anylist_setitem_call_packed | ( | ) |
Set an item into any list by running packed function call.
void anylist_setitem_call_packed(Handle anylist, int index, name, *args)
anylist[index] = call_packed(name, *args) }
| const Op& tvm::tirx::builtin::assume | ( | ) |
Provide a true statement that can be used for simplifications.
Compile-time representation of known constraints about function inputs. This assumption is removed when lowering, and does not occur in codegen.
| const Op& tvm::tirx::builtin::atomic_add | ( | ) |
atomic add instruction, corresponding e.g. to atomicAdd in CUDA
| const Op& tvm::tirx::builtin::bitwise_and | ( | ) |
Bitwise and operator.
| const Op& tvm::tirx::builtin::bitwise_not | ( | ) |
Bitwise not operator.
| const Op& tvm::tirx::builtin::bitwise_or | ( | ) |
Bitwise or operator.
| const Op& tvm::tirx::builtin::bitwise_xor | ( | ) |
Bitwise xor operator.
| const Op& tvm::tirx::builtin::break_loop | ( | ) |
Loop break.
| const Op& tvm::tirx::builtin::buffer_offset | ( | ) |
Get the element offset of a buffer given logical indices.
The offset is determined by the layout of the buffer.
| const Op& tvm::tirx::builtin::call_extern | ( | ) |
Call an extern C function with given name and signature from the types of args in the runtime environment.
Type call_extern(name, args...) { return dlsym(name)(args...); }
| const Op& tvm::tirx::builtin::call_llvm_intrin | ( | ) |
| const Op& tvm::tirx::builtin::call_llvm_pure_intrin | ( | ) |
| const Op& tvm::tirx::builtin::call_pure_extern | ( | ) |
Call an pure extern C function with given name and signature from the types of args in the runtime environment.
Type call_pure_extern(name, args...) { return dlsym(name)(args...); }
| const Op& tvm::tirx::builtin::call_spirv_pure_glsl450 | ( | ) |
| const Op& tvm::tirx::builtin::continue_loop | ( | ) |
Loop continue.
| const Op& tvm::tirx::builtin::cooperative_tensor_fill | ( | ) |
| const Op& tvm::tirx::builtin::cooperative_tensor_load | ( | ) |
Load data from device or threadgroup memory into a cooperative_tensor.
void cooperative_tensor_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int rows, int cols, bool transpose_matrix, int mma_M, int mma_N, int mma_K, int operand_role); operand_role: 0=left(A), 1=right(B), 2=destination(C)
| const Op& tvm::tirx::builtin::cooperative_tensor_multiply_accumulate | ( | ) |
Multiply and accumulate two matrices using cooperative_tensor (MetalPerformancePrimitives matmul2d).
void cooperative_tensor_multiply_accumulate( Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c, int M, int N, int K, bool transpose_a, bool transpose_b);
| const Op& tvm::tirx::builtin::cooperative_tensor_store | ( | ) |
Store data from a cooperative_tensor to device or threadgroup memory.
void cooperative_tensor_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int rows, int cols, bool transpose_matrix, int mma_M, int mma_N, int mma_K, int operand_role); operand_role: 0=left(A), 1=right(B), 2=destination(C)
| const Op& tvm::tirx::builtin::cuda_atomic_add | ( | ) |
tvm intrinsic for cuda atomic add instruction
| const Op& tvm::tirx::builtin::cuda_atomic_cas | ( | ) |
tvm intrinsic for cuda atomic compare and swap instruction
| const Op& tvm::tirx::builtin::cuda_bfloat162float | ( | ) |
tvm intrinsic for cuda bfloat16 to float conversion
| const Op& tvm::tirx::builtin::cuda_copy_bytes | ( | ) |
Typed load/store copy of num_bytes bytes.
cuda_copy_bytes(dst, src, num_bytes) copies num_bytes bytes from src to dst using a single typed load/store (uint4, uint2, unsigned int, etc.). num_bytes must be one of {1, 2, 4, 8, 16}.
| const Op& tvm::tirx::builtin::cuda_cta_reduce | ( | ) |
CTA-wide reduction via warp shuffle + shared memory.
cuda_cta_reduce(value, op, num_warps, scratch) reduces value across the entire CTA using the specified operation ("sum", "max", "min").
| const Op& tvm::tirx::builtin::cuda_cta_sync | ( | ) |
tvm intrinsic for cuda block-wide sync (syncthreads)
| const Op& tvm::tirx::builtin::cuda_float22half2 | ( | ) |
tvm intrinsic for a helper converting float2 to half2 with rounding
| const Op& tvm::tirx::builtin::cuda_float8tohalf8 | ( | ) |
tvm intrinsic to convert 8 float2 lanes to 8 half2 lanes with rounding
| const Op& tvm::tirx::builtin::cuda_func_call | ( | ) |
tvm instrinsics to call a CUDA function. Source code is provided as a string.
cuda_func_call(String func_name, PrimExpr... args, String source_code)
| const Op& tvm::tirx::builtin::cuda_get_tmem_addr | ( | ) |
tvm intrinsic for cuda tmem address calculation
| const Op& tvm::tirx::builtin::cuda_grid_sync | ( | ) |
tvm intrinsic for cuda grid-wide sync (cooperative groups)
| const Op& tvm::tirx::builtin::cuda_half2float | ( | ) |
tvm intrinsic for cuda half to float conversion
| const Op& tvm::tirx::builtin::cuda_half8tofloat8 | ( | ) |
tvm intrinsic to convert 8 half2 lanes to 8 float2 lanes
| const Op& tvm::tirx::builtin::cuda_ldg | ( | ) |
tvm intrinsic for cuda ldg instruction
| const Op& tvm::tirx::builtin::cuda_nano_sleep | ( | ) |
tvm intrinsic for cuda nano sleep instruction
| const Op& tvm::tirx::builtin::cuda_printf | ( | ) |
tvm intrinsic for cuda printf instruction
| const Op& tvm::tirx::builtin::cuda_runtime_instr_desc | ( | ) |
tvm intrinsic to modify runtime instruction descriptor
| const Op& tvm::tirx::builtin::cuda_syncthreads_and | ( | ) |
tvm intrinsic for cuda syncthreads_and instruction
| const Op& tvm::tirx::builtin::cuda_syncthreads_or | ( | ) |
tvm intrinsic for cuda syncthreads_or instruction
| const Op& tvm::tirx::builtin::cuda_thread_fence | ( | ) |
tvm intrinsic for cuda thread fence instruction
| const Op& tvm::tirx::builtin::cuda_thread_rank | ( | ) |
tvm intrinsic that returns cooperative_groups::thread_rank() for the enclosing CTA (linear thread index within the block).
| const Op& tvm::tirx::builtin::cuda_trap_when_assert_failed | ( | ) |
tvm intrinsic to trap when an assertion failed (cond == false)
| const Op& tvm::tirx::builtin::cuda_warp_reduce | ( | ) |
Warp-level butterfly shuffle-XOR reduction.
cuda_warp_reduce(value, op, width) reduces value across width adjacent lanes using the specified operation ("sum", "max", "min").
| const Op& tvm::tirx::builtin::cuda_warp_sync | ( | ) |
tvm intrinsic for cuda warp sync instruction
| const Op& tvm::tirx::builtin::dma_copy | ( | ) |
Initiate a non-blocking DMA copy from source to destination.
The copy is launched immediately.
If a dma_start_group() call is active, the copy will be added to the current group for tracking of in-flight group counts.
If no dma_start_group() call is active, the copy will be tracked individually i.e. as a group with size 1.
| const Op& tvm::tirx::builtin::dma_end_group | ( | ) |
End a group of DMA copies.
Track all calls to dma_copy() that occurred since the preceding dma_start_group() as a single group in-flight.
Calling dma_end_group() without an active group is unsupported.
Note: A group of DMA calls may be empty, and will still contribute to the count of in-flight groups used by dma_wait().
| const Op& tvm::tirx::builtin::dma_start_group | ( | ) |
Start a group of DMA copies.
Any call to dma_copy() that occurs after dma_start_group() will be added to the current group for tracking of in-flight group counts.
Only one DMA group may be active at a given time. Calling dma_start_group() while a group is active is unsupported.
| const Op& tvm::tirx::builtin::dma_wait | ( | ) |
Wait until the number of DMA groups in flight is less than or equal to some maximum.
Calling dma_wait() while a group is active is unsupported.
| const Op& tvm::tirx::builtin::dp4a | ( | ) |
Dot product of two int8x4 vectors and add an optional accumulator.
| const Op& tvm::tirx::builtin::end_profile_intrinsic | ( | ) |
Profiling intrinsic.
| const Op& tvm::tirx::builtin::filter | ( | ) |
Thread-set filter predicate. Used as the condition of an IfThenElse to narrow the active thread set A for the then-branch. Two forms: filter(var, lo, hi) – range form, true iff var in [lo, hi) filter(var, cond) – predicate form (e.g. var == k); true iff cond var must be a ScopeIdDef-declared Var at parse time (Verifier Rule 2).
| const Op& tvm::tirx::builtin::get_active_lane_mask | ( | ) |
Calculate a predicate mask given an upper bound (limit) and a current value (base).
It will be lowered to the llvm.get.active.lane.mask intrinsic. (https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics)
| const Op& tvm::tirx::builtin::handle_add_byte_offset | ( | ) |
See pesudo code.
void* handle_add_byte_offset(void* handle, int offset) { return reinterpret_cast<v*>(reinterpret_cast<char*>(handle) + offset); }
| const Op& tvm::tirx::builtin::if_then_else | ( | ) |
Same as select, used for unsafe memory access.
Type tvm_if_then_else(cond, a, b) { return cond ? a : b; }
| const Op& tvm::tirx::builtin::ignore_loop_partition | ( | ) |
Annotate a predicate not be considered as target condition of loop partition.
| const Op& tvm::tirx::builtin::isnan | ( | ) |
Check if value is nan.
| const Op& tvm::tirx::builtin::isnullptr | ( | ) |
See pesudo code.
bool isnullptr(void* handle) { return handle == nullptr }
| const Op& tvm::tirx::builtin::large_uint_imm | ( | ) |
See pesudo code.
Construct a big uint that may not be representable by int64
Expr large_uint_imm(uint32_t v0, uin32_t v1) { return (v1 << 32) | v0; }
| const Op& tvm::tirx::builtin::likely | ( | ) |
Marks a condition is likely going to happen.
| const Op& tvm::tirx::builtin::lookup_param | ( | ) |
See pseudo code Type lookup_param(ffi::String param_name) { return __tvm_param__param_name; }.
| const Op& tvm::tirx::builtin::make_filled_simdgroup_matrix | ( | ) |
| const Op& tvm::tirx::builtin::mma_fill | ( | ) |
tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers.
There is no real PTX instruction that does that, but we introduce this intrinsic for the same reason as mma_store above.
void mma_fill(IntImm local_size, Var local_ptr, Expr offset);
| const Op& tvm::tirx::builtin::mma_fill_legacy | ( | ) |
| const Op& tvm::tirx::builtin::mma_store | ( | ) |
tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory.
There is no real PTX instruction that does that, but we want to hide details of complex index manipulation behind this intrinsic to simplify TIR lowering passes (e.g. LowerWarpMemory).
void mma_store(IntImm m, IntImm n, Var dst_ptr, Var src_ptr, Expr src_offset, Var dst_stride);
| const Op& tvm::tirx::builtin::mma_store_legacy | ( | ) |
| const Op& tvm::tirx::builtin::nd_mem_alloc_with_scope | ( | ) |
Create an Nd memory allocation with storage scope.
| const Op& tvm::tirx::builtin::nki_activation | ( | ) |
nki intrinsics for activation operation.
nki_activation(result, data, opcode, bias, scale)
| const Op& tvm::tirx::builtin::nki_activation_reduce | ( | ) |
nki intrinsics for activation reduce operation.
nki_activation_reduce(reduce_res, act_res, data, opcode, reduce_opcode, bias, scale)
| const Op& tvm::tirx::builtin::nki_affine_select | ( | ) |
nki intrinsics for affine_select operation.
nki_affine_select(result, pred, true_value, false_value)
| const Op& tvm::tirx::builtin::nki_identity | ( | ) |
nki intrinsics for initializing identity tensor.
nki_identity(result, size)
| const Op& tvm::tirx::builtin::nki_load | ( | ) |
nki intrinsics for load operation.
nki_load(result, data)
| const Op& tvm::tirx::builtin::nki_matmul | ( | ) |
nki intrinsics for matmul operation.
nki_matmul(C, A, B, accum)
equivalent to C += A.T @ B (if accum is true), or C = A.T @ B (if accum is false)
| const Op& tvm::tirx::builtin::nki_memset | ( | ) |
nki intrinsics for memset operation.
nki_memset(result, value)
| const Op& tvm::tirx::builtin::nki_reciprocal | ( | ) |
nki intrinsics for reciprocal operation.
nki_reciprocal(result, data)
| const Op& tvm::tirx::builtin::nki_scalar_tensor_scalar | ( | ) |
nki intrinsics for scalar tensor scalar operation.
(data op1 operand1) op2 (operand2) where op1 and op2 are tensor-scalar
nki_scalar_tensor_scalar(result, data, operand0, operand1, opcode0, opcode1, reverse0, reverse1)
| const Op& tvm::tirx::builtin::nki_scalar_tensor_tensor | ( | ) |
nki intrinsics for scalar tensor tensor operation.
(data op1 operand1) op2 (operand2) where op1 is tensor-scalar and op2 is tensor-tensor
nki_scalar_tensor_tensor(result, data, operand0, operand1, opcode0, opcode1, reverse0, reverse1)
| const Op& tvm::tirx::builtin::nki_store | ( | ) |
nki intrinsics for store operation.
nki_store(result, data)
| const Op& tvm::tirx::builtin::nki_tensor_copy | ( | ) |
nki intrinsics for tensor_copy operation.
nki_tensor_copy(result, data)
| const Op& tvm::tirx::builtin::nki_tensorreduce | ( | ) |
nki intrinsics for tensorreduce operation.
nki_tensorreduce(result, data, opcode, negate, axes)
| const Op& tvm::tirx::builtin::nki_tensorscalar | ( | ) |
nki intrinsics for tensorscalar operation.
nki_tensorscalar(result, operand0, operand1, opcode, reverse)
| const Op& tvm::tirx::builtin::nki_tensorscalar_reduce | ( | ) |
nki intrinsics for tensorscalar reduce operation.
nki_tensorscalar_reduce(reduce_res, tensorscalar_res, operand0, operand1, opcode, reduce_opcode, reverse)
| const Op& tvm::tirx::builtin::nki_tensortensor | ( | ) |
nki intrinsics for tensortensor operation.
nki_tensortensor(result, operand0, operand1, opcode)
| const Op& tvm::tirx::builtin::nvshmem_barrier_all | ( | ) |
nvshmem intrinsics for nvshmem_barrier_all() operation.
| const Op& tvm::tirx::builtin::nvshmem_fence | ( | ) |
nvshmem intrinsics for nvshmem_fence() operation.
void nvshmem_fence()
| const Op& tvm::tirx::builtin::nvshmem_getmem_nbi | ( | ) |
nvshmem intrinsics for nvshmem_getmem_nbi() operation.
void nvshmem_getmem_nbi(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_getmem_nbi_block | ( | ) |
nvshmem intrinsics for nvshmemx_getmem_nbi_block() operation.
void nvshmemx_getmem_nbi_block(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_getmem_nbi_warp | ( | ) |
nvshmem intrinsics for nvshmemx_getmem_nbi_warp() operation.
void nvshmemx_getmem_nbi_warp(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_my_pe | ( | ) |
nvshmem intrinsics for nvshmem_my_pe() operation.
int nvshmem_my_pe()
| const Op& tvm::tirx::builtin::nvshmem_n_pes | ( | ) |
nvshmem intrinsics for nvshmem_n_pes() operation.
int nvshmem_n_pes()
| const Op& tvm::tirx::builtin::nvshmem_putmem_nbi | ( | ) |
nvshmem intrinsics for nvshmem_putmem_nbi() operation.
void nvshmem_putmem_nbi(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_putmem_nbi_block | ( | ) |
nvshmem intrinsics for nvshmemx_putmem_nbi_block() operation.
void nvshmemx_putmem_nbi_block(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_putmem_nbi_warp | ( | ) |
nvshmem intrinsics for nvshmemx_putmem_nbi_warp() operation.
void nvshmemx_putmem_nbi_warp(void *dest, const void *source, size_t nelems, int pe)
| const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi | ( | ) |
nvshmem intrinsics for nvshmemx_putmem_signal_nbi() operation.
void nvshmemx_putmem_signal_nbi(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
| const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi_block | ( | ) |
nvshmem intrinsics for nvshmemx_putmem_signal_nbi_block() operation.
void nvshmemx_putmem_signal_nbi_block(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
| const Op& tvm::tirx::builtin::nvshmem_putmem_signal_nbi_warp | ( | ) |
nvshmem intrinsics for nvshmemx_putmem_signal_nbi_warp() operation.
void nvshmemx_putmem_signal_nbi_warp(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
| const Op& tvm::tirx::builtin::nvshmem_quiet | ( | ) |
nvshmem intrinsics for nvshmem_quiet() operation.
void nvshmem_quiet()
| const Op& tvm::tirx::builtin::nvshmem_signal_op | ( | ) |
nvshmem intrinsics for nvshmemx_signal_op() operation.
void nvshmemx_signal_op(uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
| const Op& tvm::tirx::builtin::nvshmem_wait_until | ( | ) |
nvshmem intrinsics for nvshmem_FuncParam{TYPENAME}_wait_until() operation.
void nvshmem_FuncParam{TYPENAME}_wait_until(TYPE *ivar, int cmp, TYPE cmp_value)
| const Op& tvm::tirx::builtin::popcount | ( | ) |
Popcount.
| const Op& tvm::tirx::builtin::prefetch | ( | ) |
same signature as llvm.prefetch
| const Op& tvm::tirx::builtin::print_buffer | ( | ) |
Print the content of a buffer during runtime.
| const Op& tvm::tirx::builtin::ptx_add_packed_f32x2 | ( | ) |
tvm intrinsic for PTX packed add instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_any_sync | ( | ) |
tvm intrinsic for PTX warp-wide any predicate (__any_sync)
| const Op& tvm::tirx::builtin::ptx_bar_arrive | ( | ) |
tvm instrinsics to call bar.arrive a, b
bar_arrive(int name_bar_id, int thread_count)
| const Op& tvm::tirx::builtin::ptx_bar_sync | ( | ) |
tvm instrinsics to call bar.sync a, {b}
bar_sync(int name_bar_id, int thread_count)
| const Op& tvm::tirx::builtin::ptx_barrier_cluster_arrive | ( | ) |
tvm instrinsics to call barrier.cluster.arrive{.sem}{.aligned}
ptx_barrier_cluster_arrive(string sem, bool aligned)
| const Op& tvm::tirx::builtin::ptx_barrier_cluster_wait | ( | ) |
tvm instrinsics to call barrier.cluster.wait.{acquire}{.aligned}
ptx_barrier_cluster_wait(bool acquire, bool aligned)
| const Op& tvm::tirx::builtin::ptx_cp_async | ( | ) |
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk | ( | ) |
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_commit_group | ( | ) |
tvm instrinsics to call cp.async.bulk.commit_group
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_shared_to_cluster | ( | ) |
tvm intrinsics for ptx async bulk copy from shared::cta to shared::cluster
void ptx_cp_async_bulk_shared_to_cluster(Expr dst_ptr, Expr src_ptr, Expr size, Expr mbar);
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_global_to_cluster | ( | ) |
tvm instrinsics to call cp.async.bulk.tensor.dim.shared::cluster.global.tile.mbarrier::complete_tx::bytes
TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma
ptx_cp_async_bulk_tensor_global_to_cluster(int dim, PrimExpr dst_ptr, PrimExpr bar_ptr, PrimExpr tensormap_addr, int...coords, int cta_mask, int cta_group, string cache_hint)
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_global_to_cluster_prefetch | ( | ) |
tvm instrinsics to call cp.async.bulk.prefetch.tensor.dim.L2.global.tile
TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma
ptx_cp_async_bulk_tensor_global_to_cluster_prefetch(int dim, PrimExpr tensormap_addr, int...coords, string cache_hint)
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_shared_to_global | ( | ) |
tvm instrinsics to call cp.async.bulk.tensor.dim.global.shared::cta.tile。bulk_group
TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma
ptx_cp_async_bulk_tensor_shared_to_global(int dim, PrimExpr src_ptr, PrimExpr tensormap_addr, int...coords, string cache_hint)
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_shared_to_global_reduce | ( | ) |
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster | ( | ) |
tvm intrinsic to call cp.async.bulk.tensor.dim.shared::cluster.global.tile::gather4.mbarrier::complete_tx::bytes
TMA alignment requirement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-alignment-multi-dim-tma
ptx_cp_async_bulk_tensor_tile_gather4_global_to_cluster(int dim, PrimExpr dst_ptr, PrimExpr bar_ptr, PrimExpr tensormap_addr, int...coords, int cta_mask, int cta_group, string cache_hint)
| const Op& tvm::tirx::builtin::ptx_cp_async_bulk_wait_group | ( | ) |
tvm instrinsics to call cp.async.bulk.wait_group{.read} N
ptx_cp_async_bulk_wait_group(int N, bool read)
| const Op& tvm::tirx::builtin::ptx_cp_async_commit_group | ( | ) |
tvm intrinsics for ptx async copy commit and wait.
void ptx_cp_async_commit_group(); void ptx_cp_async_wait_group(int num);
| const Op& tvm::tirx::builtin::ptx_cp_async_mbarrier_arrive | ( | ) |
tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive
ptx_cp_async_mbarrier_arrive(int barrier_arr_id, int barrier_id)
| const Op& tvm::tirx::builtin::ptx_cp_async_wait_group | ( | ) |
| const Op& tvm::tirx::builtin::ptx_elect_sync | ( | ) |
tvm instrinsics to call elect.sync _|p, membermask and return the predicate
elect_sync(membermask)
| const Op& tvm::tirx::builtin::ptx_exp2 | ( | ) |
tvm intrinsic for PTX fast exp2 approximation (ex2.approx.ftz.f32)
| const Op& tvm::tirx::builtin::ptx_fence | ( | ) |
PTX fence instruction: fence.{sem}.{scope}.
ptx_fence(StringImm sem, StringImm scope)
| const Op& tvm::tirx::builtin::ptx_fence_mbarrier_init | ( | ) |
PTX fence.mbarrier_init.release.cluster instruction.
| const Op& tvm::tirx::builtin::ptx_fence_proxy_async | ( | ) |
PTX fence.proxy.async instruction: fence.proxy.async[.{space}].
ptx_fence_proxy_async(StringImm space)
| const Op& tvm::tirx::builtin::ptx_fetch_register | ( | ) |
tvm instrinsics to fetch PTX pre-defined registers
ptx_fetch_register(int bits, string reg_name)
| const Op& tvm::tirx::builtin::ptx_fma_packed_f32x2 | ( | ) |
tvm intrinsic for PTX packed FMA instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_ld_global_acquire | ( | ) |
tvm intrinsic to call ld.global.acquire.gpu.b32
| const Op & tvm::tirx::builtin::ptx_ldg32 | ( | ) |
tvm intrinsic for ptx predicate load with 32-bit data type.
| const Op& tvm::tirx::builtin::ptx_ldmatrix | ( | ) |
| const Op& tvm::tirx::builtin::ptx_ldmatrix_legacy | ( | ) |
| const Op& tvm::tirx::builtin::ptx_map_shared_rank | ( | ) |
tvm instrinsics to call map_shared_rank
ptx_map_shared_rank(PrimExpr ptr, int rank)
| const Op& tvm::tirx::builtin::ptx_mbarrier_arrive | ( | ) |
tvm instrinsics to call mbarrier.arrive.shared::cta.b64 or mapa.shared::cluster.u32 mbarrier.arrive.shared::cluster.b64
| const Op& tvm::tirx::builtin::ptx_mbarrier_arrive_expect_tx | ( | ) |
tvm instrinsics to call mbarrier.arrive.expect_tx.shared.b64 or mapa.shared::cluster.u32 mbarrier.arrive.expect_tx.shared.b64
ptx_mbarrier_arrive_expect_tx(uint64_t* bar_ptr, int byte_count)
| const Op& tvm::tirx::builtin::ptx_mbarrier_init | ( | ) |
tvm instrinsics to call mbarrier.init.shared::cta.b64
ptx_mbarrier_init(uint64_t* bar_ptr, int thread_count)
| const Op& tvm::tirx::builtin::ptx_mbarrier_try_wait | ( | ) |
tvm instrinsics to call mbarrier.try_wait.parity repeatedly until it returns true
ptx_mbarrier_try_wait(uint64_t* bar_ptr, int phase)
| const Op& tvm::tirx::builtin::ptx_mma | ( | ) |
| const Op& tvm::tirx::builtin::ptx_mma_legacy | ( | ) |
ptx mma / ldmatrix / mma_store / mma_fill variants that take (ptr_var, offset) pairs (not a folded access_ptr Call). Codegen emits ptr + offset C pointer arithmetic; lower_warp_memory rewrites the offset's group component to its thread-local index.
| const Op& tvm::tirx::builtin::ptx_mma_sp | ( | ) |
tvm intrinsic for sparse tensor core ptx instructions.
void ptx_mma_sp(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, Var metadata, Expr meta_index, Var sparse_selector, bool saturate);
| const Op& tvm::tirx::builtin::ptx_mul_packed_f32x2 | ( | ) |
tvm intrinsic for PTX packed multiply instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_rcp | ( | ) |
tvm intrinsic for PTX fast reciprocal approximation (rcp.approx.ftz.f32)
| const Op& tvm::tirx::builtin::ptx_reduce3_max_f32 | ( | ) |
tvm intrinsic for PTX 3-input max instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_reduce3_min_f32 | ( | ) |
tvm intrinsic for PTX 3-input min instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_setmaxnreg | ( | ) |
tvm intrinsic to call setmaxnreg.action.sync.aligned.u32 imm-reg-count
| const Op& tvm::tirx::builtin::ptx_stmatrix | ( | ) |
tvm intrinsic to call stmatrix.sync.aligned.m8n8.num{.trans}.shared.b16 [p], r;
ptx_stmatrix(int num, bool trans, PrimExpr ptr, PrimExpr... vars)
| const Op& tvm::tirx::builtin::ptx_sub_packed_f32x2 | ( | ) |
tvm intrinsic for PTX packed subtract instruction (sm_100a+)
| const Op& tvm::tirx::builtin::ptx_tcgen05_alloc | ( | ) |
tvm instrinsics to call tcgen05.alloc.cta_group.sync.aligned;
ptx_tcgen05_alloc(Var dst_ptr, int n_cols, int cta_group)
| const Op& tvm::tirx::builtin::ptx_tcgen05_commit | ( | ) |
tvm instrinsics to call tcgen05.commit.cta_group
| const Op& tvm::tirx::builtin::ptx_tcgen05_cp | ( | ) |
tvm instrinsics to call tcgen05.cp.cta_group
| const Op& tvm::tirx::builtin::ptx_tcgen05_dealloc | ( | ) |
tvm instrinsics to call tcgen05.dealloc.cta_group.sync.aligned;
ptx_tcgen05_dealloc(uint32_t taddr, int n_cols, int cta_group)
| const Op& tvm::tirx::builtin::ptx_tcgen05_encode_instr_descriptor | ( | ) |
tvm intrinsic to encode instruction descriptor for tcgen05 MMA.
ptx_tcgen05_encode_instr_descriptor(PrimExpr desc, string d_dtype, string a_dtype, string b_dtype, int M, int N, int K, bool trans_a, bool trans_b, int n_cta_groups, bool neg_a, bool neg_b, bool sat_d, bool is_sparse)
| const Op& tvm::tirx::builtin::ptx_tcgen05_encode_instr_descriptor_block_scaled | ( | ) |
tvm intrinsic to encode instruction descriptor for tcgen05 MMA block scaled.
ptx_tcgen05_encode_instr_descriptor_block_scaled(PrimExpr desc, string d_dtype, string a_dtype, string b_dtype, string sfa_dtype, string stb_dtype, int M, int N, int K, bool trans_a, bool trans_b, int n_cta_groups, bool neg_a, bool neg_b, bool is_sparse)
| const Op& tvm::tirx::builtin::ptx_tcgen05_encode_matrix_descriptor | ( | ) |
tvm intrinsic to encode matrix descriptor for tcgen05 instructions.
ptx_tcgen05_encode_matrix_descriptor(PrimExpr ptr, PrimExpr ldo, PrimExpr sdo, int swizzle)
| const Op& tvm::tirx::builtin::ptx_tcgen05_fence_after_thread_sync | ( | ) |
tvm instrinsics to call tcgen05.fence::after_thread_sync;
| const Op& tvm::tirx::builtin::ptx_tcgen05_fence_before_thread_sync | ( | ) |
tvm instrinsics to call tcgen05.fence::before_thread_sync;
| const Op& tvm::tirx::builtin::ptx_tcgen05_ld | ( | ) |
tvm instrinsics to call tcgen05.ld.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_tcgen05_mma | ( | ) |
tvm intrinsic to call tcgen05.mma.cta_group.kind without block scaling.
| const Op& tvm::tirx::builtin::ptx_tcgen05_mma_block_scale | ( | ) |
tvm intrinsic to call tcgen05.mma.cta_group.kind.block_scale{.scale_vec_size}
| const Op& tvm::tirx::builtin::ptx_tcgen05_mma_sp | ( | ) |
tvm intrinsic to call tcgen05.mma.sp.cta_group.kind without block scaling.
| const Op& tvm::tirx::builtin::ptx_tcgen05_mma_sp_block_scale | ( | ) |
tvm intrinsic to call tcgen05.mma.sp.cta_group.kind.block_scale{.scale_vec_size}
| const Op& tvm::tirx::builtin::ptx_tcgen05_relinquish_alloc_permit | ( | ) |
tvm instrinsics to call tcgen05.relinquish_alloc_permit.cta_group.sync.aligned;
ptx_tcgen05_relinquish_alloc_permit(int cta_group)
| const Op& tvm::tirx::builtin::ptx_tcgen05_shift | ( | ) |
tvm instrinsics to call tcgen05.shift.cta_group.down
| const Op& tvm::tirx::builtin::ptx_tcgen05_st | ( | ) |
tvm instrinsics to call tcgen05.st.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_tcgen05_wait_ld | ( | ) |
tvm instrinsics to call tcgen05.wait::ld.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_tcgen05_wait_st | ( | ) |
tvm instrinsics to call tcgen05.wait::st.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_wgmma_commit_group | ( | ) |
tvm intrinsic to call wgmma.commit_group.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_wgmma_encode_matrix_descriptor | ( | ) |
tvm intrinsic to encode matrix descriptor for wgmma instructions.
ptx_wgmma_encode_matrix_descriptor(PrimExpr ptr, PrimExpr ldo, PrimExpr sdo, int swizzle)
| const Op& tvm::tirx::builtin::ptx_wgmma_fence | ( | ) |
tvm intrinsic to call wgmma.fence.sync.aligned;
| const Op& tvm::tirx::builtin::ptx_wgmma_mma_async_rs | ( | ) |
tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where A is in register and B is in shared memory.
| const Op& tvm::tirx::builtin::ptx_wgmma_mma_async_ss | ( | ) |
tvm intrinsic to call wgmma.mma_async.sync.aligned.shape.dtype.atype.btype where both A and B are in shared memory.
| const Op& tvm::tirx::builtin::ptx_wgmma_noop_barrier | ( | ) |
tvm intrinsic to call "" : "+r"(reg) :: "memory"
| const Op& tvm::tirx::builtin::ptx_wgmma_wait_group | ( | ) |
tvm intrinsic to call wgmma.wait_group.sync.aligned;
ptx_wgmma_wait_group(int N)
| const Op& tvm::tirx::builtin::q_multiply_shift | ( | ) |
Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1)
| const Op& tvm::tirx::builtin::reinterpret | ( | ) |
Reinterpret the value using the target type.
| const Op& tvm::tirx::builtin::ret | ( | ) |
Return value.
| const Op& tvm::tirx::builtin::selector | ( | ) |
Analysis-only active-thread selector.
selector(var, pred) denotes the unique value of var in the current active domain for which pred is true. It is used only inside ExecContext/DispatchContext metadata, for predicates such as ptx.elect_sync() whose selected lane cannot be inferred structurally.
| const Op& tvm::tirx::builtin::shift_left | ( | ) |
Left shift.
| const Op& tvm::tirx::builtin::shift_right | ( | ) |
Right shift.
| const Op& tvm::tirx::builtin::simdgroup_load | ( | ) |
tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup.
void simdgroup_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);
| const Op& tvm::tirx::builtin::simdgroup_multiply_accumulate | ( | ) |
tvm intrinsic for multiply and accumulate two matrices in simdgroup
void simdgroup_mma(Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c);
| const Op& tvm::tirx::builtin::simdgroup_store | ( | ) |
tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory.
void simdgroup_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);
| const Op& tvm::tirx::builtin::start_profile_intrinsic | ( | ) |
Profiling intrinsic.
| const Op& tvm::tirx::builtin::texture2d_load | ( | ) |
Load from texture 2d memory.
| const Op& tvm::tirx::builtin::texture2d_store | ( | ) |
Store to texture 2d memory.
| const Op& tvm::tirx::builtin::thread_return | ( | ) |
Return from a GPU thread.
| const Op& tvm::tirx::builtin::timer_end_cuda | ( | ) |
tvm intrinsic for ending the timer for profiling a specific event, and storing profiling result in a buffer.
void timer_end_cuda(IntImm event_type, Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and event type, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }
| const Op& tvm::tirx::builtin::timer_finalize_cuda | ( | ) |
tvm intrinsic for finalize the timer for profiling, and storing profiling result in a buffer.
void timer_finalize_cuda(Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and end signal, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }
| const Op& tvm::tirx::builtin::timer_init_cuda | ( | ) |
tvm intrinsic for initializing the CUDA profiler, and store profiling result in a buffer.
void timer_init_cuda(Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, int num_groups, Expr group_id) { // initialize the tag and write to pos 0 in the buffer // initialize write offset for every leader thread in warp group across all blocks }
| const Op& tvm::tirx::builtin::timer_start_cuda | ( | ) |
tvm intrinsic for starting the timer for profiling a specific event, and storing profiling result in a buffer.
void timer_start_cuda(IntImm event_type, Var profiler_buffer, Var profiler_tag, Var profiler_write_offset, IntImm profiler_write_stride, Expr leader_cond) { // each leader thread in warp group gets the time stamp and event type, combine with the tag // and write to corresponding offset in buffer // each leader thread advance offset by stride }
| const Op& tvm::tirx::builtin::tvm_access_ptr | ( | ) |
Get head access address with memory access pattern info.
This operator also marks range of the memory access The offset and extent are in unit of the DType(including vectorization factor). rw_mask is a bit_mask setting whether the access is a read(1) or write(2). The access is assume to happen in the current expression.
PtrType tvm_access_ptr(Expr dtype, DType* data, int offset, int extent, int rw_mask) { // DType == dtype.type(); return &data[offset]; }
| const Op& tvm::tirx::builtin::tvm_bmma_sync | ( | ) |
tvm intrinsic for tensor core bmma_sync operators.
void tvm_bmma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::bmma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }
| const Op& tvm::tirx::builtin::tvm_call_cpacked | ( | ) |
See pesudo code.
return_type tvm_call_packed(fname, TVMFFIAny* args) { TVMFFIAny result; (*fname)(args, args, len(args), &result); return cast(return_type, result); }
| const Op& tvm::tirx::builtin::tvm_call_cpacked_lowered | ( | ) |
Lowered version of call c-packed, the space of value and type codes are explicitly allocated.
int tvm_call_packed_lowered(fname, TVMFFIAny* args_stack, int begin, int end, void* self) { fname(ffi::PackedArgs(value_stack[begin:end], tcode_stack[begin:end]), ffi::Any(value_stack + end, tcode_stack + end)); }
| const Op& tvm::tirx::builtin::tvm_call_packed | ( | ) |
See pesudo code.
return_type tvm_call_packed(name, TVMFFIAny* args) { TVMFFIAny result; ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); (*f)(args, args, len(args), &result); // return type can be int, float, handle. return cast(return_type, result); }
| const Op& tvm::tirx::builtin::tvm_call_packed_lowered | ( | ) |
Lowered version of call packed, the space of value and type codes are explicitly allocated.
return_type tvm_call_packed_lowered(name, TVMFFIAny* args_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); f->CallPacked(ffi::PackedArgs(args_stack[begin:end]), ffi::Any(args_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(args_stack + end)) }
| const Op& tvm::tirx::builtin::tvm_call_trace_packed | ( | ) |
See pesudo code.
return_type tvm_call_trace_packed(name, TVMFFIAny* args) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); (*f)(args, args, len(args)); // return type can be int, float, handle. return cast(return_type, result); }
| const Op& tvm::tirx::builtin::tvm_call_trace_packed_lowered | ( | ) |
Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack.
return_type tvm_call_trace_packed_lowered(name, TVMFFIAny* args_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const ffi::Function* f = env->GetFuncFromEnv(name); f->CallPacked(ffi::PackedArgs(args_stack[begin:end]), ffi::Any(args_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(args_stack + end)) }
| const Op& tvm::tirx::builtin::tvm_context_id | ( | ) |
Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace.
| const Op& tvm::tirx::builtin::tvm_fill_fragment | ( | ) |
tvm intrinsic for tensor core fill_fragment operators.
void tvm_fill_fragment(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr value) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::fill_fragment(fragment[index], value); }
| const Op& tvm::tirx::builtin::tvm_global_barrier_kinit | ( | ) |
Initialize the global barrier. Call this at beginning of kernel that need global barrier.
| const Op& tvm::tirx::builtin::tvm_load_matrix_sync | ( | ) |
tvm intrinsic for tensor core load operators.
void tvm_load_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment. // Determine fragment layout(column-major or row major) by layout. // fragments must be in 'wmma.matrix_a' or 'wmma.matrix_b' scope. nvcuda::wmma::load_matrix_sync(fragment[index], buffer_ptr, stride); }
| const Op& tvm::tirx::builtin::tvm_mma_sync | ( | ) |
| const Op& tvm::tirx::builtin::tvm_stack_alloca | ( | ) |
See pesudo code.
dtype in {shape, array, arg_value, arg_tcode}
Handle tvm_stack_alloca(string dtype, int num) { return new on stack dtype[num]; }
| const Op& tvm::tirx::builtin::tvm_stack_make_array | ( | ) |
Allocate a Tensor(DLTensor) on stack, return the handle.
Type tvm_stack_make_array(Expr data, Expr shape, Expr strides, Expr ndim, Expr dtype, Expr elem_offset) { ret = alloca stack DLTensor(); ret->data = data; ret->shape = shape; ret->strides = strides != 0 ? strides : nullptr; ret->ndim = ndim; ret->dtype = dtype.type(); ret->byte_offset = elem_offset * sizeof(dtype); return ret; }
| const Op& tvm::tirx::builtin::tvm_stack_make_shape | ( | ) |
Allocate a shape tuple on stack, return the handle.
Handle tvm_stack_make_shape(list args) { ret = alloca stack int64_t[len(args)]; for i in range(len(args)): ret[i] = args[i] return &ret[0]; }
| const Op& tvm::tirx::builtin::tvm_static_handle | ( | ) |
Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources.
| const Op& tvm::tirx::builtin::tvm_storage_sync | ( | ) |
See pseudo code.
int tvm_storage_sync(std::string storage_scope) { __sync(storage_scope); return 0; }
| const Op& tvm::tirx::builtin::tvm_store_matrix_sync | ( | ) |
tvm intrinsic for tensor core store operators.
void tvm_store_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::store_matrix_sync(fragment[index], buffer_ptr, stride, layout); }
| const Op& tvm::tirx::builtin::tvm_struct_get | ( | ) |
See pesudo code.
Type tvm_struct_get(StructType* arr, int index, int field_id) { return arr[index]->field; }
| const Op& tvm::tirx::builtin::tvm_struct_set | ( | ) |
See pesudo code.
Handle tvm_struct_set(StructType* arr, int index, int field_id, value) { arr[index]->field = value; }
| const Op& tvm::tirx::builtin::tvm_thread_allreduce | ( | ) |
See pesudo code.
void tvm_thread_allreduce(UIntImm size, Expr source0, ..., Expr cond, Var reduce_temp0, .., Var thread_idx1, ...) { // constraint by the other thread_idx remain the same. // reduce_temp is used to save intermediate result. reduce_temp0, ... = reduce(combiner, source0, ..., cond over [thread_idx1, thread_idx2] passed by any caller) }
| const Op& tvm::tirx::builtin::tvm_thread_invariant | ( | ) |
Mark a condition to be thread invariant. This means the condition must be the same for all threads.
| const Op& tvm::tirx::builtin::tvm_throw_last_error | ( | ) |
See pesudo code.
void tvm_throw_last_error() { throw TVMGetLastError(); }
| const Op& tvm::tirx::builtin::tvm_tuple | ( | ) |
tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization.
Handle tvm_tuple(value0, value1, ..., value_n);
| const Op& tvm::tirx::builtin::tvm_warp_activemask | ( | ) |
| const Op& tvm::tirx::builtin::tvm_warp_shuffle | ( | ) |
See pseudo code.
Type tvm_warp_shuffle(mask, Type value, warp_id, width, warp_size) { return (value passed in by warp indicated by this_warp_id); }
Type tvm_warp_shuffle_up(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id - offset); }
Type tvm_warp_shuffle_down(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id + offset); }
unsigned tvm_warp_activemask() { return (32-bit mask of currently active threads in the calling warp); }
Parameter warp_id indicates the source thread ID in a warp.
Parameter offset indicates the relative distance to this_warp_id.
Parameter width indicates the number of threads involved in one shuffle. See CUDA document for __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync and __activemask.
Parameter warp_size is the size of a warp, which helps a backend to determine whether the width parameter is legal.
| const Op& tvm::tirx::builtin::tvm_warp_shuffle_down | ( | ) |
| const Op& tvm::tirx::builtin::tvm_warp_shuffle_up | ( | ) |
| const Op& tvm::tirx::builtin::tvm_warp_shuffle_xor | ( | ) |
| const Op& tvm::tirx::builtin::undef | ( | ) |
Returns an initialized but arbitrary value.
Compile-time representation of memory locations whose values may be altered as a result of optimizations.
| const Op& tvm::tirx::builtin::vectorcombine | ( | ) |
Concat two vectors.
| const Op& tvm::tirx::builtin::vectorhigh | ( | ) |
Get the high level half of the vector.
| const Op& tvm::tirx::builtin::vectorlow | ( | ) |
Get the low-level half of the vector.
| const Op& tvm::tirx::builtin::vscale | ( | ) |
Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic)