tvm
Enumerations | Functions
tvm::tir::builtin Namespace Reference

Collection of builtin intrinsics as ops. More...

Enumerations

enum  TVMStructFieldKind : int {
  kArrAddr , kArrData , kArrShape , kArrStrides ,
  kArrNDim , kArrTypeCode , kArrTypeBits , kArrTypeLanes ,
  kArrByteOffset , kArrDeviceId , kArrDeviceType , kArrKindBound_ ,
  kTVMValueContent , kTVMValueKindBound_
}
 The kind of structure field info used in intrinsic. More...
 

Functions

const Opret ()
 Return value. More...
 
const Opreinterpret ()
 Reinterpret the value using the target type. More...
 
const Oplikely ()
 Marks a condition is likely going to happen. More...
 
const Opbitwise_and ()
 Bitwise and operator. More...
 
const Opbitwise_or ()
 Bitwise or operator. More...
 
const Opbitwise_xor ()
 Bitwise xor operator. More...
 
const Opbitwise_not ()
 Bitwise not operator. More...
 
const Opshift_left ()
 Left shift. More...
 
const Opshift_right ()
 Right shift. More...
 
const Oplarge_uint_imm ()
 See pesudo code. More...
 
const Opq_multiply_shift ()
 Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1) More...
 
const Opaddress_of ()
 Returns the address of an element in the buffer (see pseudocode below). More...
 
const Opif_then_else ()
 Same as select, used for unsafe memory access. More...
 
const Opisnullptr ()
 See pesudo code. More...
 
const Opisnan ()
 Check if value is nan. More...
 
const Oppopcount ()
 Popcount. More...
 
const Opfma ()
 Fused multiply add. More...
 
const Opcall_extern ()
 Call an extern C function with given name and signature from the types of args in the runtime environment. More...
 
const Opcall_pure_extern ()
 Call an pure extern C function with given name and signature from the types of args in the runtime environment. More...
 
const Opcall_llvm_intrin ()
 Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More...
 
const Opcall_llvm_pure_intrin ()
 Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More...
 
const Opcall_spirv_pure_glsl450 ()
 Call an SPIRV pure GLSL450 intrinsic. More...
 
const Opprefetch ()
 Prefetch a cacheline. More...
 
const Optvm_access_ptr ()
 Get head access address with memory access pattern info. More...
 
const Optvm_static_handle ()
 Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources. More...
 
const Optvm_context_id ()
 Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace. More...
 
const Optvm_tuple ()
 tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization. More...
 
const Optvm_struct_get ()
 See pesudo code. More...
 
const Optvm_struct_set ()
 See pesudo code. More...
 
const Oplookup_param ()
 See pseudo code Type lookup_param(String param_name) { return __tvm_param__param_name; }. More...
 
const Optvm_throw_last_error ()
 See pesudo code. More...
 
const Optvm_stack_alloca ()
 See pesudo code. More...
 
const Optvm_stack_make_shape ()
 Allocate a shape tuple on stack, return the handle. More...
 
const Optvm_stack_make_array ()
 Allocate a NDArray(DLTensor) on stack, return the handle. More...
 
const Optvm_call_packed ()
 See pesudo code. More...
 
const Optvm_call_cpacked ()
 See pesudo code. More...
 
const Optvm_call_trace_packed ()
 See pesudo code. More...
 
const Optvm_check_return ()
 Checks the return value of another call is correct or returns a given value. More...
 
const Optvm_thread_context ()
 See pesudo code Mark the content as thread local context, can get optimized by only call the call once at thread start. More...
 
const Optvm_thread_invariant ()
 Mark a condition to be thread invariant. This means the condition must be the same for all threads. More...
 
const Optvm_call_packed_lowered ()
 Lowered version of call packed, the space of value and type codes are explicitly allocated. More...
 
const Optvm_call_cpacked_lowered ()
 Lowered version of call c-packed, the space of value and type codes are explicitly allocated. More...
 
const Optvm_call_trace_packed_lowered ()
 Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack. More...
 
const Optvm_storage_sync ()
 See pseudo code. More...
 
const Optvm_warp_shuffle ()
 See pseudo code. More...
 
const Optvm_warp_shuffle_up ()
 
const Optvm_warp_shuffle_down ()
 
const Optvm_warp_activemask ()
 
const Optvm_global_barrier_kinit ()
 Initialize the global barrier. Call this at beginning of kernel that need global barrier. More...
 
const Optvm_thread_allreduce ()
 See pesudo code. More...
 
const Optvm_load_matrix_sync ()
 tvm intrinsic for tensor core load operators. More...
 
const Optvm_mma_sync ()
 tvm intrinsic for tensor core mma_sync operators. More...
 
const Optvm_bmma_sync ()
 tvm intrinsic for tensor core bmma_sync operators. More...
 
const Optvm_fill_fragment ()
 tvm intrinsic for tensor core fill_fragment operators. More...
 
const Optvm_store_matrix_sync ()
 tvm intrinsic for tensor core store operators. More...
 
const Opptx_mma ()
 tvm intrinsic for ptx tensor core mma instructions. More...
 
const Opptx_ldg32 ()
 tvm intrinsic for ptx predicate load with 32-bit data type. More...
 
const Opptx_mma_sp ()
 tvm intrinsic for sparse tensor core ptx instructions. More...
 
const Opptx_ldmatrix ()
 tvm intrinsic for ptx load matrix from shared memory. More...
 
const Opptx_cp_async ()
 tvm intrinsics for ptx async copy from global to shared memory using cp.async More...
 
const Opptx_cp_async_bulk ()
 tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk More...
 
const Opptx_commit_group ()
 tvm intrinsics for ptx async copy commit and wait. More...
 
const Opptx_wait_group ()
 
const Opptx_cp_async_barrier ()
 tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive More...
 
const Opptx_init_barrier_thread_count ()
 tvm intrinsics for ptx barrier initialization of thread count using mbarrier.init More...
 
const Opptx_arrive_barrier ()
 tvm intrinsics for ptx barrier arrival using mbarrier.arrive More...
 
const Opptx_arrive_barrier_expect_tx ()
 tvm intrinsic for ptx barrier arrival with expect tx using mbarrier.arrive.expect_tx More...
 
const Opptx_wait_barrier ()
 tvm intrinsics for ptx barrier wait using mbarrier.try_wait More...
 
const Opcreate_barriers ()
 tvm intrinsics to create N barriers More...
 
const Opmma_store ()
 tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory. More...
 
const Opmma_fill ()
 tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers. More...
 
const Opmake_filled_simdgroup_matrix ()
 tvm intrinsic for initializing and simdgroup with given value. More...
 
const Opsimdgroup_load ()
 tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup. More...
 
const Opsimdgroup_store ()
 tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory. More...
 
const Opsimdgroup_multiply_accumulate ()
 tvm intrinsic for multiply and accumulate two matrices in simdgroup More...
 
const Opvectorhigh ()
 Get the high level half of the vector. More...
 
const Opvectorlow ()
 Get the low-level half of the vector. More...
 
const Opvectorcombine ()
 Concat two vectors. More...
 
const Opdp4a ()
 Dot product of two int8x4 vectors and add an optional accumulator. More...
 
const Opatomic_add ()
 atomic add instruction, corresponding e.g. to atomicAdd in CUDA More...
 
const Opnd_mem_alloc_with_scope ()
 Create an Nd memory allocation with storage scope. More...
 
const Optexture2d_store ()
 Store to texture 2d memory. More...
 
const Optexture2d_load ()
 Load from texture 2d memory. More...
 
const Opdma_copy ()
 Initiate a non-blocking DMA copy from source to destination. More...
 
const Opdma_wait ()
 Wait until the number of DMA groups in flight is less than or equal to some maximum. More...
 
const Opdma_start_group ()
 Start a group of DMA copies. More...
 
const Opdma_end_group ()
 End a group of DMA copies. More...
 
const Opassume ()
 Provide a true statement that can be used for simplifications. More...
 
const Opundef ()
 Returns an initialized but arbitrary value. More...
 
const Opstart_profile_intrinsic ()
 Profiling intrinsic. More...
 
const Opend_profile_intrinsic ()
 Profiling intrinsic. More...
 
const Opanylist_getitem ()
 Get a item from any list and return it. More...
 
const Opanylist_resetitem ()
 Reset and clear a item in any list. More...
 
const Opanylist_setitem_call_packed ()
 Set an item into any list by running packed function call. More...
 
const Opanylist_setitem_call_cpacked ()
 Same as anylist_setitem_call_packed but use C calling convention. More...
 
const Opvscale ()
 Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic) More...
 
const Opget_active_lane_mask ()
 Calculate a predicate mask given an upper bound (limit) and a current value (base). More...
 

Detailed Description

Collection of builtin intrinsics as ops.

Enumeration Type Documentation

◆ TVMStructFieldKind

The kind of structure field info used in intrinsic.

Enumerator
kArrAddr 
kArrData 
kArrShape 
kArrStrides 
kArrNDim 
kArrTypeCode 
kArrTypeBits 
kArrTypeLanes 
kArrByteOffset 
kArrDeviceId 
kArrDeviceType 
kArrKindBound_ 
kTVMValueContent 
kTVMValueKindBound_ 

Function Documentation

◆ address_of()

const Op& tvm::tir::builtin::address_of ( )

Returns the address of an element in the buffer (see pseudocode below).

The number of indices should match the dimensionality of the buffer being accessed. If this operation occurs after buffer flattening, the number of indices must be supported by the target (i.e. N>1 only on targets that support non-flat memory buffers).

Handle address_of(BufferLoad *op) { return &op->buffer_var[op->indices[0], op->indices[1], ..., op->indices[N-1]]; }

◆ anylist_getitem()

const Op& tvm::tir::builtin::anylist_getitem ( )

Get a item from any list and return it.

Any anylist_getitem(Handle anylist, int index) return anylist[index]; }

Note
This intrinsic is only applicable when appearing in call_packed and anylist_setitem_call_packed.

◆ anylist_resetitem()

const Op& tvm::tir::builtin::anylist_resetitem ( )

Reset and clear a item in any list.

void anylist_resetitem(Handle anylist, int index) anylist[index] = nullptr; }

Note
This intrinsic is only applicable when appearing in call_packed and anylist_setitem_call_packed.

◆ anylist_setitem_call_cpacked()

const Op& tvm::tir::builtin::anylist_setitem_call_cpacked ( )

Same as anylist_setitem_call_packed but use C calling convention.

◆ anylist_setitem_call_packed()

const Op& tvm::tir::builtin::anylist_setitem_call_packed ( )

Set an item into any list by running packed function call.

void anylist_setitem_call_packed(Handle anylist, int index, name, *args)

anylist[index] = call_packed(name, *args) }

Note
This intrinsic can be used in combination with anylist_getitem.

◆ assume()

const Op& tvm::tir::builtin::assume ( )

Provide a true statement that can be used for simplifications.

Compile-time representation of known constraints about function inputs. This assumption is removed when lowering, and does not occur in codegen.

◆ atomic_add()

const Op& tvm::tir::builtin::atomic_add ( )

atomic add instruction, corresponding e.g. to atomicAdd in CUDA

◆ bitwise_and()

const Op& tvm::tir::builtin::bitwise_and ( )

Bitwise and operator.

◆ bitwise_not()

const Op& tvm::tir::builtin::bitwise_not ( )

Bitwise not operator.

◆ bitwise_or()

const Op& tvm::tir::builtin::bitwise_or ( )

Bitwise or operator.

◆ bitwise_xor()

const Op& tvm::tir::builtin::bitwise_xor ( )

Bitwise xor operator.

◆ call_extern()

const Op& tvm::tir::builtin::call_extern ( )

Call an extern C function with given name and signature from the types of args in the runtime environment.

Type call_extern(name, args...) { return dlsym(name)(args...); }

Note
This intrinsic does not provide any type checking, and is main used for backward compatibility reasons. Always consider use pre-registered and typed tvm::Op first.

◆ call_llvm_intrin()

const Op& tvm::tir::builtin::call_llvm_intrin ( )

Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment.

Type call_llvm_pure_intrin(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ call_llvm_pure_intrin()

const Op& tvm::tir::builtin::call_llvm_pure_intrin ( )

Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment.

Type call_llvm_pure_intrin(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ call_pure_extern()

const Op& tvm::tir::builtin::call_pure_extern ( )

Call an pure extern C function with given name and signature from the types of args in the runtime environment.

Type call_pure_extern(name, args...) { return dlsym(name)(args...); }

Note
This intrinsic does not provide any type checking, and is main used for backward compatibility reasons. Always consider use pre-registered and typed tvm::Op first.

◆ call_spirv_pure_glsl450()

const Op& tvm::tir::builtin::call_spirv_pure_glsl450 ( )

Call an SPIRV pure GLSL450 intrinsic.

Type call_spirv_pure_glsl450(intrin_id, args...) { return dlsym(name)(args...); }

Note
This op does not provide any type checking.

◆ create_barriers()

const Op& tvm::tir::builtin::create_barriers ( )

tvm intrinsics to create N barriers

ptx_wait_barrier(int barrier_count)

◆ dma_copy()

const Op& tvm::tir::builtin::dma_copy ( )

Initiate a non-blocking DMA copy from source to destination.

The copy is launched immediately.

If a dma_start_group() call is active, the copy will be added to the current group for tracking of in-flight group counts.

If no dma_start_group() call is active, the copy will be tracked individually i.e. as a group with size 1.

◆ dma_end_group()

const Op& tvm::tir::builtin::dma_end_group ( )

End a group of DMA copies.

Track all calls to dma_copy() that occurred since the preceding dma_start_group() as a single group in-flight.

Calling dma_end_group() without an active group is unsupported.

Note: A group of DMA calls may be empty, and will still contribute to the count of in-flight groups used by dma_wait().

◆ dma_start_group()

const Op& tvm::tir::builtin::dma_start_group ( )

Start a group of DMA copies.

Any call to dma_copy() that occurs after dma_start_group() will be added to the current group for tracking of in-flight group counts.

Only one DMA group may be active at a given time. Calling dma_start_group() while a group is active is unsupported.

◆ dma_wait()

const Op& tvm::tir::builtin::dma_wait ( )

Wait until the number of DMA groups in flight is less than or equal to some maximum.

Calling dma_wait() while a group is active is unsupported.

◆ dp4a()

const Op& tvm::tir::builtin::dp4a ( )

Dot product of two int8x4 vectors and add an optional accumulator.

◆ end_profile_intrinsic()

const Op& tvm::tir::builtin::end_profile_intrinsic ( )

Profiling intrinsic.

◆ fma()

const Op& tvm::tir::builtin::fma ( )

Fused multiply add.

Type fma(a, b, c) { return a * b + c; }

◆ get_active_lane_mask()

const Op& tvm::tir::builtin::get_active_lane_mask ( )

Calculate a predicate mask given an upper bound (limit) and a current value (base).

It will be lowered to the llvm.get.active.lane.mask intrinsic. (https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics)

◆ if_then_else()

const Op& tvm::tir::builtin::if_then_else ( )

Same as select, used for unsafe memory access.

Type tvm_if_then_else(cond, a, b) { return cond ? a : b; }

◆ isnan()

const Op& tvm::tir::builtin::isnan ( )

Check if value is nan.

◆ isnullptr()

const Op& tvm::tir::builtin::isnullptr ( )

See pesudo code.

bool isnullptr(void* handle) { return handle == nullptr }

◆ large_uint_imm()

const Op& tvm::tir::builtin::large_uint_imm ( )

See pesudo code.

Construct a big uint that may not be representable by int64

Expr large_uint_imm(uint32_t v0, uin32_t v1) { return (v1 << 32) | v0; }

◆ likely()

const Op& tvm::tir::builtin::likely ( )

Marks a condition is likely going to happen.

◆ lookup_param()

const Op& tvm::tir::builtin::lookup_param ( )

See pseudo code Type lookup_param(String param_name) { return __tvm_param__param_name; }.

◆ make_filled_simdgroup_matrix()

const Op& tvm::tir::builtin::make_filled_simdgroup_matrix ( )

tvm intrinsic for initializing and simdgroup with given value.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void make_filled_simdgroup_matrix(Var d, PrimExpr index, PrimExpr value, int col = 8, int row = 8);

◆ mma_fill()

const Op& tvm::tir::builtin::mma_fill ( )

tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers.

There is no real PTX instruction that does that, but we introduce this intrinsic for the same reason as mma_store above.

void mma_fill(IntImm local_size, Var local_ptr, Expr offset);

◆ mma_store()

const Op& tvm::tir::builtin::mma_store ( )

tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory.

There is no real PTX instruction that does that, but we want to hide details of complex index manipulation behind this intrinsic to simplify TIR lowering passes (e.g. LowerWarpMemory).

void mma_store(IntImm m, IntImm n, Var dst_ptr, Var src_ptr, Expr src_offset, Var dst_stride);

◆ nd_mem_alloc_with_scope()

const Op& tvm::tir::builtin::nd_mem_alloc_with_scope ( )

Create an Nd memory allocation with storage scope.

◆ popcount()

const Op& tvm::tir::builtin::popcount ( )

Popcount.

◆ prefetch()

const Op& tvm::tir::builtin::prefetch ( )

Prefetch a cacheline.

◆ ptx_arrive_barrier()

const Op& tvm::tir::builtin::ptx_arrive_barrier ( )

tvm intrinsics for ptx barrier arrival using mbarrier.arrive

ptx_arrive_barrier(int barrier_id)

◆ ptx_arrive_barrier_expect_tx()

const Op& tvm::tir::builtin::ptx_arrive_barrier_expect_tx ( )

tvm intrinsic for ptx barrier arrival with expect tx using mbarrier.arrive.expect_tx

ptx_arrive_barrier_expect_tx(int barrier_id, int byte_count)

◆ ptx_commit_group()

const Op& tvm::tir::builtin::ptx_commit_group ( )

tvm intrinsics for ptx async copy commit and wait.

void ptx_commit_group(); void ptx_wait_group(int num);

◆ ptx_cp_async()

const Op& tvm::tir::builtin::ptx_cp_async ( )

tvm intrinsics for ptx async copy from global to shared memory using cp.async

void ptx_cp_async(Var shared_ptr, Expr shared_offset, Var global_ptr, Expr global_offset, size_t bytes);

◆ ptx_cp_async_barrier()

const Op& tvm::tir::builtin::ptx_cp_async_barrier ( )

tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive

ptx_cp_async_barrier(int barrier_id)

◆ ptx_cp_async_bulk()

const Op& tvm::tir::builtin::ptx_cp_async_bulk ( )

tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk

void ptx_cp_async(Var shared_ptr, Expr shared_offset, Var global_ptr, Expr global_offset, size_t bytes, int barrier_id);

◆ ptx_init_barrier_thread_count()

const Op& tvm::tir::builtin::ptx_init_barrier_thread_count ( )

tvm intrinsics for ptx barrier initialization of thread count using mbarrier.init

ptx_init_barrier_thread_count(int barrier_id, int thread_count)

◆ ptx_ldg32()

const Op & tvm::tir::builtin::ptx_ldg32 ( )

tvm intrinsic for ptx predicate load with 32-bit data type.

◆ ptx_ldmatrix()

const Op& tvm::tir::builtin::ptx_ldmatrix ( )

tvm intrinsic for ptx load matrix from shared memory.

void ptx_ldmatrix(Bool trans, IntImm num, StringImm type, Var local_ptr, Expr local_offset, Var smem_ptr, Expr smem_offset);

◆ ptx_mma()

const Op& tvm::tir::builtin::ptx_mma ( )

tvm intrinsic for ptx tensor core mma instructions.

void ptx_mma(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, bool saturate);

◆ ptx_mma_sp()

const Op& tvm::tir::builtin::ptx_mma_sp ( )

tvm intrinsic for sparse tensor core ptx instructions.

void ptx_mma_sp(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, Var metadata, Expr meta_index, Var sparse_selector, bool saturate);

◆ ptx_wait_barrier()

const Op& tvm::tir::builtin::ptx_wait_barrier ( )

tvm intrinsics for ptx barrier wait using mbarrier.try_wait

ptx_wait_barrier(int barrier_id)

◆ ptx_wait_group()

const Op& tvm::tir::builtin::ptx_wait_group ( )

◆ q_multiply_shift()

const Op& tvm::tir::builtin::q_multiply_shift ( )

Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1)

◆ reinterpret()

const Op& tvm::tir::builtin::reinterpret ( )

Reinterpret the value using the target type.

◆ ret()

const Op& tvm::tir::builtin::ret ( )

Return value.

◆ shift_left()

const Op& tvm::tir::builtin::shift_left ( )

Left shift.

◆ shift_right()

const Op& tvm::tir::builtin::shift_right ( )

Right shift.

◆ simdgroup_load()

const Op& tvm::tir::builtin::simdgroup_load ( )

tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);

◆ simdgroup_multiply_accumulate()

const Op& tvm::tir::builtin::simdgroup_multiply_accumulate ( )

tvm intrinsic for multiply and accumulate two matrices in simdgroup

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_mma(Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c);

◆ simdgroup_store()

const Op& tvm::tir::builtin::simdgroup_store ( )

tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory.

Note
only 8x8 shape is supported by Metal Spec and TVM, but we still keep shape as params, keeping the similar interface with Metal Spec.

void simdgroup_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);

◆ start_profile_intrinsic()

const Op& tvm::tir::builtin::start_profile_intrinsic ( )

Profiling intrinsic.

◆ texture2d_load()

const Op& tvm::tir::builtin::texture2d_load ( )

Load from texture 2d memory.

◆ texture2d_store()

const Op& tvm::tir::builtin::texture2d_store ( )

Store to texture 2d memory.

◆ tvm_access_ptr()

const Op& tvm::tir::builtin::tvm_access_ptr ( )

Get head access address with memory access pattern info.

This operator also marks range of the memory access The offset and extent are in unit of the DType(including vectorization factor). rw_mask is a bit_mask setting whether the access is a read(1) or write(2). The access is assume to happen in the current expression.

PtrType tvm_access_ptr(Expr dtype, DType* data, int offset, int extent, int rw_mask) { // DType == dtype.type(); return &data[offset]; }

◆ tvm_bmma_sync()

const Op& tvm::tir::builtin::tvm_bmma_sync ( )

tvm intrinsic for tensor core bmma_sync operators.

void tvm_bmma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::bmma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }

◆ tvm_call_cpacked()

const Op& tvm::tir::builtin::tvm_call_cpacked ( )

See pesudo code.

return_type tvm_call_packed(fname, TVMValue* args) { int ret_code; TVMValue ret_value; (*fname)(args, type_code_of(args), len(args), &ret_value, &ret_code); return cast(return_type, ret_value.v_return_type); }

◆ tvm_call_cpacked_lowered()

const Op& tvm::tir::builtin::tvm_call_cpacked_lowered ( )

Lowered version of call c-packed, the space of value and type codes are explicitly allocated.

int tvm_call_packed_lowered(fname, TVMValue* value_stack, int* tcode_stack, int begin, int end) { fname(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); }

◆ tvm_call_packed()

const Op& tvm::tir::builtin::tvm_call_packed ( )

See pesudo code.

return_type tvm_call_packed(name, TVMValue* args) { TVMValue ret_value; int ret_code; ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); (*f)(args, type_code_of(args), len(args), &ret_value, &ret_code); // return type can be int, float, handle. return cast(return_type, ret_value.v_return_type); }

◆ tvm_call_packed_lowered()

const Op& tvm::tir::builtin::tvm_call_packed_lowered ( )

Lowered version of call packed, the space of value and type codes are explicitly allocated.

return_type tvm_call_packed_lowered(name, TVMValue* value_stack, int* tcode_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); f->CallPacked(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(tcode_stack + end)) }

◆ tvm_call_trace_packed()

const Op& tvm::tir::builtin::tvm_call_trace_packed ( )

See pesudo code.

return_type tvm_call_trace_packed(name, TVMValue* args) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); (*f)(args, type_code_of(args), len(args)); // return type can be int, float, handle. return cast(return_type, ret_value.v_return_type); }

◆ tvm_call_trace_packed_lowered()

const Op& tvm::tir::builtin::tvm_call_trace_packed_lowered ( )

Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack.

return_type tvm_call_trace_packed_lowered(name, TVMValue* value_stack, int* tcode_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); f->CallPacked(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(tcode_stack + end)) }

◆ tvm_check_return()

const Op& tvm::tir::builtin::tvm_check_return ( )

Checks the return value of another call is correct or returns a given value.

Note
This is meant to serve a specific case for AOT code generator whilst this cannot be fully represented in TIR.

Type tvm_check_return(expected, return_unexpected, nested_call) { if (nested_call() != expected) { return return_unexpected; } }

◆ tvm_context_id()

const Op& tvm::tir::builtin::tvm_context_id ( )

Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace.

◆ tvm_fill_fragment()

const Op& tvm::tir::builtin::tvm_fill_fragment ( )

tvm intrinsic for tensor core fill_fragment operators.

void tvm_fill_fragment(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr value) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::fill_fragment(fragment[index], value); }

◆ tvm_global_barrier_kinit()

const Op& tvm::tir::builtin::tvm_global_barrier_kinit ( )

Initialize the global barrier. Call this at beginning of kernel that need global barrier.

◆ tvm_load_matrix_sync()

const Op& tvm::tir::builtin::tvm_load_matrix_sync ( )

tvm intrinsic for tensor core load operators.

void tvm_load_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment. // Determine fragment layout(column-major or row major) by layout. // fragments must be in 'wmma.matrix_a' or 'wmma.matrix_b' scope. nvcuda::wmma::load_matrix_sync(fragment[index], buffer_ptr, stride); }

◆ tvm_mma_sync()

const Op& tvm::tir::builtin::tvm_mma_sync ( )

tvm intrinsic for tensor core mma_sync operators.

void tvm_mma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::mma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }

◆ tvm_stack_alloca()

const Op& tvm::tir::builtin::tvm_stack_alloca ( )

See pesudo code.

dtype in {shape, array, arg_value, arg_tcode}

Handle tvm_stack_alloca(string dtype, int num) { return new on stack dtype[num]; }

◆ tvm_stack_make_array()

const Op& tvm::tir::builtin::tvm_stack_make_array ( )

Allocate a NDArray(DLTensor) on stack, return the handle.

Type tvm_stack_make_array(Expr data, Expr shape, Expr strides, Expr ndim, Expr dtype, Expr elem_offset) { ret = alloca stack DLTensor(); ret->data = data; ret->shape = shape; ret->strides = strides != 0 ? strides : nullptr; ret->ndim = ndim; ret->dtype = dtype.type(); ret->byte_offset = elem_offset * sizeof(dtype); return ret; }

◆ tvm_stack_make_shape()

const Op& tvm::tir::builtin::tvm_stack_make_shape ( )

Allocate a shape tuple on stack, return the handle.

Handle tvm_stack_make_shape(list args) { ret = alloca stack int64_t[len(args)]; for i in range(len(args)): ret[i] = args[i] return &ret[0]; }

◆ tvm_static_handle()

const Op& tvm::tir::builtin::tvm_static_handle ( )

Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources.

◆ tvm_storage_sync()

const Op& tvm::tir::builtin::tvm_storage_sync ( )

See pseudo code.

int tvm_storage_sync(std::string storage_scope) { __sync(storage_scope); return 0; }

◆ tvm_store_matrix_sync()

const Op& tvm::tir::builtin::tvm_store_matrix_sync ( )

tvm intrinsic for tensor core store operators.

void tvm_store_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::store_matrix_sync(fragment[index], buffer_ptr, stride, layout); }

◆ tvm_struct_get()

const Op& tvm::tir::builtin::tvm_struct_get ( )

See pesudo code.

Type tvm_struct_get(StructType* arr, int index, int field_id) { return arr[index]->field; }

See also
TVMStructFieldKind

◆ tvm_struct_set()

const Op& tvm::tir::builtin::tvm_struct_set ( )

See pesudo code.

Handle tvm_struct_set(StructType* arr, int index, int field_id, value) { arr[index]->field = value; }

See also
TVMStructFieldKind

◆ tvm_thread_allreduce()

const Op& tvm::tir::builtin::tvm_thread_allreduce ( )

See pesudo code.

void tvm_thread_allreduce(UIntImm size, Expr source0, ..., Expr cond, Var reduce_temp0, .., Var thread_idx1, ...) { // constraint by the other thread_idx remain the same. // reduce_temp is used to save intermediate result. reduce_temp0, ... = reduce(combiner, source0, ..., cond over [thread_idx1, thread_idx2] passed by any caller) }

◆ tvm_thread_context()

const Op& tvm::tir::builtin::tvm_thread_context ( )

See pesudo code Mark the content as thread local context, can get optimized by only call the call once at thread start.

Do not allow nesting(getting a thread context from another).

Handle tvm_thread_context(Expr call) { return call; }

◆ tvm_thread_invariant()

const Op& tvm::tir::builtin::tvm_thread_invariant ( )

Mark a condition to be thread invariant. This means the condition must be the same for all threads.

◆ tvm_throw_last_error()

const Op& tvm::tir::builtin::tvm_throw_last_error ( )

See pesudo code.

void tvm_throw_last_error() { throw TVMGetLastError(); }

◆ tvm_tuple()

const Op& tvm::tir::builtin::tvm_tuple ( )

tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization.

Handle tvm_tuple(value0, value1, ..., value_n);

◆ tvm_warp_activemask()

const Op& tvm::tir::builtin::tvm_warp_activemask ( )

◆ tvm_warp_shuffle()

const Op& tvm::tir::builtin::tvm_warp_shuffle ( )

See pseudo code.

Type tvm_warp_shuffle(mask, Type value, warp_id, width, warp_size) { return (value passed in by warp indicated by this_warp_id); }

Type tvm_warp_shuffle_up(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id - offset); }

Type tvm_warp_shuffle_down(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id + offset); }

unsigned tvm_warp_activemask() { return (32-bit mask of currently active threads in the calling warp); }

Parameter warp_id indicates the source thread ID in a warp.

Parameter offset indicates the relative distance to this_warp_id.

Parameter width indicates the number of threads involved in one shuffle. See CUDA document for __shfl_sync, __shfl_up_sync, __shfl_down_sync and __activemask.

Parameter warp_size is the size of a warp, which helps a backend to determine wheter the width paramter is legal.

◆ tvm_warp_shuffle_down()

const Op& tvm::tir::builtin::tvm_warp_shuffle_down ( )

◆ tvm_warp_shuffle_up()

const Op& tvm::tir::builtin::tvm_warp_shuffle_up ( )

◆ undef()

const Op& tvm::tir::builtin::undef ( )

Returns an initialized but arbitrary value.

Compile-time representation of memory locations whose values may be altered as a result of optimizations.

◆ vectorcombine()

const Op& tvm::tir::builtin::vectorcombine ( )

Concat two vectors.

◆ vectorhigh()

const Op& tvm::tir::builtin::vectorhigh ( )

Get the high level half of the vector.

◆ vectorlow()

const Op& tvm::tir::builtin::vectorlow ( )

Get the low-level half of the vector.

◆ vscale()

const Op& tvm::tir::builtin::vscale ( )

Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic)