tvm
|
Collection of builtin intrinsics as ops. More...
Enumerations | |
enum | TVMStructFieldKind : int { kArrAddr , kArrData , kArrShape , kArrStrides , kArrNDim , kArrTypeCode , kArrTypeBits , kArrTypeLanes , kArrByteOffset , kArrDeviceId , kArrDeviceType , kArrKindBound_ , kTVMValueContent , kTVMValueKindBound_ } |
The kind of structure field info used in intrinsic. More... | |
Functions | |
const Op & | ret () |
Return value. More... | |
const Op & | reinterpret () |
Reinterpret the value using the target type. More... | |
const Op & | likely () |
Marks a condition is likely going to happen. More... | |
const Op & | bitwise_and () |
Bitwise and operator. More... | |
const Op & | bitwise_or () |
Bitwise or operator. More... | |
const Op & | bitwise_xor () |
Bitwise xor operator. More... | |
const Op & | bitwise_not () |
Bitwise not operator. More... | |
const Op & | shift_left () |
Left shift. More... | |
const Op & | shift_right () |
Right shift. More... | |
const Op & | large_uint_imm () |
See pesudo code. More... | |
const Op & | q_multiply_shift () |
Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1) More... | |
const Op & | address_of () |
Returns the address of an element in the buffer (see pseudocode below). More... | |
const Op & | if_then_else () |
Same as select, used for unsafe memory access. More... | |
const Op & | isnullptr () |
See pesudo code. More... | |
const Op & | isnan () |
Check if value is nan. More... | |
const Op & | popcount () |
Popcount. More... | |
const Op & | fma () |
Fused multiply add. More... | |
const Op & | call_extern () |
Call an extern C function with given name and signature from the types of args in the runtime environment. More... | |
const Op & | call_pure_extern () |
Call an pure extern C function with given name and signature from the types of args in the runtime environment. More... | |
const Op & | call_llvm_intrin () |
Call an LLVM intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More... | |
const Op & | call_llvm_pure_intrin () |
Call an LLVM pure intrinsic with a given intrinsic id and signature from the types of args in the runtime environment. More... | |
const Op & | call_spirv_pure_glsl450 () |
Call an SPIRV pure GLSL450 intrinsic. More... | |
const Op & | prefetch () |
Prefetch a cacheline. More... | |
const Op & | tvm_access_ptr () |
Get head access address with memory access pattern info. More... | |
const Op & | tvm_static_handle () |
Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources. More... | |
const Op & | tvm_context_id () |
Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace. More... | |
const Op & | tvm_tuple () |
tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization. More... | |
const Op & | tvm_struct_get () |
See pesudo code. More... | |
const Op & | tvm_struct_set () |
See pesudo code. More... | |
const Op & | lookup_param () |
See pseudo code Type lookup_param(String param_name) { return __tvm_param__param_name; }. More... | |
const Op & | tvm_throw_last_error () |
See pesudo code. More... | |
const Op & | tvm_stack_alloca () |
See pesudo code. More... | |
const Op & | tvm_stack_make_shape () |
Allocate a shape tuple on stack, return the handle. More... | |
const Op & | tvm_stack_make_array () |
Allocate a NDArray(DLTensor) on stack, return the handle. More... | |
const Op & | tvm_call_packed () |
See pesudo code. More... | |
const Op & | tvm_call_cpacked () |
See pesudo code. More... | |
const Op & | tvm_call_trace_packed () |
See pesudo code. More... | |
const Op & | tvm_check_return () |
Checks the return value of another call is correct or returns a given value. More... | |
const Op & | tvm_thread_context () |
See pesudo code Mark the content as thread local context, can get optimized by only call the call once at thread start. More... | |
const Op & | tvm_thread_invariant () |
Mark a condition to be thread invariant. This means the condition must be the same for all threads. More... | |
const Op & | tvm_call_packed_lowered () |
Lowered version of call packed, the space of value and type codes are explicitly allocated. More... | |
const Op & | tvm_call_cpacked_lowered () |
Lowered version of call c-packed, the space of value and type codes are explicitly allocated. More... | |
const Op & | tvm_call_trace_packed_lowered () |
Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack. More... | |
const Op & | tvm_storage_sync () |
See pseudo code. More... | |
const Op & | tvm_warp_shuffle () |
See pseudo code. More... | |
const Op & | tvm_warp_shuffle_up () |
const Op & | tvm_warp_shuffle_down () |
const Op & | tvm_warp_activemask () |
const Op & | tvm_global_barrier_kinit () |
Initialize the global barrier. Call this at beginning of kernel that need global barrier. More... | |
const Op & | tvm_thread_allreduce () |
See pesudo code. More... | |
const Op & | tvm_load_matrix_sync () |
tvm intrinsic for tensor core load operators. More... | |
const Op & | tvm_mma_sync () |
tvm intrinsic for tensor core mma_sync operators. More... | |
const Op & | tvm_bmma_sync () |
tvm intrinsic for tensor core bmma_sync operators. More... | |
const Op & | tvm_fill_fragment () |
tvm intrinsic for tensor core fill_fragment operators. More... | |
const Op & | tvm_store_matrix_sync () |
tvm intrinsic for tensor core store operators. More... | |
const Op & | ptx_mma () |
tvm intrinsic for ptx tensor core mma instructions. More... | |
const Op & | ptx_ldg32 () |
tvm intrinsic for ptx predicate load with 32-bit data type. More... | |
const Op & | ptx_mma_sp () |
tvm intrinsic for sparse tensor core ptx instructions. More... | |
const Op & | ptx_ldmatrix () |
tvm intrinsic for ptx load matrix from shared memory. More... | |
const Op & | ptx_cp_async () |
tvm intrinsics for ptx async copy from global to shared memory using cp.async More... | |
const Op & | ptx_cp_async_bulk () |
tvm intrinsics for ptx async copy from global to shared memory using cp.async.bulk More... | |
const Op & | ptx_commit_group () |
tvm intrinsics for ptx async copy commit and wait. More... | |
const Op & | ptx_wait_group () |
const Op & | ptx_cp_async_barrier () |
tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive More... | |
const Op & | ptx_init_barrier_thread_count () |
tvm intrinsics for ptx barrier initialization of thread count using mbarrier.init More... | |
const Op & | ptx_arrive_barrier () |
tvm intrinsics for ptx barrier arrival using mbarrier.arrive More... | |
const Op & | ptx_arrive_barrier_expect_tx () |
tvm intrinsic for ptx barrier arrival with expect tx using mbarrier.arrive.expect_tx More... | |
const Op & | ptx_wait_barrier () |
tvm intrinsics for ptx barrier wait using mbarrier.try_wait More... | |
const Op & | create_barriers () |
tvm intrinsics to create N barriers More... | |
const Op & | mma_store () |
tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory. More... | |
const Op & | mma_fill () |
tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers. More... | |
const Op & | make_filled_simdgroup_matrix () |
tvm intrinsic for initializing and simdgroup with given value. More... | |
const Op & | simdgroup_load () |
tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup. More... | |
const Op & | simdgroup_store () |
tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory. More... | |
const Op & | simdgroup_multiply_accumulate () |
tvm intrinsic for multiply and accumulate two matrices in simdgroup More... | |
const Op & | vectorhigh () |
Get the high level half of the vector. More... | |
const Op & | vectorlow () |
Get the low-level half of the vector. More... | |
const Op & | vectorcombine () |
Concat two vectors. More... | |
const Op & | dp4a () |
Dot product of two int8x4 vectors and add an optional accumulator. More... | |
const Op & | atomic_add () |
atomic add instruction, corresponding e.g. to atomicAdd in CUDA More... | |
const Op & | nd_mem_alloc_with_scope () |
Create an Nd memory allocation with storage scope. More... | |
const Op & | texture2d_store () |
Store to texture 2d memory. More... | |
const Op & | texture2d_load () |
Load from texture 2d memory. More... | |
const Op & | dma_copy () |
Initiate a non-blocking DMA copy from source to destination. More... | |
const Op & | dma_wait () |
Wait until the number of DMA groups in flight is less than or equal to some maximum. More... | |
const Op & | dma_start_group () |
Start a group of DMA copies. More... | |
const Op & | dma_end_group () |
End a group of DMA copies. More... | |
const Op & | assume () |
Provide a true statement that can be used for simplifications. More... | |
const Op & | undef () |
Returns an initialized but arbitrary value. More... | |
const Op & | start_profile_intrinsic () |
Profiling intrinsic. More... | |
const Op & | end_profile_intrinsic () |
Profiling intrinsic. More... | |
const Op & | anylist_getitem () |
Get a item from any list and return it. More... | |
const Op & | anylist_resetitem () |
Reset and clear a item in any list. More... | |
const Op & | anylist_setitem_call_packed () |
Set an item into any list by running packed function call. More... | |
const Op & | anylist_setitem_call_cpacked () |
Same as anylist_setitem_call_packed but use C calling convention. More... | |
const Op & | vscale () |
Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic) More... | |
const Op & | get_active_lane_mask () |
Calculate a predicate mask given an upper bound (limit) and a current value (base). More... | |
Collection of builtin intrinsics as ops.
enum tvm::tir::builtin::TVMStructFieldKind : int |
const Op& tvm::tir::builtin::address_of | ( | ) |
Returns the address of an element in the buffer (see pseudocode below).
The number of indices should match the dimensionality of the buffer being accessed. If this operation occurs after buffer flattening, the number of indices must be supported by the target (i.e. N>1 only on targets that support non-flat memory buffers).
Handle address_of(BufferLoad *op) { return &op->buffer_var[op->indices[0], op->indices[1], ..., op->indices[N-1]]; }
const Op& tvm::tir::builtin::anylist_getitem | ( | ) |
Get a item from any list and return it.
Any anylist_getitem(Handle anylist, int index) return anylist[index]; }
const Op& tvm::tir::builtin::anylist_resetitem | ( | ) |
Reset and clear a item in any list.
void anylist_resetitem(Handle anylist, int index) anylist[index] = nullptr; }
const Op& tvm::tir::builtin::anylist_setitem_call_cpacked | ( | ) |
Same as anylist_setitem_call_packed but use C calling convention.
const Op& tvm::tir::builtin::anylist_setitem_call_packed | ( | ) |
Set an item into any list by running packed function call.
void anylist_setitem_call_packed(Handle anylist, int index, name, *args)
anylist[index] = call_packed(name, *args) }
const Op& tvm::tir::builtin::assume | ( | ) |
Provide a true statement that can be used for simplifications.
Compile-time representation of known constraints about function inputs. This assumption is removed when lowering, and does not occur in codegen.
const Op& tvm::tir::builtin::atomic_add | ( | ) |
atomic add instruction, corresponding e.g. to atomicAdd in CUDA
const Op& tvm::tir::builtin::bitwise_and | ( | ) |
Bitwise and operator.
const Op& tvm::tir::builtin::bitwise_not | ( | ) |
Bitwise not operator.
const Op& tvm::tir::builtin::bitwise_or | ( | ) |
Bitwise or operator.
const Op& tvm::tir::builtin::bitwise_xor | ( | ) |
Bitwise xor operator.
const Op& tvm::tir::builtin::call_extern | ( | ) |
Call an extern C function with given name and signature from the types of args in the runtime environment.
Type call_extern(name, args...) { return dlsym(name)(args...); }
const Op& tvm::tir::builtin::call_llvm_intrin | ( | ) |
const Op& tvm::tir::builtin::call_llvm_pure_intrin | ( | ) |
const Op& tvm::tir::builtin::call_pure_extern | ( | ) |
Call an pure extern C function with given name and signature from the types of args in the runtime environment.
Type call_pure_extern(name, args...) { return dlsym(name)(args...); }
const Op& tvm::tir::builtin::call_spirv_pure_glsl450 | ( | ) |
const Op& tvm::tir::builtin::create_barriers | ( | ) |
tvm intrinsics to create N barriers
ptx_wait_barrier(int barrier_count)
const Op& tvm::tir::builtin::dma_copy | ( | ) |
Initiate a non-blocking DMA copy from source to destination.
The copy is launched immediately.
If a dma_start_group()
call is active, the copy will be added to the current group for tracking of in-flight group counts.
If no dma_start_group()
call is active, the copy will be tracked individually i.e. as a group with size 1.
const Op& tvm::tir::builtin::dma_end_group | ( | ) |
End a group of DMA copies.
Track all calls to dma_copy()
that occurred since the preceding dma_start_group()
as a single group in-flight.
Calling dma_end_group()
without an active group is unsupported.
Note: A group of DMA calls may be empty, and will still contribute to the count of in-flight groups used by dma_wait()
.
const Op& tvm::tir::builtin::dma_start_group | ( | ) |
Start a group of DMA copies.
Any call to dma_copy()
that occurs after dma_start_group()
will be added to the current group for tracking of in-flight group counts.
Only one DMA group may be active at a given time. Calling dma_start_group()
while a group is active is unsupported.
const Op& tvm::tir::builtin::dma_wait | ( | ) |
Wait until the number of DMA groups in flight is less than or equal to some maximum.
Calling dma_wait()
while a group is active is unsupported.
const Op& tvm::tir::builtin::dp4a | ( | ) |
Dot product of two int8x4 vectors and add an optional accumulator.
const Op& tvm::tir::builtin::end_profile_intrinsic | ( | ) |
Profiling intrinsic.
const Op& tvm::tir::builtin::get_active_lane_mask | ( | ) |
Calculate a predicate mask given an upper bound (limit) and a current value (base).
It will be lowered to the llvm.get.active.lane.mask intrinsic. (https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics)
const Op& tvm::tir::builtin::if_then_else | ( | ) |
Same as select, used for unsafe memory access.
Type tvm_if_then_else(cond, a, b) { return cond ? a : b; }
const Op& tvm::tir::builtin::isnan | ( | ) |
Check if value is nan.
const Op& tvm::tir::builtin::isnullptr | ( | ) |
See pesudo code.
bool isnullptr(void* handle) { return handle == nullptr }
const Op& tvm::tir::builtin::large_uint_imm | ( | ) |
See pesudo code.
Construct a big uint that may not be representable by int64
Expr large_uint_imm(uint32_t v0, uin32_t v1) { return (v1 << 32) | v0; }
const Op& tvm::tir::builtin::likely | ( | ) |
Marks a condition is likely going to happen.
const Op& tvm::tir::builtin::lookup_param | ( | ) |
See pseudo code Type lookup_param(String param_name) { return __tvm_param__param_name; }.
const Op& tvm::tir::builtin::make_filled_simdgroup_matrix | ( | ) |
const Op& tvm::tir::builtin::mma_fill | ( | ) |
tvm intrinsic for zero-initializing an MMA accumulation register. For example, if each thread in a warp of size 32 has 8 elements from the A matrix in m16xn8xk16 MMA in its registers, this intrinsic can be used to zero-initialize its 4 accumulation registers.
There is no real PTX instruction that does that, but we introduce this intrinsic for the same reason as mma_store above.
void mma_fill(IntImm local_size, Var local_ptr, Expr offset);
const Op& tvm::tir::builtin::mma_store | ( | ) |
tvm intrinsic for storing the result of PTX MMA into a destination pointer. For example, if each thread in a warp of size 32 has 4 elements from the result of m16xn8xk16 MMA in its registers, this intrinsic can be used to store the result in a 16x8 region in shared or global memory.
There is no real PTX instruction that does that, but we want to hide details of complex index manipulation behind this intrinsic to simplify TIR lowering passes (e.g. LowerWarpMemory).
void mma_store(IntImm m, IntImm n, Var dst_ptr, Var src_ptr, Expr src_offset, Var dst_stride);
const Op& tvm::tir::builtin::nd_mem_alloc_with_scope | ( | ) |
Create an Nd memory allocation with storage scope.
const Op& tvm::tir::builtin::popcount | ( | ) |
Popcount.
const Op& tvm::tir::builtin::ptx_arrive_barrier | ( | ) |
tvm intrinsics for ptx barrier arrival using mbarrier.arrive
ptx_arrive_barrier(int barrier_id)
const Op& tvm::tir::builtin::ptx_arrive_barrier_expect_tx | ( | ) |
tvm intrinsic for ptx barrier arrival with expect tx using mbarrier.arrive.expect_tx
ptx_arrive_barrier_expect_tx(int barrier_id, int byte_count)
const Op& tvm::tir::builtin::ptx_commit_group | ( | ) |
tvm intrinsics for ptx async copy commit and wait.
void ptx_commit_group(); void ptx_wait_group(int num);
const Op& tvm::tir::builtin::ptx_cp_async | ( | ) |
const Op& tvm::tir::builtin::ptx_cp_async_barrier | ( | ) |
tvm intrinsics for ptx async copy barrier using cp.async.mbarrier.arrive
ptx_cp_async_barrier(int barrier_id)
const Op& tvm::tir::builtin::ptx_cp_async_bulk | ( | ) |
const Op& tvm::tir::builtin::ptx_init_barrier_thread_count | ( | ) |
tvm intrinsics for ptx barrier initialization of thread count using mbarrier.init
ptx_init_barrier_thread_count(int barrier_id, int thread_count)
const Op & tvm::tir::builtin::ptx_ldg32 | ( | ) |
tvm intrinsic for ptx predicate load with 32-bit data type.
const Op& tvm::tir::builtin::ptx_ldmatrix | ( | ) |
const Op& tvm::tir::builtin::ptx_mma | ( | ) |
const Op& tvm::tir::builtin::ptx_mma_sp | ( | ) |
tvm intrinsic for sparse tensor core ptx instructions.
void ptx_mma_sp(StringImm shape, StringImm A_layout, StringImm B_layout, StringImm A_dtype, StringImm B_dtype, StringImm C_dtype, Var multiplicand_a, Expr a_index, Var multiplicand_b, Expr b_index, Var accumulator, Expr c_index, Var metadata, Expr meta_index, Var sparse_selector, bool saturate);
const Op& tvm::tir::builtin::ptx_wait_barrier | ( | ) |
tvm intrinsics for ptx barrier wait using mbarrier.try_wait
ptx_wait_barrier(int barrier_id)
const Op& tvm::tir::builtin::ptx_wait_group | ( | ) |
const Op& tvm::tir::builtin::q_multiply_shift | ( | ) |
Execute a multiplication between two Q-numbers x and y followed by a right shift s The default rounding rule is to the nearest value, rounding half up (i.e., round(x.1) = x and round (x.5) = x+1)
const Op& tvm::tir::builtin::reinterpret | ( | ) |
Reinterpret the value using the target type.
const Op& tvm::tir::builtin::ret | ( | ) |
Return value.
const Op& tvm::tir::builtin::shift_left | ( | ) |
Left shift.
const Op& tvm::tir::builtin::shift_right | ( | ) |
Right shift.
const Op& tvm::tir::builtin::simdgroup_load | ( | ) |
tvm intrinsic for loading data from device memory or threadgroup memory to simdgroup.
void simdgroup_load(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);
const Op& tvm::tir::builtin::simdgroup_multiply_accumulate | ( | ) |
tvm intrinsic for multiply and accumulate two matrices in simdgroup
void simdgroup_mma(Var d, PrimExpr index_d, Var a, PrimExpr index_a, Var b, PrimExpr index_b, Var c, PrimExpr index_c);
const Op& tvm::tir::builtin::simdgroup_store | ( | ) |
tvm intrinsic for storing data from simdgroup to device memory or threadgroup memory.
void simdgroup_store(Var d, PrimExpr index, PrimExpr ptr, PrimExpr stride, int col = 8, int row = 8, bool transpose_matrix = false);
const Op& tvm::tir::builtin::start_profile_intrinsic | ( | ) |
Profiling intrinsic.
const Op& tvm::tir::builtin::texture2d_load | ( | ) |
Load from texture 2d memory.
const Op& tvm::tir::builtin::texture2d_store | ( | ) |
Store to texture 2d memory.
const Op& tvm::tir::builtin::tvm_access_ptr | ( | ) |
Get head access address with memory access pattern info.
This operator also marks range of the memory access The offset and extent are in unit of the DType(including vectorization factor). rw_mask is a bit_mask setting whether the access is a read(1) or write(2). The access is assume to happen in the current expression.
PtrType tvm_access_ptr(Expr dtype, DType* data, int offset, int extent, int rw_mask) { // DType == dtype.type(); return &data[offset]; }
const Op& tvm::tir::builtin::tvm_bmma_sync | ( | ) |
tvm intrinsic for tensor core bmma_sync operators.
void tvm_bmma_sync(Var fragment_d, Expr index_d, Var fragment_a, Expr index_a, Var fragment_b, Expr index_b, Var fragment_c, Expr index_c) { nvcuda::wmma::bmma_sync(fragment_d[index_d], fragment_a[index_a], fragment_b[index_b], fragment_c[index_c]); }
const Op& tvm::tir::builtin::tvm_call_cpacked | ( | ) |
See pesudo code.
return_type tvm_call_packed(fname, TVMValue* args) { int ret_code; TVMValue ret_value; (*fname)(args, type_code_of(args), len(args), &ret_value, &ret_code); return cast(return_type, ret_value.v_return_type); }
const Op& tvm::tir::builtin::tvm_call_cpacked_lowered | ( | ) |
Lowered version of call c-packed, the space of value and type codes are explicitly allocated.
int tvm_call_packed_lowered(fname, TVMValue* value_stack, int* tcode_stack, int begin, int end) { fname(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); }
const Op& tvm::tir::builtin::tvm_call_packed | ( | ) |
See pesudo code.
return_type tvm_call_packed(name, TVMValue* args) { TVMValue ret_value; int ret_code; ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); (*f)(args, type_code_of(args), len(args), &ret_value, &ret_code); // return type can be int, float, handle. return cast(return_type, ret_value.v_return_type); }
const Op& tvm::tir::builtin::tvm_call_packed_lowered | ( | ) |
Lowered version of call packed, the space of value and type codes are explicitly allocated.
return_type tvm_call_packed_lowered(name, TVMValue* value_stack, int* tcode_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); f->CallPacked(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(tcode_stack + end)) }
const Op& tvm::tir::builtin::tvm_call_trace_packed | ( | ) |
See pesudo code.
return_type tvm_call_trace_packed(name, TVMValue* args) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); (*f)(args, type_code_of(args), len(args)); // return type can be int, float, handle. return cast(return_type, ret_value.v_return_type); }
const Op& tvm::tir::builtin::tvm_call_trace_packed_lowered | ( | ) |
Lowered version of trace intrinsic, the space of value and type codes are explicitly allocated. The return value is the (end - 1) value on the stack.
return_type tvm_call_trace_packed_lowered(name, TVMValue* value_stack, int* tcode_stack, int begin, int end) { ModuleNode* env = GetCurrentEnv(); const PackedFunc* f = env->GetFuncFromEnv(name); f->CallPacked(TVMArgs(value_stack[begin:end], tcode_stack[begin:end]), TVMRetValue(value_stack + end, tcode_stack + end)); // return type can be int, float, handle. return cast(return_type, load_return_from(tcode_stack + end)) }
const Op& tvm::tir::builtin::tvm_check_return | ( | ) |
Checks the return value of another call is correct or returns a given value.
Type tvm_check_return(expected, return_unexpected, nested_call) { if (nested_call() != expected) { return return_unexpected; } }
const Op& tvm::tir::builtin::tvm_context_id | ( | ) |
Return a unique context id, used for hint of workspace separation. Different context id ganrantees not having overlapping workspace.
const Op& tvm::tir::builtin::tvm_fill_fragment | ( | ) |
tvm intrinsic for tensor core fill_fragment operators.
void tvm_fill_fragment(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr value) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::fill_fragment(fragment[index], value); }
const Op& tvm::tir::builtin::tvm_global_barrier_kinit | ( | ) |
Initialize the global barrier. Call this at beginning of kernel that need global barrier.
const Op& tvm::tir::builtin::tvm_load_matrix_sync | ( | ) |
tvm intrinsic for tensor core load operators.
void tvm_load_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment. // Determine fragment layout(column-major or row major) by layout. // fragments must be in 'wmma.matrix_a' or 'wmma.matrix_b' scope. nvcuda::wmma::load_matrix_sync(fragment[index], buffer_ptr, stride); }
const Op& tvm::tir::builtin::tvm_mma_sync | ( | ) |
const Op& tvm::tir::builtin::tvm_stack_alloca | ( | ) |
See pesudo code.
dtype in {shape, array, arg_value, arg_tcode}
Handle tvm_stack_alloca(string dtype, int num) { return new on stack dtype[num]; }
const Op& tvm::tir::builtin::tvm_stack_make_array | ( | ) |
Allocate a NDArray(DLTensor) on stack, return the handle.
Type tvm_stack_make_array(Expr data, Expr shape, Expr strides, Expr ndim, Expr dtype, Expr elem_offset) { ret = alloca stack DLTensor(); ret->data = data; ret->shape = shape; ret->strides = strides != 0 ? strides : nullptr; ret->ndim = ndim; ret->dtype = dtype.type(); ret->byte_offset = elem_offset * sizeof(dtype); return ret; }
const Op& tvm::tir::builtin::tvm_stack_make_shape | ( | ) |
Allocate a shape tuple on stack, return the handle.
Handle tvm_stack_make_shape(list args) { ret = alloca stack int64_t[len(args)]; for i in range(len(args)): ret[i] = args[i] return &ret[0]; }
const Op& tvm::tir::builtin::tvm_static_handle | ( | ) |
Create a function local static handle that iniitalizes to nullptr. can be used to cache function local static resources.
const Op& tvm::tir::builtin::tvm_storage_sync | ( | ) |
See pseudo code.
int tvm_storage_sync(std::string storage_scope) { __sync(storage_scope); return 0; }
const Op& tvm::tir::builtin::tvm_store_matrix_sync | ( | ) |
tvm intrinsic for tensor core store operators.
void tvm_store_matrix_sync(Var fragment, UIntImm m, UIntImm, n, UIntImm k, Expr index, Expr buffer_ptr, Expr stride, StringImm layout) { // m, n, k are the shape of wmma fragment // fragments must be in 'wmma.accumulator' scope. nvcuda::wmma::store_matrix_sync(fragment[index], buffer_ptr, stride, layout); }
const Op& tvm::tir::builtin::tvm_struct_get | ( | ) |
See pesudo code.
Type tvm_struct_get(StructType* arr, int index, int field_id) { return arr[index]->field; }
const Op& tvm::tir::builtin::tvm_struct_set | ( | ) |
See pesudo code.
Handle tvm_struct_set(StructType* arr, int index, int field_id, value) { arr[index]->field = value; }
const Op& tvm::tir::builtin::tvm_thread_allreduce | ( | ) |
See pesudo code.
void tvm_thread_allreduce(UIntImm size, Expr source0, ..., Expr cond, Var reduce_temp0, .., Var thread_idx1, ...) { // constraint by the other thread_idx remain the same. // reduce_temp is used to save intermediate result. reduce_temp0, ... = reduce(combiner, source0, ..., cond over [thread_idx1, thread_idx2] passed by any caller) }
const Op& tvm::tir::builtin::tvm_thread_context | ( | ) |
See pesudo code Mark the content as thread local context, can get optimized by only call the call once at thread start.
Do not allow nesting(getting a thread context from another).
Handle tvm_thread_context(Expr call) { return call; }
const Op& tvm::tir::builtin::tvm_thread_invariant | ( | ) |
Mark a condition to be thread invariant. This means the condition must be the same for all threads.
const Op& tvm::tir::builtin::tvm_throw_last_error | ( | ) |
See pesudo code.
void tvm_throw_last_error() { throw TVMGetLastError(); }
const Op& tvm::tir::builtin::tvm_tuple | ( | ) |
tvm_tuple is not an actual function and cannot codegen. It is used to represent tuple structure in value field of AttrStmt, for the sake of giving hint to optimization.
Handle tvm_tuple(value0, value1, ..., value_n);
const Op& tvm::tir::builtin::tvm_warp_activemask | ( | ) |
const Op& tvm::tir::builtin::tvm_warp_shuffle | ( | ) |
See pseudo code.
Type tvm_warp_shuffle(mask, Type value, warp_id, width, warp_size) { return (value passed in by warp indicated by this_warp_id); }
Type tvm_warp_shuffle_up(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id - offset); }
Type tvm_warp_shuffle_down(mask, Type value, offset, width, warp_size) { return (value passed in by warp indicated by this_warp_id + offset); }
unsigned tvm_warp_activemask() { return (32-bit mask of currently active threads in the calling warp); }
Parameter warp_id indicates the source thread ID in a warp.
Parameter offset indicates the relative distance to this_warp_id.
Parameter width indicates the number of threads involved in one shuffle. See CUDA document for __shfl_sync, __shfl_up_sync, __shfl_down_sync and __activemask.
Parameter warp_size is the size of a warp, which helps a backend to determine wheter the width paramter is legal.
const Op& tvm::tir::builtin::tvm_warp_shuffle_down | ( | ) |
const Op& tvm::tir::builtin::tvm_warp_shuffle_up | ( | ) |
const Op& tvm::tir::builtin::undef | ( | ) |
Returns an initialized but arbitrary value.
Compile-time representation of memory locations whose values may be altered as a result of optimizations.
const Op& tvm::tir::builtin::vectorcombine | ( | ) |
Concat two vectors.
const Op& tvm::tir::builtin::vectorhigh | ( | ) |
Get the high level half of the vector.
const Op& tvm::tir::builtin::vectorlow | ( | ) |
Get the low-level half of the vector.
const Op& tvm::tir::builtin::vscale | ( | ) |
Get the target's vscale value. It will be lowered to llvm.vscale intrinsic (https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic)