tvm
Functions
tvm::s_tir::transform Namespace Reference

Functions

Pass VerifyGPUCode (ffi::Map< ffi::String, PrimExpr > constraints)
 Pass to verify GPU code constraints. More...
 
Pass VerifyVTCMLimit (ffi::Optional< Target > default_target=std::nullopt)
 Pass to check if VTCM usage is within limit. More...
 
Pass OOBChecker ()
 Statically check TIR code for out of bounds array access. More...
 
Pass CanonicalizeLoop ()
 Canonicalize loop to start from zero . More...
 
Pass LowerCrossThreadReduction ()
 Lower cross-thread reduction from thread bindings to intrinsic function calls. More...
 
Pass LowerInitBlock ()
 Lower block init stmt into IfThenElse stmts. More...
 
Pass PlanAndUpdateBufferAllocationLocation ()
 Locate the buffer allocation to the exact position (usually is the lca of buffer access). This pass will inject opaque block with alloc_buffers at the allocation site. More...
 
Pass ConvertBlocksToOpaque ()
 Substitute all the block vars with the PrimExprs they are bound to, indicated by the corresponding iter_values in BlockRealize, for opaque blocks by removing all . the iter_values in BlockRealize and iter_vars in Block. More...
 
Pass LiftThreadBinding ()
 Lift the same thread bindings to their LCA loops. More...
 
Pass CompactBufferAllocation (bool is_strict=true)
 Compact the buffer access region by removing the buffer regions that are not accessed, i.e. narrowing the buffer shape and adjust the access region if necessary. More...
 
Pass LowerMatchBuffer ()
 Remove match buffers inside the block. Also, it will validate the binding. More...
 
Pass InjectPermutedLayout ()
 Inject permuted layout for shared memory. More...
 
Pass TransformMmaBufferLayout ()
 Transform Mma scope (m16n8k8.matrixA/B/C) to local scope with layout transformation. More...
 
Pass LowerOpaqueBlock ()
 Remove the block to ensure that the TIR can not be scheduled again. More...
 
Pass UnifyThreadBinding ()
 Unify all the thread bindings for "blockIdx.x/y/z", "threadIdx.x/y/z", and "vthread.x/y/z". Before the unification, two vars that are bound to a thread axis (e.g., "threadIdx.x") use different IterVars and variables in their AttrStmts. After the unification, we use a consolidated IterVar and a variable for them. More...
 
Pass InjectSoftwarePipeline ()
 This pass transforms annotated loops into pipelined ones where producers and consumers are overlapped with the information provided in loop annotations, which enables optimization techniques like prefetching and pipeline parallelism. More...
 
Pass LowerAutoCopy ()
 Automatically do memory optimizations for auto copy blocks. More...
 
Pass ManifestSharedMemoryLocalStage ()
 Add the explicit local stage for the shared memory access on GPU. More...
 
Pass AnnotateIrregularLoop ()
 Annotate irregular loop mark. More...
 
Pass LoopPartition ()
 partition loops in the stmt. More...
 
Pass InjectVirtualThread ()
 Inject virtual thread loops. More...
 
Pass InjectDoubleBuffer ()
 Inject double buffer statements. More...
 
Pass HoistIfThenElse (tvm::ffi::String variant="")
 Hoist loop-invariant IfThenElse nodes to outside the eligible loops. More...
 
Pass HoistExpression ()
 Hoist loop-invariant expressions to outside the eligible loops. More...
 
Pass RenormalizeSplitPattern ()
 Renormalize the split pattern from floordiv(floormod()) to floormod(floordiv()). More...
 
Pass RewriteUnsafeSelect ()
 Detect and rewrite unsafe select that contains memory access. More...
 
Pass InstrumentBoundCheckers ()
 Instruments bound checkers. More...
 
Pass InjectPTXLDG32 (bool enable_inject=true)
 Rewrite global to local memory copy on CUDA with ldg32 instruction. More...
 
Pass InstrumentProfileIntrinsics ()
 Insert intrinsic calls to instrument function and loop level profiling. More...
 
Pass LowerVtcmAlloc ()
 Lower VTCM allocations. More...
 
Pass ThreadSync (tvm::ffi::String storage_scope)
 Insert sync between parallel read/write of shared buffers. More...
 
Pass InferFragment ()
 Infer the TensorCore fragment information using tensor intrinsics. More...
 
Pass LowerThreadAllreduce ()
 Lower cross thread allreduce. More...
 
Pass LowerAsyncDMA ()
 Lower Async TIR primitives to DMA copy and wait builtins. More...
 
Pass InjectPTXAsyncCopy ()
 Rewrite global to shared memory copy on CUDA with asynchronous copy. More...
 
Pass MergeSharedMemoryAllocations ()
 Merge multiple TIR-level shared memory allocations into one. More...
 
Pass DefaultGPUSchedule ()
 Set default thread bindings for GPU PrimFuncs. More...
 
Pass RemoveWeightLayoutRewriteBlock (bool skip_tensor_rewrite=false)
 Remove weight layout rewrite block before benchmark. More...
 
Pass RemoveStoreUndef ()
 Remove stores of tirx::builtin::undef. More...
 
Pass DecorateDeviceScope ()
 Decorate all the function's body as device function. More...
 
Pass UseAssumeToReduceBranches ()
 Eliminate branches by leveraging buffer assumptions (T.assume). More...
 

Function Documentation

◆ AnnotateIrregularLoop()

Pass tvm::s_tir::transform::AnnotateIrregularLoop ( )

Annotate irregular loop mark.

◆ CanonicalizeLoop()

Pass tvm::s_tir::transform::CanonicalizeLoop ( )

Canonicalize loop to start from zero .

Returns
The pass.

◆ CompactBufferAllocation()

Pass tvm::s_tir::transform::CompactBufferAllocation ( bool  is_strict = true)

Compact the buffer access region by removing the buffer regions that are not accessed, i.e. narrowing the buffer shape and adjust the access region if necessary.

Before narrowing, B is a [16, 16] buffer, but only a skinny vector B[i, 0:16] is accessed.

for i in range(0, 16):
with T.sblock():
B = T.alloc_buffer(16, 16)
for j in range(0, 16):
B[i, j] = A[i, j] + 1
for j in range(0, 16):
C[i, j] = B[i, j] + 1

This pass narrows the buffer shape and adjust its accessed region accordingly. In this particular case, because only a 1 * 16 vector of B is accessed, the pass narrows B to shape [1, 16], and changes the access to B[i, j] to B[0, j].

for i in range(0, 16):
with T.sblock():
B = T.alloc_buffer(1, 16)
for j in range(0, 16):
B[0, j] = A[i, j] + 1
for j in range(0, 16):
C[i, j] = B[0, j] + 1
Parameters
is_strictensure the compacted shape always smaller than the original shape. otherwise it allows to grow the shape to match actual accessed buffer regions.
Returns
The pass.

◆ ConvertBlocksToOpaque()

Pass tvm::s_tir::transform::ConvertBlocksToOpaque ( )

Substitute all the block vars with the PrimExprs they are bound to, indicated by the corresponding iter_values in BlockRealize, for opaque blocks by removing all . the iter_values in BlockRealize and iter_vars in Block.

Returns
The pass.

◆ DecorateDeviceScope()

Pass tvm::s_tir::transform::DecorateDeviceScope ( )

Decorate all the function's body as device function.

Returns
The pass.

◆ DefaultGPUSchedule()

Pass tvm::s_tir::transform::DefaultGPUSchedule ( )

Set default thread bindings for GPU PrimFuncs.

Returns
The pass.

◆ HoistExpression()

Pass tvm::s_tir::transform::HoistExpression ( )

Hoist loop-invariant expressions to outside the eligible loops.

Can hoist conditionals used in IfThenElse statements and expressions, bindings of variables in Let statements and expressions, or boolean expressions, configurable to enable/disable each hoistable type.

Returns
The pass.

◆ HoistIfThenElse()

Pass tvm::s_tir::transform::HoistIfThenElse ( tvm::ffi::String  variant = "")

Hoist loop-invariant IfThenElse nodes to outside the eligible loops.

Parameters
variantThe variant of the pass. variant can have any one of following values ["basic", ""(Default)].
Returns
The pass.

◆ InferFragment()

Pass tvm::s_tir::transform::InferFragment ( )

Infer the TensorCore fragment information using tensor intrinsics.

Returns
The pass.

◆ InjectDoubleBuffer()

Pass tvm::s_tir::transform::InjectDoubleBuffer ( )

Inject double buffer statements.

Returns
The pass.

◆ InjectPermutedLayout()

Pass tvm::s_tir::transform::InjectPermutedLayout ( )

Inject permuted layout for shared memory.

Returns
The pass.

◆ InjectPTXAsyncCopy()

Pass tvm::s_tir::transform::InjectPTXAsyncCopy ( )

Rewrite global to shared memory copy on CUDA with asynchronous copy.

Returns
The pass.

◆ InjectPTXLDG32()

Pass tvm::s_tir::transform::InjectPTXLDG32 ( bool  enable_inject = true)

Rewrite global to local memory copy on CUDA with ldg32 instruction.

Parameters
enable_injectWhether to enable injection.
Returns
The pass.

◆ InjectSoftwarePipeline()

Pass tvm::s_tir::transform::InjectSoftwarePipeline ( )

This pass transforms annotated loops into pipelined ones where producers and consumers are overlapped with the information provided in loop annotations, which enables optimization techniques like prefetching and pipeline parallelism.

The pipeline scope consists of the direct children of the annotated loop (ignoring SBlockRealize, SBlock, SeqStmt), and the number of children is denoted by n in the documentation.

The following annotations are used to guide the loop transformation:

1) Loop annotation software_pipeline_stage defines the pipeline stage. An array of n integers, and each element should be in range [0, max_stage], where max_stage is the maximum (inclusive) stage. 2) Loop annotation software_pipeline_order defines the pipeline order. An array of n integers, a permutation of [0, 1, ..., num_components - 1]; 3) SBlock annotation double_buffer_scope controls certain buffer sizes to allow decoupling of read/write dependency. It's an integer index of the write regions of the block.

Every annotated loop is transformed into a loop with three blocks as its direct children:

1) Prologue block, where components whose stage is less than max_stage is executed;

2) Body block, where all the components are executed;

3) Epilogue block, where only components whose stage is greater than 0 will be executed. The execution order is controlled by the annotation software_pipeline_order, and thus could be different than the original order.

Note: For nested software pipelines, the inner software pipeline will be generated first, which may affect the number of the direct children of the outer loop. In this case, the annotations for the outer software pipeline should include the result of the inner software pipeline, which is the three blocks as discussed above.

Returns
The IR transform pass.

◆ InjectVirtualThread()

Pass tvm::s_tir::transform::InjectVirtualThread ( )

Inject virtual thread loops.

Returns
The pass.

◆ InstrumentBoundCheckers()

Pass tvm::s_tir::transform::InstrumentBoundCheckers ( )

Instruments bound checkers.

Returns
The pass.

◆ InstrumentProfileIntrinsics()

Pass tvm::s_tir::transform::InstrumentProfileIntrinsics ( )

Insert intrinsic calls to instrument function and loop level profiling.

Returns
The pass.

◆ LiftThreadBinding()

Pass tvm::s_tir::transform::LiftThreadBinding ( )

Lift the same thread bindings to their LCA loops.

Returns
The pass.

◆ LoopPartition()

Pass tvm::s_tir::transform::LoopPartition ( )

partition loops in the stmt.

Returns
The pass.

◆ LowerAsyncDMA()

Pass tvm::s_tir::transform::LowerAsyncDMA ( )

Lower Async TIR primitives to DMA copy and wait builtins.

Returns
The pass.

◆ LowerAutoCopy()

Pass tvm::s_tir::transform::LowerAutoCopy ( )

Automatically do memory optimizations for auto copy blocks.

Returns
The pass.

◆ LowerCrossThreadReduction()

Pass tvm::s_tir::transform::LowerCrossThreadReduction ( )

Lower cross-thread reduction from thread bindings to intrinsic function calls.

Returns
The pass.

◆ LowerInitBlock()

Pass tvm::s_tir::transform::LowerInitBlock ( )

Lower block init stmt into IfThenElse stmts.

Returns
The pass.

◆ LowerMatchBuffer()

Pass tvm::s_tir::transform::LowerMatchBuffer ( )

Remove match buffers inside the block. Also, it will validate the binding.

Returns
The pass.

◆ LowerOpaqueBlock()

Pass tvm::s_tir::transform::LowerOpaqueBlock ( )

Remove the block to ensure that the TIR can not be scheduled again.

Returns
The pass.

◆ LowerThreadAllreduce()

Pass tvm::s_tir::transform::LowerThreadAllreduce ( )

Lower cross thread allreduce.

Returns
The pass.

◆ LowerVtcmAlloc()

Pass tvm::s_tir::transform::LowerVtcmAlloc ( )

Lower VTCM allocations.

Returns
The pass.

◆ ManifestSharedMemoryLocalStage()

Pass tvm::s_tir::transform::ManifestSharedMemoryLocalStage ( )

Add the explicit local stage for the shared memory access on GPU.

Returns
The pass.

◆ MergeSharedMemoryAllocations()

Pass tvm::s_tir::transform::MergeSharedMemoryAllocations ( )

Merge multiple TIR-level shared memory allocations into one.

Returns
The pass.

◆ OOBChecker()

Pass tvm::s_tir::transform::OOBChecker ( )

Statically check TIR code for out of bounds array access.

Returns
The pass.

◆ PlanAndUpdateBufferAllocationLocation()

Pass tvm::s_tir::transform::PlanAndUpdateBufferAllocationLocation ( )

Locate the buffer allocation to the exact position (usually is the lca of buffer access). This pass will inject opaque block with alloc_buffers at the allocation site.

Returns
The pass.

◆ RemoveStoreUndef()

Pass tvm::s_tir::transform::RemoveStoreUndef ( )

Remove stores of tirx::builtin::undef.

Returns
The pass.

◆ RemoveWeightLayoutRewriteBlock()

Pass tvm::s_tir::transform::RemoveWeightLayoutRewriteBlock ( bool  skip_tensor_rewrite = false)

Remove weight layout rewrite block before benchmark.

Parameters
skip_tensor_rewriteWhether to skip tensor rewrite.
Returns
The pass.

◆ RenormalizeSplitPattern()

Pass tvm::s_tir::transform::RenormalizeSplitPattern ( )

Renormalize the split pattern from floordiv(floormod()) to floormod(floordiv()).

Returns
The pass.

◆ RewriteUnsafeSelect()

Pass tvm::s_tir::transform::RewriteUnsafeSelect ( )

Detect and rewrite unsafe select that contains memory access.

Returns
The pass.

◆ ThreadSync()

Pass tvm::s_tir::transform::ThreadSync ( tvm::ffi::String  storage_scope)

Insert sync between parallel read/write of shared buffers.

Parameters
storage_scopeThe storage scope considered.
Returns
The pass.

◆ TransformMmaBufferLayout()

Pass tvm::s_tir::transform::TransformMmaBufferLayout ( )

Transform Mma scope (m16n8k8.matrixA/B/C) to local scope with layout transformation.

Returns
The pass.

◆ UnifyThreadBinding()

Pass tvm::s_tir::transform::UnifyThreadBinding ( )

Unify all the thread bindings for "blockIdx.x/y/z", "threadIdx.x/y/z", and "vthread.x/y/z". Before the unification, two vars that are bound to a thread axis (e.g., "threadIdx.x") use different IterVars and variables in their AttrStmts. After the unification, we use a consolidated IterVar and a variable for them.

Returns
The pass.
Note
vthread is a legacy behavior that will be deprecated, though thread bindings of vthread are still also unified in this pass. Please use vthread.x, vthread.y and vthread.z instead.

◆ UseAssumeToReduceBranches()

Pass tvm::s_tir::transform::UseAssumeToReduceBranches ( )

Eliminate branches by leveraging buffer assumptions (T.assume).

Returns
The pass.

◆ VerifyGPUCode()

Pass tvm::s_tir::transform::VerifyGPUCode ( ffi::Map< ffi::String, PrimExpr constraints)

Pass to verify GPU code constraints.

Parameters
constraintsThe dict to specify constraints.
Returns
The pass.

◆ VerifyVTCMLimit()

Pass tvm::s_tir::transform::VerifyVTCMLimit ( ffi::Optional< Target default_target = std::nullopt)

Pass to check if VTCM usage is within limit.

Parameters
default_targetThe default target for functions without target attribute.
Returns
The pass.