|
tvm
|
Functions | |
| Pass | CanonicalizeLoop () |
| Canonicalize loop to start from zero . More... | |
| Pass | LowerCrossThreadReduction () |
| Lower cross-thread reduction from thread bindings to intrinsic function calls. More... | |
| Pass | LowerInitBlock () |
| Lower block init stmt into IfThenElse stmts. More... | |
| Pass | PlanAndUpdateBufferAllocationLocation () |
| Locate the buffer allocation to the exact position (usually is the lca of buffer access). This pass will inject opaque block with alloc_buffers at the allocation site. More... | |
| Pass | ConvertBlocksToOpaque () |
| Substitute all the block vars with the PrimExprs they are bound to, indicated by the corresponding iter_values in BlockRealize, for opaque blocks by removing all . the iter_values in BlockRealize and iter_vars in Block. More... | |
| Pass | LiftThreadBinding () |
| Lift the same thread bindings to their LCA loops. More... | |
| Pass | CompactBufferAllocation (bool is_strict=true) |
| Compact the buffer access region by removing the buffer regions that are not accessed, i.e. narrowing the buffer shape and adjust the access region if necessary. More... | |
| Pass | LowerMatchBuffer () |
| Remove match buffers inside the block. Also, it will validate the binding. More... | |
| Pass | InjectPermutedLayout () |
| Inject permuted layout for shared memory. More... | |
| Pass | TransformMmaBufferLayout () |
| Transform Mma scope (m16n8k8.matrixA/B/C) to local scope with layout transformation. More... | |
| Pass | LowerOpaqueBlock () |
| Remove the block to ensure that the TIR can not be scheduled again. More... | |
| Pass | UnifyThreadBinding () |
| Unify all the thread bindings for "blockIdx.x/y/z", "threadIdx.x/y/z", and "vthread.x/y/z". Before the unification, two vars that are bound to a thread axis (e.g., "threadIdx.x") use different IterVars and variables in their AttrStmts. After the unification, we use a consolidated IterVar and a variable for them. More... | |
| Pass | InjectSoftwarePipeline () |
| This pass transforms annotated loops into pipelined ones where producers and consumers are overlapped with the information provided in loop annotations, which enables optimization techniques like prefetching and pipeline parallelism. More... | |
| Pass | LowerAutoCopy () |
| Automatically do memory optimizations for auto copy blocks. More... | |
| Pass | ManifestSharedMemoryLocalStage () |
| Add the explicit local stage for the shared memory access on GPU. More... | |
| Pass | AnnotateIrregularLoop () |
| Annotate irregular loop mark. More... | |
| Pass | LoopPartition () |
| partition loops in the stmt. More... | |
| Pass | InjectVirtualThread () |
| Inject virtual thread loops. More... | |
| Pass | InjectDoubleBuffer () |
| Inject double buffer statements. More... | |
| Pass tvm::s_tir::transform::AnnotateIrregularLoop | ( | ) |
Annotate irregular loop mark.
| Pass tvm::s_tir::transform::CanonicalizeLoop | ( | ) |
Canonicalize loop to start from zero .
| Pass tvm::s_tir::transform::CompactBufferAllocation | ( | bool | is_strict = true | ) |
Compact the buffer access region by removing the buffer regions that are not accessed, i.e. narrowing the buffer shape and adjust the access region if necessary.
Before narrowing, B is a [16, 16] buffer, but only a skinny vector B[i, 0:16] is accessed.
This pass narrows the buffer shape and adjust its accessed region accordingly. In this particular case, because only a 1 * 16 vector of B is accessed, the pass narrows B to shape [1, 16], and changes the access to B[i, j] to B[0, j].
| is_strict | ensure the compacted shape always smaller than the original shape. otherwise it allows to grow the shape to match actual accessed buffer regions. |
| Pass tvm::s_tir::transform::ConvertBlocksToOpaque | ( | ) |
Substitute all the block vars with the PrimExprs they are bound to, indicated by the corresponding iter_values in BlockRealize, for opaque blocks by removing all . the iter_values in BlockRealize and iter_vars in Block.
| Pass tvm::s_tir::transform::InjectDoubleBuffer | ( | ) |
Inject double buffer statements.
| Pass tvm::s_tir::transform::InjectPermutedLayout | ( | ) |
Inject permuted layout for shared memory.
| Pass tvm::s_tir::transform::InjectSoftwarePipeline | ( | ) |
This pass transforms annotated loops into pipelined ones where producers and consumers are overlapped with the information provided in loop annotations, which enables optimization techniques like prefetching and pipeline parallelism.
The pipeline scope consists of the direct children of the annotated loop (ignoring SBlockRealize, SBlock, SeqStmt), and the number of children is denoted by n in the documentation.
The following annotations are used to guide the loop transformation:
1) Loop annotation software_pipeline_stage defines the pipeline stage. An array of n integers, and each element should be in range [0, max_stage], where max_stage is the maximum (inclusive) stage. 2) Loop annotation software_pipeline_order defines the pipeline order. An array of n integers, a permutation of [0, 1, ..., num_components - 1]; 3) SBlock annotation double_buffer_scope controls certain buffer sizes to allow decoupling of read/write dependency. It's an integer index of the write regions of the block.
Every annotated loop is transformed into a loop with three blocks as its direct children:
1) Prologue block, where components whose stage is less than max_stage is executed;
2) Body block, where all the components are executed;
3) Epilogue block, where only components whose stage is greater than 0 will be executed. The execution order is controlled by the annotation software_pipeline_order, and thus could be different than the original order.
Note: For nested software pipelines, the inner software pipeline will be generated first, which may affect the number of the direct children of the outer loop. In this case, the annotations for the outer software pipeline should include the result of the inner software pipeline, which is the three blocks as discussed above.
| Pass tvm::s_tir::transform::InjectVirtualThread | ( | ) |
Inject virtual thread loops.
| Pass tvm::s_tir::transform::LiftThreadBinding | ( | ) |
Lift the same thread bindings to their LCA loops.
| Pass tvm::s_tir::transform::LoopPartition | ( | ) |
partition loops in the stmt.
| Pass tvm::s_tir::transform::LowerAutoCopy | ( | ) |
Automatically do memory optimizations for auto copy blocks.
| Pass tvm::s_tir::transform::LowerCrossThreadReduction | ( | ) |
Lower cross-thread reduction from thread bindings to intrinsic function calls.
| Pass tvm::s_tir::transform::LowerInitBlock | ( | ) |
Lower block init stmt into IfThenElse stmts.
| Pass tvm::s_tir::transform::LowerMatchBuffer | ( | ) |
Remove match buffers inside the block. Also, it will validate the binding.
| Pass tvm::s_tir::transform::LowerOpaqueBlock | ( | ) |
Remove the block to ensure that the TIR can not be scheduled again.
| Pass tvm::s_tir::transform::ManifestSharedMemoryLocalStage | ( | ) |
Add the explicit local stage for the shared memory access on GPU.
| Pass tvm::s_tir::transform::PlanAndUpdateBufferAllocationLocation | ( | ) |
Locate the buffer allocation to the exact position (usually is the lca of buffer access). This pass will inject opaque block with alloc_buffers at the allocation site.
| Pass tvm::s_tir::transform::TransformMmaBufferLayout | ( | ) |
Transform Mma scope (m16n8k8.matrixA/B/C) to local scope with layout transformation.
| Pass tvm::s_tir::transform::UnifyThreadBinding | ( | ) |
Unify all the thread bindings for "blockIdx.x/y/z", "threadIdx.x/y/z", and "vthread.x/y/z". Before the unification, two vars that are bound to a thread axis (e.g., "threadIdx.x") use different IterVars and variables in their AttrStmts. After the unification, we use a consolidated IterVar and a variable for them.
vthread is a legacy behavior that will be deprecated, though thread bindings of vthread are still also unified in this pass. Please use vthread.x, vthread.y and vthread.z instead.