copy
Synchronous element copy src → dst between global, shared, and register
(local) memory. Four CUDA variants are registered. gmem_smem (global ↔
shared) and the two register variants take disjoint memory-scope pairs; reg and
ldstmatrix both cover register ↔ shared, where ldstmatrix is tried first
and declines to reg unless the layouts are ldmatrix fragments. All three are
priority 10; the scalar fallback (priority 0) runs only if they all decline.
Variant |
Pair |
Lowering |
|---|---|---|
global ↔ shared |
synthesized |
|
register ↔ shared/global |
partition induced by the register layout’s thread axes |
|
register ↔ shared |
warp-collective |
|
any |
scalar single-thread copy (priority 0, catch-all) |
Each variant has its own deep walkthrough — accepted input, algorithm, a runnable demo program, the dispatch’s TIRx IR, and the generated CUDA: