copy#

Synchronous element copy src → dst between global, shared, and register (local) memory. Four CUDA variants are registered. gmem_smem (global ↔ shared) and the two register variants take disjoint memory-scope pairs; reg and ldstmatrix both cover register ↔ shared, where ldstmatrix is tried first and declines to reg unless the layouts are ldmatrix fragments. All three are priority 10; the scalar fallback (priority 0) runs only if they all decline.

Variant	Pair	Lowering
copy → gmem_smem	global ↔ shared	synthesized `[outer, threads, vec]` partition, vectorized `copy_Nb`
copy → reg	register ↔ shared/global	partition induced by the register layout’s thread axes
copy → ldstmatrix	register ↔ shared	warp-collective `ldmatrix` / `stmatrix` (m8n8 fragments)
copy → fallback	any	scalar single-thread copy (priority 0, catch-all)

Each variant has its own deep walkthrough — accepted input, algorithm, a runnable demo program, the dispatch’s TIRx IR, and the generated CUDA: