copy_async

Asynchronous copy. Every variant emits only the issue instruction — the caller is responsible for completion (cp.async commit/wait for ldgsts; mbarrier arrive/wait for the bulk-tensor and dsmem paths; tcgen05.commit / tcgen05.wait for the tensor-memory paths). Selection is by the source/dest memory pair and scope.

Variant

Pair

Prio

Issue instruction

copy_async → ldgsts

global → shared

20

cp.async (LDGSTS), per-thread vectorized

copy_async → tma

global ↔ shared

10

cp.async.bulk.tensor (TMA, descriptor-driven, single-thread)

copy_async → dsmem

shared → shared (cross-CTA)

10

cp.async.bulk shared::cluster (mapa remote address)

copy_async → tcgen05_cp

shared → tmem

10

tcgen05.cp.32x128b.warpx4 (matrix-descriptor driven)

copy_async → tcgen05_ldst

tmem ↔ register

10

tcgen05.ld / tcgen05.st (warpgroup, atom-matched)