copy_async#

Asynchronous copy. Every variant emits only the issue instruction — the caller is responsible for completion (cp.async commit/wait for ldgsts; mbarrier arrive/wait for the bulk-tensor and dsmem paths; tcgen05.commit / tcgen05.wait for the tensor-memory paths). Selection is by the source/dest memory pair and scope.

Variant	Pair	Prio	Issue instruction
copy_async → ldgsts	global → shared	20	`cp.async` (LDGSTS), per-thread vectorized
copy_async → tma_auto / tma_explicit	global ↔ shared	10	`cp.async.bulk.tensor` (TMA, descriptor-driven, single-thread)
copy_async → dsmem	shared → shared (cross-CTA)	10	`cp.async.bulk` shared::cluster (`mapa` remote address)
copy_async → tcgen05_cp	shared → tmem	10	`tcgen05.cp.32x128b.warpx4` (matrix-descriptor driven)
copy_async → tcgen05_ldst	tmem ↔ register	10	`tcgen05.ld` / `tcgen05.st` (warpgroup, atom-matched)

copy_async → ldgsts
copy_async → tma_auto / tma_explicit
copy_async → dsmem
copy_async → tcgen05_cp
copy_async → tcgen05_ldst