copy_async
Asynchronous copy. Every variant emits only the issue instruction — the caller is
responsible for completion (cp.async commit/wait for ldgsts; mbarrier
arrive/wait for the bulk-tensor and dsmem paths; tcgen05.commit /
tcgen05.wait for the tensor-memory paths). Selection is by the source/dest
memory pair and scope.
Variant |
Pair |
Prio |
Issue instruction |
|---|---|---|---|
global → shared |
20 |
|
|
global ↔ shared |
10 |
|
|
shared → shared (cross-CTA) |
10 |
|
|
shared → tmem |
10 |
|
|
tmem ↔ register |
10 |
|