copy_async → ldgsts#

The ldgsts variant lowers copy_async for a global → shared transfer to the PTX cp.async (LDGSTS) instruction: each thread issues an asynchronous vectorized copy that the hardware completes in the background, so the warp can keep computing while the load is in flight. It reuses the exact [outer, threads, vec] partition of the synchronous copy → gmem_smem variant; the differences are all in what is emitted and when it completes. Source: python/tvm/backend/cuda/tile_primitive/copy_async/ldgsts.py.

What it accepts#

_LDGSTS_PAIRS = [("global", "shared*")]      # cp.async is unidirectional
_LDGSTS_VEC_BITS = (128, 64, 32)             # cp_size ∈ {16, 8, 4} bytes

def _is_ldgsts(op_call, sctx):
    if not sctx.is_target("cuda"):
        return False, "non-cuda target"
    if sctx.scope_kind not in ("thread", "warp", "warpgroup", "cta"):
        return False, f"unsupported exec_scope {sctx.scope_kind}"
    for check in (
        lambda: _all_threads_active(sctx),
        lambda: _is_valid_copy(op_call, sctx),
        lambda: _scope_allowed(op_call, sctx, allowed_pairs=_LDGSTS_PAIRS),
        lambda: _divides_thread_cnt_ldgsts(op_call, sctx),
    ):
        ok, msg = check()
        if not ok:
            return False, msg
    return True, None

Property	Requirement
target / scope	`cuda`; `thread` / `warp` / `warpgroup` / `cta`, all active
direction	global → shared only (`_LDGSTS_PAIRS`) — `cp.async` is unidirectional
dtype / shape	`_is_valid_copy`; region element count divides the thread count
vector width	`cp.async` only accepts cp_size ∈ {4, 8, 16} bytes, so the candidate set is restricted to `_LDGSTS_VEC_BITS` = `{128, 64, 32}` bits
priority	`20` — selected for `copy_async` global→shared (vs the bulk/TMA variants)

Demonstration program#

A CTA (128 threads) asynchronously loads a 128×32 float16 tile global → shared, then commits and waits before reading it back (from test_ldgsts.py):

shape, dtype = (128, 32), "float16"
s_layout = TileLayout(S[shape]); full = (slice(0, 128), slice(0, 32))

@T.prim_func
def copy_async(A_ptr: T.handle, B_ptr: T.handle):
    A = T.match_buffer(A_ptr, shape, dtype)
    B = T.match_buffer(B_ptr, shape, dtype)
    T.device_entry(); T.cta_id([1]); T.warp_id([4]); T.lane_id([32]); tid = T.thread_id([128])
    A_smem = T.alloc_buffer(shape, dtype, scope="shared", layout=s_layout)
    Tx.cta.copy_async(A_smem[full], A[full], dispatch="ldgsts")   # async global -> shared
    T.ptx.cp_async.commit_group()                                # caller commits ...
    T.ptx.cp_async.wait_group()                                  # ... and waits
    T.cuda.cta_sync()
    Tx.cta.copy(B[full], A_smem[full])

Algorithm#

1. Same partition as copy → gmem_smem. align_layouts_gs builds the [outer, threads, vec] split with the global side driving the canonical order — but the vector candidates are clamped to {128, 64, 32} bits so the byte size is a legal cp.async cp_size. For 128×32 = 4096 float16 over 128 threads the widest legal width is vec = 8 (8 × 2 B = 16 B), giving outer = 4.

2. Emit cp.async instead of a synchronous copy, and do not sync:

for f in range(total_outer):
    s_lin = s_p.apply(f, tid, v0, shape=apply_shape)["m"]
    g_lin = g_p.apply(f, tid, v0, shape=apply_shape)["m"]
    s_ptr = _ptr_off(s_buf.ptr_to(s_zero), _s_off(f, s_lin))
    g_ptr = _ptr_off(g_buf.ptr_to(g_zero), g_lin)
    T.evaluate(T.ptx.cp_async(s_ptr, g_ptr, cp_size))   # async; cp_size = vec_bits // 8
# NO cta_sync — commit_group / wait_group / cta_sync are the caller's job

Completion is the caller’s responsibility (cp_async.commit_group() then cp_async.wait_group()); the dispatch only issues the in-flight loads.

Generated TIRx IR#

for f in range(4):                                       # outer = 4
    s_ptr = pointer_offset(A_smem, ...)
    g_ptr = pointer_offset(A_1, ...)
    T.ptx.cp_async(s_ptr, g_ptr, 16, T.uint64(0), 0, -1, -1, "")   # cp_size = 16 B

Generated CUDA#

// cp.async.cg copies 16 bytes shared <- global, asynchronously
tvm_builtin_ptx_cp_async_cg_16(s_ptr, g_ptr, /*cache_policy=*/0);   // x4 (outer = 4)
// ...
tvm_builtin_ptx_cp_async_commit_group();    // asm: cp.async.commit_group;
tvm_builtin_ptx_cp_async_wait_group_0();    // asm: cp.async.wait_group 0;

where the helper is asm volatile("cp.async.cg.shared.global [%0], [%1], 16;"). Each thread issues 4 asynchronous 16-byte copies; nothing blocks until the caller’s wait_group.

How inputs change the algorithm#

The dtype/alignment set vec (hence cp_size and outer), but unlike the synchronous variant the width is capped at 16 B (cp.async maximum):

case	`vec`	`cp_size`	`outer = 4096 / (128 · vec)`
`float16`, aligned	8	16 B	4
`float32`, aligned	4	16 B	8
8-B-aligned only	(clamped)	8 B	(doubles)
4-B-aligned only	(clamped)	4 B	(×4)

If the region can’t satisfy even a 4-byte (32-bit) cp_size, align_layouts_gs finds no candidate and the variant declines. The direction is fixed: a shared → global copy_async is never ldgsts (hardware has no store form).

copy_async → ldgsts

Contents