copy_async → ldgsts
The ldgsts variant lowers copy_async for a global → shared transfer to
the PTX cp.async (LDGSTS) instruction: each thread issues an asynchronous
vectorized copy that the hardware completes in the background, so the warp can keep
computing while the load is in flight. It reuses the exact [outer, threads, vec]
partition of the synchronous copy → gmem_smem variant; the differences are
all in what is emitted and when it completes. Source:
python/tvm/backend/cuda/operator/tile_primitive/copy_async/ldgsts.py.
What it accepts
_LDGSTS_PAIRS = [("global", "shared*")] # cp.async is unidirectional
_LDGSTS_VEC_BITS = (128, 64, 32) # cp_size ∈ {16, 8, 4} bytes
def _is_ldgsts(op_call, sctx):
if not sctx.is_target("cuda"):
return False, "non-cuda target"
if sctx.scope_kind not in ("thread", "warp", "warpgroup", "cta"):
return False, f"unsupported exec_scope {sctx.scope_kind}"
for check in (
lambda: _all_threads_active(sctx),
lambda: _is_valid_copy(op_call, sctx),
lambda: _scope_allowed(op_call, sctx, allowed_pairs=_LDGSTS_PAIRS),
lambda: _divides_thread_cnt_ldgsts(op_call, sctx),
):
ok, msg = check()
if not ok:
return False, msg
return True, None
Property |
Requirement |
|---|---|
target / scope |
|
direction |
global → shared only ( |
dtype / shape |
|
vector width |
|
priority |
|
Demonstration program
A CTA (128 threads) asynchronously loads a 128×32 float16 tile global →
shared, then commits and waits before reading it back (from test_ldgsts.py):
shape, dtype = (128, 32), "float16"
s_layout = TileLayout(S[shape]); full = (slice(0, 128), slice(0, 32))
@T.prim_func
def copy_async(A_ptr: T.handle, B_ptr: T.handle):
A = T.match_buffer(A_ptr, shape, dtype)
B = T.match_buffer(B_ptr, shape, dtype)
T.device_entry(); T.cta_id([1]); T.warp_id([4]); T.lane_id([32]); tid = T.thread_id([128])
A_smem = T.alloc_buffer(shape, dtype, scope="shared", layout=s_layout)
Tx.cta.copy_async(A_smem[full], A[full], dispatch="ldgsts") # async global -> shared
T.ptx.cp_async.commit_group() # caller commits ...
T.ptx.cp_async.wait_group() # ... and waits
T.cuda.cta_sync()
Tx.cta.copy(B[full], A_smem[full])
Algorithm
1. Same partition as copy → gmem_smem. align_layouts_gs builds the
[outer, threads, vec] split with the global side driving the canonical order —
but the vector candidates are clamped to {128, 64, 32} bits so the byte size is
a legal cp.async cp_size. For 128×32 = 4096 float16 over 128 threads the
widest legal width is vec = 8 (8 × 2 B = 16 B), giving outer = 4.
2. Emit cp.async instead of a synchronous copy, and do not sync:
for f in range(total_outer):
s_lin = s_p.apply(f, tid, v0, shape=apply_shape)["m"]
g_lin = g_p.apply(f, tid, v0, shape=apply_shape)["m"]
s_ptr = _ptr_off(s_buf.ptr_to(s_zero), _s_off(f, s_lin))
g_ptr = _ptr_off(g_buf.ptr_to(g_zero), g_lin)
T.evaluate(T.ptx.cp_async(s_ptr, g_ptr, cp_size)) # async; cp_size = vec_bits // 8
# NO cta_sync — commit_group / wait_group / cta_sync are the caller's job
Completion is the caller’s responsibility (cp_async.commit_group() then
cp_async.wait_group()); the dispatch only issues the in-flight loads.
Generated TIRx IR
for f in range(4): # outer = 4
s_ptr = pointer_offset(A_smem, ...)
g_ptr = pointer_offset(A_1, ...)
T.ptx.cp_async(s_ptr, g_ptr, 16, T.uint64(0), 0, -1, -1, "") # cp_size = 16 B
Generated CUDA
// cp.async.cg copies 16 bytes shared <- global, asynchronously
tvm_builtin_ptx_cp_async_cg_16(s_ptr, g_ptr, /*cache_policy=*/0); // x4 (outer = 4)
// ...
tvm_builtin_ptx_cp_async_commit_group(); // asm: cp.async.commit_group;
tvm_builtin_ptx_cp_async_wait_group_0(); // asm: cp.async.wait_group 0;
where the helper is asm volatile("cp.async.cg.shared.global [%0], [%1], 16;").
Each thread issues 4 asynchronous 16-byte copies; nothing blocks until the caller’s
wait_group.
How inputs change the algorithm
The dtype/alignment set vec (hence cp_size and outer), but unlike the
synchronous variant the width is capped at 16 B (cp.async maximum):
case |
|
|
|
|---|---|---|---|
|
8 |
16 B |
4 |
|
4 |
16 B |
8 |
8-B-aligned only |
(clamped) |
8 B |
(doubles) |
4-B-aligned only |
(clamped) |
4 B |
(×4) |
If the region can’t satisfy even a 4-byte (32-bit) cp_size, align_layouts_gs
finds no candidate and the variant declines. The direction is fixed: a
shared → global copy_async is never ldgsts (hardware has no store form).