copy_async → tcgen05_cp
The tcgen05_cp variant lowers a copy_async from shared memory to tensor
memory (Blackwell tmem). One elected thread issues
tcgen05.cp.32x128b.warpx4: a shared matrix descriptor names the source tile,
and the warpx4 multicast routes 32 lanes × 128 bits into the tensor-memory lanes
owned by all four warps. The dispatch issues only the copy; the caller signals
completion with tcgen05.commit. Source:
python/tvm/backend/cuda/operator/tile_primitive/copy_async/tcgen05_cp.py.
What it accepts
Two predicates — a valid shared→tmem copy and a single-thread scope:
# register_dispatch(..., variant="smem->tmem", priority=10, when=[
predicate("validate_smem_tmem_copy", _is_valid_smem_tmem_copy),
predicate("exec_scope", _single_thread_exec), # exec_scope == "thread"
# ])
def _is_valid_smem_tmem_copy(op, sctx):
if not (src.scope().startswith("shared") and dst.scope() == "tmem"): ...
if not (src.layout and dst.layout): ...
if dst.allocated_addr is None: ...
rep = dst.layout.replica # the warpx4 router
if not (len(rep) == 1 and int(rep[0].extent) == 4
and int(rep[0].stride) == 32 and "TLane" in str(rep[0].axis)):
return False, f"requires R[4:32@TLane] on tmem, got {list(rep)}"
return True, None
Property |
Requirement |
|---|---|
target / priority |
|
scope |
single thread issues the copy |
memory pair |
source |
tmem layout |
the replica must be exactly |
dtype |
sets |
Demonstration program
A warpgroup allocates 16 tmem columns, fills a 32×16 uint8 shared tile, and
copies it into tmem with tcgen05_cp (from test_smem_tmem.py; the readback /
dealloc tail is elided):
from tvm.tirx.layout import R, S, TCol, TileLayout, TLane
A_smem = T.alloc_buffer([32, 16], "uint8", scope="shared",
layout=TileLayout(S[(32, 16) : (16, 1)]), align=1024)
tmem_addr = T.alloc_shared([1], "uint32")
cp_mbar = T.alloc_shared([1], "uint64")
if warp_id == 0:
T.ptx.tcgen05.alloc(T.address_of(tmem_addr), n_cols=16, cta_group=1)
# ... mbarrier.init, fence, cta_sync, fill A_smem from global ...
tmem = T.decl_buffer([32, 16], "uint8", scope="tmem", allocated_addr=tmem_addr[0],
layout=TileLayout(S[(32, 16) : (1 @ TLane, 1 @ TCol)] + R[4 : 32 @ TLane]))
if tid_in_wg == 0:
Tx.copy_async(tmem[0:32, 0:16], A_smem[0:32, 0:16], cta_group=1) # smem -> tmem
T.ptx.tcgen05.commit(cp_mbar.ptr_to([0]), cta_group=1) # caller signals
T.ptx.mbarrier.try_wait(cp_mbar.ptr_to([0]), 0)
# ... readback via tcgen05.ld, then tcgen05.dealloc ...
(The copy_async is auto-dispatched to the smem->tmem variant — the source is
shared and the destination is the R[4:32@TLane] tmem buffer.)
Algorithm
1. Verify the warpx4 router and re-order. After slicing both layouts to the
region, the dispatch confirms the tmem replica is R[4:32@TLane], permutes to
TLane-first / TCol-stride-descending, isolates the broadcast, and groups the
remaining iters into (32, middle, elem_per_128b) — the 32×128-bit atom plus a
list of middle tiles to loop over.
2. Encode the matrix descriptor once. A 64-bit shared descriptor (leading-dim
offset ldo, stride-dim offset sdo, swizzle mode) is encoded right after the
shared buffer is allocated, cached per (smem_buf, ldo, sdo, swizzle):
desc_buf = decl_buffer((1,), "uint64", scope="local")
T.ptx.tcgen05.encode_matrix_descriptor(desc_buf.data, s_buf.ptr_to([0, 0]), ldo, sdo, swizzle)
3. Issue the copy — one tcgen05.cp for a single atom, or an unrolled loop
that bumps the tmem column offset and the descriptor’s 16-byte shared offset per
middle tile:
if total == 1:
T.ptx.tcgen05.cp(t_addr[0] + t_col0,
smem_desc_add_16B_offset(desc_buf[0], init_off_16B),
shape="32x128b", cta_group=cta_group, multicast="warpx4")
else:
for flat in T.unroll(total):
t_off, s_off = T.meta_var(compute_offsets(flat))
T.ptx.tcgen05.cp(t_addr[0] + t_col0 + t_off,
smem_desc_add_16B_offset(desc_buf[0], init_off_16B + s_off),
shape="32x128b", cta_group=cta_group, multicast="warpx4")
The dispatch emits no tcgen05.commit / wait — the caller commits against
an mbarrier (as in the demo).
Generated TIRx IR
The 32×16 uint8 tile is a single atom (ldo=16, sdo=8, swizzle=0):
T.ptx.tcgen05.encode_matrix_descriptor(cp_desc.data, T.address_of(A_smem[0]), 16, 8, 0)
T.ptx.tcgen05.cp(tmem_addr[0],
smem_desc_add_16B_offset(cp_desc[0], 0),
shape="32x128b", cta_group=1, multicast="warpx4")
Generated CUDA
// one warpx4 copy: shared (named by the matrix descriptor) -> tensor memory
"tcgen05.cp.cta_group::1.32x128b.warpx4 [%0], %1;" // [%0]=tmem addr, %1=descriptor
(Compiled for sm_100a. End-to-end correctness — including the tmem readback —
is covered by test_smem_tmem.py.)
How inputs change the algorithm
input |
effect |
|---|---|
dtype |
sets |
number of tiles |
|
shared swizzle layout |
changes the encoded |
tmem layout (D vs F) / cta_group |
the permutation order sets per-tile column steps; |
atom shape |
fixed at |