copy → fallback
The fallback variant is the priority-0 catch-all: a scalar, single-thread
copy that runs only when every faster variant (copy → gmem_smem, copy → reg,
copy → ldstmatrix) has declined. It always works — for any valid copy at any scope
— and is intentionally slow, so it emits a UserWarning when chosen. Source:
python/tvm/backend/cuda/operator/tile_primitive/copy/fallback.py.
What it accepts
Any valid copy. The only gate is _is_valid_copy (layouts present, equal dtype,
equal non-unit extents); there is no scope, pair, or divisibility restriction. It is
registered at priority=0 so it is the last candidate the dispatcher tries:
Property |
Requirement |
|---|---|
priority |
|
target / scope |
any ( |
memory pair |
any (global / shared / local, either direction) |
dtype / shape |
|
A typical reason the fast variants all decline: the region’s element count does not
divide the thread count, so the [outer, threads, vec] split has no solution
(gmem_smem declines), and neither side is a register layout (reg /
ldstmatrix decline). Example: a warp (32 threads) copying a 4×6 = 24-element
tile (24 ∤ 32).
Demonstration program
A warp copies a 4×6 float32 tile global → shared and back. 24 is not
divisible by 32, so this falls through to fallback (from
test_fallback.py):
shape, dtype = (4, 6), "float32" # 24 elements, 32 threads -> 24 ∤ 32
s_layout = TileLayout(S[shape])
full = (slice(0, 4), slice(0, 6))
@T.prim_func
def kernel(A_ptr: T.handle, B_ptr: T.handle):
A = T.match_buffer(A_ptr, shape, dtype)
B = T.match_buffer(B_ptr, shape, dtype)
T.device_entry(); T.cta_id([1]); T.lane_id([32]); T.thread_id([32])
A_smem = T.alloc_buffer(shape, dtype, scope="shared", layout=s_layout)
Tx.warp.copy(A_smem[full], A[full]) # fallback
T.cuda.cta_sync()
Tx.warp.copy(B[full], A_smem[full]) # fallback
Algorithm
1. Elect one thread. For a multi-thread scope the copy is done entirely by the
first thread of the scope — its laneid base, plus the warp offset within a
warpgroup / cta (radix-32 composition of the per-axis offsets); every other thread
skips it. For thread scope there is only one thread, so no guard is emitted:
if scope_kind == "thread":
def impl():
_copy_body(dst, src) # the single thread copies everything
else:
first_tid = int(sctx.intra["laneid"][1]) # first thread of the scope:
if scope_kind == "warpgroup":
first_tid += 32 * int(sctx.intra["wid_in_wg"][1]) # + warp offset within the wg
elif scope_kind == "cta":
first_tid += 32 * int(sctx.intra["warpid"][1]) # + warp offset within the cta
def impl():
tid = _axis_decl(tid_axis_name, sctx)
if tid == first_tid: # only the scope's first thread copies
_copy_body(dst, src)
2. Scalar nested loop over the region. _copy_body iterates the non-unit
extents with T.grid and copies one element per step — no vectorization, no
partition:
with T.grid(*copy_extents) as lvs: # copy_extents = non-unit dst extents
dst[_dst_coord(lvs)] = src[_src_coord(lvs)]
Generated TIRx IR
LowerTIRx produces the guarded scalar grid (global → shared shown):
if tid == 0: # first_tid (lane 0 for this warp scope)
for v_3, v_4 in T.grid(4, 6):
A_smem[v_3, v_4] = A[v_3, v_4]
Generated CUDA
__shared__ alignas(64) float A_smem_ptr[24];
if (((int)threadIdx.x) == 0) { // lane 0 does the whole copy
for (int v_3 = 0; v_3 < 4; ++v_3)
for (int v_4 = 0; v_4 < 6; ++v_4)
A_smem_ptr[(v_3 * 6) + v_4] = A_ptr[(v_3 * 6) + v_4];
}
Lane 0 copies all 24 elements one at a time; the other 31 lanes do nothing. (At
lowering the dispatch also prints UserWarning: copy/fallback (scalar
single-thread) picked … all faster variants rejected.)
How inputs change the algorithm
The scope decides who runs the loop:
scope |
who copies |
|---|---|
|
the single thread, no |
|
only |
The shape sets the T.grid bounds (the non-unit extents); the loop body is
always one scalar element copy, regardless of dtype. There is no vectorization, so
performance does not depend on dtype width — this variant exists for correctness,
not speed.