reduction

Covers sum, max, min (reduce over axes). Three variants: local and shared (priority 10, discriminated by operand storage scope) and sm100_packed (priority 20, which pre-empts the others for the thread-scope float32 case on Blackwell).

Variant

Prio

Lowering

reduction → local

10

register src/dst; sequential thread reduction (+ optional warp shuffle)

reduction → shared

10

shared src/dst; adaptive group-size __shfl_xor tree

reduction → sm100_packed

20

Blackwell thread-scope fp32 ≥8: packed add.f32x2 / max3/min3