reduction#

Covers sum, max, min (reduce over axes). Three variants: local and shared (priority 10, discriminated by operand storage scope) and sm100_packed (priority 20, which pre-empts the others for the thread-scope float32 case on Blackwell).

Variant	Prio	Lowering
reduction → local	10	register src/dst; sequential thread reduction (+ optional warp shuffle)
reduction → shared	10	shared src/dst; adaptive group-size `__shfl_xor` tree
reduction → sm100_packed	20	Blackwell thread-scope fp32 ≥8: packed `add.f32x2` / `max3`/`min3`

reduction → local
reduction → shared
reduction → sm100_packed