reduction
Covers sum, max, min (reduce over axes). Three variants: local
and shared (priority 10, discriminated by operand storage scope) and
sm100_packed (priority 20, which pre-empts the others for the thread-scope
float32 case on Blackwell).
Variant |
Prio |
Lowering |
|---|---|---|
10 |
register src/dst; sequential thread reduction (+ optional warp shuffle) |
|
10 |
shared src/dst; adaptive group-size |
|
20 |
Blackwell thread-scope fp32 ≥8: packed |