TIRx Basics: CUDA C++/PTX native level
Note
Native-level kernel authoring for the CUDA backend (the "cuda"
target): the thread hierarchy, memory scopes, the T.cuda.* / T.ptx.*
intrinsics, and the compile / run / inspect loop. The complete kernels in
these chapters (scale, add, smem_demo, block_sum, and the
warp all-reduce) are tested end-to-end on a CUDA GPU.
What “native level” means
A native-level TIRx kernel reads like a structured device kernel: you place threads yourself, allocate shared/register buffers, write loops and barriers, and call device intrinsics directly. There is no automatic scheduling — what you write is what is emitted. This is the foundation the tile primitives (Tile Primitives) are built on; everything here is what those primitives ultimately lower to, so it is also where you go when a hardware feature does not have a primitive yet.