TIRx Basics: CUDA C++/PTX native level

Note

Native-level kernel authoring for the CUDA backend (the "cuda" target): the thread hierarchy, memory scopes, the T.cuda.* / T.ptx.* intrinsics, and the compile / run / inspect loop. The complete kernels in these chapters (scale, add, smem_demo, block_sum, and the warp all-reduce) are tested end-to-end on a CUDA GPU.

What “native level” means

A native-level TIRx kernel reads like a structured device kernel: you place threads yourself, allocate shared/register buffers, write loops and barriers, and call device intrinsics directly. There is no automatic scheduling — what you write is what is emitted. This is the foundation the tile primitives (Tile Primitives) are built on; everything here is what those primitives ultimately lower to, so it is also where you go when a hardware feature does not have a primitive yet.

The authoring model

  • @T.prim_func (or @T.jit for compile-time-specialized) kernels, written with from tvm.script import tirx as T;

  • T.device_entry() plus scope-id intrinsics for thread binding;

  • T.match_buffer parameters and T.alloc_* scratch buffers;

  • ordinary loops, branches, and scalar math;

  • tvm.compile(mod, target=..., tir_pipeline="tirx") to build, then call the result directly.

All native authoring uses these imports. The __future__ import lets @T.jit kernels reference compile-time parameters inside type annotations (see Defining a function); it is harmless for ordinary kernels:

from __future__ import annotations
import tvm
from tvm.script import tirx as T