Class CubinKernel#

Class Documentation#

class CubinKernel#

CUDA kernel handle for launching kernels.

This class represents a loaded CUDA kernel function and provides methods to launch it with specified grid/block dimensions, arguments, and stream configuration. Obtained from CubinModule by kernel name.

See also

CubinModule for loading CUBIN and getting kernels

See also

dim3 for grid/block dimension specification

Usage Pattern
// Get kernel from module
tvm::ffi::CubinKernel kernel = module["kernel_name"];

// Prepare arguments (must be pointers to actual values)
void* data_ptr = tensor.data_ptr();
int64_t size = tensor.size(0);
void* args[] = {&data_ptr, &size};

// Configure launch dimensions
tvm::ffi::dim3 grid(32);    // 32 blocks
tvm::ffi::dim3 block(256);  // 256 threads per block

// Launch on stream
cudaStream_t stream = ...;
cudaError_t result = kernel.Launch(args, grid, block, stream);
TVM_FFI_CHECK_CUDA_ERROR(result);

Note

This class is movable but not copyable.

Public Functions

inline CubinKernel(cudaLibrary_t library, const char *name)#

Construct a CubinKernel from a library and kernel name.

Parameters:
  • library – The cudaLibrary_t handle.

  • name – Name of the kernel function.

~CubinKernel() = default#

Destructor (kernel handle doesn’t need explicit cleanup)

inline cudaError_t Launch(void **args, dim3 grid, dim3 block, cudaStream_t stream, uint32_t dyn_smem_bytes = 0)#

Launch the kernel with specified parameters.

This function launches the kernel on the current CUDA context/device using the CUDA Runtime API. The kernel executes asynchronously on the specified stream.

Argument Preparation

The args array must contain pointers to the actual argument values, not the values themselves. For example:

void* data_ptr = tensor.data_ptr();
int64_t size = 100;
void* args[] = {&data_ptr, &size};  // Note: addresses of the variables

Launch Configuration

Grid and block dimensions determine the kernel’s parallelism:

  • Grid: Number of thread blocks (can be 1D, 2D, or 3D)

  • Block: Number of threads per block (can be 1D, 2D, or 3D)

  • Total threads = grid.x * grid.y * grid.z * block.x * block.y * block.z

Error Checking

Always check the returned cudaError_t:

cudaError_t result = kernel.Launch(args, grid, block, stream);
TVM_FFI_CHECK_CUDA_ERROR(result);

Note

The kernel executes asynchronously. Use cudaStreamSynchronize() or cudaDeviceSynchronize() to wait for completion if needed.

Parameters:
  • argsArray of pointers to kernel arguments (must point to actual values).

  • grid – Grid dimensions (number of blocks in x, y, z).

  • block – Block dimensions (threads per block in x, y, z).

  • stream – CUDA stream to launch the kernel on (use 0 for default stream).

  • dyn_smem_bytes – Dynamic shared memory size in bytes (default: 0).

Returns:

cudaError_t error code from cudaLaunchKernel (cudaSuccess on success).

inline cudaKernel_t GetHandle() const#

Get the underlying cudaKernel_t handle.

CubinKernel(const CubinKernel&) = delete#
inline CubinKernel(CubinKernel &&other) noexcept#

Move constructor for CubinKernel.

Transfers ownership of the CUDA kernel handle from another CubinKernel instance.

Parameters:

other – The source CubinKernel to move from (will be left in an empty state).