CUDA 12.1 Supports Large Kernel Parameters

Summary

The latest update to CUDA, version 12.1, brings a significant enhancement to kernel parameters, allowing developers to pass up to 32,764 bytes of data. This change simplifies application development and improves performance by eliminating the need for workarounds to handle large kernel parameters. Here, we explore the details of this update and its implications for developers working with NVIDIA GPUs.

Simplifying Kernel Parameters with CUDA 12.1

Kernel parameters are crucial for passing data to CUDA kernels, but they have historically been limited to 4,096 bytes. This restriction often forced developers to use workarounds, such as copying excess parameters into constant memory using cudaMemcpyToSymbol or cudaMemcpyToSymbolAsync. These methods added complexity and overhead to applications.

The New Limit: 32,764 Bytes

With CUDA 12.1, the kernel parameter limit has been increased to 32,764 bytes on all device architectures, including NVIDIA Volta and above. This change allows developers to pass larger amounts of data directly to kernels, simplifying application development and improving performance.

Example Code Snippets

To illustrate the difference, let’s compare two code snippets. The first snippet shows the traditional method of handling large kernel parameters by copying excess data into constant memory:

#define TOTAL_PARAMS (8000) // ints
#define KERNEL_PARAM_LIMIT (1024) // ints
#define CONST_COPIED_PARAMS (TOTAL_PARAMS - KERNEL_PARAM_LIMIT)

__constant__ int excess_params;

typedef struct {
    int param;
} param_t;

__global__ void kernelDefault(__grid_constant__ const param_t p,...) {
    // access <= 4,096 parameters from p
    // access excess parameters from __constant__ memory
}

int main() {
    param_t p;
    // ...
}

The second snippet demonstrates how CUDA 12.1 simplifies this process by allowing larger kernel parameters:

#define TOTAL_PARAMS (8000) // ints

typedef struct {
    int param;
} param_large_t;

__global__ void kernelLargeParam(__grid_constant__ const param_large_t p,...) {
    // access all parameters from p
}

int main() {
    param_large_t p_large;
    kernelLargeParam<<<GRIDDIM,BLOCKDIM>>>(p_large,...);
    cudaDeviceSynchronize();
    // ...
}

Toolkit and Driver Compatibility

To use large kernel parameters, developers must have CUDA Toolkit 12.1 and a R530 driver or higher. Attempting to launch kernels with large parameters on older drivers will result in a CUDA_ERROR_NOT_SUPPORTED error.

Supported Architectures

The increased parameter limit is available on all architectures, including NVIDIA Volta and above. Architectures below NVIDIA Volta remain limited to 4,096 bytes.

Performance Savings

The performance benefits of using large kernel parameters are significant. Figure 1 compares the execution times of the two code snippets on a single NVIDIA H100 system. The results show a 28% overall savings in application runtime and a 9% improvement in kernel execution time when using large kernel parameters.

Practical Applications

Developers can leverage large kernel parameters to simplify and optimize their applications. For example, in deep learning applications, passing large matrices as kernel parameters can significantly improve performance by reducing the need for explicit memory copies.

Concurrent Kernels and Constant Memory

When launching concurrent kernels in different streams, developers must ensure that constant memory is properly partitioned to avoid interleaving. However, as long as the aggregate constant memory usage is less than 64KB, accesses from independent kernels are handled independently by the GPU hardware.

Real-World Impact

The ability to pass large kernel parameters has already shown significant performance improvements in real-world applications. For instance, modifying the CUDA version of RT to pass scene and camera data as const kernel parameters resulted in a 10x speedup.

#Table 1: NVIDIA GPU Architectures

Architecture	Year	#SMs	#CUDA-cores	Base clock (MHz)	Base DP (Gflops)	Total available memory (GB)	Memory bus width (bit)	Peak mem. BW (GB/s)
Fermi	2011	16	448	1.15	515	3	384	144
Kepler	2012	14	2688	735	1310	6	384	250
Kepler	2013	15	2880	745	1430	12	384	288
Pascal	2016	56	3584	1328	4755	16	4096	732
Volta	2017	80	5120	1370	7000	16	4096	900
Ampere	2021	108	6912	1700	9700	40	5120	1555
Hopper	2022	132	16896	1600	30000	80	5120	3072

Table 2: Performance Comparison

Scenario	Execution Time (ms)
Small kernel parameters	100
Large kernel parameters	72

Table 3: Constant Memory Usage

Scenario	Constant Memory Usage (KB)
Single kernel	32
Concurrent kernels	64

Conclusion

CUDA 12.1’s support for large kernel parameters is a significant enhancement that simplifies application development and improves performance. By eliminating the need for workarounds and reducing memory copies, developers can focus on optimizing their applications and achieving better results. As GPU architectures continue to evolve, features like large kernel parameters will play a crucial role in unlocking their full potential.

The New Limit: 32,764 Bytes#

Example Code Snippets#

Toolkit and Driver Compatibility#

Supported Architectures#

Performance Savings#

Practical Applications#

Concurrent Kernels and Constant Memory#

Real-World Impact#

Conclusion#