cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
Public Member Functions | List of all members
cuda::kernel_t Class Reference

A non-owning wrapper class for CUDA __global__ functions. More...

#include <kernel.hpp>

Public Member Functions

const void * ptr () const noexcept
 
device_t device () const noexcept
 
 operator const void * () noexcept
 
kernel::attributes_t attributes () const
 
grid::dimension_t maximum_active_blocks_per_multiprocessor (grid::block_dimension_t num_threads_per_block, memory::shared::size_t dynamic_shared_memory_per_block, bool disable_caching_override=false)
 Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e. More...
 
void set_attribute (kernel::attribute_t attribute, kernel::attribute_value_t value)
 
void opt_in_to_extra_dynamic_memory (cuda::memory::shared::size_t amount_required_by_kernel)
 Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory. More...
 
grid::complete_dimensions_t min_grid_params_for_max_occupancy (memory::shared::size_t dynamic_shared_memory_size=no_dynamic_shared_memory, grid::block_dimension_t block_size_limit=0, bool disable_caching_override=false) const
 
template<typename UnaryFunction >
grid::complete_dimensions_t min_grid_params_for_max_occupancy (UnaryFunction block_size_to_dynamic_shared_mem_size, grid::block_dimension_t block_size_limit=0, bool disable_caching_override=false) const
 
void set_preferred_shared_mem_fraction (unsigned shared_mem_percentage)
 Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with fine granularity. More...
 
void set_cache_preference (multiprocessor_cache_preference_t preference)
 Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity. More...
 
void set_shared_memory_bank_size (multiprocessor_shared_memory_bank_size_option_t config)
 Sets a device function's preference of shared memory bank size preference (for the current device probably) More...
 

Detailed Description

A non-owning wrapper class for CUDA __global__ functions.

Note
The association of a kernel_t with an individual device is somewhat tenuous. That is, the same function pointer could be used with any other device (provided the kernel was compiled appropriately). However, many/most of the features, attributes and settings are device-specific.

Member Function Documentation

◆ maximum_active_blocks_per_multiprocessor()

grid::dimension_t cuda::kernel_t::maximum_active_blocks_per_multiprocessor ( grid::block_dimension_t  num_threads_per_block,
memory::shared::size_t  dynamic_shared_memory_per_block,
bool  disable_caching_override = false 
)
inline

Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e.

with warps from any of these block being schedulable concurrently)

Parameters
num_threads_per_block
dynamic_shared_memory_per_block
disable_caching_overrideOn some GPUs, the choice of whether to cache memory reads affects occupancy. But what if this caching results in 0 potential occupancy for a kernel? There are two options, controlled by this flag. When it is set to false - the calculator will assume caching is off for the purposes of its work; when set to true, it will return 0 for such device functions. See also the "Unified L1/Texture Cache" section of the Maxwell tuning guide.

◆ min_grid_params_for_max_occupancy()

grid::complete_dimensions_t cuda::kernel_t::min_grid_params_for_max_occupancy ( memory::shared::size_t  dynamic_shared_memory_size = no_dynamic_shared_memory,
grid::block_dimension_t  block_size_limit = 0,
bool  disable_caching_override = false 
) const
Parameters
dynamic_shared_memory_sizeThe amount of dynamic shared memory each grid block will need.
block_size_limitdo not return a block size above this value; the default, 0, means no limit on the returned block size.
disable_caching_overrideOn platforms where global caching affects occupancy, and when enabling caching would result in zero occupancy, the occupancy calculator will calculate the occupancy as if caching is disabled. Setting this to true makes the occupancy calculator return 0 in such cases. More information can be found about this feature in the "Unified L1/Texture Cache" section of the Maxwell tuning guide.
Returns
A pair, with the second element being the maximum achievable block size (1-dimensional), and the first element being the minimum number of such blocks necessary for keeping the GPU "busy" (again, in a 1-dimensional grid).

◆ opt_in_to_extra_dynamic_memory()

void cuda::kernel_t::opt_in_to_extra_dynamic_memory ( cuda::memory::shared::size_t  amount_required_by_kernel)
inline

Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but can also be set on the individual device-function level, by specifying the amount of shared memory the kernel may require.

◆ set_cache_preference()

void cuda::kernel_t::set_cache_preference ( multiprocessor_cache_preference_t  preference)
inline

Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but the driver can set another value for a specific function. This function doesn't make a demand from the CUDA runtime (as in opt_in_to_extra_dynamic_memory), but rather indicates what is the fraction of L1 to shared memory it would like the kernel scheduler to carve out.

Parameters
preferenceone of: as much shared memory as possible, as much L1 as possible, or no preference (i.e. using the device default).
Note
similar to set_preferred_shared_mem_fraction() - but with coarser granularity.

◆ set_preferred_shared_mem_fraction()

void cuda::kernel_t::set_preferred_shared_mem_fraction ( unsigned  shared_mem_percentage)
inline

Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with fine granularity.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but the driver can set another value for a specific function. This function doesn't make a demand from the CUDA runtime (as in opt_in_to_extra_dynamic_memory), but rather indicates what is the fraction of L1 to shared memory it would like the kernel scheduler to carve out.

Parameters
shared_mem_percentageThe percentage - from 0 to 100 - of the combined L1/shared memory space the user wishes to assign to shared memory.
Note
similar to set_cache_preference() - but with finer granularity.

◆ set_shared_memory_bank_size()

void cuda::kernel_t::set_shared_memory_bank_size ( multiprocessor_shared_memory_bank_size_option_t  config)
inline

Sets a device function's preference of shared memory bank size preference (for the current device probably)

Parameters
configbank size setting to make

The documentation for this class was generated from the following files: