cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
cuda::kernel_t Class Reference

A non-owning wrapper for CUDA kernels - whether they be __global__ functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way. More...

#include <kernel.hpp>

Inheritance diagram for cuda::kernel_t:

Public Member Functions

context_t context () const noexcept
 Get (a proxy for) the context in which this kernel is defined.
 
device_t device () const noexcept
 Get (a proxy for) the device for (a context of) which this kernel is defined.
 
device::id_t device_id () const noexcept
 Get the id of the device for (a context of) which this kernel is defined.
 
context::handle_t context_handle () const noexcept
 Get the raw handle of the context in which this kernel is defined.
 
kernel::handle_t handle () const
 Get the raw (intra-context) CUDA handle for this kernel. More...
 
kernel_toperator= (const kernel_t &)=delete
 
kernel_toperator= (kernel_t &&other) noexcept
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE kernel::attribute_value_t get_attribute (kernel::attribute_t attribute) const
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t ptx_version () const
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t binary_compilation_target_architecture () const
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::block_dimension_t maximum_threads_per_block () const
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::dimension_t max_active_blocks_per_multiprocessor (grid::block_dimension_t block_size_in_threads, memory::shared::size_t dynamic_shared_memory_per_block, bool disable_caching_override=false) const
 Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e. More...
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void set_attribute (kernel::attribute_t attribute, kernel::attribute_value_t value) const
 
void set_maximum_dynamic_shared_memory_per_block (cuda::memory::shared::size_t amount_required_by_kernel) const
 Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory. More...
 
memory::shared::size_t get_maximum_dynamic_shared_memory_per_block () const
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void set_cache_preference (multiprocessor_cache_preference_t preference) const
 Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity. More...
 
VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void set_shared_memory_bank_size (multiprocessor_shared_memory_bank_size_option_t config) const
 Sets a device function's preference of shared memory bank size. More...
 
 kernel_t (const kernel_t &other)
 
 kernel_t (kernel_t &&other)
 

Friends

kernel_t kernel::wrap (device::id_t, context::handle_t, kernel::handle_t, bool)
 

Detailed Description

A non-owning wrapper for CUDA kernels - whether they be __global__ functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way.

Note
The association of a kernel_t with an individual device or context is somewhat tenuous. That is, the same function could be used with any other device; However, many/most of the features, attributes and settings are context-specific or device-specific.
NVRTC-compiled kernels can only use this class, with apriori-compiled kernels can use their own subclass.

Member Function Documentation

◆ binary_compilation_target_architecture()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t cuda::kernel_t::binary_compilation_target_architecture ( ) const
inline
Returns
the physical microarchitecture which this kernel was compiled to target

◆ handle()

kernel::handle_t cuda::kernel_t::handle ( ) const
inline

Get the raw (intra-context) CUDA handle for this kernel.

Note
for earlier versions of CUDA, and for apriori-compiled kernels, this handle may be unavailable.

◆ max_active_blocks_per_multiprocessor()

grid::dimension_t cuda::kernel_t::max_active_blocks_per_multiprocessor ( grid::block_dimension_t  block_size_in_threads,
memory::shared::size_t  dynamic_shared_memory_per_block,
bool  disable_caching_override = false 
) const
inline

Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e.

with warps from any of these block being schedulable concurrently)

Parameters
block_size_in_threads
dynamic_shared_memory_per_block
disable_caching_overrideOn some GPUs, the choice of whether to cache memory reads affects occupancy. But what if this caching results in 0 potential occupancy for a kernel? There are two options, controlled by this flag. When it is set to false - the calculator will assume caching is off for the purposes of its work; when set to true, it will return 0 for such device functions.
See also
The "Unified L1/Texture Cache" section of the Maxwell tuning guide regarding caching override.

◆ maximum_threads_per_block()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::block_dimension_t cuda::kernel_t::maximum_threads_per_block ( ) const
inline
Returns
the maximum number of threads per block for which the GPU device can satisfy this kernel's hardware requirement - typically, the number of registers in use.
Note
the kernel may have other constraints, requiring a different number of threads per block; these cannot be determined using this method.

◆ ptx_version()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t cuda::kernel_t::ptx_version ( ) const
inline
Returns
the PTX version used as the target for the compilation of this kernel

◆ set_cache_preference()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void cuda::kernel_t::set_cache_preference ( multiprocessor_cache_preference_t  preference) const
inline

Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but the driver can set another value for a specific function. This function doesn't make a demand from the CUDA runtime (as in opt_in_to_extra_dynamic_memory), but rather indicates what is the fraction of L1 to shared memory it would like the kernel scheduler to carve out.

Parameters
preferenceone of: as much shared memory as possible, as much L1 as possible, or no preference (i.e. using the device default).
Note
similar to set_preferred_shared_mem_fraction() - but with coarser granularity.

◆ set_maximum_dynamic_shared_memory_per_block()

void cuda::kernel_t::set_maximum_dynamic_shared_memory_per_block ( cuda::memory::shared::size_t  amount_required_by_kernel) const
inline

Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but can also be set on the individual device-function level, by specifying the amount of shared memory the kernel may require.

◆ set_shared_memory_bank_size()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void cuda::kernel_t::set_shared_memory_bank_size ( multiprocessor_shared_memory_bank_size_option_t  config) const
inline

Sets a device function's preference of shared memory bank size.

Parameters
configbank size setting to make

The documentation for this class was generated from the following file: