A non-owning wrapper for CUDA kernels - whether they be __global__ functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way. More...

#include <kernel.hpp>

Inheritance diagram for cuda::kernel_t:

[legend]

Public Member Functions
context_t	context () const noexcept
	Get (a proxy for) the context in which this kernel is defined.

device_t	device () const noexcept
	Get (a proxy for) the device for (a context of) which this kernel is defined.

device::id_t	device_id () const noexcept
	Get the id of the device for (a context of) which this kernel is defined.

context::handle_t	context_handle () const noexcept
	Get the raw handle of the context in which this kernel is defined.

kernel::handle_t	handle () const
	Get the raw (intra-context) CUDA handle for this kernel. More...

kernel_t &	operator= (const kernel_t &)=delete

kernel_t &	operator= (kernel_t &&other) noexcept

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE kernel::attribute_value_t	get_attribute (kernel::attribute_t attribute) const

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t	ptx_version () const

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t	binary_compilation_target_architecture () const

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::block_dimension_t	maximum_threads_per_block () const

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::dimension_t	max_active_blocks_per_multiprocessor (grid::block_dimension_t block_size_in_threads, memory::shared::size_t dynamic_shared_memory_per_block, bool disable_caching_override=false) const
	Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e. More...

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void	set_attribute (kernel::attribute_t attribute, kernel::attribute_value_t value) const

void	set_maximum_dynamic_shared_memory_per_block (cuda::memory::shared::size_t amount_required_by_kernel) const
	Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory. More...

memory::shared::size_t	get_maximum_dynamic_shared_memory_per_block () const

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void	set_cache_preference (multiprocessor_cache_preference_t preference) const
	Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity. More...

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void	set_shared_memory_bank_size (multiprocessor_shared_memory_bank_size_option_t config) const
	Sets a device function's preference of shared memory bank size. More...

	kernel_t (const kernel_t &other)

	kernel_t (kernel_t &&other)

Friends
kernel_t	kernel::wrap (device::id_t, context::handle_t, kernel::handle_t, bool)

Detailed Description

A non-owning wrapper for CUDA kernels - whether they be __global__ functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way.

Note: The association of a kernel_t with an individual device or context is somewhat tenuous. That is, the same function could be used with any other device; However, many/most of the features, attributes and settings are context-specific or device-specific.; NVRTC-compiled kernels can only use this class, with apriori-compiled kernels can use their own subclass.

Member Function Documentation

◆ binary_compilation_target_architecture()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t cuda::kernel_t::binary_compilation_target_architecture ( ) const

inline

Returns: the physical microarchitecture which this kernel was compiled to target

◆ handle()

kernel::handle_t cuda::kernel_t::handle ( ) const

inline

Get the raw (intra-context) CUDA handle for this kernel.

Note: for earlier versions of CUDA, and for apriori-compiled kernels, this handle may be unavailable.

◆ max_active_blocks_per_multiprocessor()

grid::dimension_t cuda::kernel_t::max_active_blocks_per_multiprocessor	(	grid::block_dimension_t	block_size_in_threads,
		memory::shared::size_t	dynamic_shared_memory_per_block,
		bool	disable_caching_override = `false`
	)		const

inline

Calculates the number of grid blocks which may be "active" on a given GPU multiprocessor simultaneously (i.e.

with warps from any of these block being schedulable concurrently)

Parameters

block_size_in_threads
dynamic_shared_memory_per_block
disable_caching_override	On some GPUs, the choice of whether to cache memory reads affects occupancy. But what if this caching results in 0 potential occupancy for a kernel? There are two options, controlled by this flag. When it is set to false - the calculator will assume caching is off for the purposes of its work; when set to true, it will return 0 for such device functions.

See also: The "Unified L1/Texture Cache" section of the Maxwell tuning guide regarding caching override.

◆ maximum_threads_per_block()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE grid::block_dimension_t cuda::kernel_t::maximum_threads_per_block ( ) const

inline

Returns: the maximum number of threads per block for which the GPU device can satisfy this kernel's hardware requirement - typically, the number of registers in use.

Note: the kernel may have other constraints, requiring a different number of threads per block; these cannot be determined using this method.

◆ ptx_version()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE cuda::device::compute_capability_t cuda::kernel_t::ptx_version ( ) const

inline

Returns: the PTX version used as the target for the compilation of this kernel

◆ set_cache_preference()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void cuda::kernel_t::set_cache_preference ( multiprocessor_cache_preference_t preference ) const

inline

Indicate the desired carve-out between shared memory and L1 cache when launching this kernel - with coarse granularity.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but the driver can set another value for a specific function. This function doesn't make a demand from the CUDA runtime (as in opt_in_to_extra_dynamic_memory), but rather indicates what is the fraction of L1 to shared memory it would like the kernel scheduler to carve out.

Parameters

preference one of: as much shared memory as possible, as much L1 as possible, or no preference (i.e. using the device default).

Note: similar to set_preferred_shared_mem_fraction() - but with coarser granularity.

◆ set_maximum_dynamic_shared_memory_per_block()

void cuda::kernel_t::set_maximum_dynamic_shared_memory_per_block ( cuda::memory::shared::size_t amount_required_by_kernel ) const

inline

Change the hardware resource carve-out between L1 cache and shared memory for launches of the kernel to allow for at least the specified amount of shared memory.

On several nVIDIA GPU micro-architectures, the L1 cache and the shared memory in each symmetric multiprocessor (=physical core) use the same hardware resources. The carve-out between the two uses has a device-wide value (which can be changed), but can also be set on the individual device-function level, by specifying the amount of shared memory the kernel may require.

◆ set_shared_memory_bank_size()

VIRTUAL_UNLESS_CAN_GET_APRIORI_KERNEL_HANDLE void cuda::kernel_t::set_shared_memory_bank_size ( multiprocessor_shared_memory_bank_size_option_t config ) const

inline

Sets a device function's preference of shared memory bank size.

Parameters

config bank size setting to make

The documentation for this class was generated from the following file:

src/cuda/api/kernel.hpp

Public Member Functions

Friends

Detailed Description

Member Function Documentation

◆ binary_compilation_target_architecture()

◆ handle()

◆ max_active_blocks_per_multiprocessor()

◆ maximum_threads_per_block()

◆ ptx_version()

◆ set_cache_preference()

◆ set_maximum_dynamic_shared_memory_per_block()

◆ set_shared_memory_bank_size()