Definitions and functionality wrapping CUDA APIs. More...

Namespaces
	array
	CUDA facilities for interpolating access to multidimensional array objects, in particular via the array_t class.

	device
	Definitions and functionality related to CUDA devices (not including the device wrapper type cuda::device_t itself)

	event
	CUDA timing functionality, via events and their related code (not including the event wrapper type event_t itself)

	link
	Definitions related to CUDA linking-processes, captured by the link_t wrapper class.

	memory
	Representation, allocation and manipulation of CUDA-related memory, of different.

	outstanding_error
	Unlike the Runtime API, where every error is outstanding until cleared, the Driver API, which we use mostly, only remembers "sticky" errors - severe errors which corrupt contexts.

	profiling
	Interaction with NVIDIA's profiler, particularly tagging, marking and indications of entities it will pick up and register/display.

	rtc
	Real-time compilation of programs using the NVIDIA libraries.

	stream
	Definitions and functionality related to CUDA streams (not including the device wrapper type stream_t itself)

Classes
class	array_t
	Owning wrapper for CUDA 2D and 3D arrays. More...

struct	caching
	A helper struct for templatizing caching<Op>::mode. More...

struct	caching< memory_operation_t::load >
	Load operation caching settings. More...

struct	caching< memory_operation_t::store >
	Store operation caching settings. More...

class	context_t
	Wrapper class for a CUDA context. More...

class	device_t
	Wrapper class for a CUDA device. More...

class	event_t
	Wrapper class for a CUDA event. More...

class	kernel_t
	A non-owning wrapper for CUDA kernels - whether they be `__global__` functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way. More...

class	launch_config_builder_t
	A convenience class for gradually constructing a launch_configuration_t instance, as per the "builder pattern". More...

struct	launch_configuration_t
	The full set of possible configuration parameters for launching a kernel on a GPU. More...

class	link_t
	Wrapper class for a CUDA link (a process of linking compiled code together into an executable binary, using CUDA, at run-time) More...

class	module_t
	Wrapper class for a CUDA code module. More...

class	runtime_error
	A (base?) class for exceptions raised by CUDA code; these errors are thrown by essentially all CUDA Runtime API wrappers upon failure. More...

class	stream_t
	Proxy class for a CUDA stream. More...

class	texture_view
	Use texture memory for optimized read only cache access. More...

struct	version_t
	A structure representing a CUDA release version. More...

Typedefs
template<memory_operation_t Op>
using	caching_mode_t = typename caching< Op >::mode

template<typename T , size_t N>
using	c_array = T[N]

using	status_t = CUresult
	Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error). More...

using	size_t = ::std::size_t
	A size type for use throughout the wrappers library (except when specific API functions limit the size further)

using	dimensionality_t = size_t
	The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3. More...

using	uuid_t = CUuuid
	The CUDA-driver-specific representation of a UUID value; see also {device_t::uuid()}.

using	combined_version_t = int
	A combination of the major and minor version numbers for a CUDA release into a single integer, e.g. More...

Enumerations
enum	memory_operation_t { load, store }

enum	{ warp_size = 32 }
	CUDA's NVCC allows use the use of the warpSize identifier, without having to define it. More...

enum	: bool { thread_blocks_may_cooperate = true, thread_blocks_may_not_cooperate = false }
	Thread block cooperativity control for kernel launches. More...

enum	: memory::shared::size_t { no_dynamic_shared_memory = 0 }

enum	: bool { do_take_ownership = true, do_not_take_ownership = false }

enum	: bool { do_hold_primary_context_refcount_unit = true, do_not_hold_primary_context_refcount_unit = false }

enum	: bool { dont_clear_errors = false, do_clear_errors = true }

enum	multiprocessor_cache_preference_t : ::std::underlying_type< CUfunc_cache_enum >::type { multiprocessor_cache_preference_t::no_preference = CU_FUNC_CACHE_PREFER_NONE, multiprocessor_cache_preference_t::equal_l1_and_shared_memory = CU_FUNC_CACHE_PREFER_EQUAL, multiprocessor_cache_preference_t::prefer_shared_memory_over_l1 = CU_FUNC_CACHE_PREFER_SHARED, multiprocessor_cache_preference_t::prefer_l1_over_shared_memory = CU_FUNC_CACHE_PREFER_L1, none = no_preference, equal = equal_l1_and_shared_memory, prefer_shared = prefer_shared_memory_over_l1, prefer_l1 = prefer_l1_over_shared_memory }
	L1-vs-shared-memory balance option. More...

enum	multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type { device_default = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE, four_bytes_per_bank = CU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE, eight_bytes_per_bank = CU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE }
	A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel. More...

enum	source_kind_t { cuda_cpp = 0, ptx = 1 }
	The API wrappers support different kinds of source code, accepted by different NVIDIA run-time compilation libraries. More...

Functions
void	synchronize (const context_t &context)
	Waits for all previously-scheduled tasks on all streams (= queues) in a CUDA context to conclude, before returning. More...

void	synchronize (const device_t &device)
	Waits for all previously-scheduled tasks on all streams (= queues) on a specified device to conclude. More...

detail_::all_devices	devices ()

void	throw_if_error (status_t status) noexcept(false)
	Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More...

void	throw_if_error (cudaError_t status) noexcept(false)

void	wait (const event_t &event)
	Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred() More...

void	synchronize (const event_t &event)

bool	operator== (const kernel_t &lhs, const kernel_t &rhs) noexcept

bool	operator!= (const kernel_t &lhs, const kernel_t &rhs) noexcept

constexpr grid::dimensions_t	single_block ()
	A named constructor idiom for grid::dimensions_t, which, when used, will result in a grid with a single block.

constexpr grid::block_dimensions_t	single_thread_per_block ()
	A named constructor idiom for grid::dimensions_t, which, when used, will result in a grid whose blocks have a single thread.

template<typename Kernel , typename... KernelParameters>
void	enqueue_launch (Kernel &&kernel, const stream_t &stream, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
	Enqueues a kernel on a stream (=queue) on the current CUDA device. More...

template<typename Kernel , typename... KernelParameters>
void	launch (Kernel &&kernel, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
	Variant of enqueue_launch for use with the default stream in the current context. More...

launch_config_builder_t	launch_config_builder ()
	A slightly shorter-named construction idiom for launch_config_builder_t.

void	initialize_driver ()
	Obtains the CUDA Runtime version. More...

void	ensure_driver_is_initialized ()
	A mechanism for ensuring a cuInit() call has been made, to use before making any other driver API calls. More...

void	synchronize (const stream_t &stream)
	Waits for all previously-scheduled tasks on a given stream to conclude, before returning. More...

template<source_kind_t Kind>
constexpr bool	is_failure (rtc::status_t< Kind > status)
	Determine whether the API call returning the specified status had failed.

template<source_kind_t Kind>
void	throw_if_error (rtc::status_t< Kind > status, const ::std::string &message) noexcept(false)
	Do nothing... More...

template<source_kind_t Kind>
void	throw_if_error (rtc::status_t< Kind > status) noexcept(false)
	Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More...


bool	operator== (const context_t &lhs, const context_t &rhs) noexcept

bool	operator!= (const context_t &lhs, const context_t &rhs) noexcept


constexpr bool	is_success (status_t status)
	Determine whether the API call returning the specified status had succeeded.

constexpr bool	is_success (cudaError_t status)


constexpr bool	is_failure (status_t status)
	Determine whether the API call returning the specified status had failed.

constexpr bool	is_failure (cudaError_t status)


inline ::std::string	describe (status_t status)
	Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code.

inline ::std::string	describe (cudaError_t status)


void	throw_if_error (status_t status, const ::std::string &message) noexcept(false)
	Do nothing... More...

void	throw_if_error (cudaError_t status, const ::std::string &message) noexcept(false)

void	throw_if_error (status_t status, ::std::string &&message) noexcept(false)

void	throw_if_error (cudaError_t status, ::std::string &&message) noexcept(false)


template<typename SpanOfConstVoidPtrLike >
void	launch_type_erased (const kernel_t &kernel, const stream_t &stream, launch_configuration_t launch_configuration, SpanOfConstVoidPtrLike marshalled_arguments)
	Launch a kernel with the arguments pre-marshalled into the (main) form which the CUDA driver's launch primitive accepts variables in: A null- terminated sequence of (possibly const) `void *`'s to the argument values. More...


template<source_kind_t Kind>
constexpr bool	is_success (rtc::status_t< Kind > status)
	Determine whether the API call returning the specified status had succeeded.


inline ::std::string	describe (rtc::status_t< cuda_cpp > status)
	Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code.

Detailed Description

Definitions and functionality wrapping CUDA APIs.

Typedef Documentation

◆ combined_version_t

using cuda::combined_version_t = typedef int

A combination of the major and minor version numbers for a CUDA release into a single integer, e.g.

CUDA 11.3 is represented by the combined version number 11300. Se also version_t.

◆ dimensionality_t

using cuda::dimensionality_t = typedef size_t

The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3.

◆ status_t

using cuda::status_t = typedef CUresult

Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error).

Note: This single type really needs to double as both CUresult for driver API calls and cudaError_t for runtime API calls. These aren't actually the same type - but they are both enums, sharing most of the defined values. See also error.hpp where we unify the set of errors.

Enumeration Type Documentation

◆ anonymous enum

anonymous enum

CUDA's NVCC allows use the use of the warpSize identifier, without having to define it.

Un(?)fortunately, warpSize is not a compile-time constant; it is replaced at some point with the appropriate immediate value which goes into, the SASS instruction as a literal. This is apparently due to the theoretical possibility of different warp sizes in the future. However, it is useful - both for host-side and more importantly for device-side code - to have the warp size available at compile time. This allows all sorts of useful optimizations, as well as its use in constexpr code.

If nVIDIA comes out with 64-lanes-per-warp GPUs - we'll refactor this.

◆ anonymous enum

anonymous enum : bool

Thread block cooperativity control for kernel launches.

Enumerator
thread_blocks_may_cooperate	Thread groups may span multiple blocks, so that they can synchronize their actions.
thread_blocks_may_not_cooperate	Thread blocks are not allowed to synchronize (the default, and likely faster, execution mode)

◆ multiprocessor_cache_preference_t

enum cuda::multiprocessor_cache_preference_t : ::std::underlying_type< CUfunc_cache_enum >::type

strong

L1-vs-shared-memory balance option.

In some GPU micro-architectures, it's possible to have the multiprocessors change the balance in the allocation of L1-cache-like resources between actual L1 cache and shared memory; these are the possible choices.

Enumerator
no_preference	No preference for more L1 cache or for more shared memory; the API can do as it please.
equal_l1_and_shared_memory	Divide the cache resources equally between actual L1 cache and shared memory.
prefer_shared_memory_over_l1	Divide the cache resources to maximize available shared memory at the expense of L1 cache.
prefer_l1_over_shared_memory	Divide the cache resources to maximize available L1 cache at the expense of shared memory.

◆ multiprocessor_shared_memory_bank_size_option_t

enum cuda::multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type

A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel.

The number of banks and bank sizes differ for different GPU architecture generations; but in some of them (e.g. Kepler), they are configurable - and you can trade the number of banks for bank size, in case that makes sense for your data access pattern - by using device_t::shared_memory_bank_size .

◆ source_kind_t

enum cuda::source_kind_t

The API wrappers support different kinds of source code, accepted by different NVIDIA run-time compilation libraries.

Enumerator
cuda_cpp	The CUDA variant of C++, accepted by the NVRTC library.
ptx	NVIDIA's architecture-inspecific intermediate program representation language, known as PTX or Parallel Thread Execution.

Function Documentation

◆ devices()

detail_::all_devices cuda::devices ( )

inline

Returns

all CUDA-supporting GPU devices on the system - as a gadget for iteration, so that one can write:

for(auto& dev : cuda::devices) { do_stuff_with(dev); }

◆ enqueue_launch()

template<typename Kernel , typename... KernelParameters>

void cuda::enqueue_launch	(	Kernel &&	kernel,
		const stream_t &	stream,
		launch_configuration_t	launch_configuration,
		KernelParameters &&...	parameters
	)

Enqueues a kernel on a stream (=queue) on the current CUDA device.

CUDA's 'chevron' kernel launch syntax cannot be compiled in proper C++. Thus, every kernel launch must at some point reach code compiled with CUDA's nvcc. Naively, every single different kernel (perhaps up to template specialization) would require writing its own wrapper C++ function, launching it. This function, however, constitutes a single minimal wrapper around the CUDA kernel launch, which may be called from proper C++ code (across translation unit boundaries - the caller is compiled with a C++ compiler, the callee compiled by nvcc).

This function is similar to C++17's ::std::apply, or to a a beta-reduction in Lambda calculus: It applies a function to its arguments; the difference is in the nature of the function (a CUDA kernel) and in that the function application requires setting additional CUDA-related launch parameters, additional to the function's own.

As kernels do not return values, neither does this function. It also contains no hooks, logging commands etc. - if you want those, write an additional wrapper (perhaps calling this one in turn).

Parameters

kernel	the kernel to apply. Pass it just as-it-is, as though it were any other function. Note: If the kernel is templated, you must pass it fully-instantiated. Alternatively, you can pass a
stream	the CUDA hardware command queue on which to place the command to launch the kernel (affects the scheduling of the launch and the execution)
launch_configuration	not all launches of the same kernel are identical: The launch may be configured to use more of less blocks in the grid, to allow blocks dynamic memory, to control the block's dimensions etc; this parameter defines that extra configuration outside the kernels' actual source. See also cuda::launch_configuration_t.
parameters	whatever parameters `kernel_function` takes

◆ ensure_driver_is_initialized()

void cuda::ensure_driver_is_initialized ( )

inline

A mechanism for ensuring a cuInit() call has been made, to use before making any other driver API calls.

Note: differs from simply calling initialize_driver() in that repeated calls from the same thread will avoid additional cuInit() call.

◆ initialize_driver()

void cuda::initialize_driver ( )

inline

Obtains the CUDA Runtime version.

Note: unlike {maximum_supported_by_driver()}, 0 cannot be returned, as we are actually using the runtime to obtain the version, so it does have some version.

◆ launch()

template<typename Kernel , typename... KernelParameters>

void cuda::launch	(	Kernel &&	kernel,
		launch_configuration_t	launch_configuration,
		KernelParameters &&...	parameters
	)

Variant of enqueue_launch for use with the default stream in the current context.

Note: This isn't called enqueue since the default stream is synchronous.

◆ launch_type_erased()

template<typename SpanOfConstVoidPtrLike >

void cuda::launch_type_erased	(	const kernel_t &	kernel,
		const stream_t &	stream,
		launch_configuration_t	launch_configuration,
		SpanOfConstVoidPtrLike	marshalled_arguments
	)

inline

Launch a kernel with the arguments pre-marshalled into the (main) form which the CUDA driver's launch primitive accepts variables in: A null- terminated sequence of (possibly const) void *'s to the argument values.

Template Parameters

SpanOfConstVoidPtrLike Type of the container for the marshalled arguments; typically, this would be span<const void*> - but it can be an ::std::vector, or have non-const void* elements etc.

Parameters

kernel	A wrapped GPU kernel
stream	Proxy for the stream on which to enqueue the kernel launch; may be the default stream of a context.
launch_configuration	The configuration information for the grid of blocks of threads and other configuration with which to configure the launch
marshalled_arguments	A container of `void` or `const void` pointers to the argument values

◆ operator==()

bool cuda::operator==	(	const context_t &	lhs,
		const context_t &	rhs
	)

inlinenoexcept

Note: : The comparison ignores whether or not the wrapper is owning

◆ synchronize() [1/3]

void cuda::synchronize ( const device_t & device )

inline

Waits for all previously-scheduled tasks on all streams (= queues) on a specified device to conclude.

Depending on the host_thread_sync_scheduling_policy_t set for the specified device, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on this device have been concluded.

◆ synchronize() [2/3]

void cuda::synchronize ( const context_t & context )

inline

Waits for all previously-scheduled tasks on all streams (= queues) in a CUDA context to conclude, before returning.

Avoid executing any additional instructions on this thread until all work on all streams in context has been concluded.

Depending on the host_thread_sync_scheduling_policy_t set for the specified context, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on streams within this context have concluded.

Parameters

context The context all of whose streams are synchronized when a call to this function is made.

Note: There are multiple ways to effect a synchronization - and the choice is the context's host synchronization policy; see context_t::sync_scheduling_policy

◆ synchronize() [3/3]

void cuda::synchronize ( const stream_t & stream )

inline

Waits for all previously-scheduled tasks on a given stream to conclude, before returning.

Depending on the host_thread_sync_scheduling_policy_t set for the specified stream, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on the stream have concluded.

◆ throw_if_error() [1/4]

template<source_kind_t Kind>

void cuda::throw_if_error	(	rtc::status_t< Kind >	status,
		const ::std::string &	message
	)

inlinenoexcept

Do nothing...

unless the status indicates an error, in which case a cuda::runtime_error exception is thrown

Parameters

status	should be cuda::status::success - otherwise an exception is thrown
message	An extra description message to add to the exception

◆ throw_if_error() [2/4]

template<source_kind_t Kind>

void cuda::throw_if_error ( rtc::status_t< Kind > status )

inlinenoexcept

Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.

Parameters

status should be cuda::status::success - otherwise an exception is thrown

◆ throw_if_error() [3/4]

void cuda::throw_if_error	(	status_t	status,
		const ::std::string &	message
	)

inlinenoexcept

Do nothing...

unless the status indicates an error, in which case a cuda::runtime_error exception is thrown

Note: Using these functions means the string will (almost certainly) be constructed, hence you might want to use the throw_if_error_lazy macro instead

Parameters

status	should be status::success - otherwise an exception is thrown
message	An extra description message to add to the exception

◆ throw_if_error() [4/4]

void cuda::throw_if_error ( status_t status )

inlinenoexcept

Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.

Note: Using these functions means the string will (almost certainly) be constructed, hence you might want to use the throw_if_error_lazy macro instead

Parameters

status should be cuda::status::success - otherwise an exception is thrown

◆ wait()

void cuda::wait ( const event_t & event )

inline

Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred()

Todo:: figure out what happens if the event has not been recorded before this call is made.

Note: the waiting will occur either passively (e.g. like waiting for information on a file descriptor), or actively (by busy-waiting) - depending on the flag with which the event was created.

Parameters

event the event for whose occurrence to wait; must be scheduled to occur on some stream (possibly the different stream)

Namespaces

Classes

Typedefs

Enumerations

Functions

Detailed Description

Typedef Documentation

◆ combined_version_t

◆ dimensionality_t

◆ status_t

Enumeration Type Documentation

◆ anonymous enum

◆ anonymous enum

◆ multiprocessor_cache_preference_t

◆ multiprocessor_shared_memory_bank_size_option_t

◆ source_kind_t

Function Documentation

◆ devices()

◆ enqueue_launch()

◆ ensure_driver_is_initialized()

◆ initialize_driver()

◆ launch()

◆ launch_type_erased()

◆ operator==()

◆ synchronize() [1/3]

◆ synchronize() [2/3]

◆ synchronize() [3/3]

◆ throw_if_error() [1/4]

◆ throw_if_error() [2/4]

◆ throw_if_error() [3/4]

◆ throw_if_error() [4/4]

◆ wait()