cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
Namespaces | Classes | Typedefs | Enumerations | Functions
cuda Namespace Reference

All definitions and functionality wrapping the CUDA Runtime API. More...

Namespaces

 device
 Definitions and functionality related to CUDA devices (not including the device wrapper type device_t itself)
 
 event
 Definitions and functionality related to CUDA events (not including the event wrapper type event_t itself)
 
 memory
 Management and operations on memory in different CUDA-recognized spaces.
 
 stream
 Definitions and functionality related to CUDA streams (not including the device wrapper type stream_t itself)
 

Classes

class  array_t
 Owning wrapper for CUDA 2D and 3D arrays. More...
 
class  device_t
 Proxy class for a CUDA device. More...
 
class  event_t
 Proxy class for a CUDA event. More...
 
class  kernel_t
 A non-owning wrapper class for CUDA __global__ functions. More...
 
struct  launch_configuration_t
 Holds the parameters necessary to "launch" a CUDA kernel (i.e. More...
 
class  runtime_error
 A (base?) class for exceptions raised by CUDA code; these errors are thrown by essentially all CUDA Runtime API wrappers upon failure. More...
 
class  stream_t
 Proxy class for a CUDA stream. More...
 
struct  symbol_t
 Object-code symbols. More...
 
class  texture_view
 Use texture memory for optimized read only cache access. More...
 
struct  version_t
 CUDA Runtime version. More...
 

Typedefs

using queue_t = stream_t
 
using queue_id_t = stream::id_t
 
using combined_version_t = int
 
using status_t = cudaError_t
 Indicates either the result (success or error index) of a CUDA Runtime API call, or the overall status of the Runtime API (which is typically the last triggered error).
 
using size_t = ::std::size_t
 
using dimensionality_t = unsigned
 The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3.
 
using native_word_t = unsigned
 

Enumerations

enum  : native_word_t { warp_size = 32 }
 CUDA's NVCC allows use the use of the warpSize identifier, without having to define it. More...
 
enum  : bool {
  thread_blocks_may_cooperate = true,
  thread_blocks_may_not_cooperate = false
}
 Thread block cooperativity control for kernel launches. More...
 
enum  : memory::shared::size_t { no_dynamic_shared_memory = 0 }
 
enum  : bool {
  do_take_ownership = true,
  do_not_take_ownership = false
}
 
enum  : bool {
  dont_clear_errors = false,
  do_clear_errors = true
}
 
enum  multiprocessor_cache_preference_t {
  multiprocessor_cache_preference_t::no_preference = cudaFuncCachePreferNone,
  multiprocessor_cache_preference_t::equal_l1_and_shared_memory = cudaFuncCachePreferEqual,
  multiprocessor_cache_preference_t::prefer_shared_memory_over_l1 = cudaFuncCachePreferShared,
  multiprocessor_cache_preference_t::prefer_l1_over_shared_memory = cudaFuncCachePreferL1,
  none = no_preference,
  equal = equal_l1_and_shared_memory,
  prefer_shared = prefer_shared_memory_over_l1,
  prefer_l1 = prefer_l1_over_shared_memory
}
 L1-vs-shared-memory balance option. More...
 
enum  multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< cudaSharedMemConfig >::type {
  device_default = cudaSharedMemBankSizeDefault,
  four_bytes_per_bank = cudaSharedMemBankSizeFourByte,
  eight_bytes_per_bank = cudaSharedMemBankSizeEightByte
}
 A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel. More...
 
enum  host_thread_synch_scheduling_policy_t : unsigned int {
  heuristic = cudaDeviceScheduleAuto,
  spin = cudaDeviceScheduleSpin,
  block = cudaDeviceScheduleBlockingSync,
  yield = cudaDeviceScheduleYield,
  automatic = heuristic
}
 Scheduling policies the Runtime API may use when the host-side thread it is running in needs to wait for results from a certain device. More...
 

Functions

void synchronize (device_t &device)
 Suspends execution until all previously-scheduled tasks on the specified device (all contexts, all streams) have concluded. More...
 
detail_::all_devices devices ()
 
constexpr bool is_success (status_t status)
 Determine whether the API call returning the specified status had succeeded.
 
constexpr bool is_failure (status_t status)
 Determine whether the API call returning the specified status had failed.
 
inline ::std::string describe (status_t status)
 Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code.
 
void throw_if_error (cuda::status_t status, const ::std::string &message) noexcept(false)
 Do nothing... More...
 
void throw_if_error (cuda::status_t status) noexcept(false)
 Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More...
 
void synchronize (const event_t &event)
 Waits for a specified event to conclude before returning control to the calling code. More...
 
constexpr grid::dimensions_t single_block ()
 A named constructor idiom for {grid::dimensions_t}, which, when used, will result in a grid with a single block.
 
constexpr grid::block_dimensions_t single_thread_per_block ()
 A named constructor idiom for {grid::dimensions_t}, which, when used, will result in a grid whose blocks have a single thread.
 
template<typename Kernel , typename... KernelParameters>
void enqueue_launch (bool thread_block_cooperation, Kernel kernel_function, const stream_t &stream, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
 Enqueues a kernel on a stream (=queue) on the current CUDA device. More...
 
template<typename Kernel , typename... KernelParameters>
void enqueue_launch (Kernel kernel_function, const stream_t &stream, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
 A variant of enqueue_launch which uses the default of no cooperation between thread blocks.
 
template<typename Kernel , typename... KernelParameters>
void launch (Kernel kernel, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
 Variant of enqueue_launch for use with the default stream on the current device. More...
 
void force_runtime_initialization ()
 Ensures the CUDA runtime has fully initialized. More...
 
void synchronize (const stream_t &stream)
 
bool operator!= (const stream_t &lhs, const stream_t &rhs) noexcept
 
bool operator== (const texture_view &lhs, const texture_view &rhs) noexcept
 
bool operator!= (const texture_view &lhs, const texture_view &rhs) noexcept
 
constexpr launch_configuration_t make_launch_config (grid::dimensions_t grid_dimensions, grid::block_dimensions_t block_dimensions, memory::shared::size_t dynamic_shared_memory_size=0u) noexcept
 a named constructor idiom for a launch_config_t
 
constexpr bool operator== (const launch_configuration_t lhs, const launch_configuration_t &rhs) noexcept
 

Detailed Description

All definitions and functionality wrapping the CUDA Runtime API.

Enumeration Type Documentation

◆ anonymous enum

anonymous enum : native_word_t

CUDA's NVCC allows use the use of the warpSize identifier, without having to define it.

Un(?)fortunately, warpSize is not a compile-time constant; it is replaced at some point with the appropriate immediate value which goes into, the SASS instruction as a literal. This is apparently due to the theoretical possibility of different warp sizes in the future. However, it is useful - both for host-side and more importantly for device-side code - to have the warp size available at compile time. This allows all sorts of useful optimizations, as well as its use in constexpr code.

If nVIDIA comes out with 64-lanes-per-warp GPUs - we'll refactor this.

◆ anonymous enum

anonymous enum : bool

Thread block cooperativity control for kernel launches.

Enumerator
thread_blocks_may_cooperate 

Thread groups may span multiple blocks, so that they can synchronize their actions.

thread_blocks_may_not_cooperate 

Thread blocks are not allowed to synchronize (the default, and likely faster, execution mode)

◆ host_thread_synch_scheduling_policy_t

Scheduling policies the Runtime API may use when the host-side thread it is running in needs to wait for results from a certain device.

Enumerator
heuristic 

Default behavior; yield or spin based on a heuristic.

The default value if the flags parameter is zero, uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor.

spin 

Keep control and spin-check for result availability.

Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.

block 

Yield control while waiting for results.

Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.

yield 

Block the thread until results are available.

Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.

automatic 

see heuristic

◆ multiprocessor_cache_preference_t

L1-vs-shared-memory balance option.

In some GPU micro-architectures, it's possible to have the multiprocessors change the balance in the allocation of L1-cache-like resources between actual L1 cache and shared memory; these are the possible choices.

Enumerator
no_preference 

No preference for more L1 cache or for more shared memory; the API can do as it please.

equal_l1_and_shared_memory 

Divide the cache resources equally between actual L1 cache and shared memory.

prefer_shared_memory_over_l1 

Divide the cache resources to maximize available shared memory at the expense of L1 cache.

prefer_l1_over_shared_memory 

Divide the cache resources to maximize available L1 cache at the expense of shared memory.

◆ multiprocessor_shared_memory_bank_size_option_t

enum cuda::multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< cudaSharedMemConfig >::type

A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel.

The number of banks and bank sizes differ for different GPU architecture generations; but in some of them (e.g. Kepler), they are configurable - and you can trade the number of banks for bank size, in case that makes sense for your data access pattern - by using device_t::shared_memory_bank_size .

Function Documentation

◆ enqueue_launch()

template<typename Kernel , typename... KernelParameters>
void cuda::enqueue_launch ( bool  thread_block_cooperation,
Kernel  kernel_function,
const stream_t stream,
launch_configuration_t  launch_configuration,
KernelParameters &&...  parameters 
)
inline

Enqueues a kernel on a stream (=queue) on the current CUDA device.

CUDA's 'chevron' kernel launch syntax cannot be compiled in proper C++. Thus, every kernel launch must at some point reach code compiled with CUDA's nvcc. Naively, every single different kernel (perhaps up to template specialization) would require writing its own wrapper C++ function, launching it. This function, however, constitutes a single minimal wrapper around the CUDA kernel launch, which may be called from proper C++ code (across translation unit boundaries - the caller is compiled with a C++ compiler, the callee compiled by nvcc).

This function is similar to C++17's ::std::apply, or to a a beta-reduction in Lambda calculus: It applies a function to its arguments; the difference is in the nature of the function (a CUDA kernel) and in that the function application requires setting additional CUDA-related launch parameters, additional to the function's own.

As kernels do not return values, neither does this function. It also contains no hooks, logging commands etc. - if you want those, write an additional wrapper (perhaps calling this one in turn).

Parameters
thread_block_cooperationif true, use CUDA's "cooperative launch" mechanism which enables more flexible synchronization capabilities (see CUDA C Programming Guide C.3. Grid Synchronization). Note that this is a requirement of the kernel function rather than merely an arbitrary choice.
kernel_functionthe kernel to apply. Pass it just as-it-is, as though it were any other function. Note: If the kernel is templated, you must pass it fully-instantiated. Alternatively, you can pass a kernel_t wrapping the raw pointer to the function.
streamthe CUDA hardware command queue on which to place the command to launch the kernel (affects the scheduling of the launch and the execution)
launch_configurationa kernel is launched on a grid of blocks of thread, and with an allowance of shared memory per block in the grid; this defines how the grid will look and what the shared memory allowance will be (see {cuda::launch_configuration_t})
parameterswhatever parameters kernel_function takes
Note
If the Kernel type is kernel_t, it will already have a thread_block_cooperation setting, so using this variant of enqueue_launch is somewhat redundant; at any rate, the value passed for thread_block_cooperation must match the kernel function's needs, with the kernel_t wrapper is assumed to indicate. Behavior on mismatch is undefined.

◆ force_runtime_initialization()

void cuda::force_runtime_initialization ( )
inline

Ensures the CUDA runtime has fully initialized.

Note
The CUDA runtime uses lazy initialization, so that until you perform certain actions, the CUDA driver is not used to create a context, nothing is done on the device etc. This function forces this initialization to happen immediately, while not having any other effect.

◆ launch()

template<typename Kernel , typename... KernelParameters>
void cuda::launch ( Kernel  kernel,
launch_configuration_t  launch_configuration,
KernelParameters &&...  parameters 
)
inline

Variant of enqueue_launch for use with the default stream on the current device.

Note
This isn't called enqueue since the default stream is synchronous.

◆ synchronize() [1/2]

void cuda::synchronize ( const event_t event)
inline

Waits for a specified event to conclude before returning control to the calling code.

Todo:
Determine how this waiting takes place (as opposed to stream synchrnoization).
Parameters
eventthe event for whose occurrence to wait; must be scheduled to occur on some stream (possibly the different stream)

◆ synchronize() [2/2]

void cuda::synchronize ( device_t device)
inline

Suspends execution until all previously-scheduled tasks on the specified device (all contexts, all streams) have concluded.

Depending on the host_thread_synch_scheduling_policy_t set for this device, the thread calling this method will either yield, spin or block until this completion.

◆ throw_if_error() [1/2]

void cuda::throw_if_error ( cuda::status_t  status,
const ::std::string &  message 
)
inlinenoexcept

Do nothing...

unless the status indicates an error, in which case a cuda::runtime_error exception is thrown

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown
messageAn extra description message to add to the exception

◆ throw_if_error() [2/2]

void cuda::throw_if_error ( cuda::status_t  status)
inlinenoexcept

Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown