cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
Namespaces | Classes | Typedefs | Enumerations | Functions | Variables
cuda Namespace Reference

All definitions and functionality wrapping the CUDA Runtime API. More...

Namespaces

 device
 Definitions and functionality related to CUDA devices (not including the device wrapper type device_t itself)
 
 event
 Definitions and functionality related to CUDA events (not including the event wrapper type event_t itself)
 
 outstanding_error
 Unlike the Runtime API, where every error is outstanding until cleared, the Driver API, which we use mostly, only remembers "sticky" errors - severe errors which corrupt contexts.
 
 rtc
 Real-time compilation of CUDA programs using the NVIDIA NVRTC library.
 
 stream
 Definitions and functionality related to CUDA streams (not including the device wrapper type stream_t itself)
 

Classes

class  array_t
 Owning wrapper for CUDA 2D and 3D arrays. More...
 
struct  caching
 
struct  caching< memory_operation_t::load >
 
struct  caching< memory_operation_t::store >
 
class  context_t
 Wrapper class for a CUDA context. More...
 
class  event_t
 Wrapper class for a CUDA event. More...
 
class  launch_config_builder_t
 
struct  launch_configuration_t
 
class  link_t
 Wrapper class for a CUDA link (a process of linking compiled code together into an executable binary, using CUDA, at run-time) More...
 
class  module_t
 Wrapper class for a CUDA code module. More...
 
class  runtime_error
 A (base?) class for exceptions raised by CUDA code; these errors are thrown by essentially all CUDA Runtime API wrappers upon failure. More...
 
struct  span
 A "poor man's" span class. More...
 
class  stream_t
 Proxy class for a CUDA stream. More...
 
class  texture_view
 Use texture memory for optimized read only cache access. More...
 
struct  version_t
 CUDA Runtime version. More...
 

Typedefs

template<memory_operation_t Op>
using caching_mode_t = typename caching< Op >::mode
 
using nullopt_t = detail_::no_value_t
 
template<typename T >
using optional = cuda::detail_::poor_mans_optional< T >
 
using status_t = CUresult
 Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error). More...
 
using size_t = ::std::size_t
 
using dimensionality_t = size_t
 The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3.
 
using native_word_t = unsigned
 
using uuid_t = CUuuid
 
template<typename T >
using dynarray = ::std::vector< T >
 
using combined_version_t = int
 
using string_view = bpstd::string_view
 

Enumerations

enum  memory_operation_t {
  load,
  store
}
 
enum  : native_word_t { warp_size = 32 }
 CUDA's NVCC allows use the use of the warpSize identifier, without having to define it. More...
 
enum  : bool {
  thread_blocks_may_cooperate = true,
  thread_blocks_may_not_cooperate = false
}
 Thread block cooperativity control for kernel launches. More...
 
enum  : memory::shared::size_t { no_dynamic_shared_memory = 0 }
 
enum  : bool {
  do_take_ownership = true,
  do_not_take_ownership = false
}
 
enum  : bool {
  do_hold_primary_context_refcount_unit = true,
  do_not_hold_primary_context_refcount_unit = false
}
 
enum  : bool {
  dont_clear_errors = false,
  do_clear_errors = true
}
 
enum  multiprocessor_cache_preference_t : ::std::underlying_type< CUfunc_cache_enum >::type {
  multiprocessor_cache_preference_t::no_preference = CU_FUNC_CACHE_PREFER_NONE,
  multiprocessor_cache_preference_t::equal_l1_and_shared_memory = CU_FUNC_CACHE_PREFER_EQUAL,
  multiprocessor_cache_preference_t::prefer_shared_memory_over_l1 = CU_FUNC_CACHE_PREFER_SHARED,
  multiprocessor_cache_preference_t::prefer_l1_over_shared_memory = CU_FUNC_CACHE_PREFER_L1,
  none = no_preference,
  equal = equal_l1_and_shared_memory,
  prefer_shared = prefer_shared_memory_over_l1,
  prefer_l1 = prefer_l1_over_shared_memory
}
 L1-vs-shared-memory balance option. More...
 
enum  multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type {
  device_default = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE,
  four_bytes_per_bank = CU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE,
  eight_bytes_per_bank = CU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE
}
 A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel. More...
 
enum  source_kind_t {
  cuda_cpp = 0,
  ptx = 1
}
 

Functions

template<memory_operation_t Op>
const char * name (caching_mode_t< Op > mode)
 
template<memory_operation_t Op>
inline ::std::ostream & operator<< (::std::ostream &os, caching_mode_t< Op > lcm)
 
void synchronize (const context_t &context)
 
bool operator== (const context_t &lhs, const context_t &rhs)
 
bool operator!= (const context_t &lhs, const context_t &rhs)
 
detail_::all_devices devices ()
 
void throw_if_error (status_t status, const ::std::string &message) noexcept(false)
 Do nothing... More...
 
void throw_if_error (cudaError_t status, const ::std::string &message) noexcept(false)
 
void throw_if_error (status_t status, ::std::string &&message) noexcept(false)
 
void throw_if_error (cudaError_t status, ::std::string &&message) noexcept(false)
 
void throw_if_error (status_t status) noexcept(false)
 Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More...
 
void throw_if_error (cudaError_t status) noexcept(false)
 
void wait (const event_t &event)
 Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred() More...
 
void synchronize (const event_t &event)
 
launch_config_builder_t launch_config_builder ()
 
constexpr bool operator== (const launch_configuration_t lhs, const launch_configuration_t &rhs) noexcept
 
constexpr bool operator!= (const launch_configuration_t lhs, const launch_configuration_t &rhs) noexcept
 
void initialize_driver ()
 Obtains the CUDA Runtime version. More...
 
void ensure_driver_is_initialized ()
 
void synchronize (const device_t &device)
 
template<typename Kernel , typename... KernelParameters>
void enqueue_launch (Kernel &&kernel, const stream_t &stream, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
 
template<typename Kernel , typename... KernelParameters>
void launch (Kernel &&kernel, launch_configuration_t launch_configuration, KernelParameters &&... parameters)
 
template<typename SpanOfConstVoidPtrLike >
void launch_type_erased (const kernel_t &kernel, const stream_t &stream, launch_configuration_t launch_configuration, SpanOfConstVoidPtrLike marshalled_arguments)
 
void synchronize (const stream_t &stream)
 
bool operator!= (const stream_t &lhs, const stream_t &rhs) noexcept
 
bool operator== (const texture_view &lhs, const texture_view &rhs) noexcept
 
bool operator!= (const texture_view &lhs, const texture_view &rhs) noexcept
 
template<source_kind_t Kind>
constexpr bool is_failure (rtc::status_t< Kind > status)
 Determine whether the API call returning the specified status had failed.
 
template<source_kind_t Kind>
void throw_if_error (rtc::status_t< Kind > status, const ::std::string &message) noexcept(false)
 Do nothing... More...
 
template<source_kind_t Kind>
void throw_if_error (rtc::status_t< Kind > status) noexcept(false)
 Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More...
 
constexpr bool is_success (status_t status)
 Determine whether the API call returning the specified status had succeeded.
 
constexpr bool is_success (cudaError_t status)
 
constexpr bool is_failure (status_t status)
 Determine whether the API call returning the specified status had failed.
 
constexpr bool is_failure (cudaError_t status)
 
inline ::std::string describe (status_t status)
 Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code.
 
inline ::std::string describe (cudaError_t status)
 
template<source_kind_t Kind>
constexpr bool is_success (rtc::status_t< Kind > status)
 Determine whether the API call returning the specified status had succeeded.
 
inline ::std::string describe (rtc::status_t< cuda_cpp > status)
 Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code.
 

Variables

constexpr nullopt_t nullopt {}
 

Detailed Description

All definitions and functionality wrapping the CUDA Runtime API.

Typedef Documentation

◆ status_t

using cuda::status_t = typedef CUresult

Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error).

Note
This single type really needs to double as both CUresult for driver API calls and cudaError_t for runtime API calls. These aren't actually the same type - but they are both enums, sharing most of the defined values. See also error.hpp where we unify the set of errors.

Enumeration Type Documentation

◆ anonymous enum

anonymous enum : native_word_t

CUDA's NVCC allows use the use of the warpSize identifier, without having to define it.

Un(?)fortunately, warpSize is not a compile-time constant; it is replaced at some point with the appropriate immediate value which goes into, the SASS instruction as a literal. This is apparently due to the theoretical possibility of different warp sizes in the future. However, it is useful - both for host-side and more importantly for device-side code - to have the warp size available at compile time. This allows all sorts of useful optimizations, as well as its use in constexpr code.

If nVIDIA comes out with 64-lanes-per-warp GPUs - we'll refactor this.

◆ anonymous enum

anonymous enum : bool

Thread block cooperativity control for kernel launches.

Enumerator
thread_blocks_may_cooperate 

Thread groups may span multiple blocks, so that they can synchronize their actions.

thread_blocks_may_not_cooperate 

Thread blocks are not allowed to synchronize (the default, and likely faster, execution mode)

◆ multiprocessor_cache_preference_t

enum cuda::multiprocessor_cache_preference_t : ::std::underlying_type< CUfunc_cache_enum >::type
strong

L1-vs-shared-memory balance option.

In some GPU micro-architectures, it's possible to have the multiprocessors change the balance in the allocation of L1-cache-like resources between actual L1 cache and shared memory; these are the possible choices.

Enumerator
no_preference 

No preference for more L1 cache or for more shared memory; the API can do as it please.

equal_l1_and_shared_memory 

Divide the cache resources equally between actual L1 cache and shared memory.

prefer_shared_memory_over_l1 

Divide the cache resources to maximize available shared memory at the expense of L1 cache.

prefer_l1_over_shared_memory 

Divide the cache resources to maximize available L1 cache at the expense of shared memory.

◆ multiprocessor_shared_memory_bank_size_option_t

enum cuda::multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type

A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel.

The number of banks and bank sizes differ for different GPU architecture generations; but in some of them (e.g. Kepler), they are configurable - and you can trade the number of banks for bank size, in case that makes sense for your data access pattern - by using device_t::shared_memory_bank_size .

Function Documentation

◆ initialize_driver()

void cuda::initialize_driver ( )
inline

Obtains the CUDA Runtime version.

Note
unlike {maximum_supported_by_driver()}, 0 cannot be returned, as we are actually using the runtime to obtain the version, so it does have some version.

◆ throw_if_error() [1/4]

template<source_kind_t Kind>
void cuda::throw_if_error ( rtc::status_t< Kind >  status,
const ::std::string &  message 
)
inlinenoexcept

Do nothing...

unless the status indicates an error, in which case a cuda::runtime_error exception is thrown

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown
messageAn extra description message to add to the exception

◆ throw_if_error() [2/4]

template<source_kind_t Kind>
void cuda::throw_if_error ( rtc::status_t< Kind >  status)
inlinenoexcept

Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown

◆ throw_if_error() [3/4]

void cuda::throw_if_error ( status_t  status,
const ::std::string &  message 
)
inlinenoexcept

Do nothing...

unless the status indicates an error, in which case a cuda::runtime_error exception is thrown

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown
messageAn extra description message to add to the exception

◆ throw_if_error() [4/4]

void cuda::throw_if_error ( status_t  status)
inlinenoexcept

Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.

Parameters
statusshould be cuda::status::success - otherwise an exception is thrown

◆ wait()

void cuda::wait ( const event_t event)
inline

Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred()

Todo:
figure out what happens if the event has not been recorded before this call is made.
Note
the waiting will occur either passively (e.g. like waiting for information on a file descriptor), or actively (by busy-waiting) - depending on the flag with which the event was created.
Parameters
eventthe event for whose occurrence to wait; must be scheduled to occur on some stream (possibly the different stream)