cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
|
Definitions and functionality wrapping CUDA APIs. More...
Namespaces | |
array | |
CUDA facilities for interpolating access to multidimensional array objects, in particular via the array_t class. | |
device | |
Definitions and functionality related to CUDA devices (not including the device wrapper type cuda::device_t itself) | |
event | |
CUDA timing functionality, via events and their related code (not including the event wrapper type event_t itself) | |
link | |
Definitions related to CUDA linking-processes, captured by the link_t wrapper class. | |
memory | |
Representation, allocation and manipulation of CUDA-related memory, of different. | |
outstanding_error | |
Unlike the Runtime API, where every error is outstanding until cleared, the Driver API, which we use mostly, only remembers "sticky" errors - severe errors which corrupt contexts. | |
profiling | |
Interaction with NVIDIA's profiler, particularly tagging, marking and indications of entities it will pick up and register/display. | |
rtc | |
Real-time compilation of programs using the NVIDIA libraries. | |
stream | |
Definitions and functionality related to CUDA streams (not including the device wrapper type stream_t itself) | |
Classes | |
class | array_t |
Owning wrapper for CUDA 2D and 3D arrays. More... | |
struct | caching |
A helper struct for templatizing caching<Op>::mode. More... | |
struct | caching< memory_operation_t::load > |
Load operation caching settings. More... | |
struct | caching< memory_operation_t::store > |
Store operation caching settings. More... | |
class | context_t |
Wrapper class for a CUDA context. More... | |
class | device_t |
Wrapper class for a CUDA device. More... | |
class | event_t |
Wrapper class for a CUDA event. More... | |
class | kernel_t |
A non-owning wrapper for CUDA kernels - whether they be __global__ functions compiled apriori, or the result of dynamic NVRTC compilation, or obtained in some other future way. More... | |
class | launch_config_builder_t |
A convenience class for gradually constructing a launch_configuration_t instance, as per the "builder pattern". More... | |
struct | launch_configuration_t |
The full set of possible configuration parameters for launching a kernel on a GPU. More... | |
class | link_t |
Wrapper class for a CUDA link (a process of linking compiled code together into an executable binary, using CUDA, at run-time) More... | |
class | module_t |
Wrapper class for a CUDA code module. More... | |
class | runtime_error |
A (base?) class for exceptions raised by CUDA code; these errors are thrown by essentially all CUDA Runtime API wrappers upon failure. More... | |
class | stream_t |
Proxy class for a CUDA stream. More... | |
class | texture_view |
Use texture memory for optimized read only cache access. More... | |
struct | version_t |
A structure representing a CUDA release version. More... | |
Typedefs | |
template<memory_operation_t Op> | |
using | caching_mode_t = typename caching< Op >::mode |
template<typename T , size_t N> | |
using | c_array = T[N] |
using | status_t = CUresult |
Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error). More... | |
using | size_t = ::std::size_t |
A size type for use throughout the wrappers library (except when specific API functions limit the size further) | |
using | dimensionality_t = size_t |
The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3. More... | |
using | uuid_t = CUuuid |
The CUDA-driver-specific representation of a UUID value; see also {device_t::uuid()}. | |
using | combined_version_t = int |
A combination of the major and minor version numbers for a CUDA release into a single integer, e.g. More... | |
Enumerations | |
enum | memory_operation_t { load, store } |
enum | { warp_size = 32 } |
CUDA's NVCC allows use the use of the warpSize identifier, without having to define it. More... | |
enum | : bool { thread_blocks_may_cooperate = true, thread_blocks_may_not_cooperate = false } |
Thread block cooperativity control for kernel launches. More... | |
enum | : memory::shared::size_t { no_dynamic_shared_memory = 0 } |
enum | : bool { do_take_ownership = true, do_not_take_ownership = false } |
enum | : bool { do_hold_primary_context_refcount_unit = true, do_not_hold_primary_context_refcount_unit = false } |
enum | : bool { dont_clear_errors = false, do_clear_errors = true } |
enum | multiprocessor_cache_preference_t : ::std::underlying_type< CUfunc_cache_enum >::type { multiprocessor_cache_preference_t::no_preference = CU_FUNC_CACHE_PREFER_NONE, multiprocessor_cache_preference_t::equal_l1_and_shared_memory = CU_FUNC_CACHE_PREFER_EQUAL, multiprocessor_cache_preference_t::prefer_shared_memory_over_l1 = CU_FUNC_CACHE_PREFER_SHARED, multiprocessor_cache_preference_t::prefer_l1_over_shared_memory = CU_FUNC_CACHE_PREFER_L1, none = no_preference, equal = equal_l1_and_shared_memory, prefer_shared = prefer_shared_memory_over_l1, prefer_l1 = prefer_l1_over_shared_memory } |
L1-vs-shared-memory balance option. More... | |
enum | multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type { device_default = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE, four_bytes_per_bank = CU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE, eight_bytes_per_bank = CU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE } |
A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel. More... | |
enum | source_kind_t { cuda_cpp = 0, ptx = 1 } |
The API wrappers support different kinds of source code, accepted by different NVIDIA run-time compilation libraries. More... | |
Functions | |
void | synchronize (const context_t &context) |
Waits for all previously-scheduled tasks on all streams (= queues) in a CUDA context to conclude, before returning. More... | |
void | synchronize (const device_t &device) |
Waits for all previously-scheduled tasks on all streams (= queues) on a specified device to conclude. More... | |
detail_::all_devices | devices () |
void | throw_if_error (status_t status) noexcept(false) |
Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More... | |
void | throw_if_error (cudaError_t status) noexcept(false) |
void | wait (const event_t &event) |
Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred() More... | |
void | synchronize (const event_t &event) |
bool | operator== (const kernel_t &lhs, const kernel_t &rhs) noexcept |
bool | operator!= (const kernel_t &lhs, const kernel_t &rhs) noexcept |
constexpr grid::dimensions_t | single_block () |
A named constructor idiom for grid::dimensions_t, which, when used, will result in a grid with a single block. | |
constexpr grid::block_dimensions_t | single_thread_per_block () |
A named constructor idiom for grid::dimensions_t, which, when used, will result in a grid whose blocks have a single thread. | |
template<typename Kernel , typename... KernelParameters> | |
void | enqueue_launch (Kernel &&kernel, const stream_t &stream, launch_configuration_t launch_configuration, KernelParameters &&... parameters) |
Enqueues a kernel on a stream (=queue) on the current CUDA device. More... | |
template<typename Kernel , typename... KernelParameters> | |
void | launch (Kernel &&kernel, launch_configuration_t launch_configuration, KernelParameters &&... parameters) |
Variant of enqueue_launch for use with the default stream in the current context. More... | |
launch_config_builder_t | launch_config_builder () |
A slightly shorter-named construction idiom for launch_config_builder_t. | |
void | initialize_driver () |
Obtains the CUDA Runtime version. More... | |
void | ensure_driver_is_initialized () |
A mechanism for ensuring a cuInit() call has been made, to use before making any other driver API calls. More... | |
void | synchronize (const stream_t &stream) |
Waits for all previously-scheduled tasks on a given stream to conclude, before returning. More... | |
template<source_kind_t Kind> | |
constexpr bool | is_failure (rtc::status_t< Kind > status) |
Determine whether the API call returning the specified status had failed. | |
template<source_kind_t Kind> | |
void | throw_if_error (rtc::status_t< Kind > status, const ::std::string &message) noexcept(false) |
Do nothing... More... | |
template<source_kind_t Kind> | |
void | throw_if_error (rtc::status_t< Kind > status) noexcept(false) |
Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown. More... | |
bool | operator== (const context_t &lhs, const context_t &rhs) noexcept |
bool | operator!= (const context_t &lhs, const context_t &rhs) noexcept |
constexpr bool | is_success (status_t status) |
Determine whether the API call returning the specified status had succeeded. | |
constexpr bool | is_success (cudaError_t status) |
constexpr bool | is_failure (status_t status) |
Determine whether the API call returning the specified status had failed. | |
constexpr bool | is_failure (cudaError_t status) |
inline ::std::string | describe (status_t status) |
Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code. | |
inline ::std::string | describe (cudaError_t status) |
void | throw_if_error (status_t status, const ::std::string &message) noexcept(false) |
Do nothing... More... | |
void | throw_if_error (cudaError_t status, const ::std::string &message) noexcept(false) |
void | throw_if_error (status_t status, ::std::string &&message) noexcept(false) |
void | throw_if_error (cudaError_t status, ::std::string &&message) noexcept(false) |
template<typename SpanOfConstVoidPtrLike > | |
void | launch_type_erased (const kernel_t &kernel, const stream_t &stream, launch_configuration_t launch_configuration, SpanOfConstVoidPtrLike marshalled_arguments) |
Launch a kernel with the arguments pre-marshalled into the (main) form which the CUDA driver's launch primitive accepts variables in: A null- terminated sequence of (possibly const) void * 's to the argument values. More... | |
template<source_kind_t Kind> | |
constexpr bool | is_success (rtc::status_t< Kind > status) |
Determine whether the API call returning the specified status had succeeded. | |
inline ::std::string | describe (rtc::status_t< cuda_cpp > status) |
Obtain a brief textual explanation for a specified kind of CUDA Runtime API status or error code. | |
Definitions and functionality wrapping CUDA APIs.
using cuda::combined_version_t = typedef int |
A combination of the major and minor version numbers for a CUDA release into a single integer, e.g.
CUDA 11.3 is represented by the combined version number 11300. Se also version_t.
using cuda::dimensionality_t = typedef size_t |
The index or number of dimensions of an entity (as opposed to the extent in any dimension) - typically just 0, 1, 2 or 3.
using cuda::status_t = typedef CUresult |
Indicates either the result (success or error index) of a CUDA Runtime or Driver API call, or the overall status of the API (which is typically the last triggered error).
anonymous enum |
CUDA's NVCC allows use the use of the warpSize identifier, without having to define it.
Un(?)fortunately, warpSize is not a compile-time constant; it is replaced at some point with the appropriate immediate value which goes into, the SASS instruction as a literal. This is apparently due to the theoretical possibility of different warp sizes in the future. However, it is useful - both for host-side and more importantly for device-side code - to have the warp size available at compile time. This allows all sorts of useful optimizations, as well as its use in constexpr code.
If nVIDIA comes out with 64-lanes-per-warp GPUs - we'll refactor this.
anonymous enum : bool |
Thread block cooperativity control for kernel launches.
|
strong |
L1-vs-shared-memory balance option.
In some GPU micro-architectures, it's possible to have the multiprocessors change the balance in the allocation of L1-cache-like resources between actual L1 cache and shared memory; these are the possible choices.
enum cuda::multiprocessor_shared_memory_bank_size_option_t : ::std::underlying_type< CUsharedconfig >::type |
A physical core (SM)'s shared memory has multiple "banks"; at most one datum per bank may be accessed simultaneously, while data in different banks can be accessed in parallel.
The number of banks and bank sizes differ for different GPU architecture generations; but in some of them (e.g. Kepler), they are configurable - and you can trade the number of banks for bank size, in case that makes sense for your data access pattern - by using device_t::shared_memory_bank_size .
enum cuda::source_kind_t |
The API wrappers support different kinds of source code, accepted by different NVIDIA run-time compilation libraries.
|
inline |
for(auto& dev : cuda::devices) { do_stuff_with(dev); }
void cuda::enqueue_launch | ( | Kernel && | kernel, |
const stream_t & | stream, | ||
launch_configuration_t | launch_configuration, | ||
KernelParameters &&... | parameters | ||
) |
Enqueues a kernel on a stream (=queue) on the current CUDA device.
CUDA's 'chevron' kernel launch syntax cannot be compiled in proper C++. Thus, every kernel launch must at some point reach code compiled with CUDA's nvcc. Naively, every single different kernel (perhaps up to template specialization) would require writing its own wrapper C++ function, launching it. This function, however, constitutes a single minimal wrapper around the CUDA kernel launch, which may be called from proper C++ code (across translation unit boundaries - the caller is compiled with a C++ compiler, the callee compiled by nvcc).
This function is similar to C++17's ::std::apply
, or to a a beta-reduction in Lambda calculus: It applies a function to its arguments; the difference is in the nature of the function (a CUDA kernel) and in that the function application requires setting additional CUDA-related launch parameters, additional to the function's own.
As kernels do not return values, neither does this function. It also contains no hooks, logging commands etc. - if you want those, write an additional wrapper (perhaps calling this one in turn).
kernel | the kernel to apply. Pass it just as-it-is, as though it were any other function. Note: If the kernel is templated, you must pass it fully-instantiated. Alternatively, you can pass a |
stream | the CUDA hardware command queue on which to place the command to launch the kernel (affects the scheduling of the launch and the execution) |
launch_configuration | not all launches of the same kernel are identical: The launch may be configured to use more of less blocks in the grid, to allow blocks dynamic memory, to control the block's dimensions etc; this parameter defines that extra configuration outside the kernels' actual source. See also cuda::launch_configuration_t. |
parameters | whatever parameters kernel_function takes |
|
inline |
A mechanism for ensuring a cuInit() call has been made, to use before making any other driver API calls.
initialize_driver()
in that repeated calls from the same thread will avoid additional cuInit() call.
|
inline |
Obtains the CUDA Runtime version.
void cuda::launch | ( | Kernel && | kernel, |
launch_configuration_t | launch_configuration, | ||
KernelParameters &&... | parameters | ||
) |
Variant of enqueue_launch for use with the default stream in the current context.
enqueue
since the default stream is synchronous.
|
inline |
Launch a kernel with the arguments pre-marshalled into the (main) form which the CUDA driver's launch primitive accepts variables in: A null- terminated sequence of (possibly const) void *
's to the argument values.
SpanOfConstVoidPtrLike | Type of the container for the marshalled arguments; typically, this would be span<const void*> - but it can be an ::std::vector , or have non-const void* elements etc. |
kernel | A wrapped GPU kernel |
stream | Proxy for the stream on which to enqueue the kernel launch; may be the default stream of a context. |
launch_configuration | The configuration information for the grid of blocks of threads and other configuration with which to configure the launch |
marshalled_arguments | A container of void or const void pointers to the argument values |
|
inline |
Waits for all previously-scheduled tasks on all streams (= queues) on a specified device to conclude.
Depending on the host_thread_sync_scheduling_policy_t set for the specified device, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on this device have been concluded.
|
inline |
Waits for all previously-scheduled tasks on all streams (= queues) in a CUDA context to conclude, before returning.
Avoid executing any additional instructions on this thread until all work on all streams in context
has been concluded.
Depending on the host_thread_sync_scheduling_policy_t
set for the specified context, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on streams within this context have concluded.
context | The context all of whose streams are synchronized when a call to this function is made. |
|
inline |
Waits for all previously-scheduled tasks on a given stream to conclude, before returning.
Depending on the host_thread_sync_scheduling_policy_t
set for the specified stream, the thread calling this method will either yield, spin or block until all tasks scheduled previously scheduled on the stream have concluded.
|
inlinenoexcept |
Do nothing...
unless the status indicates an error, in which case a cuda::runtime_error exception is thrown
status | should be cuda::status::success - otherwise an exception is thrown |
message | An extra description message to add to the exception |
|
inlinenoexcept |
Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.
status | should be cuda::status::success - otherwise an exception is thrown |
|
inlinenoexcept |
Do nothing...
unless the status indicates an error, in which case a cuda::runtime_error exception is thrown
status | should be status::success - otherwise an exception is thrown |
message | An extra description message to add to the exception |
|
inlinenoexcept |
Does nothing - unless the status indicates an error, in which case a cuda::runtime_error exception is thrown.
status | should be cuda::status::success - otherwise an exception is thrown |
|
inline |
Have the calling thread wait - either busy-waiting or blocking - and return only after this event has occurred (see event_t::has_occurred()
event | the event for whose occurrence to wait; must be scheduled to occur on some stream (possibly the different stream) |