Thin C++-flavored wrappers for the CUDA Runtime API
Thin C++-flavored wrappers for the CUDA APIs:
Runtime, Driver, NVRTC and NVTX
Table of contents
General description
- Key features
Detailed documentation
Using the library in your project
Coverage of the APIs
A taste of some features in play
Example programs
- Modified CUDA samples
- 'Coverage' programs - by API module
Bugs, suggestions, feedback

General description

This is a header-only library of integrated wrappers around the core parts of NVIDIA's CUDA execution ecosystem:

It is intended for those who would otherwise use these APIs directly, to make working with them be more intuitive and consistent, making use of modern C++ language capabilities, programming idioms and best practices. In a nutshell - making CUDA API work more fun :-)

Also, and importantly - while the wrappers seem to be "high-level", more "abstract" code - they are nothing more than a modern-C++-aesthetic arrangement of NVIDIA's own APIs. The wrapper library does not force any abstractions above CUDA's own, nor conventions regarding where to place data, how and when to perform synchronization, etc.; you have the complete range of expression of the underlying APIs.

Key features

In contrast to the above, this library provides:

There is one noteworthy caveat: The wrapper API calls cannot make assumptions about previous or later code of yours, which means some of them require more calls to obtain the current context handle or push a(n existing) context, then pop it. While these calls are cheap, they are still non-trivial and can't be optimized away.


NVIDIA provides two main APIs for using CUDA: The Runtime API and the Driver API. These suffer from several deficiencies:

You may have noticed this list reads like the opposite of the key features, listed above: The idea is to make this library overcome and rectify all of these deficiencies as much as possible.

Detailed documentation

Detailed Doxygen-genereated documentation is available. It is mostly complete for the Runtime API wrappers, less so for the rest of the wrappers.


Using the library in your project

Use involving CMake:

Use not involving CMake:

Finally, if you've started using the library in a publicly-available (FOSS or commercial) project, please consider emailing , or open an issue, to announce this.

Coverage of the APIs

Most, but not all, API calls in the Runtime, Driver, NVTX and NVRTC are covered by these wrappers. Specifically, the following are missing:

Support for textures, arrays and surfaces exists, but is partial: Not all relevant API functions are covered.

The Milestones indicates some features which aren't covered and are slated for future work. Since I am not currently working on anything graphics-related, there are no short-term plans to extend coverage to more graphics-related APIs; however - PRs are welcome.

A taste of some features in play

We've all dreamed of being able to type in:

    [&foo](cuda::stream_t stream, cuda::status_t status) {
        ::std::cout << "Hello " << foo << " world!\n";

... and have that just work, right? Well, now it does!

On a slightly more serious note, though, let's demonstrate the principles listed above:

Use of namespaces (and internal classes)

With this library, you would do cuda::memory::host::allocate() instead of cudaMallocHost() and cuda::device_t::memory::allocate() instead of setting the current device and then cudaMalloc(). Note, though, that device_t::memory::allocate() is not a freestanding function but a method of an internal class, so a call to it might be cuda::device::get(my_device_id).memory.allocate(my_size). The compiled version of this supposedly complicated construct will be nothing but the sequence of cudaSetDevice() and cudaMalloc() calls.

Adorning POD structs with convenience methods

The expression >= cuda::make_compute_capability(50)

is a valid comparison, true for all devices with a Maxwell-or-later micro-architecture. This, despite the fact that struct cuda::compute_capability_t is a POD type with two unsigned integer fields, not a scalar. Note that struct cuda::device::properties_t (which is really basically a struct cudaDeviceProp of the Runtime API itself) does not have a compute_capability field.

Meaningful naming

Instead of using

cudaError_t cudaEventCreateWithFlags(
cudaEvent_t* event,
unsigned int flags)

which requires you remember what you need to specify as flags and how, you create a cuda::event_t proxy object, using the function:

cuda::event_t cuda::event::create(
cuda::device_t device,
bool uses_blocking_sync,
bool records_timing = cuda::event::do_record_timing,
bool interprocess = cuda::event::not_interprocess)

The default values here are enum : bool's, which you can use yourself when creating non-default-parameter events - to make the call more easily readable than with true or false.

Example programs

In lieu of a full-fledged user's guide, I'm providing several kinds of example programs; browsing their source you'll know most of what there is to know about the API wrappers. To build and run the examples (just as a sanity check), execute the following (in a Unix-style command shell):

cmake -S . -B build -DBUILD_EXAMPLES=ON .
cmake --build build/
find build/examples/bin -type f -executable -exec "{}" ";"

The two main kinds of example programs are:

Modified CUDA samples

The CUDA distribution contains sample programs demostrating various features and concepts. A few of these - which are not focused on device-side work - have been adapted to use the API wrappers - completely foregoing direct use of the CUDA Runtime API itself. You will find them in the modified CUDA samples example programs folder.

'Coverage' programs - by API module

Gradually, an example program is being added for each one of the CUDA Runtime API Modules, in which the approach replacing use of those module API calls by use of the API wrappers is demonstrated. These per-module example programs can be found here.

Bugs, suggestions, feedback

If you notice a specific issue which needs addressing, especially any sort of bug, compatibility problem, or missing functionality - please file the issue here on GitHub. If you'd like to contribute, or give some less-specific or less public feedback - do write me. You can also write if you're interested in collaborating on related research or coding work.