cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
cuda-api-wrappers:
Thin C++-flavored wrappers for the CUDA runtime API

nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads).

This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways.

Key features

Detailed documentation

Detailed nearly-complete Doxygen-genereated documentation is available.

Requirements

Coverage of the Runtime API

Considering the list of runtime API modules, the library currently has the following (w.r.t. CUDA 8.x):

Coverage level Modules
full Error Handling, Stream Management, Event Management, Memory Management, Version Management, Peer Device Memory Access, Occupancy, Unified Addressing
almost full Device Management (no chooseDevice, cudaSetValidDevices), Execution Control (no support for working with parameter buffers)
no coverage OpenGL Interoperability, Direct3D 9 Interoperability, Direct3D 10 Interoperability, Direct3D 11 Interoperability, VDPAU Interoperability, EGL Interoperability, Graphics Interoperability, Texture Reference Management, Surface Reference Management, Texture Object Management, Surface Object Management

CUDA 9.0 additions to the API are a WIP (see the issues page).

Since I am not currently working on anything graphics-related, there are no short-term plans to extend coverage to any of the graphics related modules.

A taste of the key features in play

We've all dreamed of being able to type in:

my_stream.enqueue.callback(
    [&foo](cuda::stream::id_t stream_id, cuda::status_t status) {
        std::cout << "Hello " << foo << " world!\n";
    }
);

... and have that just work, right? Well, now it does!

On a slightly more serious note, though, let's demonstrate the principles listed above:

Use of namespaces (and internal classes)

With this library, you would do cuda::memory::host::allocate() instead of cudaMallocHost() and cuda::device_t::memory::allocate() instead of setting the current device and then cudaMalloc(). Note, though, that device_t::memory::allocate() is not a freestanding function but a method of an internal class, so a call to it might be cuda::device::get(my_device_id).memory.allocate(my_size). The compiled version of this supposedly complicated construct will be nothing but the sequence of cudaSetDevice() and cudaMalloc() calls.

Adorning POD structs with convenience methods

The expression

my_device.properties().compute_capability() >= cuda::make_compute_capability(50)

is a valid comparison, true for all devices with a Maxwell-or-later micro-architecture. This, despite the fact that struct cuda::compute_capability_t is a POD type with two unsigned integer fields, not a scalar. Note that struct cuda::device::properties_t (which is really basically a struct cudaDeviceProp of the Runtime API itself) does not have a compute_capability field.

Meaningful naming

Instead of using

cudaError_t cudaEventCreateWithFlags(
cudaEvent_t* event,
unsigned int flags)

which requires you remember what you need to specify as flags and how, you construct a cuda::event_t proxy object, using the constructor

cuda::event_t::event_t(
bool uses_blocking_sync,
bool records_timing = event::do_record_timing,
bool interprocess = event::not_interprocess)

The default values here are enum : bool's, which you should also use when constructing an event_t with non-default parameters. There is also the no-arguments event_t() constructor which calls cudaEventCreate without flags.

Example programs

More detailed documentation / feature walk-through is forthcoming. For now I'm providing two kinds of short example programs; browsing their source you'll know essentially all there is to know about the API wrappers.

To build and run the examples (just as a sanity check), execute the following:

[user@host:/path/to/cuda-api-wrappers/]$ cmake . && make examples && examples/scripts/run-all-examples

Modified CUDA samples

The CUDA distribution contains sample programs demostrating various features and concepts. A few of these - which are not focused on device-side work - have been adapted to use the API wrappers - completely foregoing direct use of the CUDA Runtime API itself. You will find them in the modified CUDA samples example programs folder.

'Coverage' test programs - by module of the Runtime API

Gradually, an example program is being added for each one of the CUDA Runtime API Modules, in which the approach replacing use of those module API calls by use of the API wrappers is demonstrated. These per-module example programs can be found here.

Bugs, suggestions, feedback

I would like some help with building up documentation and perhaps a Wiki here; if you can spare the time - do write me. You can also do so if you're interested in collaborating on some related project or for general comments/feedback/suggestions.

If you notice a specific issue which needs addressing, especially any sort of bug or compilation error, please file the issue here on GitHub.