cuda-api-wrappers
Thin C++-flavored wrappers for the CUDA Runtime API
cuda::stream_t::enqueue_t Class Reference

A gadget through which commands are enqueued on the stream. More...

#include <stream.hpp>

Public Member Functions

template<typename KernelFunction , typename... KernelParameters>
void kernel_launch (const KernelFunction &kernel_function, launch_configuration_t launch_configuration, KernelParameters &&... parameters) const
 Schedule a kernel launch on the associated stream. More...
 
void type_erased_kernel_launch (const kernel_t &kernel, launch_configuration_t launch_configuration, span< const void *> marshalled_arguments) const
 Schedule a kernel launch on the associated stream. More...
 
void memset (void *start, int byte_value, size_t num_bytes) const
 Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value. More...
 
void memset (memory::region_t region, int byte_value) const
 Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value. More...
 
void memzero (void *start, size_t num_bytes) const
 Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero. More...
 
void memzero (memory::region_t region) const
 Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero. More...
 
event_tevent (event_t &existing_event) const
 Have an event 'fire', i.e. More...
 
event_t event (bool uses_blocking_sync=event::sync_by_busy_waiting, bool records_timing=event::do_record_timings, bool interprocess=event::not_interprocess) const
 Have an event 'fire', i.e. More...
 
template<typename Invokable >
void host_invokable (Invokable &invokable) const
 Enqueues a host-invokable object, typically a function or closure object call.
 
void attach_managed_region (const void *managed_region_start, memory::managed::attachment_t attachment=memory::managed::attachment_t::single_stream) const
 Sets the attachment of a region of managed memory (i.e. More...
 
void attach_managed_region (memory::region_t region, memory::managed::attachment_t attachment=memory::managed::attachment_t::single_stream) const
 Sets the attachment of a region of managed memory (i.e. More...
 
void wait (const event_t &event_) const
 Will pause all further activity on the stream until the specified event has occurred (i.e. More...
 
template<typename T >
void set_single_value (T *__restrict__ ptr, T value, bool with_memory_barrier=true) const
 Schedule writing a single value to global device memory after all previous work has concluded. More...
 
template<typename T >
void wait (const T *address, stream::wait_condition_t condition, T value, bool with_memory_barrier=false) const
 Wait for a value in device global memory to change so as to meet some condition. More...
 
void flush_remote_writes () const
 Guarantee all remote writes to the specified address are visible to subsequent operations scheduled on this stream.
 
void copy (void *destination, const void *source, size_t num_bytes) const
 Copy operations. More...
 
void copy (void *destination, memory::const_region_t source, size_t num_bytes) const
 Copy operations.
 
void copy (memory::region_t destination, memory::const_region_t source, size_t num_bytes) const
 Copy operations. More...
 
void copy (memory::region_t destination, memory::const_region_t source) const
 Copy operations.
 
void copy (void *destination, memory::const_region_t source) const
 Copy operations.
 
template<typename Iterator >
void single_value_operations_batch (Iterator ops_begin, Iterator ops_end) const
 Enqueue multiple single-value write, wait and flush operations to the device (avoiding the overhead of multiple enqueue calls). More...
 
template<typename Container >
void single_value_operations_batch (const Container &single_value_ops) const
 

Detailed Description

A gadget through which commands are enqueued on the stream.

Note
this class exists solely as a form of "syntactic sugar", allowing for code such as

my_stream.enqueue.copy(foo, bar, my_size)

Member Function Documentation

◆ attach_managed_region() [1/2]

void cuda::stream_t::enqueue_t::attach_managed_region ( const void *  managed_region_start,
memory::managed::attachment_t  attachment = memory::managed::attachment_t::single_stream 
) const
inline

Sets the attachment of a region of managed memory (i.e.

in the address space visible on all CUDA devices and the host) in one of several supported attachment modes.

Parameters
managed_region_starta pointer to the beginning of the managed memory region. This cannot be a pointer to anywhere in the middle of an allocated region - you must pass whatever cuda::memory::managed::allocate() returned.

The attachment is actually a commitment vis-a-vis the CUDA driver and the GPU itself that it doesn't need to worry about accesses to this memory from devices other than its object of attachment, so that the driver can optimize scheduling accordingly.

Note
by default, the memory region is attached to this specific stream on its specific device. In this case, the host will be allowed to read from this memory region whenever no kernels are pending on this stream.
Attachment happens asynchronously, as an operation on this stream, i.e. the attachment goes into effect (some time after) previous scheduled actions have concluded.

◆ attach_managed_region() [2/2]

void cuda::stream_t::enqueue_t::attach_managed_region ( memory::region_t  region,
memory::managed::attachment_t  attachment = memory::managed::attachment_t::single_stream 
) const
inline

Sets the attachment of a region of managed memory (i.e.

Parameters
regionthe entire managed memory region; note this must not be a sub-region; you must pass whatever the CUDA memory allocation or construction code provided you with, in full.

The attachment is actually a commitment vis-a-vis the CUDA driver and the GPU itself that it doesn't need to worry about accesses to this memory from devices other than its object of attachment, so that the driver can optimize scheduling accordingly.

Note
by default, the memory region is attached to this specific stream on its specific device. In this case, the host will be allowed to read from this memory region whenever no kernels are pending on this stream.
Attachment happens asynchronously, as an operation on this stream, i.e. the attachment goes into effect (some time after) previous scheduled actions have concluded.

◆ copy() [1/2]

void cuda::stream_t::enqueue_t::copy ( void *  destination,
const void *  source,
size_t  num_bytes 
) const
inline

Copy operations.

The source and destination memory regions may be anywhere the CUDA driver can map (e.g. the device's global memory, host/system memory, the global memory of another device, constant memory etc.) Schedule a copy of one region of memory to another

◆ copy() [2/2]

void cuda::stream_t::enqueue_t::copy ( memory::region_t  destination,
memory::const_region_t  source,
size_t  num_bytes 
) const
inline

Copy operations.

Note
num_bytes may be smaller than the sizes of any of the regions

◆ event() [1/2]

event_t & cuda::stream_t::enqueue_t::event ( event_t existing_event) const
inline

Have an event 'fire', i.e.

marked as having occurred, after all hereto-scheduled work on this stream has been completed. Threads which are waiting on the event (via the wait method) will become available for continued execution.

Parameters
existing_eventA pre-created CUDA event (for the stream's device); any existing "registration" of the event to occur elsewhere is overwritten.
Note
It is possible to wait for events across devices, but it is not possible to trigger events across devices.

◆ event() [2/2]

event_t cuda::stream_t::enqueue_t::event ( bool  uses_blocking_sync = event::sync_by_busy_waiting,
bool  records_timing = event::do_record_timings,
bool  interprocess = event::not_interprocess 
) const
inline

Have an event 'fire', i.e.

marked as having occurred, after all hereto-scheduled work on this stream has been completed. Threads which are waiting on the event (via the wait method) will become available for continued execution.

Note
the parameters are the same as for event::create()
It is possible to wait for events across devices, but it is not possible to trigger events across devices.

◆ kernel_launch()

template<typename KernelFunction , typename... KernelParameters>
void cuda::stream_t::enqueue_t::kernel_launch ( const KernelFunction &  kernel_function,
launch_configuration_t  launch_configuration,
KernelParameters &&...  parameters 
) const
inline

Schedule a kernel launch on the associated stream.

Parameters
kernelA wrapper around the kernel to launch
launch_configurationA description of how to launch the kernel (e.g. block and grid dimensions).
parametersto arguments to be passed to the kernel for this launch
Note
This function is cognizant of the types of all arguments passed to it; for a type-erased version, see type_erased_kernel_launch()

◆ memset() [1/2]

void cuda::stream_t::enqueue_t::memset ( void *  start,
int  byte_value,
size_t  num_bytes 
) const
inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value.

Parameters
startBeginning of the region to fill
byte_valuethe value with which to fill the memory region bytes
num_bytessize in bytes of the region to fill

◆ memset() [2/2]

void cuda::stream_t::enqueue_t::memset ( memory::region_t  region,
int  byte_value 
) const
inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value.

Parameters
startBeginning of the region to fill
byte_valuethe value with which to fill the memory region bytes
num_bytessize in bytes of the region to fill

◆ memzero() [1/2]

void cuda::stream_t::enqueue_t::memzero ( void *  start,
size_t  num_bytes 
) const
inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero.

Note
this is a separate method, since the CUDA runtime has a separate API call for setting to zero; does that mean there are special facilities for zero'ing memory faster? Who knows.
Parameters
startBeginning of the region to fill
num_bytessize of the region to fill

◆ memzero() [2/2]

void cuda::stream_t::enqueue_t::memzero ( memory::region_t  region) const
inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero.

Note
this is a separate method, since the CUDA runtime has a separate API call for setting to zero; does that mean there are special facilities for zero'ing memory faster? Who knows.
Parameters
startBeginning of the region to fill
num_bytessize of the region to fill

◆ set_single_value()

template<typename T >
void cuda::stream_t::enqueue_t::set_single_value ( T *__restrict__  ptr,
value,
bool  with_memory_barrier = true 
) const
inline

Schedule writing a single value to global device memory after all previous work has concluded.

Template Parameters
Tthe value to schedule a setting of. Can only be a raw uint32_t or uint64_t !
Parameters
ptrlocation in global device memory to set at the appropriate time.
valuethe value to write to address.
with_memory_barrierif false, allows reordering of this write operation with writes scheduled before it.

◆ single_value_operations_batch() [1/2]

template<typename Iterator >
void cuda::stream_t::enqueue_t::single_value_operations_batch ( Iterator  ops_begin,
Iterator  ops_end 
) const
inline

Enqueue multiple single-value write, wait and flush operations to the device (avoiding the overhead of multiple enqueue calls).

Note
see wait(), set_single_value and flush_remote_writes.
Parameters
ops_beginbeginning of a sequence of single-value operation specifications
ops_endend of a sequence of single-value operation specifications

◆ single_value_operations_batch() [2/2]

template<typename Container >
void cuda::stream_t::enqueue_t::single_value_operations_batch ( const Container &  single_value_ops) const
inline
Parameters
single_value_opsA sequence of single-value operation specifiers to enqueue together.

◆ type_erased_kernel_launch()

void cuda::stream_t::enqueue_t::type_erased_kernel_launch ( const kernel_t kernel,
launch_configuration_t  launch_configuration,
span< const void *>  marshalled_arguments 
) const
inline

Schedule a kernel launch on the associated stream.

Parameters
kernelA wrapper around the kernel to launch
launch_configurationA description of how to launch the kernel (e.g. block and grid dimensions).
marshalled_argumentsPointers to arguments to be passed to the kernel for this launch
Note
This signature does not require any type information regarding the kernel function type; see kernel_launch() for a type-observing version of the same schedulign operation.

◆ wait() [1/2]

void cuda::stream_t::enqueue_t::wait ( const event_t event_) const
inline

Will pause all further activity on the stream until the specified event has occurred (i.e.

has fired, i.e. has had all preceding scheduled work on the stream on which it was recorded completed).

Note
this call will not delay any already-enqueued work on the stream, only work enqueued after the call.
Parameters
event_the event for whose occurrence to wait; the event would typically be recorded on another stream.

◆ wait() [2/2]

template<typename T >
void cuda::stream_t::enqueue_t::wait ( const T *  address,
stream::wait_condition_t  condition,
value,
bool  with_memory_barrier = false 
) const
inline

Wait for a value in device global memory to change so as to meet some condition.

Template Parameters
Tthe value to schedule a setting of. Can only be a raw uint32_t or uint64_t !
Parameters
addresslocation in global device memory to set at the appropriate time.
conditionthe kind of condition to check against the reference value. Examples: equal to 5, greater-or-equal to 5, non-zero bitwise-and with 5 etc.
valuethe condition is checked against this reference value. Example: waiting on the value at address to be greater-or-equal to this value.
with_memory_barrierIf true, all remote writes guaranteed to have reached the device before the wait is performed will be visible to all operations on this stream/queue scheduled after the wait.

The documentation for this class was generated from the following file: