A gadget through which commands are enqueued on the stream. More...

#include <stream.hpp>

Public Member Functions
template<typename KernelFunction , typename... KernelParameters>
void	kernel_launch (const KernelFunction &kernel_function, launch_configuration_t launch_configuration, KernelParameters &&... parameters) const
	Schedule a kernel launch on the associated stream. More...

void	type_erased_kernel_launch (const kernel_t &kernel, launch_configuration_t launch_configuration, span< const void *> marshalled_arguments) const
	Schedule a kernel launch on the associated stream. More...

void	memset (void *start, int byte_value, size_t num_bytes) const
	Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value. More...

void	memset (memory::region_t region, int byte_value) const
	Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value. More...

void	memzero (void *start, size_t num_bytes) const
	Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero. More...

void	memzero (memory::region_t region) const
	Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero. More...

event_t &	event (event_t &existing_event) const
	Have an event 'fire', i.e. More...

event_t	event (bool uses_blocking_sync=event::sync_by_busy_waiting, bool records_timing=event::do_record_timings, bool interprocess=event::not_interprocess) const
	Have an event 'fire', i.e. More...

template<typename Invokable >
void	host_invokable (Invokable &invokable) const
	Enqueues a host-invokable object, typically a function or closure object call.

void	attach_managed_region (const void *managed_region_start, memory::managed::attachment_t attachment=memory::managed::attachment_t::single_stream) const
	Sets the attachment of a region of managed memory (i.e. More...

void	attach_managed_region (memory::region_t region, memory::managed::attachment_t attachment=memory::managed::attachment_t::single_stream) const
	Sets the attachment of a region of managed memory (i.e. More...

void	wait (const event_t &event_) const
	Will pause all further activity on the stream until the specified event has occurred (i.e. More...

template<typename T >
void	set_single_value (T *__restrict__ ptr, T value, bool with_memory_barrier=true) const
	Schedule writing a single value to global device memory after all previous work has concluded. More...

template<typename T >
void	wait (const T *address, stream::wait_condition_t condition, T value, bool with_memory_barrier=false) const
	Wait for a value in device global memory to change so as to meet some condition. More...

void	flush_remote_writes () const
	Guarantee all remote writes to the specified address are visible to subsequent operations scheduled on this stream.


void	copy (void destination, const void source, size_t num_bytes) const
	Copy operations. More...

void	copy (void *destination, memory::const_region_t source, size_t num_bytes) const
	Copy operations.

void	copy (memory::region_t destination, memory::const_region_t source, size_t num_bytes) const
	Copy operations. More...

void	copy (memory::region_t destination, memory::const_region_t source) const
	Copy operations.

void	copy (void *destination, memory::const_region_t source) const
	Copy operations.


template<typename Iterator >
void	single_value_operations_batch (Iterator ops_begin, Iterator ops_end) const
	Enqueue multiple single-value write, wait and flush operations to the device (avoiding the overhead of multiple enqueue calls). More...

template<typename Container >
void	single_value_operations_batch (const Container &single_value_ops) const

Detailed Description

A gadget through which commands are enqueued on the stream.

Note: this class exists solely as a form of "syntactic sugar", allowing for code such as

my_stream.enqueue.copy(foo, bar, my_size)

Member Function Documentation

◆ attach_managed_region() [1/2]

void cuda::stream_t::enqueue_t::attach_managed_region	(	const void *	managed_region_start,
		memory::managed::attachment_t	attachment = `memory::managed::attachment_t::single_stream`
	)		const

inline

Sets the attachment of a region of managed memory (i.e.

in the address space visible on all CUDA devices and the host) in one of several supported attachment modes.

Parameters

managed_region_start a pointer to the beginning of the managed memory region. This cannot be a pointer to anywhere in the middle of an allocated region - you must pass whatever cuda::memory::managed::allocate() returned.

The attachment is actually a commitment vis-a-vis the CUDA driver and the GPU itself that it doesn't need to worry about accesses to this memory from devices other than its object of attachment, so that the driver can optimize scheduling accordingly.

Note: by default, the memory region is attached to this specific stream on its specific device. In this case, the host will be allowed to read from this memory region whenever no kernels are pending on this stream.; Attachment happens asynchronously, as an operation on this stream, i.e. the attachment goes into effect (some time after) previous scheduled actions have concluded.

◆ attach_managed_region() [2/2]

void cuda::stream_t::enqueue_t::attach_managed_region	(	memory::region_t	region,
		memory::managed::attachment_t	attachment = `memory::managed::attachment_t::single_stream`
	)		const

inline

Sets the attachment of a region of managed memory (i.e.

Parameters

region the entire managed memory region; note this must not be a sub-region; you must pass whatever the CUDA memory allocation or construction code provided you with, in full.

The attachment is actually a commitment vis-a-vis the CUDA driver and the GPU itself that it doesn't need to worry about accesses to this memory from devices other than its object of attachment, so that the driver can optimize scheduling accordingly.

Note: by default, the memory region is attached to this specific stream on its specific device. In this case, the host will be allowed to read from this memory region whenever no kernels are pending on this stream.; Attachment happens asynchronously, as an operation on this stream, i.e. the attachment goes into effect (some time after) previous scheduled actions have concluded.

◆ copy() [1/2]

void cuda::stream_t::enqueue_t::copy	(	void *	destination,
		const void *	source,
		size_t	num_bytes
	)		const

inline

Copy operations.

The source and destination memory regions may be anywhere the CUDA driver can map (e.g. the device's global memory, host/system memory, the global memory of another device, constant memory etc.) Schedule a copy of one region of memory to another

◆ copy() [2/2]

void cuda::stream_t::enqueue_t::copy	(	memory::region_t	destination,
		memory::const_region_t	source,
		size_t	num_bytes
	)		const

inline

Copy operations.

Note: num_bytes may be smaller than the sizes of any of the regions

◆ event() [1/2]

event_t & cuda::stream_t::enqueue_t::event ( event_t & existing_event ) const

inline

Have an event 'fire', i.e.

marked as having occurred, after all hereto-scheduled work on this stream has been completed. Threads which are waiting on the event (via the wait method) will become available for continued execution.

Parameters

existing_event A pre-created CUDA event (for the stream's device); any existing "registration" of the event to occur elsewhere is overwritten.

Note: It is possible to wait for events across devices, but it is not possible to trigger events across devices.

◆ event() [2/2]

event_t cuda::stream_t::enqueue_t::event	(	bool	uses_blocking_sync = `event::sync_by_busy_waiting`,
		bool	records_timing = `event::do_record_timings`,
		bool	interprocess = `event::not_interprocess`
	)		const

inline

Have an event 'fire', i.e.

marked as having occurred, after all hereto-scheduled work on this stream has been completed. Threads which are waiting on the event (via the wait method) will become available for continued execution.

Note: the parameters are the same as for event::create(); It is possible to wait for events across devices, but it is not possible to trigger events across devices.

◆ kernel_launch()

template<typename KernelFunction , typename... KernelParameters>

void cuda::stream_t::enqueue_t::kernel_launch	(	const KernelFunction &	kernel_function,
		launch_configuration_t	launch_configuration,
		KernelParameters &&...	parameters
	)		const

inline

Schedule a kernel launch on the associated stream.

Parameters

kernel	A wrapper around the kernel to launch
launch_configuration	A description of how to launch the kernel (e.g. block and grid dimensions).
parameters	to arguments to be passed to the kernel for this launch

Note: This function is cognizant of the types of all arguments passed to it; for a type-erased version, see type_erased_kernel_launch()

◆ memset() [1/2]

void cuda::stream_t::enqueue_t::memset	(	void *	start,
		int	byte_value,
		size_t	num_bytes
	)		const

inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value.

Parameters

start	Beginning of the region to fill
byte_value	the value with which to fill the memory region bytes
num_bytes	size in bytes of the region to fill

◆ memset() [2/2]

void cuda::stream_t::enqueue_t::memset	(	memory::region_t	region,
		int	byte_value
	)		const

inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to a single fixed value.

Parameters

start	Beginning of the region to fill
byte_value	the value with which to fill the memory region bytes
num_bytes	size in bytes of the region to fill

◆ memzero() [1/2]

void cuda::stream_t::enqueue_t::memzero	(	void *	start,
		size_t	num_bytes
	)		const

inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero.

Note: this is a separate method, since the CUDA runtime has a separate API call for setting to zero; does that mean there are special facilities for zero'ing memory faster? Who knows.

Parameters

start	Beginning of the region to fill
num_bytes	size of the region to fill

◆ memzero() [2/2]

void cuda::stream_t::enqueue_t::memzero ( memory::region_t region ) const

inline

Set all bytes of a certain region in device memory (or unified memory, but using the CUDA device to do it) to zero.

Note: this is a separate method, since the CUDA runtime has a separate API call for setting to zero; does that mean there are special facilities for zero'ing memory faster? Who knows.

Parameters

start	Beginning of the region to fill
num_bytes	size of the region to fill

◆ set_single_value()

template<typename T >

void cuda::stream_t::enqueue_t::set_single_value	(	T *__restrict__	ptr,
		T	value,
		bool	with_memory_barrier = `true`
	)		const

inline

Schedule writing a single value to global device memory after all previous work has concluded.

Template Parameters

T	the value to schedule a setting of. Can only be a raw uint32_t or uint64_t !

Parameters

ptr	location in global device memory to set at the appropriate time.
value	the value to write to `address`.
with_memory_barrier	if false, allows reordering of this write operation with writes scheduled before it.

◆ single_value_operations_batch() [1/2]

template<typename Iterator >

void cuda::stream_t::enqueue_t::single_value_operations_batch	(	Iterator	ops_begin,
		Iterator	ops_end
	)		const

inline

Enqueue multiple single-value write, wait and flush operations to the device (avoiding the overhead of multiple enqueue calls).

Note: see wait(), set_single_value and flush_remote_writes.

Parameters

ops_begin	beginning of a sequence of single-value operation specifications
ops_end	end of a sequence of single-value operation specifications

◆ single_value_operations_batch() [2/2]

template<typename Container >

void cuda::stream_t::enqueue_t::single_value_operations_batch ( const Container & single_value_ops ) const

inline

Parameters

single_value_ops A sequence of single-value operation specifiers to enqueue together.

◆ type_erased_kernel_launch()

void cuda::stream_t::enqueue_t::type_erased_kernel_launch	(	const kernel_t &	kernel,
		launch_configuration_t	launch_configuration,
		span< const void *>	marshalled_arguments
	)		const

inline

Schedule a kernel launch on the associated stream.

Parameters

kernel	A wrapper around the kernel to launch
launch_configuration	A description of how to launch the kernel (e.g. block and grid dimensions).
marshalled_arguments	Pointers to arguments to be passed to the kernel for this launch

Note: This signature does not require any type information regarding the kernel function type; see kernel_launch() for a type-observing version of the same schedulign operation.

◆ wait() [1/2]

void cuda::stream_t::enqueue_t::wait ( const event_t & event_ ) const

inline

Will pause all further activity on the stream until the specified event has occurred (i.e.

has fired, i.e. has had all preceding scheduled work on the stream on which it was recorded completed).

Note: this call will not delay any already-enqueued work on the stream, only work enqueued after the call.

Parameters

event_ the event for whose occurrence to wait; the event would typically be recorded on another stream.

◆ wait() [2/2]

template<typename T >

void cuda::stream_t::enqueue_t::wait	(	const T *	address,
		stream::wait_condition_t	condition,
		T	value,
		bool	with_memory_barrier = `false`
	)		const

inline

Wait for a value in device global memory to change so as to meet some condition.

Template Parameters

T	the value to schedule a setting of. Can only be a raw uint32_t or uint64_t !

Parameters

address	location in global device memory to set at the appropriate time.
condition	the kind of condition to check against the reference value. Examples: equal to 5, greater-or-equal to 5, non-zero bitwise-and with 5 etc.
value	the condition is checked against this reference value. Example: waiting on the value at address to be greater-or-equal to this value.
with_memory_barrier	If true, all remote writes guaranteed to have reached the device before the wait is performed will be visible to all operations on this stream/queue scheduled after the wait.

The documentation for this class was generated from the following file:

src/cuda/api/stream.hpp

Public Member Functions

Detailed Description

Member Function Documentation

◆ attach_managed_region() [1/2]

◆ attach_managed_region() [2/2]

◆ copy() [1/2]

◆ copy() [2/2]

◆ event() [1/2]

◆ event() [2/2]

◆ kernel_launch()

◆ memset() [1/2]

◆ memset() [2/2]

◆ memzero() [1/2]

◆ memzero() [2/2]

◆ set_single_value()

◆ single_value_operations_batch() [1/2]

◆ single_value_operations_batch() [2/2]

◆ type_erased_kernel_launch()

◆ wait() [1/2]

◆ wait() [2/2]