Collaboration diagram for Warpmodule:

Namespaces
	detail
	Deprecated: Configuration of device-level scan primitives.

Classes
class	warp_exchange< T, ItemsPerThread, WarpSize >
	The `warp_exchange` class is a warp level parallel primitive which provides methods for rearranging items partitioned across threads in a warp. More...

class	warp_load< T, ItemsPerThread, WarpSize, Method >
	The `warp_load` class is a warp level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across a warp. More...

class	warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_striped >

class	warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_vectorize >

class	warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_transpose >

class	warp_reduce< T, WarpSize, UseAllReduce >
	The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp. More...

class	warp_scan< T, WarpSize >
	The warp_scan class is a warp level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a hardware warp. More...

class	warp_sort< Key, WarpSize, Value >
	The warp_sort class provides warp-wide methods for computing a parallel sort of items across thread warps. More...

class	warp_store< T, ItemsPerThread, WarpSize, Method >
	The `warp_store` class is a warp level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More...

class	warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_striped >

class	warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_vectorize >

class	warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_transpose >

Enumerations
enum	warp_load_method { warp_load_method::warp_load_direct, warp_load_method::warp_load_striped, warp_load_method::warp_load_vectorize, warp_load_method::warp_load_transpose, warp_load_method::default_method = warp_load_direct }
	`warp_load_method` enumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp More...

enum	warp_store_method { warp_store_method::warp_store_direct, warp_store_method::warp_store_striped, warp_store_method::warp_store_vectorize, warp_store_method::warp_store_transpose, warp_store_method::default_method = warp_store_direct }
	`warp_store_method` enumerates the methods available to store a blocked/striped arrangement of items into a blocked/striped arrangement in continuous memory More...

Functions
template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T	warp_shuffle (const T &input, const int src_lane, const int width=device_warp_size())
	Shuffle for any data type. More...

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T	warp_shuffle_up (const T &input, const unsigned int delta, const int width=device_warp_size())
	Shuffle up for any data type. More...

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T	warp_shuffle_down (const T &input, const unsigned int delta, const int width=device_warp_size())
	Shuffle down for any data type. More...

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T	warp_shuffle_xor (const T &input, const int lane_mask, const int width=device_warp_size())
	Shuffle XOR for any data type. More...

template<typename T >
ROCPRIM_DEVICE ROCPRIM_INLINE T	warp_permute (const T &input, const int dst_lane, const int width=device_warp_size())
	Permute items across the threads in a warp. More...

Detailed Description

Enumeration Type Documentation

◆ warp_load_method

enum warp_load_method

strong

warp_load_method enumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp

Enumerator
warp_load_direct	Data from continuous memory is loaded into a blocked arrangement of items. Performance Notes: Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
warp_load_striped	A striped arrangement of data is read directly from memory.
warp_load_vectorize	Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization. Performance Notes: Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to `warp_load_direct`. Requirements: The input offset (`block_input`) must be quad-item aligned. The following conditions will prevent vectorization and switch to default `warp_load_direct:` `ItemsPerThread` is odd. The datatype `T` is not a primitive or a HIP vector type (e.g. int2, int4, etc.
warp_load_transpose	A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `warp_load_direct` and `warp_load_vectorize` due to reordering on local memory.
default_method	Defaults to `warp_load_direct`.

◆ warp_store_method

enum warp_store_method

strong

warp_store_method enumerates the methods available to store a blocked/striped arrangement of items into a blocked/striped arrangement in continuous memory

Enumerator
warp_store_direct	A blocked arrangement of items is stored into a blocked arrangement on continuous memory. Performance Notes: Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
warp_store_striped	A striped arrangement of items is stored into a blocked arrangement on continuous memory.
warp_store_vectorize	A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization. Performance Notes: Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to `warp_store_direct`. Requirements: The output offset (`block_output`) must be quad-item aligned. The following conditions will prevent vectorization and switch to default `warp_store_direct:` `ItemsPerThread` is odd. The datatype `T` is not a primitive or a HIP vector type (e.g. int2, int4, etc.
warp_store_transpose	A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `warp_store_direct` and `warp_store_vectorize` due to reordering on local memory.
default_method	Defaults to `warp_store_direct`.

Function Documentation

◆ warp_permute()

template<typename T >

ROCPRIM_DEVICE ROCPRIM_INLINE T warp_permute	(	const T &	input,
		const int	dst_lane,
		const int	width = `device_warp_size()`
	)

Permute items across the threads in a warp.

The value from this thread in the warp is permuted to the dst_lane-th thread in the warp. If multiple warps write to the same destination, the result is undefined but will be a value from either of the source values. If no threads write to a particular thread then the value for that thread will be 0. The destination index is taken modulo the logical warp size, so any value larger than the logical warp size will wrap around.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters

input	- input to pass to other threads
dst_lane	- the destination lane to which the value from this thread is written.
width	- logical warp width

◆ warp_shuffle()

template<class T >

ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle	(	const T &	input,
		const int	src_lane,
		const int	width = `device_warp_size()`
	)

Shuffle for any data type.

Each thread in warp obtains input from src_lane-th thread in warp. If width is less than device_warp_size() then each subsection of the warp behaves as a separate entity with a starting logical lane id of 0. If src_lane is not in [0; width) range, the returned value is equal to input passed by the src_lane modulo width thread.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters

input	- input to pass to other threads
src_lane	- warp if of a thread whose `input` should be returned
width	- logical warp width

◆ warp_shuffle_down()

template<class T >

ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_down	(	const T &	input,
		const unsigned int	delta,
		const int	width = `device_warp_size()`
	)

Shuffle down for any data type.

i-th thread in warp obtains input from i+delta-th thread in warp. If i+delta is not in [0; width) range, thread's own input is returned.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters

input	- input to pass to other threads
delta	- offset for calculating source lane id
width	- logical warp width

◆ warp_shuffle_up()

template<class T >

ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_up	(	const T &	input,
		const unsigned int	delta,
		const int	width = `device_warp_size()`
	)

Shuffle up for any data type.

i-th thread in warp obtains input from i-delta-th thread in warp. If i-delta is not in [0; width) range, thread's own input is returned.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters

input	- input to pass to other threads
delta	- offset for calculating source lane id
width	- logical warp width

◆ warp_shuffle_xor()

template<class T >

ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_xor	(	const T &	input,
		const int	lane_mask,
		const int	width = `device_warp_size()`
	)

Shuffle XOR for any data type.

i-th thread in warp obtains input from i^lane_mask-th thread in warp.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters

input	- input to pass to other threads
lane_mask	- mask used for calculating source lane id
width	- logical warp width

Namespaces

Classes

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

◆ warp_load_method

◆ warp_store_method

Function Documentation

◆ warp_permute()

◆ warp_shuffle()

◆ warp_shuffle_down()

◆ warp_shuffle_up()

◆ warp_shuffle_xor()