rocPRIM
Namespaces | Classes | Enumerations | Functions
Warpmodule
Collaboration diagram for Warpmodule:

Namespaces

 detail
 Deprecated: Configuration of device-level scan primitives.
 

Classes

class  warp_exchange< T, ItemsPerThread, WarpSize >
 The warp_exchange class is a warp level parallel primitive which provides methods for rearranging items partitioned across threads in a warp. More...
 
class  warp_load< T, ItemsPerThread, WarpSize, Method >
 The warp_load class is a warp level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across a warp. More...
 
class  warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_striped >
 
class  warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_vectorize >
 
class  warp_load< T, ItemsPerThread, WarpSize, warp_load_method::warp_load_transpose >
 
class  warp_reduce< T, WarpSize, UseAllReduce >
 The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp. More...
 
class  warp_scan< T, WarpSize >
 The warp_scan class is a warp level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a hardware warp. More...
 
class  warp_sort< Key, WarpSize, Value >
 The warp_sort class provides warp-wide methods for computing a parallel sort of items across thread warps. More...
 
class  warp_store< T, ItemsPerThread, WarpSize, Method >
 The warp_store class is a warp level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More...
 
class  warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_striped >
 
class  warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_vectorize >
 
class  warp_store< T, ItemsPerThread, WarpSize, warp_store_method::warp_store_transpose >
 

Enumerations

enum  warp_load_method {
  warp_load_method::warp_load_direct, warp_load_method::warp_load_striped, warp_load_method::warp_load_vectorize, warp_load_method::warp_load_transpose,
  warp_load_method::default_method = warp_load_direct
}
 warp_load_method enumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp More...
 
enum  warp_store_method {
  warp_store_method::warp_store_direct, warp_store_method::warp_store_striped, warp_store_method::warp_store_vectorize, warp_store_method::warp_store_transpose,
  warp_store_method::default_method = warp_store_direct
}
 warp_store_method enumerates the methods available to store a blocked/striped arrangement of items into a blocked/striped arrangement in continuous memory More...
 

Functions

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle (const T &input, const int src_lane, const int width=device_warp_size())
 Shuffle for any data type. More...
 
template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_up (const T &input, const unsigned int delta, const int width=device_warp_size())
 Shuffle up for any data type. More...
 
template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_down (const T &input, const unsigned int delta, const int width=device_warp_size())
 Shuffle down for any data type. More...
 
template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_xor (const T &input, const int lane_mask, const int width=device_warp_size())
 Shuffle XOR for any data type. More...
 
template<typename T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_permute (const T &input, const int dst_lane, const int width=device_warp_size())
 Permute items across the threads in a warp. More...
 

Detailed Description

Enumeration Type Documentation

◆ warp_load_method

enum warp_load_method
strong

warp_load_method enumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp

Enumerator
warp_load_direct 

Data from continuous memory is loaded into a blocked arrangement of items.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
warp_load_striped 

A striped arrangement of data is read directly from memory.

warp_load_vectorize 

Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to warp_load_direct.
Requirements:
  • The input offset (block_input) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default warp_load_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
warp_load_transpose 

A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to warp_load_direct and warp_load_vectorize due to reordering on local memory.
default_method 

Defaults to warp_load_direct.

◆ warp_store_method

enum warp_store_method
strong

warp_store_method enumerates the methods available to store a blocked/striped arrangement of items into a blocked/striped arrangement in continuous memory

Enumerator
warp_store_direct 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
warp_store_striped 

A striped arrangement of items is stored into a blocked arrangement on continuous memory.

warp_store_vectorize 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to warp_store_direct.
Requirements:
  • The output offset (block_output) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default warp_store_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
warp_store_transpose 

A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to warp_store_direct and warp_store_vectorize due to reordering on local memory.
default_method 

Defaults to warp_store_direct.

Function Documentation

◆ warp_permute()

template<typename T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_permute ( const T &  input,
const int  dst_lane,
const int  width = device_warp_size() 
)

Permute items across the threads in a warp.

The value from this thread in the warp is permuted to the dst_lane-th thread in the warp. If multiple warps write to the same destination, the result is undefined but will be a value from either of the source values. If no threads write to a particular thread then the value for that thread will be 0. The destination index is taken modulo the logical warp size, so any value larger than the logical warp size will wrap around.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters
input- input to pass to other threads
dst_lane- the destination lane to which the value from this thread is written.
width- logical warp width

◆ warp_shuffle()

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle ( const T &  input,
const int  src_lane,
const int  width = device_warp_size() 
)

Shuffle for any data type.

Each thread in warp obtains input from src_lane-th thread in warp. If width is less than device_warp_size() then each subsection of the warp behaves as a separate entity with a starting logical lane id of 0. If src_lane is not in [0; width) range, the returned value is equal to input passed by the src_lane modulo width thread.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters
input- input to pass to other threads
src_lane- warp if of a thread whose input should be returned
width- logical warp width

◆ warp_shuffle_down()

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_down ( const T &  input,
const unsigned int  delta,
const int  width = device_warp_size() 
)

Shuffle down for any data type.

i-th thread in warp obtains input from i+delta-th thread in warp. If i+delta is not in [0; width) range, thread's own input is returned.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters
input- input to pass to other threads
delta- offset for calculating source lane id
width- logical warp width

◆ warp_shuffle_up()

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_up ( const T &  input,
const unsigned int  delta,
const int  width = device_warp_size() 
)

Shuffle up for any data type.

i-th thread in warp obtains input from i-delta-th thread in warp. If i-delta is not in [0; width) range, thread's own input is returned.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters
input- input to pass to other threads
delta- offset for calculating source lane id
width- logical warp width

◆ warp_shuffle_xor()

template<class T >
ROCPRIM_DEVICE ROCPRIM_INLINE T warp_shuffle_xor ( const T &  input,
const int  lane_mask,
const int  width = device_warp_size() 
)

Shuffle XOR for any data type.

i-th thread in warp obtains input from i^lane_mask-th thread in warp.

Note: The optional width parameter must be a power of 2; results are undefined if it is not a power of 2, or it is greater than device_warp_size().

Parameters
input- input to pass to other threads
lane_mask- mask used for calculating source lane id
width- logical warp width