rocPRIM
Classes | Enumerations | Functions
Collaboration diagram for Block-wide:

Classes

class  block_adjacent_difference< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_adjacent_difference class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More...
 
class  block_discontinuity< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_discontinuity class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More...
 
class  block_exchange< T, BlockSizeX, ItemsPerThread, BlockSizeY, BlockSizeZ >
 The block_exchange class is a block level parallel primitive which provides methods for rearranging items partitioned across threads in a block. More...
 
class  block_histogram< T, BlockSizeX, ItemsPerThread, Bins, Algorithm, BlockSizeY, BlockSizeZ >
 The block_histogram class is a block level parallel primitive which provides methods for constructing block-wide histograms from items partitioned across threads in a block. More...
 
class  block_load< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
 The block_load class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block. More...
 
class  block_radix_sort< Key, BlockSizeX, ItemsPerThread, Value, BlockSizeY, BlockSizeZ >
 The block_radix_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm. More...
 
class  block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
 The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More...
 
class  block_scan< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
 The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block. More...
 
class  block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More...
 
class  block_sort< Key, BlockSizeX, Value, Algorithm, BlockSizeY, BlockSizeZ >
 The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm. More...
 
class  block_store< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
 The block_store class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More...
 

Enumerations

enum  block_histogram_algorithm { block_histogram_algorithm::using_atomic, block_histogram_algorithm::using_sort, block_histogram_algorithm::default_algorithm = using_atomic }
 Available algorithms for block_histogram primitive. More...
 
enum  block_load_method {
  block_load_method::block_load_direct, block_load_method::block_load_striped, block_load_method::block_load_vectorize, block_load_method::block_load_transpose,
  block_load_method::block_load_warp_transpose, block_load_method::default_method = block_load_direct
}
 block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block More...
 
enum  block_reduce_algorithm { block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce, block_reduce_algorithm::raking_reduce_commutative_only, block_reduce_algorithm::default_algorithm = using_warp_reduce }
 Available algorithms for block_reduce primitive. More...
 
enum  block_scan_algorithm { block_scan_algorithm::using_warp_scan, block_scan_algorithm::reduce_then_scan, block_scan_algorithm::default_algorithm = using_warp_scan }
 Available algorithms for block_scan primitive. More...
 
enum  block_sort_algorithm { block_sort_algorithm::bitonic_sort, block_sort_algorithm::default_algorithm = bitonic_sort }
 Available algorithms for block_sort primitive. More...
 
enum  block_store_method {
  block_store_method::block_store_direct, block_store_method::block_store_striped, block_store_method::block_store_vectorize, block_store_method::block_store_transpose,
  block_store_method::block_store_warp_transpose, block_store_method::default_method = block_store_direct
}
 block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory More...
 

Functions

template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a blocked arrangement of items across the thread block. More...
 
template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid. More...
 
template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized (unsigned int flat_id, T *block_input, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
 Loads data from continuous memory into a blocked arrangement of items across the thread block. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a striped arrangement of items across the thread block. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 
template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized (unsigned int flat_id, T *block_output, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 

Detailed Description

Enumeration Type Documentation

◆ block_histogram_algorithm

Available algorithms for block_histogram primitive.

Enumerator
using_atomic 

Atomic addition is used to update bin count directly.

Performance Notes:
  • Performance is dependent on hardware implementation of atomic addition.
  • Performance may decrease for non-uniform random input distributions where many concurrent updates may be made to the same bin counter.
using_sort 

A two-phase operation is used:-.

  • Data is sorted using radix-sort.
  • "Runs" of same-valued keys are detected using discontinuity; run-lengths are bin counts.
    Performance Notes:
  • Performance is consistent regardless of sample bin distribution.
default_algorithm 

Default block_histogram algorithm.

◆ block_load_method

enum block_load_method
strong

block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block

Enumerator
block_load_direct 

Data from continuous memory is loaded into a blocked arrangement of items.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_load_striped 

A striped arrangement of data is read directly from memory.

block_load_vectorize 

Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_load_direct.
Requirements:
  • The input offset (block_input) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default block_load_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_load_transpose 

A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.
block_load_warp_transpose 

A warp-striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Requirements:
  • The number of threads in the block must be a multiple of the size of hardware warp.
Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.
default_method 

Defaults to block_load_direct.

◆ block_reduce_algorithm

Available algorithms for block_reduce primitive.

Enumerator
using_warp_reduce 

A warp_reduce based algorithm.

raking_reduce 

An algorithm which limits calculations to a single hardware warp.

raking_reduce_commutative_only 

raking reduce that supports only commutative operators

default_algorithm 

Default block_reduce algorithm.

◆ block_scan_algorithm

enum block_scan_algorithm
strong

Available algorithms for block_scan primitive.

Enumerator
using_warp_scan 

A warp_scan based algorithm.

reduce_then_scan 

An algorithm which limits calculations to a single hardware warp.

default_algorithm 

Default block_scan algorithm.

◆ block_sort_algorithm

enum block_sort_algorithm
strong

Available algorithms for block_sort primitive.

Enumerator
bitonic_sort 

A bitonic sort based algorithm.

default_algorithm 

Default block_sort algorithm.

◆ block_store_method

enum block_store_method
strong

block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory

Enumerator
block_store_direct 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_store_striped 

A striped arrangement of items is stored into a blocked arrangement on continuous memory.

block_store_vectorize 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_store_direct.
Requirements:
  • The output offset (block_output) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default block_store_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_store_transpose 

A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.
block_store_warp_transpose 

A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory.

Requirements:
  • The number of threads in the block must be a multiple of the size of hardware warp.
Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.
default_method 

Defaults to block_store_direct.

Function Documentation

◆ block_load_direct_blocked() [1/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_blocked() [2/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_blocked() [3/3]

template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_load_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized ( unsigned int  flat_id,
T *  block_input,
U(&)  items[ItemsPerThread] 
) -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The input offset (block_input + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

  • ItemsPerThread is odd.
  • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
Template Parameters
T- [inferred] the input data type
U- [inferred] the output data type
ItemsPerThread- [inferred] the number of items to be processed by each thread

The type T must be such that it can be implicitly converted to U.

Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_striped() [1/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a striped arrangement of items across the thread block.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_striped() [2/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_striped() [3/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_load_direct_warp_striped() [1/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_warp_striped() [2/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_warp_striped() [3/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_store_direct_blocked() [1/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_blocked() [2/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store

◆ block_store_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized ( unsigned int  flat_id,
T *  block_output,
U(&)  items[ItemsPerThread] 
) -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

The input offset (block_output + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

  • ItemsPerThread is odd.
  • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
Template Parameters
T- [inferred] the output data type
U- [inferred] the input data type
ItemsPerThread- [inferred] the number of items to be processed by each thread

The type U must be such that it can be implicitly converted to T.

Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_store_direct_striped() [1/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
BlockSize- the number of threads in a block
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_striped() [2/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
BlockSize- the number of threads in a block
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store

◆ block_store_direct_warp_striped() [1/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_warp_striped() [2/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store