rocPRIM
Namespaces | Classes | Enumerations | Functions
Blockmodule
Collaboration diagram for Blockmodule:

Namespaces

 detail
 Deprecated: Configuration of device-level scan primitives.
 

Classes

class  block_adjacent_difference< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_adjacent_difference class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block. More...
 
class  block_discontinuity< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_discontinuity class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More...
 
class  block_exchange< T, BlockSizeX, ItemsPerThread, BlockSizeY, BlockSizeZ >
 The block_exchange class is a block level parallel primitive which provides methods for rearranging items partitioned across threads in a block. More...
 
class  block_histogram< T, BlockSizeX, ItemsPerThread, Bins, Algorithm, BlockSizeY, BlockSizeZ >
 The block_histogram class is a block level parallel primitive which provides methods for constructing block-wide histograms from items partitioned across threads in a block. More...
 
class  block_load< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
 The block_load class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block. More...
 
class  block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_striped, BlockSizeY, BlockSizeZ >
 
class  block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_vectorize, BlockSizeY, BlockSizeZ >
 
class  block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_transpose, BlockSizeY, BlockSizeZ >
 
class  block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_warp_transpose, BlockSizeY, BlockSizeZ >
 
class  block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >
 The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block. More...
 
class  block_radix_sort< Key, BlockSizeX, ItemsPerThread, Value, BlockSizeY, BlockSizeZ >
 The block_radix_sort class is a block level parallel primitive which provides methods for sorting of items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm. More...
 
class  block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
 The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More...
 
class  block_scan< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
 The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block. More...
 
class  block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >
 The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More...
 
class  block_sort< Key, BlockSizeX, ItemsPerThread, Value, Algorithm, BlockSizeY, BlockSizeZ >
 The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm. More...
 
class  block_store< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
 The block_store class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More...
 
class  block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_striped, BlockSizeY, BlockSizeZ >
 
class  block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_vectorize, BlockSizeY, BlockSizeZ >
 
class  block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_transpose, BlockSizeY, BlockSizeZ >
 
class  block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_warp_transpose, BlockSizeY, BlockSizeZ >
 

Enumerations

enum  block_histogram_algorithm { block_histogram_algorithm::using_atomic, block_histogram_algorithm::using_sort, block_histogram_algorithm::default_algorithm = using_atomic }
 Available algorithms for block_histogram primitive. More...
 
enum  block_load_method {
  block_load_method::block_load_direct, block_load_method::block_load_striped, block_load_method::block_load_vectorize, block_load_method::block_load_transpose,
  block_load_method::block_load_warp_transpose, block_load_method::default_method = block_load_direct
}
 block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block More...
 
enum  block_radix_rank_algorithm { block_radix_rank_algorithm::basic, block_radix_rank_algorithm::basic_memoize, block_radix_rank_algorithm::match, block_radix_rank_algorithm::default_algorithm = basic }
 Available algorithms for the block_radix_rank primitive. More...
 
enum  block_reduce_algorithm { block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce, block_reduce_algorithm::raking_reduce_commutative_only, block_reduce_algorithm::default_algorithm = using_warp_reduce }
 Available algorithms for block_reduce primitive. More...
 
enum  block_scan_algorithm { block_scan_algorithm::using_warp_scan, block_scan_algorithm::reduce_then_scan, block_scan_algorithm::default_algorithm = using_warp_scan }
 Available algorithms for block_scan primitive. More...
 
enum  block_sort_algorithm { block_sort_algorithm::bitonic_sort, block_sort_algorithm::merge_sort, block_sort_algorithm::stable_merge_sort, block_sort_algorithm::default_algorithm = bitonic_sort }
 Available algorithms for block_sort primitive. More...
 
enum  block_store_method {
  block_store_method::block_store_direct, block_store_method::block_store_striped, block_store_method::block_store_vectorize, block_store_method::block_store_transpose,
  block_store_method::block_store_warp_transpose, block_store_method::default_method = block_store_direct
}
 block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory More...
 

Functions

template<class InputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a blocked arrangement of items across the thread block. More...
 
template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid. More...
 
template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized (unsigned int flat_id, T *block_input, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
 Loads data from continuous memory into a blocked arrangement of items across the thread block. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a striped arrangement of items across the thread block. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid. More...
 
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid. More...
 
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
 Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...
 
template<class OutputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 
template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized (unsigned int flat_id, T *block_output, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
 Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
 Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...
 
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
 Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More...
 

Detailed Description

Enumeration Type Documentation

◆ block_histogram_algorithm

Available algorithms for block_histogram primitive.

Enumerator
using_atomic 

Atomic addition is used to update bin count directly.

Performance Notes:
  • Performance is dependent on hardware implementation of atomic addition.
  • Performance may decrease for non-uniform random input distributions where many concurrent updates may be made to the same bin counter.
using_sort 

A two-phase operation is used:-.

  • Data is sorted using radix-sort.
  • "Runs" of same-valued keys are detected using discontinuity; run-lengths are bin counts.
    Performance Notes:
  • Performance is consistent regardless of sample bin distribution.
default_algorithm 

Default block_histogram algorithm.

◆ block_load_method

enum block_load_method
strong

block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block

Enumerator
block_load_direct 

Data from continuous memory is loaded into a blocked arrangement of items.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_load_striped 

A striped arrangement of data is read directly from memory.

block_load_vectorize 

Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_load_direct.
Requirements:
  • The input offset (block_input) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default block_load_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_load_transpose 

A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.
block_load_warp_transpose 

A warp-striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Requirements:
  • The number of threads in the block must be a multiple of the size of hardware warp.
Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.
default_method 

Defaults to block_load_direct.

◆ block_radix_rank_algorithm

Available algorithms for the block_radix_rank primitive.

Enumerator
basic 

The basic block radix rank algorithm. Keys and ranks are assumed in blocked order.

basic_memoize 

The basic block radix rank algorithm, configured to memoize intermediate values.

This trades register usage for less shared memory operations. Keys and ranks are assumed in blocked order.

match 

Warp-based radix ranking algorithm. Keys and ranks are assumed in warp-striped order for this algorithm.

default_algorithm 

The default radix ranking algorithm.

◆ block_reduce_algorithm

Available algorithms for block_reduce primitive.

Enumerator
using_warp_reduce 

A warp_reduce based algorithm.

raking_reduce 

An algorithm which limits calculations to a single hardware warp.

raking_reduce_commutative_only 

raking reduce that supports only commutative operators

default_algorithm 

Default block_reduce algorithm.

◆ block_scan_algorithm

enum block_scan_algorithm
strong

Available algorithms for block_scan primitive.

Enumerator
using_warp_scan 

A warp_scan based algorithm.

reduce_then_scan 

An algorithm which limits calculations to a single hardware warp.

default_algorithm 

Default block_scan algorithm.

◆ block_sort_algorithm

enum block_sort_algorithm
strong

Available algorithms for block_sort primitive.

Enumerator
bitonic_sort 

A bitonic sort based algorithm.

merge_sort 

A merge sort based algorithm.

stable_merge_sort 

A merged sort based algorithm which sorts stably.

default_algorithm 

Default block_sort algorithm.

◆ block_store_method

enum block_store_method
strong

block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory

Enumerator
block_store_direct 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory.

Performance Notes:
  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_store_striped 

A striped arrangement of items is stored into a blocked arrangement on continuous memory.

block_store_vectorize 

A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization.

Performance Notes:
  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_store_direct.
Requirements:
  • The output offset (block_output) must be quad-item aligned.
  • The following conditions will prevent vectorization and switch to default block_store_direct:
    • ItemsPerThread is odd.
    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_store_transpose 

A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory.

Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.
block_store_warp_transpose 

A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory.

Requirements:
  • The number of threads in the block must be a multiple of the size of hardware warp.
Performance Notes:
  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.
default_method 

Defaults to block_store_direct.

Function Documentation

◆ block_load_direct_blocked() [1/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_blocked() [2/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_blocked() [3/3]

template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_load_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized ( unsigned int  flat_id,
T *  block_input,
U(&)  items[ItemsPerThread] 
) -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The input offset (block_input + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

  • ItemsPerThread is odd.
  • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
Template Parameters
T- [inferred] the input data type
U- [inferred] the output data type
ItemsPerThread- [inferred] the number of items to be processed by each thread

The type T must be such that it can be implicitly converted to U.

Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_striped() [1/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a striped arrangement of items across the thread block.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_striped() [2/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_striped() [3/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters
BlockSize- the number of threads in a block
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_load_direct_warp_striped() [1/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread] 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_load_direct_warp_striped() [2/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load

◆ block_load_direct_warp_striped() [3/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped ( unsigned int  flat_id,
InputIterator  block_input,
T(&)  items[ItemsPerThread],
unsigned int  valid,
Default  out_of_bounds 
)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
InputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Default- [inferred] The data type of the default value
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_input- the input iterator from the thread block to load from
items- array that data is loaded to
valid- maximum range of valid numbers to load
out_of_bounds- default value assigned to out-of-bound items

◆ block_store_direct_blocked() [1/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_blocked() [2/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store

◆ block_store_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized ( unsigned int  flat_id,
T *  block_output,
U(&)  items[ItemsPerThread] 
) -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

The input offset (block_output + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

  • ItemsPerThread is odd.
  • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.
Template Parameters
T- [inferred] the output data type
U- [inferred] the input data type
ItemsPerThread- [inferred] the number of items to be processed by each thread

The type U must be such that it can be implicitly converted to T.

Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to load from
items- array that data is loaded to

◆ block_store_direct_striped() [1/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
BlockSize- the number of threads in a block
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_striped() [2/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters
BlockSize- the number of threads in a block
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store

◆ block_store_direct_warp_striped() [1/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread] 
)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block

◆ block_store_direct_warp_striped() [2/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped ( unsigned int  flat_id,
OutputIterator  block_output,
T(&)  items[ItemsPerThread],
unsigned int  valid 
)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

  • The number of threads in the block must be a multiple of WarpSize.
  • The default WarpSize is a hardware warpsize and is an optimal value.
  • WarpSize must be a power of two and equal or less than the size of hardware warp.
  • Using WarpSize smaller than hardware warpsize could result in lower performance.
Template Parameters
WarpSize- [optional] the number of threads in a warp
OutputIterator- [inferred] an iterator type for input (can be a simple pointer
T- [inferred] the data type
ItemsPerThread- [inferred] the number of items to be processed by each thread
Parameters
flat_id- a local flat 1D thread id in a block (tile) for the calling thread
block_output- the input iterator from the thread block to store to
items- array that data is stored to thread block
valid- maximum range of valid numbers to store