Collaboration diagram for Blockmodule:

Namespaces
	detail
	Deprecated: Configuration of device-level scan primitives.

Classes
class	block_adjacent_difference< T, BlockSizeX, BlockSizeY, BlockSizeZ >
	The `block_adjacent_difference` class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block. More...

class	block_discontinuity< T, BlockSizeX, BlockSizeY, BlockSizeZ >
	The `block_discontinuity` class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More...

class	block_exchange< T, BlockSizeX, ItemsPerThread, BlockSizeY, BlockSizeZ >
	The `block_exchange` class is a block level parallel primitive which provides methods for rearranging items partitioned across threads in a block. More...

class	block_histogram< T, BlockSizeX, ItemsPerThread, Bins, Algorithm, BlockSizeY, BlockSizeZ >
	The block_histogram class is a block level parallel primitive which provides methods for constructing block-wide histograms from items partitioned across threads in a block. More...

class	block_load< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
	The `block_load` class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block. More...

class	block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_striped, BlockSizeY, BlockSizeZ >

class	block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_vectorize, BlockSizeY, BlockSizeZ >

class	block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_transpose, BlockSizeY, BlockSizeZ >

class	block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_warp_transpose, BlockSizeY, BlockSizeZ >

class	block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >
	The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block. More...

class	block_radix_sort< Key, BlockSizeX, ItemsPerThread, Value, BlockSizeY, BlockSizeZ >
	The block_radix_sort class is a block level parallel primitive which provides methods for sorting of items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm. More...

class	block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
	The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More...

class	block_scan< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
	The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block. More...

class	block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >
	The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More...

class	block_sort< Key, BlockSizeX, ItemsPerThread, Value, Algorithm, BlockSizeY, BlockSizeZ >
	The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm. More...

class	block_store< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ >
	The `block_store` class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More...

class	block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_striped, BlockSizeY, BlockSizeZ >

class	block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_vectorize, BlockSizeY, BlockSizeZ >

class	block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_transpose, BlockSizeY, BlockSizeZ >

class	block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_warp_transpose, BlockSizeY, BlockSizeZ >

Enumerations
enum	block_histogram_algorithm { block_histogram_algorithm::using_atomic, block_histogram_algorithm::using_sort, block_histogram_algorithm::default_algorithm = using_atomic }
	Available algorithms for block_histogram primitive. More...

enum	block_load_method { block_load_method::block_load_direct, block_load_method::block_load_striped, block_load_method::block_load_vectorize, block_load_method::block_load_transpose, block_load_method::block_load_warp_transpose, block_load_method::default_method = block_load_direct }
	`block_load_method` enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block More...

enum	block_radix_rank_algorithm { block_radix_rank_algorithm::basic, block_radix_rank_algorithm::basic_memoize, block_radix_rank_algorithm::match, block_radix_rank_algorithm::default_algorithm = basic }
	Available algorithms for the block_radix_rank primitive. More...

enum	block_reduce_algorithm { block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce, block_reduce_algorithm::raking_reduce_commutative_only, block_reduce_algorithm::default_algorithm = using_warp_reduce }
	Available algorithms for block_reduce primitive. More...

enum	block_scan_algorithm { block_scan_algorithm::using_warp_scan, block_scan_algorithm::reduce_then_scan, block_scan_algorithm::default_algorithm = using_warp_scan }
	Available algorithms for block_scan primitive. More...

enum	block_sort_algorithm { block_sort_algorithm::bitonic_sort, block_sort_algorithm::merge_sort, block_sort_algorithm::stable_merge_sort, block_sort_algorithm::default_algorithm = bitonic_sort }
	Available algorithms for block_sort primitive. More...

enum	block_store_method { block_store_method::block_store_direct, block_store_method::block_store_striped, block_store_method::block_store_vectorize, block_store_method::block_store_transpose, block_store_method::block_store_warp_transpose, block_store_method::default_method = block_store_direct }
	`block_store_method` enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory More...

Functions
template<class InputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
	Loads data from continuous memory into a blocked arrangement of items across the thread block. More...

template<class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
	Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range `valid`. More...

template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
	Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	block_load_direct_blocked_vectorized (unsigned int flat_id, T *block_input, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
	Loads data from continuous memory into a blocked arrangement of items across the thread block. More...

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
	Loads data from continuous memory into a striped arrangement of items across the thread block. More...

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
	Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range `valid`. More...

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
	Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread])
	Loads data from continuous memory into a warp-striped arrangement of items across the thread block. More...

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid)
	Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range `valid`. More...

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)
	Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More...

template<class OutputIterator , class T , unsigned int ItemsPerThread>
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
	Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...

template<class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
	Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range `valid`. More...

template<class T , class U , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	block_store_direct_blocked_vectorized (unsigned int flat_id, T *block_output, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type
	Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
	Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
	Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range `valid`. More...

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread])
	Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More...

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void	block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid)
	Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range `valid`. More...

Detailed Description

Enumeration Type Documentation

◆ block_histogram_algorithm

enum block_histogram_algorithm

strong

Available algorithms for block_histogram primitive.

Enumerator

using_atomic

Atomic addition is used to update bin count directly.

Performance Notes:

Performance is dependent on hardware implementation of atomic addition.
Performance may decrease for non-uniform random input distributions where many concurrent updates may be made to the same bin counter.

using_sort

A two-phase operation is used:-.

Data is sorted using radix-sort.
"Runs" of same-valued keys are detected using discontinuity; run-lengths are bin counts.
Performance Notes:
Performance is consistent regardless of sample bin distribution.

default_algorithm

Default block_histogram algorithm.

◆ block_load_method

enum block_load_method

strong

block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block

Enumerator
block_load_direct	Data from continuous memory is loaded into a blocked arrangement of items. Performance Notes: Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_load_striped	A striped arrangement of data is read directly from memory.
block_load_vectorize	Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization. Performance Notes: Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to `block_load_direct`. Requirements: The input offset (`block_input`) must be quad-item aligned. The following conditions will prevent vectorization and switch to default `block_load_direct:` `ItemsPerThread` is odd. The datatype `T` is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_load_transpose	A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `block_load_direct` and `block_load_vectorize` due to reordering on local memory.
block_load_warp_transpose	A warp-striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items. Requirements: The number of threads in the block must be a multiple of the size of hardware warp. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `block_load_direct` and `block_load_vectorize` due to reordering on local memory.
default_method	Defaults to `block_load_direct`.

◆ block_radix_rank_algorithm

enum block_radix_rank_algorithm

strong

Available algorithms for the block_radix_rank primitive.

Enumerator
basic	The basic block radix rank algorithm. Keys and ranks are assumed in blocked order.
basic_memoize	The basic block radix rank algorithm, configured to memoize intermediate values. This trades register usage for less shared memory operations. Keys and ranks are assumed in blocked order.
match	Warp-based radix ranking algorithm. Keys and ranks are assumed in warp-striped order for this algorithm.
default_algorithm	The default radix ranking algorithm.

◆ block_reduce_algorithm

enum block_reduce_algorithm

strong

Available algorithms for block_reduce primitive.

Enumerator
using_warp_reduce	A warp_reduce based algorithm.
raking_reduce	An algorithm which limits calculations to a single hardware warp.
raking_reduce_commutative_only	raking reduce that supports only commutative operators
default_algorithm	Default block_reduce algorithm.

◆ block_scan_algorithm

enum block_scan_algorithm

strong

Available algorithms for block_scan primitive.

Enumerator
using_warp_scan	A warp_scan based algorithm.
reduce_then_scan	An algorithm which limits calculations to a single hardware warp.
default_algorithm	Default block_scan algorithm.

◆ block_sort_algorithm

enum block_sort_algorithm

strong

Available algorithms for block_sort primitive.

Enumerator
bitonic_sort	A bitonic sort based algorithm.
merge_sort	A merge sort based algorithm.
stable_merge_sort	A merged sort based algorithm which sorts stably.
default_algorithm	Default block_sort algorithm.

◆ block_store_method

enum block_store_method

strong

block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory

Enumerator
block_store_direct	A blocked arrangement of items is stored into a blocked arrangement on continuous memory. Performance Notes: Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
block_store_striped	A striped arrangement of items is stored into a blocked arrangement on continuous memory.
block_store_vectorize	A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization. Performance Notes: Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to `block_store_direct`. Requirements: The output offset (`block_output`) must be quad-item aligned. The following conditions will prevent vectorization and switch to default `block_store_direct:` `ItemsPerThread` is odd. The datatype `T` is not a primitive or a HIP vector type (e.g. int2, int4, etc.
block_store_transpose	A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `block_store_direct` and `block_store_vectorize` due to reordering on local memory.
block_store_warp_transpose	A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory. Requirements: The number of threads in the block must be a multiple of the size of hardware warp. Performance Notes: Performance remains high due to increased memory coalescing, regardless of the number of items per thread. Performance may be better compared to `block_store_direct` and `block_store_vectorize` due to reordering on local memory.
default_method	Defaults to `block_store_direct`.

Function Documentation

◆ block_load_direct_blocked() [1/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>

BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread]
	)

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to

◆ block_load_direct_blocked() [2/3]

template<class InputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load

◆ block_load_direct_blocked() [3/3]

template<class InputIterator , class T , unsigned int ItemsPerThread, class Default >

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid,
		Default	out_of_bounds
	)

Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread
Default	- [inferred] The data type of the default value

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load
out_of_bounds	- default value assigned to out-of-bound items

◆ block_load_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized	(	unsigned int	flat_id,
		T *	block_input,
		U(&)	items[ItemsPerThread]
	)		-> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Loads data from continuous memory into a blocked arrangement of items across the thread block.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The input offset (block_input + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

ItemsPerThread is odd.
The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.

Template Parameters

T	- [inferred] the input data type
U	- [inferred] the output data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

The type T must be such that it can be implicitly converted to U.

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to

◆ block_load_direct_striped() [1/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread]
	)

Loads data from continuous memory into a striped arrangement of items across the thread block.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

BlockSize	- the number of threads in a block
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to

◆ block_load_direct_striped() [2/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

BlockSize	- the number of threads in a block
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load

◆ block_load_direct_striped() [3/3]

template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default >

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid,
		Default	out_of_bounds
	)

Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

Template Parameters

BlockSize	- the number of threads in a block
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread
Default	- [inferred] The data type of the default value

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load
out_of_bounds	- default value assigned to out-of-bound items

◆ block_load_direct_warp_striped() [1/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread]
	)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The number of threads in the block must be a multiple of WarpSize.
The default WarpSize is a hardware warpsize and is an optimal value.
WarpSize must be a power of two and equal or less than the size of hardware warp.
Using WarpSize smaller than hardware warpsize could result in lower performance.

Template Parameters

WarpSize	- [optional] the number of threads in a warp
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to

◆ block_load_direct_warp_striped() [2/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The number of threads in the block must be a multiple of WarpSize.
The default WarpSize is a hardware warpsize and is an optimal value.
WarpSize must be a power of two and equal or less than the size of hardware warp.
Using WarpSize smaller than hardware warpsize could result in lower performance.

Template Parameters

WarpSize	- [optional] the number of threads in a warp
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load

◆ block_load_direct_warp_striped() [3/3]

template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default >

ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped	(	unsigned int	flat_id,
		InputIterator	block_input,
		T(&)	items[ItemsPerThread],
		unsigned int	valid,
		Default	out_of_bounds
	)

Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.

The number of threads in the block must be a multiple of WarpSize.
The default WarpSize is a hardware warpsize and is an optimal value.
WarpSize must be a power of two and equal or less than the size of hardware warp.
Using WarpSize smaller than hardware warpsize could result in lower performance.

Template Parameters

WarpSize	- [optional] the number of threads in a warp
InputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread
Default	- [inferred] The data type of the default value

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_input	- the input iterator from the thread block to load from
items	- array that data is loaded to
valid	- maximum range of valid numbers to load
out_of_bounds	- default value assigned to out-of-bound items

◆ block_store_direct_blocked() [1/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>

BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread]
	)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters

OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block

◆ block_store_direct_blocked() [2/2]

template<class OutputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters

OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block
valid	- maximum range of valid numbers to store

◆ block_store_direct_blocked_vectorized()

template<class T , class U , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized	(	unsigned int	flat_id,
		T *	block_output,
		U(&)	items[ItemsPerThread]
	)		-> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type

Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

The input offset (block_output + offset) must be quad-item aligned.

The following conditions will prevent vectorization and switch to default block_load_direct_blocked:

ItemsPerThread is odd.
The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.

Template Parameters

T	- [inferred] the output data type
U	- [inferred] the input data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

The type U must be such that it can be implicitly converted to T.

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to load from
items	- array that data is loaded to

◆ block_store_direct_striped() [1/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread]
	)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters

BlockSize	- the number of threads in a block
OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block

◆ block_store_direct_striped() [2/2]

template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

Template Parameters

BlockSize	- the number of threads in a block
OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block
valid	- maximum range of valid numbers to store

◆ block_store_direct_warp_striped() [1/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread]
	)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

The number of threads in the block must be a multiple of WarpSize.
The default WarpSize is a hardware warpsize and is an optimal value.
WarpSize must be a power of two and equal or less than the size of hardware warp.
Using WarpSize smaller than hardware warpsize could result in lower performance.

Template Parameters

WarpSize	- [optional] the number of threads in a warp
OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block

◆ block_store_direct_warp_striped() [2/2]

template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped	(	unsigned int	flat_id,
		OutputIterator	block_output,
		T(&)	items[ItemsPerThread],
		unsigned int	valid
	)

Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.

The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.

The number of threads in the block must be a multiple of WarpSize.
The default WarpSize is a hardware warpsize and is an optimal value.
WarpSize must be a power of two and equal or less than the size of hardware warp.
Using WarpSize smaller than hardware warpsize could result in lower performance.

Template Parameters

WarpSize	- [optional] the number of threads in a warp
OutputIterator	- [inferred] an iterator type for input (can be a simple pointer
T	- [inferred] the data type
ItemsPerThread	- [inferred] the number of items to be processed by each thread

Parameters

flat_id	- a local flat 1D thread id in a block (tile) for the calling thread
block_output	- the input iterator from the thread block to store to
items	- array that data is stored to thread block
valid	- maximum range of valid numbers to store

Namespaces

Classes

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

◆ block_histogram_algorithm

◆ block_load_method

◆ block_radix_rank_algorithm

◆ block_reduce_algorithm

◆ block_scan_algorithm

◆ block_sort_algorithm

◆ block_store_method

Function Documentation

◆ block_load_direct_blocked() [1/3]

◆ block_load_direct_blocked() [2/3]

◆ block_load_direct_blocked() [3/3]

◆ block_load_direct_blocked_vectorized()

◆ block_load_direct_striped() [1/3]

◆ block_load_direct_striped() [2/3]

◆ block_load_direct_striped() [3/3]

◆ block_load_direct_warp_striped() [1/3]

◆ block_load_direct_warp_striped() [2/3]

◆ block_load_direct_warp_striped() [3/3]

◆ block_store_direct_blocked() [1/2]

◆ block_store_direct_blocked() [2/2]

◆ block_store_direct_blocked_vectorized()

◆ block_store_direct_striped() [1/2]

◆ block_store_direct_striped() [2/2]

◆ block_store_direct_warp_striped() [1/2]

◆ block_store_direct_warp_striped() [2/2]