rocPRIM
|
![]() |
Namespaces | |
detail | |
Deprecated: Configuration of device-level scan primitives. | |
Classes | |
class | block_adjacent_difference< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
The block_adjacent_difference class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block. More... | |
class | block_discontinuity< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
The block_discontinuity class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More... | |
class | block_exchange< T, BlockSizeX, ItemsPerThread, BlockSizeY, BlockSizeZ > |
The block_exchange class is a block level parallel primitive which provides methods for rearranging items partitioned across threads in a block. More... | |
class | block_histogram< T, BlockSizeX, ItemsPerThread, Bins, Algorithm, BlockSizeY, BlockSizeZ > |
The block_histogram class is a block level parallel primitive which provides methods for constructing block-wide histograms from items partitioned across threads in a block. More... | |
class | block_load< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ > |
The block_load class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block. More... | |
class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_striped, BlockSizeY, BlockSizeZ > |
class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_vectorize, BlockSizeY, BlockSizeZ > |
class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_transpose, BlockSizeY, BlockSizeZ > |
class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_warp_transpose, BlockSizeY, BlockSizeZ > |
class | block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ > |
The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block. More... | |
class | block_radix_sort< Key, BlockSizeX, ItemsPerThread, Value, BlockSizeY, BlockSizeZ > |
The block_radix_sort class is a block level parallel primitive which provides methods for sorting of items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm. More... | |
class | block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ > |
The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More... | |
class | block_scan< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ > |
The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block. More... | |
class | block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More... | |
class | block_sort< Key, BlockSizeX, ItemsPerThread, Value, Algorithm, BlockSizeY, BlockSizeZ > |
The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm. More... | |
class | block_store< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ > |
The block_store class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More... | |
class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_striped, BlockSizeY, BlockSizeZ > |
class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_vectorize, BlockSizeY, BlockSizeZ > |
class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_transpose, BlockSizeY, BlockSizeZ > |
class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_warp_transpose, BlockSizeY, BlockSizeZ > |
Functions | |
template<class InputIterator , class T , unsigned int ItemsPerThread> | |
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
Loads data from continuous memory into a blocked arrangement of items across the thread block. More... | |
template<class InputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid . More... | |
template<class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
template<class T , class U , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE auto | block_load_direct_blocked_vectorized (unsigned int flat_id, T *block_input, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type |
Loads data from continuous memory into a blocked arrangement of items across the thread block. More... | |
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
Loads data from continuous memory into a striped arrangement of items across the thread block. More... | |
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid . More... | |
template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block. More... | |
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid . More... | |
template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
template<class OutputIterator , class T , unsigned int ItemsPerThread> | |
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
template<class OutputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid . More... | |
template<class T , class U , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE auto | block_store_direct_blocked_vectorized (unsigned int flat_id, T *block_output, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid . More... | |
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread> | |
ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid . More... | |
|
strong |
Available algorithms for block_histogram primitive.
Enumerator | |
---|---|
using_atomic | Atomic addition is used to update bin count directly.
|
using_sort | A two-phase operation is used:-.
|
default_algorithm | Default block_histogram algorithm. |
|
strong |
block_load_method
enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block
|
strong |
Available algorithms for the block_radix_rank primitive.
|
strong |
Available algorithms for block_reduce primitive.
Enumerator | |
---|---|
using_warp_reduce | A warp_reduce based algorithm. |
raking_reduce | An algorithm which limits calculations to a single hardware warp. |
raking_reduce_commutative_only | raking reduce that supports only commutative operators |
default_algorithm | Default block_reduce algorithm. |
|
strong |
Available algorithms for block_scan primitive.
Enumerator | |
---|---|
using_warp_scan | A warp_scan based algorithm. |
reduce_then_scan | An algorithm which limits calculations to a single hardware warp. |
default_algorithm | Default block_scan algorithm. |
|
strong |
Available algorithms for block_sort primitive.
Enumerator | |
---|---|
bitonic_sort | A bitonic sort based algorithm. |
merge_sort | A merge sort based algorithm. |
stable_merge_sort | A merged sort based algorithm which sorts stably. |
default_algorithm | Default block_sort algorithm. |
|
strong |
block_store_method
enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread] | ||
) |
Loads data from continuous memory into a blocked arrangement of items across the thread block.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid
.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid, | ||
Default | out_of_bounds | ||
) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
Default | - [inferred] The data type of the default value |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
out_of_bounds | - default value assigned to out-of-bound items |
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized | ( | unsigned int | flat_id, |
T * | block_input, | ||
U(&) | items[ItemsPerThread] | ||
) | -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type |
Loads data from continuous memory into a blocked arrangement of items across the thread block.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
The input offset (block_input
+ offset) must be quad-item aligned.
The following conditions will prevent vectorization and switch to default block_load_direct_blocked:
ItemsPerThread
is odd.T
is not a primitive or a HIP vector type (e.g. int2, int4, etc.T | - [inferred] the input data type |
U | - [inferred] the output data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
The type T
must be such that it can be implicitly converted to U
.
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread] | ||
) |
Loads data from continuous memory into a striped arrangement of items across the thread block.
The striped arrangement is assumed to be (BlockSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
BlockSize | - the number of threads in a block |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid
.
The striped arrangement is assumed to be (BlockSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
BlockSize | - the number of threads in a block |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid, | ||
Default | out_of_bounds | ||
) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The striped arrangement is assumed to be (BlockSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
BlockSize | - the number of threads in a block |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
Default | - [inferred] The data type of the default value |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
out_of_bounds | - default value assigned to out-of-bound items |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread] | ||
) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block.
The warp-striped arrangement is assumed to be (WarpSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
WarpSize
.WarpSize
is a hardware warpsize and is an optimal value.WarpSize
must be a power of two and equal or less than the size of hardware warp.WarpSize
smaller than hardware warpsize could result in lower performance.WarpSize | - [optional] the number of threads in a warp |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid
.
The warp-striped arrangement is assumed to be (WarpSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
WarpSize
.WarpSize
is a hardware warpsize and is an optimal value.WarpSize
must be a power of two and equal or less than the size of hardware warp.WarpSize
smaller than hardware warpsize could result in lower performance.WarpSize | - [optional] the number of threads in a warp |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
InputIterator | block_input, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid, | ||
Default | out_of_bounds | ||
) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The warp-striped arrangement is assumed to be (WarpSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to load a range of ItemsPerThread
into items
.
WarpSize
.WarpSize
is a hardware warpsize and is an optimal value.WarpSize
must be a power of two and equal or less than the size of hardware warp.WarpSize
smaller than hardware warpsize could result in lower performance.WarpSize | - [optional] the number of threads in a warp |
InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
Default | - [inferred] The data type of the default value |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_input | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
valid | - maximum range of valid numbers to load |
out_of_bounds | - default value assigned to out-of-bound items |
BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread] | ||
) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid
.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
valid | - maximum range of valid numbers to store |
ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized | ( | unsigned int | flat_id, |
T * | block_output, | ||
U(&) | items[ItemsPerThread] | ||
) | -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The block arrangement is assumed to be (block-threads * ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
The input offset (block_output
+ offset) must be quad-item aligned.
The following conditions will prevent vectorization and switch to default block_load_direct_blocked:
ItemsPerThread
is odd.T
is not a primitive or a HIP vector type (e.g. int2, int4, etc.T | - [inferred] the output data type |
U | - [inferred] the input data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
The type U
must be such that it can be implicitly converted to T
.
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to load from |
items | - array that data is loaded to |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread] | ||
) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The striped arrangement is assumed to be (BlockSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
BlockSize | - the number of threads in a block |
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid
.
The striped arrangement is assumed to be (BlockSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
BlockSize | - the number of threads in a block |
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
valid | - maximum range of valid numbers to store |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread] | ||
) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The warp-striped arrangement is assumed to be (WarpSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
WarpSize
.WarpSize
is a hardware warpsize and is an optimal value.WarpSize
must be a power of two and equal or less than the size of hardware warp.WarpSize
smaller than hardware warpsize could result in lower performance.WarpSize | - [optional] the number of threads in a warp |
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped | ( | unsigned int | flat_id, |
OutputIterator | block_output, | ||
T(&) | items[ItemsPerThread], | ||
unsigned int | valid | ||
) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid
.
The warp-striped arrangement is assumed to be (WarpSize
* ItemsPerThread
) items across a thread block. Each thread uses a flat_id
to store a range of ItemsPerThread
items
to the thread block.
WarpSize
.WarpSize
is a hardware warpsize and is an optimal value.WarpSize
must be a power of two and equal or less than the size of hardware warp.WarpSize
smaller than hardware warpsize could result in lower performance.WarpSize | - [optional] the number of threads in a warp |
OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
T | - [inferred] the data type |
ItemsPerThread | - [inferred] the number of items to be processed by each thread |
flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
block_output | - the input iterator from the thread block to store to |
items | - array that data is stored to thread block |
valid | - maximum range of valid numbers to store |