|
rocPRIM
|
|
Namespaces | |
| detail | |
| Deprecated: Configuration of device-level scan primitives. | |
Classes | |
| class | block_adjacent_difference< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
The block_adjacent_difference class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block. More... | |
| class | block_discontinuity< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
The block_discontinuity class is a block level parallel primitive which provides methods for flagging items that are discontinued within an ordered set of items across threads in a block. More... | |
| class | block_exchange< T, BlockSizeX, ItemsPerThread, BlockSizeY, BlockSizeZ > |
The block_exchange class is a block level parallel primitive which provides methods for rearranging items partitioned across threads in a block. More... | |
| class | block_histogram< T, BlockSizeX, ItemsPerThread, Bins, Algorithm, BlockSizeY, BlockSizeZ > |
| The block_histogram class is a block level parallel primitive which provides methods for constructing block-wide histograms from items partitioned across threads in a block. More... | |
| class | block_load< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ > |
The block_load class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block. More... | |
| class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_striped, BlockSizeY, BlockSizeZ > |
| class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_vectorize, BlockSizeY, BlockSizeZ > |
| class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_transpose, BlockSizeY, BlockSizeZ > |
| class | block_load< T, BlockSizeX, ItemsPerThread, block_load_method::block_load_warp_transpose, BlockSizeY, BlockSizeZ > |
| class | block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ > |
| The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block. More... | |
| class | block_radix_sort< Key, BlockSizeX, ItemsPerThread, Value, BlockSizeY, BlockSizeZ > |
| The block_radix_sort class is a block level parallel primitive which provides methods for sorting of items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm. More... | |
| class | block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ > |
| The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More... | |
| class | block_scan< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ > |
| The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block. More... | |
| class | block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ > |
| The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More... | |
| class | block_sort< Key, BlockSizeX, ItemsPerThread, Value, Algorithm, BlockSizeY, BlockSizeZ > |
| The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm. More... | |
| class | block_store< T, BlockSizeX, ItemsPerThread, Method, BlockSizeY, BlockSizeZ > |
The block_store class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory. More... | |
| class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_striped, BlockSizeY, BlockSizeZ > |
| class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_vectorize, BlockSizeY, BlockSizeZ > |
| class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_transpose, BlockSizeY, BlockSizeZ > |
| class | block_store< T, BlockSizeX, ItemsPerThread, block_store_method::block_store_warp_transpose, BlockSizeY, BlockSizeZ > |
Functions | |
| template<class InputIterator , class T , unsigned int ItemsPerThread> | |
| BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
| Loads data from continuous memory into a blocked arrangement of items across the thread block. More... | |
| template<class InputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid. More... | |
| template<class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_blocked (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
| Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
| template<class T , class U , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE auto | block_load_direct_blocked_vectorized (unsigned int flat_id, T *block_input, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type |
| Loads data from continuous memory into a blocked arrangement of items across the thread block. More... | |
| template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
| Loads data from continuous memory into a striped arrangement of items across the thread block. More... | |
| template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid. More... | |
| template<unsigned int BlockSize, class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
| Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
| template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread]) |
| Loads data from continuous memory into a warp-striped arrangement of items across the thread block. More... | |
| template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid. More... | |
| template<unsigned int WarpSize = device_warp_size(), class InputIterator , class T , unsigned int ItemsPerThread, class Default > | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_load_direct_warp_striped (unsigned int flat_id, InputIterator block_input, T(&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds) |
| Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements. More... | |
| template<class OutputIterator , class T , unsigned int ItemsPerThread> | |
| BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
| Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
| template<class OutputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_blocked (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More... | |
| template<class T , class U , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE auto | block_store_direct_blocked_vectorized (unsigned int flat_id, T *block_output, U(&items)[ItemsPerThread]) -> typename std::enable_if< detail::is_vectorizable< T, ItemsPerThread >::value >::type |
| Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
| template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
| Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
| template<unsigned int BlockSize, class OutputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More... | |
| template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread]) |
| Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory. More... | |
| template<unsigned int WarpSize = device_warp_size(), class OutputIterator , class T , unsigned int ItemsPerThread> | |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | block_store_direct_warp_striped (unsigned int flat_id, OutputIterator block_output, T(&items)[ItemsPerThread], unsigned int valid) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid. More... | |
|
strong |
Available algorithms for block_histogram primitive.
| Enumerator | |
|---|---|
| using_atomic | Atomic addition is used to update bin count directly.
|
| using_sort | A two-phase operation is used:-.
|
| default_algorithm | Default block_histogram algorithm. |
|
strong |
block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block
|
strong |
Available algorithms for the block_radix_rank primitive.
|
strong |
Available algorithms for block_reduce primitive.
| Enumerator | |
|---|---|
| using_warp_reduce | A warp_reduce based algorithm. |
| raking_reduce | An algorithm which limits calculations to a single hardware warp. |
| raking_reduce_commutative_only | raking reduce that supports only commutative operators |
| default_algorithm | Default block_reduce algorithm. |
|
strong |
Available algorithms for block_scan primitive.
| Enumerator | |
|---|---|
| using_warp_scan | A warp_scan based algorithm. |
| reduce_then_scan | An algorithm which limits calculations to a single hardware warp. |
| default_algorithm | Default block_scan algorithm. |
|
strong |
Available algorithms for block_sort primitive.
| Enumerator | |
|---|---|
| bitonic_sort | A bitonic sort based algorithm. |
| merge_sort | A merge sort based algorithm. |
| stable_merge_sort | A merged sort based algorithm which sorts stably. |
| default_algorithm | Default block_sort algorithm. |
|
strong |
block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory
| BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Loads data from continuous memory into a blocked arrangement of items across the thread block.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range valid.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_blocked | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid, | ||
| Default | out_of_bounds | ||
| ) |
Loads data from continuous memory into a blocked arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| Default | - [inferred] The data type of the default value |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| out_of_bounds | - default value assigned to out-of-bound items |
| ROCPRIM_DEVICE ROCPRIM_INLINE auto block_load_direct_blocked_vectorized | ( | unsigned int | flat_id, |
| T * | block_input, | ||
| U(&) | items[ItemsPerThread] | ||
| ) | -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type |
Loads data from continuous memory into a blocked arrangement of items across the thread block.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
The input offset (block_input + offset) must be quad-item aligned.
The following conditions will prevent vectorization and switch to default block_load_direct_blocked:
ItemsPerThread is odd.T is not a primitive or a HIP vector type (e.g. int2, int4, etc.| T | - [inferred] the input data type |
| U | - [inferred] the output data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
The type T must be such that it can be implicitly converted to U.
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Loads data from continuous memory into a striped arrangement of items across the thread block.
The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| BlockSize | - the number of threads in a block |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range valid.
The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| BlockSize | - the number of threads in a block |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid, | ||
| Default | out_of_bounds | ||
| ) |
Loads data from continuous memory into a striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
| BlockSize | - the number of threads in a block |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| Default | - [inferred] The data type of the default value |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| out_of_bounds | - default value assigned to out-of-bound items |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block.
The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
WarpSize.WarpSize is a hardware warpsize and is an optimal value.WarpSize must be a power of two and equal or less than the size of hardware warp.WarpSize smaller than hardware warpsize could result in lower performance.| WarpSize | - [optional] the number of threads in a warp |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range valid.
The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
WarpSize.WarpSize is a hardware warpsize and is an optimal value.WarpSize must be a power of two and equal or less than the size of hardware warp.WarpSize smaller than hardware warpsize could result in lower performance.| WarpSize | - [optional] the number of threads in a warp |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_load_direct_warp_striped | ( | unsigned int | flat_id, |
| InputIterator | block_input, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid, | ||
| Default | out_of_bounds | ||
| ) |
Loads data from continuous memory into a warp-striped arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.
The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to load a range of ItemsPerThread into items.
WarpSize.WarpSize is a hardware warpsize and is an optimal value.WarpSize must be a power of two and equal or less than the size of hardware warp.WarpSize smaller than hardware warpsize could result in lower performance.| WarpSize | - [optional] the number of threads in a warp |
| InputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| Default | - [inferred] The data type of the default value |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_input | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| valid | - maximum range of valid numbers to load |
| out_of_bounds | - default value assigned to out-of-bound items |
| BEGIN_ROCPRIM_NAMESPACE ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_blocked | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| valid | - maximum range of valid numbers to store |
| ROCPRIM_DEVICE ROCPRIM_INLINE auto block_store_direct_blocked_vectorized | ( | unsigned int | flat_id, |
| T * | block_output, | ||
| U(&) | items[ItemsPerThread] | ||
| ) | -> typename std::enable_if<detail::is_vectorizable<T, ItemsPerThread>::value>::type |
Stores a blocked arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The block arrangement is assumed to be (block-threads * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
The input offset (block_output + offset) must be quad-item aligned.
The following conditions will prevent vectorization and switch to default block_load_direct_blocked:
ItemsPerThread is odd.T is not a primitive or a HIP vector type (e.g. int2, int4, etc.| T | - [inferred] the output data type |
| U | - [inferred] the input data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
The type U must be such that it can be implicitly converted to T.
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to load from |
| items | - array that data is loaded to |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
| BlockSize | - the number of threads in a block |
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_striped | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Stores a striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.
The striped arrangement is assumed to be (BlockSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
| BlockSize | - the number of threads in a block |
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| valid | - maximum range of valid numbers to store |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread] | ||
| ) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory.
The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
WarpSize.WarpSize is a hardware warpsize and is an optimal value.WarpSize must be a power of two and equal or less than the size of hardware warp.WarpSize smaller than hardware warpsize could result in lower performance.| WarpSize | - [optional] the number of threads in a warp |
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_store_direct_warp_striped | ( | unsigned int | flat_id, |
| OutputIterator | block_output, | ||
| T(&) | items[ItemsPerThread], | ||
| unsigned int | valid | ||
| ) |
Stores a warp-striped arrangement of items from across the thread block into a blocked arrangement on continuous memory, which is guarded by range valid.
The warp-striped arrangement is assumed to be (WarpSize * ItemsPerThread) items across a thread block. Each thread uses a flat_id to store a range of ItemsPerThread items to the thread block.
WarpSize.WarpSize is a hardware warpsize and is an optimal value.WarpSize must be a power of two and equal or less than the size of hardware warp.WarpSize smaller than hardware warpsize could result in lower performance.| WarpSize | - [optional] the number of threads in a warp |
| OutputIterator | - [inferred] an iterator type for input (can be a simple pointer |
| T | - [inferred] the data type |
| ItemsPerThread | - [inferred] the number of items to be processed by each thread |
| flat_id | - a local flat 1D thread id in a block (tile) for the calling thread |
| block_output | - the input iterator from the thread block to store to |
| items | - array that data is stored to thread block |
| valid | - maximum range of valid numbers to store |
1.8.13