|
| template<class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| | Performs reduction across threads in a block. More...
|
| |
| template<class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T input, T &output, BinaryFunction reduce_op=BinaryFunction()) |
| | This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
| |
| template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T(&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| | Performs reduction across threads in a block. More...
|
| |
| template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T(&input)[ItemsPerThread], T &output, BinaryFunction reduce_op=BinaryFunction()) |
| | This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
| |
| template<class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| | Performs reduction across threads in a block. More...
|
| |
| template<class BinaryFunction = ::rocprim::plus<T>> |
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T input, T &output, unsigned int valid_items, BinaryFunction reduce_op=BinaryFunction()) |
| | This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
| |
template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.
- Template Parameters
-
- Overview
- Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
- Computation can more efficient when:
ItemsPerThread is greater than one,
T is an arithmetic type,
- reduce operation is simple addition operator, and
- the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
- block_reduce has three alternative implementations:
block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce and block_reduce_algorithm::raking_reduce_commutative_only.
- If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.
- Examples
In the examples reduce operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.
__global__ void example_kernel(...)
{
using block_reduce_int = rocprim::block_reduce<int, 192>;
__shared__ block_reduce_int::storage_type storage;
int value = ...;
block_reduce_int().reduce(
value,
value,
storage
);
...
}
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - thread input value. |
| [out] | output | - reference to a thread output value. May be aliased with input. |
| [in] | storage | - reference to a temporary storage object of type storage_type. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one float value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<float, 256>;
__shared__ block_reduce_float::storage_type storage;
float input = ...;
float output;
block_reduce_float().reduce(
input,
output,
storage,
rocprim::minimum<float>()
);
...
}
If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output value will be {-256}.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - thread input value. |
| [out] | output | - reference to a thread output value. May be aliased with input. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T(&) |
input[ItemsPerThread], |
|
|
T & |
output, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
| ItemsPerThread | - number of items in the input array. |
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - reference to an array containing thread input values. |
| [out] | output | - reference to a thread output array. May be aliased with input. |
| [in] | storage | - reference to a temporary storage object of type storage_type. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
- Examples
The examples present maximum reduce operations performed on a block of 128 threads, each provides two long value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<long, 128>;
__shared__ block_reduce_long::storage_type storage;
long input[2] = ...;
long output[2];
block_reduce_long().reduce(
input,
output,
storage,
rocprim::maximum<long>()
);
...
}
If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output value will be {256}.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T(&) |
input[ItemsPerThread], |
|
|
T & |
output, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
| ItemsPerThread | - number of items in the input array. |
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - reference to an array containing thread input values. |
| [out] | output | - reference to a thread output array. May be aliased with input. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
unsigned int |
valid_items, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - thread input value. |
| [out] | output | - reference to a thread output value. May be aliased with input. |
| [in] | valid_items | - number of items that will be reduced in the block. |
| [in] | storage | - reference to a temporary storage object of type storage_type. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one float value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<float, 256>;
__shared__ block_reduce_float::storage_type storage;
float input = ...;
unsigned int valid_items = 250;
float output;
block_reduce_float().reduce(
input,
output,
valid_items,
storage,
rocprim::minimum<float>()
);
...
}
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
| ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
unsigned int |
valid_items, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
| ItemsPerThread | - number of items in the input array. |
| BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
| [in] | input | - reference to an array containing thread input values. |
| [out] | output | - reference to a thread output array. May be aliased with input. |
| [in] | valid_items | - number of items that will be reduced in the block. |
| [in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. |