|
template<class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| Performs reduction across threads in a block. More...
|
|
template<class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T input, T &output, BinaryFunction reduce_op=BinaryFunction()) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
|
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T(&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| Performs reduction across threads in a block. More...
|
|
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T(&input)[ItemsPerThread], T &output, BinaryFunction reduce_op=BinaryFunction()) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
|
template<class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_INLINE void | reduce (T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) |
| Performs reduction across threads in a block. More...
|
|
template<class BinaryFunction = ::rocprim::plus<T>> |
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void | reduce (T input, T &output, unsigned int valid_items, BinaryFunction reduce_op=BinaryFunction()) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
|
|
template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >
The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.
- Template Parameters
-
- Overview
- Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
- Computation can more efficient when:
ItemsPerThread
is greater than one,
T
is an arithmetic type,
- reduce operation is simple addition operator, and
- the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
- block_reduce has three alternative implementations:
block_reduce_algorithm::using_warp_reduce
, block_reduce_algorithm::raking_reduce
and block_reduce_algorithm::raking_reduce_commutative_only
.
- If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.
- Examples
In the examples reduce operation is performed on block of 192 threads, each provides one int
value, result is returned using the same variable as for input.
__global__ void example_kernel(...)
{
using block_reduce_int = rocprim::block_reduce<int, 192>;
__shared__ block_reduce_int::storage_type storage;
int value = ...;
block_reduce_int().reduce(
value,
value,
storage
);
...
}
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - thread input value. |
[out] | output | - reference to a thread output value. May be aliased with input . |
[in] | storage | - reference to a temporary storage object of type storage_type. |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one float
value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<float, 256>;
__shared__ block_reduce_float::storage_type storage;
float input = ...;
float output;
block_reduce_float().reduce(
input,
output,
storage,
rocprim::minimum<float>()
);
...
}
If the input
values across threads in a block are {1, -2, 3, -4, ..., 255, -256}
, then output
value will be {-256}
.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - thread input value. |
[out] | output | - reference to a thread output value. May be aliased with input . |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T(&) |
input[ItemsPerThread], |
|
|
T & |
output, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
ItemsPerThread | - number of items in the input array. |
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - reference to an array containing thread input values. |
[out] | output | - reference to a thread output array. May be aliased with input . |
[in] | storage | - reference to a temporary storage object of type storage_type. |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
The examples present maximum reduce operations performed on a block of 128 threads, each provides two long
value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<long, 128>;
__shared__ block_reduce_long::storage_type storage;
long input[2] = ...;
long output[2];
block_reduce_long().reduce(
input,
output,
storage,
rocprim::maximum<long>()
);
...
}
If the input
values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}
, then output
value will be {256}
.
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T(&) |
input[ItemsPerThread], |
|
|
T & |
output, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
ItemsPerThread | - number of items in the input array. |
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - reference to an array containing thread input values. |
[out] | output | - reference to a thread output array. May be aliased with input . |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
unsigned int |
valid_items, |
|
|
storage_type & |
storage, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
Performs reduction across threads in a block.
- Template Parameters
-
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - thread input value. |
[out] | output | - reference to a thread output value. May be aliased with input . |
[in] | valid_items | - number of items that will be reduced in the block. |
[in] | storage | - reference to a temporary storage object of type storage_type. |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |
- Storage reusage
- Synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one float
value.
__global__ void example_kernel(...)
{
using block_reduce_f = rocprim::block_reduce<float, 256>;
__shared__ block_reduce_float::storage_type storage;
float input = ...;
unsigned int valid_items = 250;
float output;
block_reduce_float().reduce(
input,
output,
valid_items,
storage,
rocprim::minimum<float>()
);
...
}
template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce |
( |
T |
input, |
|
|
T & |
output, |
|
|
unsigned int |
valid_items, |
|
|
BinaryFunction |
reduce_op = BinaryFunction() |
|
) |
| |
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Performs reduction across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
ItemsPerThread | - number of items in the input array. |
BinaryFunction | - type of binary function used for reduce. Default type is rocprim::plus<T>. |
- Parameters
-
[in] | input | - reference to an array containing thread input values. |
[out] | output | - reference to a thread output array. May be aliased with input . |
[in] | valid_items | - number of items that will be reduced in the block. |
[in] | reduce_op | - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b); . The signature does not need to have const & , but function object must not modify the objects passed to it. |