rocPRIM
Public Types | Public Member Functions | List of all members
block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ > Class Template Reference

The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More...

#include <block_reduce.hpp>

Inheritance diagram for block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >:
Inheritance graph
[legend]
Collaboration diagram for block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >:
Collaboration graph
[legend]

Public Types

using storage_type = typename base_type::storage_type
 Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...
 

Public Member Functions

template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
 Performs reduction across threads in a block. More...
 
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void reduce (T input, T &output, BinaryFunction reduce_op=BinaryFunction())
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
 
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void reduce (T(&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
 Performs reduction across threads in a block. More...
 
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void reduce (T(&input)[ItemsPerThread], T &output, BinaryFunction reduce_op=BinaryFunction())
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
 
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void reduce (T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
 Performs reduction across threads in a block. More...
 
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void reduce (T input, T &output, unsigned int valid_items, BinaryFunction reduce_op=BinaryFunction())
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...
 

Detailed Description

template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >

The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.

Template Parameters
T- the input/output type.
BlockSize- the number of threads in a block.
Algorithm- selected reduce algorithm, block_reduce_algorithm::default_algorithm by default.
Overview
  • Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
  • Computation can more efficient when:
    • ItemsPerThread is greater than one,
    • T is an arithmetic type,
    • reduce operation is simple addition operator, and
    • the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
  • block_reduce has three alternative implementations: block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce and block_reduce_algorithm::raking_reduce_commutative_only.
  • If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.
Examples

In the examples reduce operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 192 threads
using block_reduce_int = rocprim::block_reduce<int, 192>;
// allocate storage in shared memory
__shared__ block_reduce_int::storage_type storage;
int value = ...;
// execute reduce
block_reduce_int().reduce(
value, // input
value, // output
storage
);
...
}

Member Typedef Documentation

◆ storage_type

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
using block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::storage_type = typename base_type::storage_type

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Member Function Documentation

◆ reduce() [1/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( input,
T &  output,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

Performs reduction across threads in a block.

Template Parameters
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
// specialize block_reduce for float and block of 256 threads
using block_reduce_f = rocprim::block_reduce<float, 256>;
// allocate storage in shared memory for the block
__shared__ block_reduce_float::storage_type storage;
float input = ...;
float output;
// execute min reduce
block_reduce_float().reduce(
input,
output,
storage,
rocprim::minimum<float>()
);
...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output value will be {-256}.

◆ reduce() [2/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( input,
T &  output,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

◆ reduce() [3/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( T(&)  input[ItemsPerThread],
T &  output,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

Performs reduction across threads in a block.

Template Parameters
ItemsPerThread- number of items in the input array.
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- reference to an array containing thread input values.
[out]output- reference to a thread output array. May be aliased with input.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

The examples present maximum reduce operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
// specialize block_reduce for long and block of 128 threads
using block_reduce_f = rocprim::block_reduce<long, 128>;
// allocate storage in shared memory for the block
__shared__ block_reduce_long::storage_type storage;
long input[2] = ...;
long output[2];
// execute max reduce
block_reduce_long().reduce(
input,
output,
storage,
rocprim::maximum<long>()
);
...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output value will be {256}.

◆ reduce() [4/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( T(&)  input[ItemsPerThread],
T &  output,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
ItemsPerThread- number of items in the input array.
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- reference to an array containing thread input values.
[out]output- reference to a thread output array. May be aliased with input.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

◆ reduce() [5/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( input,
T &  output,
unsigned int  valid_items,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

Performs reduction across threads in a block.

Template Parameters
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]valid_items- number of items that will be reduced in the block.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
// specialize block_reduce for float and block of 256 threads
using block_reduce_f = rocprim::block_reduce<float, 256>;
// allocate storage in shared memory for the block
__shared__ block_reduce_float::storage_type storage;
float input = ...;
unsigned int valid_items = 250;
float output;
// execute min reduce
block_reduce_float().reduce(
input,
output,
valid_items,
storage,
rocprim::minimum<float>()
);
...
}

◆ reduce() [6/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce ( input,
T &  output,
unsigned int  valid_items,
BinaryFunction  reduce_op = BinaryFunction() 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
ItemsPerThread- number of items in the input array.
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- reference to an array containing thread input values.
[out]output- reference to a thread output array. May be aliased with input.
[in]valid_items- number of items that will be reduced in the block.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

The documentation for this class was generated from the following file: