The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. More...

#include <block_reduce.hpp>

Inheritance diagram for block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >:

[legend]

Collaboration diagram for block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >:

[legend]

Public Types
using	storage_type = typename base_type::storage_type
	Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...

Public Member Functions
template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void	reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
	Performs reduction across threads in a block. More...

template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void	reduce (T input, T &output, BinaryFunction reduce_op=BinaryFunction())
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void	reduce (T(&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
	Performs reduction across threads in a block. More...

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void	reduce (T(&input)[ItemsPerThread], T &output, BinaryFunction reduce_op=BinaryFunction())
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...

template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_INLINE void	reduce (T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction())
	Performs reduction across threads in a block. More...

template<class BinaryFunction = ::rocprim::plus<T>>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void	reduce (T input, T &output, unsigned int valid_items, BinaryFunction reduce_op=BinaryFunction())
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. More...

Detailed Description

template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >

The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.

Template Parameters

T	- the input/output type.
BlockSize	- the number of threads in a block.
Algorithm	- selected reduce algorithm, block_reduce_algorithm::default_algorithm by default.

Overview

Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Computation can more efficient when:
- ItemsPerThread is greater than one,
- T is an arithmetic type,
- reduce operation is simple addition operator, and
- the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
block_reduce has three alternative implementations: block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce and block_reduce_algorithm::raking_reduce_commutative_only.
If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.

Examples: In the examples reduce operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 192 threads
using block_reduce_int = rocprim::block_reduce<int, 192>;
// allocate storage in shared memory
__shared__ block_reduce_int::storage_type storage;
int value = ...;
// execute reduce
block_reduce_int().reduce(
value, // input
value, // output
storage
);
...
}

Member Typedef Documentation

◆ storage_type

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

using block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::storage_type = typename base_type::storage_type

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Member Function Documentation

◆ reduce() [1/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T	input,
		T &	output,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

Performs reduction across threads in a block.

Template Parameters

BinaryFunction - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_reduce for float and block of 256 threads
    using block_reduce_f = rocprim::block_reduce<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_reduce_float::storage_type storage;
    float input = ...;
    float output;
    // execute min reduce
    block_reduce_float().reduce(
        input,
        output,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output value will be {-256}.

◆ reduce() [2/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T	input,
		T &	output,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters

BinaryFunction - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

◆ reduce() [3/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T(&)	input[ItemsPerThread],
		T &	output,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

Performs reduction across threads in a block.

Template Parameters

ItemsPerThread	- number of items in the `input` array.
BinaryFunction	- type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- reference to an array containing thread input values.
[out]	output	- reference to a thread output array. May be aliased with `input`.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present maximum reduce operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_reduce for long and block of 128 threads
    using block_reduce_f = rocprim::block_reduce<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_reduce_long::storage_type storage;
    long input[2] = ...;
    long output[2];
    // execute max reduce
    block_reduce_long().reduce(
        input,
        output,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output value will be {256}.

◆ reduce() [4/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T(&)	input[ItemsPerThread],
		T &	output,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters

ItemsPerThread	- number of items in the `input` array.
BinaryFunction	- type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- reference to an array containing thread input values.
[out]	output	- reference to a thread output array. May be aliased with `input`.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

◆ reduce() [5/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T	input,
		T &	output,
		unsigned int	valid_items,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

Performs reduction across threads in a block.

Template Parameters

BinaryFunction - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	valid_items	- number of items that will be reduced in the block.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples: The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
// specialize block_reduce for float and block of 256 threads
using block_reduce_f = rocprim::block_reduce<float, 256>;
// allocate storage in shared memory for the block
__shared__ block_reduce_float::storage_type storage;
float input = ...;
unsigned int valid_items = 250;
float output;
// execute min reduce
block_reduce_float().reduce(
input,
output,
valid_items,
storage,
rocprim::minimum<float>()
);
...
}

◆ reduce() [6/6]

template<class T , unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>

template<class BinaryFunction = ::rocprim::plus<T>>

ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >::reduce	(	T	input,
		T &	output,
		unsigned int	valid_items,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Performs reduction across threads in a block.

This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters

ItemsPerThread	- number of items in the `input` array.
BinaryFunction	- type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- reference to an array containing thread input values.
[out]	output	- reference to a thread output array. May be aliased with `input`.
[in]	valid_items	- number of items that will be reduced in the block.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

The documentation for this class was generated from the following file:

rocprim/include/rocprim/block/block_reduce.hpp

Public Types

Public Member Functions

Detailed Description

template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1> class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >

Member Typedef Documentation

◆ storage_type

Member Function Documentation

◆ reduce() [1/6]

◆ reduce() [2/6]

◆ reduce() [3/6]

◆ reduce() [4/6]

◆ reduce() [5/6]

◆ reduce() [6/6]

template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce< T, BlockSizeX, Algorithm, BlockSizeY, BlockSizeZ >