The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp. More...

#include <warp_reduce.hpp>

Inheritance diagram for warp_reduce< T, WarpSize, UseAllReduce >:

[legend]

Collaboration diagram for warp_reduce< T, WarpSize, UseAllReduce >:

[legend]

Public Types
using	storage_type = typename base_type::storage_type
	Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...

Public Member Functions
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs reduction across threads in a logical warp. More...

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	reduce (T, T &, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs reduction across threads in a logical warp. More...

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	reduce (T input, T &output, int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs reduction across threads in a logical warp. More...

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	reduce (T, T &, int, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs reduction across threads in a logical warp. More...

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	head_segmented_reduce (T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs head-segmented reduction across threads in a logical warp. More...

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	head_segmented_reduce (T, T &, Flag, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs head-segmented reduction across threads in a logical warp. More...

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	tail_segmented_reduce (T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs tail-segmented reduction across threads in a logical warp. More...

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto	tail_segmented_reduce (T, T &, Flag, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
	Performs tail-segmented reduction across threads in a logical warp. More...

Detailed Description

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
class warp_reduce< T, WarpSize, UseAllReduce >

The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp.

Template Parameters

T	- the input/output type.
WarpSize	- the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Reduce operations are performed separately within groups determined by WarpSize.
UseAllReduce	- input parameter to determine whether to broadcast final reduction value to all threads (default is false).

Overview

WarpSize must be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). If it is less, reduce is performed separately within groups determined by WarpSize.
For example, if WarpSize is 4, hardware warp is 64, reduction will be performed in logical warps grouped like this: { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} } (thread is represented here by its id within hardware warp).
Logical warp is a group of WarpSize consecutive threads from the same hardware warp.
Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Number of threads executing warp_reduce's function must be a multiple of WarpSize;
All threads from a logical warp must be in the same hardware warp.

Examples: In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
// execute reduce
warp_reduce_int().reduce(
value, // input
value, // output
temp[logical_warp_id]
);
...
}

Member Typedef Documentation

◆ storage_type

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

using warp_reduce< T, WarpSize, UseAllReduce >::storage_type = typename base_type::storage_type

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Member Function Documentation

◆ head_segmented_reduce() [1/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::head_segmented_reduce	(	T	input,
		T &	output,
		Flag	flag,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs head-segmented reduction across threads in a logical warp.

Template Parameters

Flag	- type of head flags. Must be contextually convertible to `bool`.
BinaryFunction	- type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	flag	- thread head flag, `true` flags mark beginnings of segments.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

◆ head_segmented_reduce() [2/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::head_segmented_reduce	(	T	,
		T &	,
		Flag	,
		storage_type &	,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs head-segmented reduction across threads in a logical warp.

Invalid Warp Size

◆ reduce() [1/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce	(	T	input,
		T &	output,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs reduction across threads in a logical warp.

Template Parameters

BinaryFunction - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples: In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
// execute reduction
warp_reduce_int().reduce(
value, // input
value, // output
temp[logical_warp_id],
rocprim::minimum<float>()
);
...
}

◆ reduce() [2/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce	(	T	,
		T &	,
		storage_type &	,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs reduction across threads in a logical warp.

Invalid Warp Size

◆ reduce() [3/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce	(	T	input,
		T &	output,
		int	valid_items,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs reduction across threads in a logical warp.

Template Parameters

BinaryFunction - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	valid_items	- number of items that will be reduced in the warp.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples: In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
int valid_items = 4;
// execute reduction
warp_reduce_int().reduce(
value, // input
value, // output
valid_items,
temp[logical_warp_id]
);
...
}

◆ reduce() [4/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce	(	T	,
		T &	,
		int	,
		storage_type &	,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs reduction across threads in a logical warp.

Invalid Warp Size

◆ tail_segmented_reduce() [1/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::tail_segmented_reduce	(	T	input,
		T &	output,
		Flag	flag,
		storage_type &	storage,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs tail-segmented reduction across threads in a logical warp.

Template Parameters

Flag	- type of tail flags. Must be contextually convertible to `bool`.
BinaryFunction	- type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters

[in]	input	- thread input value.
[out]	output	- reference to a thread output value. May be aliased with `input`.
[in]	flag	- thread tail flag, `true` flags mark ends of segments.
[in]	storage	- reference to a temporary storage object of type storage_type.
[in]	reduce_op	- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: `T f(const T &a, const T &b);`. The signature does not need to have `const &`, but function object must not modify the objects passed to it.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

◆ tail_segmented_reduce() [2/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>

template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>

ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::tail_segmented_reduce	(	T	,
		T &	,
		Flag	,
		storage_type &	,
		BinaryFunction	reduce_op = `BinaryFunction()`
	)		-> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type

inline

Performs tail-segmented reduction across threads in a logical warp.

Invalid Warp Size

The documentation for this class was generated from the following file:

rocprim/include/rocprim/warp/warp_reduce.hpp

Public Types

Public Member Functions

Detailed Description

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false> class warp_reduce< T, WarpSize, UseAllReduce >

Member Typedef Documentation

◆ storage_type

Member Function Documentation

◆ head_segmented_reduce() [1/2]

◆ head_segmented_reduce() [2/2]

◆ reduce() [1/4]

◆ reduce() [2/4]

◆ reduce() [3/4]

◆ reduce() [4/4]

◆ tail_segmented_reduce() [1/2]

◆ tail_segmented_reduce() [2/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
class warp_reduce< T, WarpSize, UseAllReduce >