rocPRIM
Public Types | Public Member Functions | List of all members
warp_reduce< T, WarpSize, UseAllReduce > Class Template Reference

The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp. More...

#include <warp_reduce.hpp>

Inheritance diagram for warp_reduce< T, WarpSize, UseAllReduce >:
Inheritance graph
[legend]
Collaboration diagram for warp_reduce< T, WarpSize, UseAllReduce >:
Collaboration graph
[legend]

Public Types

using storage_type = typename base_type::storage_type
 Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...
 

Public Member Functions

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto reduce (T input, T &output, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs reduction across threads in a logical warp. More...
 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto reduce (T, T &, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs reduction across threads in a logical warp. More...
 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto reduce (T input, T &output, int valid_items, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs reduction across threads in a logical warp. More...
 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto reduce (T, T &, int, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs reduction across threads in a logical warp. More...
 
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto head_segmented_reduce (T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs head-segmented reduction across threads in a logical warp. More...
 
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto head_segmented_reduce (T, T &, Flag, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs head-segmented reduction across threads in a logical warp. More...
 
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto tail_segmented_reduce (T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize<=__AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs tail-segmented reduction across threads in a logical warp. More...
 
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto tail_segmented_reduce (T, T &, Flag, storage_type &, BinaryFunction reduce_op=BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void >::type
 Performs tail-segmented reduction across threads in a logical warp. More...
 

Detailed Description

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
class warp_reduce< T, WarpSize, UseAllReduce >

The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp.

Template Parameters
T- the input/output type.
WarpSize- the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Reduce operations are performed separately within groups determined by WarpSize.
UseAllReduce- input parameter to determine whether to broadcast final reduction value to all threads (default is false).
Overview
  • WarpSize must be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). If it is less, reduce is performed separately within groups determined by WarpSize.
    For example, if WarpSize is 4, hardware warp is 64, reduction will be performed in logical warps grouped like this: { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} } (thread is represented here by its id within hardware warp).
  • Logical warp is a group of WarpSize consecutive threads from the same hardware warp.
  • Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
  • Number of threads executing warp_reduce's function must be a multiple of WarpSize;
  • All threads from a logical warp must be in the same hardware warp.
Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
// execute reduce
warp_reduce_int().reduce(
value, // input
value, // output
temp[logical_warp_id]
);
...
}

Member Typedef Documentation

◆ storage_type

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
using warp_reduce< T, WarpSize, UseAllReduce >::storage_type = typename base_type::storage_type

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Member Function Documentation

◆ head_segmented_reduce() [1/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::head_segmented_reduce ( input,
T &  output,
Flag  flag,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs head-segmented reduction across threads in a logical warp.

Template Parameters
Flag- type of head flags. Must be contextually convertible to bool.
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]flag- thread head flag, true flags mark beginnings of segments.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

◆ head_segmented_reduce() [2/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::head_segmented_reduce ( ,
T &  ,
Flag  ,
storage_type ,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs head-segmented reduction across threads in a logical warp.

Invalid Warp Size

◆ reduce() [1/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce ( input,
T &  output,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs reduction across threads in a logical warp.

Template Parameters
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
// execute reduction
warp_reduce_int().reduce(
value, // input
value, // output
temp[logical_warp_id],
rocprim::minimum<float>()
);
...
}

◆ reduce() [2/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce ( ,
T &  ,
storage_type ,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs reduction across threads in a logical warp.

Invalid Warp Size

◆ reduce() [3/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce ( input,
T &  output,
int  valid_items,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs reduction across threads in a logical warp.

Template Parameters
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]valid_items- number of items that will be reduced in the warp.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
// specialize warp_reduce for int and logical warp of 16 threads
using warp_reduce_int = rocprim::warp_reduce<int, 16>;
// allocate storage in shared memory
__shared__ warp_reduce_int::storage_type temp[4];
int logical_warp_id = threadIdx.x/16;
int value = ...;
int valid_items = 4;
// execute reduction
warp_reduce_int().reduce(
value, // input
value, // output
valid_items,
temp[logical_warp_id]
);
...
}

◆ reduce() [4/4]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::reduce ( ,
T &  ,
int  ,
storage_type ,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs reduction across threads in a logical warp.

Invalid Warp Size

◆ tail_segmented_reduce() [1/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::tail_segmented_reduce ( input,
T &  output,
Flag  flag,
storage_type storage,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize <= __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs tail-segmented reduction across threads in a logical warp.

Template Parameters
Flag- type of tail flags. Must be contextually convertible to bool.
BinaryFunction- type of binary function used for reduce. Default type is rocprim::plus<T>.
Parameters
[in]input- thread input value.
[out]output- reference to a thread output value. May be aliased with input.
[in]flag- thread tail flag, true flags mark ends of segments.
[in]storage- reference to a temporary storage object of type storage_type.
[in]reduce_op- binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.
Storage reusage
Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

◆ tail_segmented_reduce() [2/2]

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
template<class Flag , class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
ROCPRIM_DEVICE ROCPRIM_INLINE auto warp_reduce< T, WarpSize, UseAllReduce >::tail_segmented_reduce ( ,
T &  ,
Flag  ,
storage_type ,
BinaryFunction  reduce_op = BinaryFunction() 
) -> typename std::enable_if<(FunctionWarpSize > __AMDGCN_WAVEFRONT_SIZE), void>::type
inline

Performs tail-segmented reduction across threads in a logical warp.

Invalid Warp Size


The documentation for this class was generated from the following file: