rocPRIM
Classes | Public Types | Public Member Functions | List of all members
block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ > Class Template Reference

The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block. More...

#include <block_shuffle.hpp>

Public Types

using storage_type = detail::raw_storage< storage_type_ >
 Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...
 

Public Member Functions

ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void offset (T input, T &output, int distance=1)
 Shuffles data across threads in a block, offseted by the distance value. More...
 
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void offset (const size_t &flat_id, T input, T &output, int distance)
 Shuffles data across threads in a block, offseted by the distance value. More...
 
ROCPRIM_DEVICE ROCPRIM_INLINE void offset (const size_t &flat_id, T input, T &output, int distance, storage_type &storage)
 Shuffles data across threads in a block, offseted by the distance value, using temporary storage. More...
 
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void rotate (T input, T &output, unsigned int distance=1)
 Shuffles data across threads in a block, offseted by the distance value. More...
 
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void rotate (const size_t &flat_id, T input, T &output, unsigned int distance)
 Shuffles data across threads in a block, offseted by the distance value. More...
 
ROCPRIM_DEVICE ROCPRIM_INLINE void rotate (const size_t &flat_id, T input, T &output, unsigned int distance, storage_type &storage)
 Shuffles data across threads in a block, offseted by the distance value, using temporary storage. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void up (T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread])
 The thread block rotates a blocked arrange of input items, shifting it up by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void up (const size_t &flat_id, T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread])
 The thread block rotates a blocked arrange of input items, shifting it up by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void up (const size_t &flat_id, T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread], storage_type &storage)
 The thread block rotates a blocked arrange of input items, shifting it up by one item, using temporary storage. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void up (T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread], T &block_suffix)
 The thread block rotates a blocked arrange of input items, shifting it up by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void up (const size_t &flat_id, T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread], T &block_suffix)
 The thread block rotates a blocked arrange of input items, shifting it up by one item. More...
 
template<int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void up (const size_t &flat_id, T(&input)[ItemsPerThread], T(&prev)[ItemsPerThread], T &block_suffix, storage_type &storage)
 The thread block rotates a blocked arrange of input items, shifting it up by one item, using temporary storage. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void down (T(&input)[ItemsPerThread], T(&next)[ItemsPerThread])
 The thread block rotates a blocked arrange of input items, shifting it down by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void down (const size_t &flat_id, T(&input)[ItemsPerThread], T(&next)[ItemsPerThread])
 The thread block rotates a blocked arrange of input items, shifting it down by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void down (const size_t &flat_id, T(&input)[ItemsPerThread], T(&next)[ItemsPerThread], storage_type &storage)
 The thread block rotates a blocked arrange of input items, shifting it down by one item, using temporary storage. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void down (T(&input)[ItemsPerThread], T(&next)[ItemsPerThread], T &block_prefix)
 The thread block rotates a blocked arrange of input items, shifting it down by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void down (const size_t &flat_id, T(&input)[ItemsPerThread], T(&next)[ItemsPerThread], T &block_prefix)
 The thread block rotates a blocked arrange of input items, shifting it down by one item. More...
 
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void down (const size_t &flat_id, T(&input)[ItemsPerThread], T(&next)[ItemsPerThread], T &block_prefix, storage_type &storage)
 The thread block rotates a blocked arrange of input items, shifting it down by one item, using temporary storage. More...
 

Detailed Description

template<class T, unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >

The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block.

Template Parameters
T- the input/output type.
BlockSizeX- the number of threads in a block's x dimension, it has no defaults value.
BlockSizeY- the number of threads in a block's y dimension, defaults to 1.
BlockSizeZ- the number of threads in a block's z dimension, defaults to 1.
Overview
It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor.
  • Computation can more efficient when:
    • ItemsPerThread is greater than one,
    • T is an arithmetic type,
    • the number of threads in the block is a multiple of the hardware warp size (see rocprim::warp_size()).
Examples

In the examples shuffle operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
// specialize block__shuffle_int for int and logical warp of 192 threads
using block__shuffle_int = rocprim::block_shuffle<int, 192>;
// allocate storage in shared memory
__shared__ block_shuffle::storage_type storage;
int value = ...;
// execute block shuffle
block__shuffle_int().inclusive_up(
value, // input
value, // output
storage
);
...
}

Member Typedef Documentation

◆ storage_type

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
using block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::storage_type = detail::raw_storage<storage_type_>

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Member Function Documentation

◆ down() [1/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread] 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Parameters
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.
Example.
__global__ void example_kernel(...)
{
// specialize block__shuffle_int for int and logical warp of 192 threads
using block__shuffle_int = rocprim::block_shuffle<int, 192>;
int value = ...;
// execute block shuffle
block__shuffle_int().down(
value, // input
value // output
);
...
}

◆ down() [2/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread] 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.

◆ down() [3/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread],
storage_type storage 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item, using temporary storage.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.
[in]storage- reference to a temporary storage object of type storage_type.

◆ down() [4/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread],
T &  block_prefix 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Parameters
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.
[out]block_prefix- The item input[0] from thread0, provided to all threads

◆ down() [5/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread],
T &  block_prefix 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.
[out]block_prefix- The item input[0] from thread0, provided to all threads

◆ down() [6/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::down ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  next[ItemsPerThread],
T &  block_prefix,
storage_type storage 
)
inline

The thread block rotates a blocked arrange of input items, shifting it down by one item, using temporary storage.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]next- The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.
[out]block_prefix- The item input[0] from thread0, provided to all threads
[in]storage- reference to a temporary storage object of type storage_type.

◆ offset() [1/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::offset ( input,
T &  output,
int  distance = 1 
)
inline

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance), where distance may be a negative value.
Any shuffle operation with invalid input or output threadIds are not carried out, i.e. threadId < 0 || threadId >= BlockSize.
Parameters
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.
Example.
__global__ void example_kernel(...)
{
// specialize block__shuffle_int for int and logical warp of 192 threads
using block__shuffle_int = rocprim::block_shuffle<int, 192>;
int value = ...;
// execute block shuffle
block__shuffle_int().offset(
value, // input
value // output
);
...
}

◆ offset() [2/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::offset ( const size_t &  flat_id,
input,
T &  output,
int  distance 
)
inline

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance), where distance may be a negative value.
Any shuffle operation with invalid input or output threadIds are not carried out, i.e. threadId < 0 || threadId >= BlockSize.
Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.

◆ offset() [3/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::offset ( const size_t &  flat_id,
input,
T &  output,
int  distance,
storage_type storage 
)
inline

Shuffles data across threads in a block, offseted by the distance value, using temporary storage.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance), where distance may be a negative value.
Any shuffle operation with invalid input or output threadIds are not carried out, i.e. threadId < 0 || threadId >= BlockSize.
Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.
[in]storage- reference to a temporary storage object of type storage_type.

◆ rotate() [1/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::rotate ( input,
T &  output,
unsigned int  distance = 1 
)
inline

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance) % BlockSize, where distance may be a negative value.
Data is rotated around the block, using (input_threadId + distance) modulous BlockSize to ensure valid threadIds.
Parameters
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.
Example.
__global__ void example_kernel(...)
{
// specialize block__shuffle_int for int and logical warp of 192 threads
using block__shuffle_int = rocprim::block_shuffle<int, 192>;
int value = ...;
// execute block shuffle
block__shuffle_int().rotate(
value, // input
value // output
);
...
}

◆ rotate() [2/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::rotate ( const size_t &  flat_id,
input,
T &  output,
unsigned int  distance 
)
inline

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance) % BlockSize, where distance may be a negative value.
Data is rotated around the block, using (input_threadId + distance) modulous BlockSize to ensure valid threadIds.
Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.

◆ rotate() [3/3]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::rotate ( const size_t &  flat_id,
input,
T &  output,
unsigned int  distance,
storage_type storage 
)
inline

Shuffles data across threads in a block, offseted by the distance value, using temporary storage.

A thread with threadIdx i receives data from a thread with threadIdx (i - distance) % BlockSize, where distance may be a negative value.
Data is rotated around the block, using (input_threadId + distance) modulous BlockSize to ensure valid threadIds.
Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- input data to be shuffled to another thread.
[out]output- reference to a output value, that receives data from another thread
[in]distance- The input threadId + distance = output threadId.
[in]storage- reference to a temporary storage object of type storage_type.

◆ up() [1/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread] 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Parameters
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
Example.
__global__ void example_kernel(...)
{
// specialize block__shuffle_int for int and logical warp of 192 threads
using block__shuffle_int = rocprim::block_shuffle<int, 192>;
int value = ...;
// execute block shuffle
block__shuffle_int().up(
value, // input
value // output
);
...
}

◆ up() [2/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread] 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.

◆ up() [3/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread],
storage_type storage 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item, using temporary storage.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input).
[in]storage- reference to a temporary storage object of type storage_type. The item prev[0] is not updated for thread0.

◆ up() [4/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread],
T &  block_suffix 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Parameters
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
[out]block_suffix- The item input[ItemsPerThread-1] from threadBlockSize-1, provided to all threads

◆ up() [5/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<unsigned int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_FORCE_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread],
T &  block_suffix 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
[out]block_suffix- The item input[ItemsPerThread-1] from threadBlockSize-1, provided to all threads

◆ up() [6/6]

template<class T , unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<int ItemsPerThread>
ROCPRIM_DEVICE ROCPRIM_INLINE void block_shuffle< T, BlockSizeX, BlockSizeY, BlockSizeZ >::up ( const size_t &  flat_id,
T(&)  input[ItemsPerThread],
T(&)  prev[ItemsPerThread],
T &  block_suffix,
storage_type storage 
)
inline

The thread block rotates a blocked arrange of input items, shifting it up by one item, using temporary storage.

Parameters
[in]flat_id- flat thread ID obtained from rocprim::flat_block_thread_id
[in]input- The calling thread's input items
[out]prev- The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.
[out]block_suffix- The item input[ItemsPerThread-1] from threadBlockSize-1, provided to all threads
[in]storage- reference to a temporary storage object of type storage_type.

The documentation for this class was generated from the following file: