rocPRIM
Public Types | Public Member Functions | Static Public Attributes | List of all members
block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ > Class Template Reference

The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block. More...

#include <block_radix_rank.hpp>

Inheritance diagram for block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >:
Inheritance graph
[legend]
Collaboration diagram for block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >:
Collaboration graph
[legend]

Public Types

using storage_type = typename base_type::storage_type
 Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. More...
 

Public Member Functions

template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, unsigned int begin_bit=0, unsigned int pass_bits=RadixBits)
 Perform ascending radix rank over keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], unsigned int begin_bit=0, unsigned int pass_bits=RadixBits)
 Perform ascending radix rank over keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, unsigned int begin_bit=0, unsigned int pass_bits=RadixBits)
 Perform ascending radix rank over bit keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], unsigned int begin_bit=0, unsigned int pass_bits=RadixBits)
 Perform descending radix rank over keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor)
 Perform ascending radix rank over bit keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], DigitExtractor digit_extractor)
 Perform ascending radix rank over bit keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor)
 Perform descending radix rank over bit keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], DigitExtractor digit_extractor)
 Perform descending radix rank over bit keys partitioned across threads in a block. More...
 
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor, unsigned int(&prefix)[digits_per_thread], unsigned int(&counts)[digits_per_thread])
 Perform ascending radix rank over bit keys partitioned across threads in a block. More...
 

Static Public Attributes

static constexpr unsigned int digits_per_thread = base_type::digits_per_thread
 The number of digits each thread will process.
 

Detailed Description

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >

The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block.

This algorithm associates each item with the index it would gain if the keys were sorted into an array, according to a radix comparison. Ranking is performed in a stable manner.

Template Parameters
BlockSizeX- the number of threads in a block's x dimension.
RadixBits- the maximum number of radix digit bits that comparisons are performed by.
MemoizeOuterScan- whether to cache digit counters in local memory. This omits loading the same values from shared memory twice, at the expense of more register usage.
BlockSizeY- the number of threads in a block's y dimension, defaults to 1.
BlockSizeZ- the number of threads in a block's z dimension, defaults to 1.
Overview
  • Key type must be an arithmetic type (that is, an integral type or a floating point type).
  • Performance depends on the block size and the number of items that will be sorted per thread.
    • It is usually better if the block size is a multiple of the size of the hardware warp.
    • It is usually increased when there are more than one item per thread. However, when there are too many items per thread, each thread may need so many registers and/or shared memory that it impedes performance.
  • Shared memory usage depends on the block size and the maximum number of radix bits that will be considered when comparing keys.
    • The storage requirement increases when more bits are considered.
  • The maximum amount of keys that can be ranked in a block is 2**16.
Examples
In the example, radix rank is performed on a block of 128 threads. Each thread provides three float values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int values.
__global__ void example_kernel(...)
{
// specialize the block_radix_rank for float, block of 128 threads.
using block_rank_float = rocprim::block_radix_rank<float, 128>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3] = ...;
// execute the block radix rank (ascending)
block_rank_float().rank_keys(input,
output,
storage,
10,
4);
...
}

Member Typedef Documentation

◆ storage_type

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
using block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::storage_type = typename base_type::storage_type

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords shared. It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.

Member Function Documentation

◆ rank_keys() [1/5]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
storage_type storage,
unsigned int  begin_bit = 0,
unsigned int  pass_bits = RadixBits 
)
inline

Perform ascending radix rank over keys partitioned across threads in a block.

Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]storage- reference to a temporary storage object of type storage_type.
[in]begin_bit- index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)).
[in]pass_bits- [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits]. Default value: RadixBits.
Storage reusage
A synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the example, radix rank is performed on a block of 128 threads. Each thread provides three float values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int values.

__global__ void example_kernel(...)
{
// specialize the block_radix_rank for float, block of 128 threads, and a maximum of 4 bits.
using block_rank_float = rocprim::block_radix_rank<float, 128, 4>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3];
// execute the block radix rank (ascending)
block_rank_float().rank_keys(input,
output,
storage,
10,
4);
...
}

◆ rank_keys() [2/5]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
unsigned int  begin_bit = 0,
unsigned int  pass_bits = RadixBits 
)
inline

Perform ascending radix rank over keys partitioned across threads in a block.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]begin_bit- index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)).
[in]pass_bits- [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits]. Default value: RadixBits.

◆ rank_keys() [3/5]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
storage_type storage,
DigitExtractor  digit_extractor 
)
inline

Perform ascending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key.

Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
DigitExtractor- type of the unary function object used to extract a digit from a key.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]storage- reference to a temporary storage object of type storage_type.
[in]digit_extractor- function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key);. The signature does not need to have const &, but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits).
Storage reusage
A synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the example, radix rank is performed on a block of 128 threads. Each thread provides three int values, which are ranked according to a digit callback that extracts digits 0 through 4. Results written back in a separate array of three unsigned int values.

__global__ void example_kernel(...)
{
// specialize the block_radix_rank for int, block of 128 threads, and a maximum of 4 bits.
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
// execute the block radix rank (ascending)
block_rank_float().rank_keys(input,
output,
storage,
[](const int& key)
{
// Rank the keys by the lower 4 bits
return key & 0xF;
});
...
}

◆ rank_keys() [4/5]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
DigitExtractor  digit_extractor 
)
inline

Perform ascending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
DigitExtractor- type of the unary function object used to extract a digit from a key.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]digit_extractor- function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key);. The signature does not need to have const &, but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits).

◆ rank_keys() [5/5]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
storage_type storage,
DigitExtractor  digit_extractor,
unsigned int(&)  prefix[digits_per_thread],
unsigned int(&)  counts[digits_per_thread] 
)
inline

Perform ascending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key, and provides the counts of each digit and a prefix scan thereof in a blocked arrangement.

Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
DigitExtractor- type of the unary function object used to extract a digit from a key.
Parameters
[in]keys- reference to an array of keys provided by a thread. Keys are expected in warp-striped arrangement.
[out]ranks- reference to an array where the final ranks are written to. Ranks are provided in warp-striped arrangement.
[in]storage- reference to a temporary storage object of type storage_type.
[in]digit_extractor- function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key);. The signature does not need to have const &, but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits).
[in]prefix- An exclusive prefix scan of the counts per digit.
[in]counts- The number of keys with a particular digit in the input, per digit.
Storage reusage
A synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
__global__ void example_kernel(...)
{
// specialize the block_radix_rank for int, block of 128 threads, and a maximum of 4 bits.
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
unsinged int digit_prefix[block_rank_float::digits_per_thread];
unsinged int digit_counts[block_rank_float::digits_per_thread];
// execute the block radix rank (ascending)
block_rank_float().rank_keys(input,
output,
storage,
[](const int& key)
{
// Rank the keys by the lower 4 bits
return key & 0xF;
},
digit_prefix,
digit_counts);
...
}

◆ rank_keys_desc() [1/4]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
storage_type storage,
unsigned int  begin_bit = 0,
unsigned int  pass_bits = RadixBits 
)
inline

Perform ascending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key.

Template Parameters
ItemsPerThread- the number of items contributed by each thread in the block.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]storage- reference to a temporary storage object of type storage_type.
[in]begin_bit- index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)).
[in]pass_bits- [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits]. Default value: RadixBits.
Storage reusage
A synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the example, radix rank is performed on a block of 128 threads. Each thread provides three float values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int values.

__global__ void example_kernel(...)
{
// specialize the block_radix_rank for float, block of 128 threads, and a maximum of 4 bits.
using block_rank_float = rocprim::block_radix_rank<float, 128, 4>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3];
// execute the block radix rank (descending)
block_rank_float().rank_keys_desc(input,
output,
storage,
10,
4);
...
}

◆ rank_keys_desc() [2/4]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
unsigned int  begin_bit = 0,
unsigned int  pass_bits = RadixBits 
)
inline

Perform descending radix rank over keys partitioned across threads in a block.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]begin_bit- index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)).
[in]pass_bits- [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits]. Default value: RadixBits.

◆ rank_keys_desc() [3/4]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
storage_type storage,
DigitExtractor  digit_extractor 
)
inline

Perform descending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key.

Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
DigitExtractor- type of the unary function object used to extract a digit from a key.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]storage- reference to a temporary storage object of type storage_type.
[in]digit_extractor- function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key);. The signature does not need to have const &, but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits).
Storage reusage
A synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().
Examples

In the example, radix rank is performed on a block of 128 threads. Each thread provides three int values, which are ranked according to a digit callback that extracts digits 0 through 4. Results written back in a separate array of three unsigned int values.

__global__ void example_kernel(...)
{
// specialize the block_radix_rank for int, block of 128 threads, and a maximum of 4 bits.
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
// allocate storage in shared memory
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
// execute the block radix rank (descending))
block_rank_float().rank_keys_desc(input,
output,
storage,
[](const int& key)
{
// Rank the keys by the lower 4 bits
return key & 0xF;
});
...
}

◆ rank_keys_desc() [4/4]

template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc ( const Key(&)  keys[ItemsPerThread],
unsigned int(&)  ranks[ItemsPerThread],
DigitExtractor  digit_extractor 
)
inline

Perform descending radix rank over bit keys partitioned across threads in a block.

This overload accepts a callback used to extract the radix digit from a key.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.
Template Parameters
Key- the key type.
ItemsPerThread- the number of items contributed by each thread in the block.
DigitExtractor- type of the unary function object used to extract a digit from a key.
Parameters
[in]keys- reference to an array of keys provided by a thread.
[out]ranks- reference to an array where the final ranks are written to.
[in]digit_extractor- function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsinged int f(const Key &key);. The signature does not need to have const &, but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits).

The documentation for this class was generated from the following file: