|
template<typename Key , unsigned ItemsPerThread> |
ROCPRIM_DEVICE void | rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, unsigned int begin_bit=0, unsigned int pass_bits=RadixBits) |
| Perform ascending radix rank over keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread> |
ROCPRIM_DEVICE void | rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], unsigned int begin_bit=0, unsigned int pass_bits=RadixBits) |
| Perform ascending radix rank over keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread> |
ROCPRIM_DEVICE void | rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, unsigned int begin_bit=0, unsigned int pass_bits=RadixBits) |
| Perform ascending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread> |
ROCPRIM_DEVICE void | rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], unsigned int begin_bit=0, unsigned int pass_bits=RadixBits) |
| Perform descending radix rank over keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor > |
ROCPRIM_DEVICE void | rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor) |
| Perform ascending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor > |
ROCPRIM_DEVICE void | rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], DigitExtractor digit_extractor) |
| Perform ascending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor > |
ROCPRIM_DEVICE void | rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor) |
| Perform descending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor > |
ROCPRIM_DEVICE void | rank_keys_desc (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], DigitExtractor digit_extractor) |
| Perform descending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor > |
ROCPRIM_DEVICE void | rank_keys (const Key(&keys)[ItemsPerThread], unsigned int(&ranks)[ItemsPerThread], storage_type &storage, DigitExtractor digit_extractor, unsigned int(&prefix)[digits_per_thread], unsigned int(&counts)[digits_per_thread]) |
| Perform ascending radix rank over bit keys partitioned across threads in a block. More...
|
|
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >
The block_radix_rank class is a block level parallel primitive that provides methods for ranking items partitioned across threads in a block.
This algorithm associates each item with the index it would gain if the keys were sorted into an array, according to a radix comparison. Ranking is performed in a stable manner.
- Template Parameters
-
BlockSizeX | - the number of threads in a block's x dimension. |
RadixBits | - the maximum number of radix digit bits that comparisons are performed by. |
MemoizeOuterScan | - whether to cache digit counters in local memory. This omits loading the same values from shared memory twice, at the expense of more register usage. |
BlockSizeY | - the number of threads in a block's y dimension, defaults to 1. |
BlockSizeZ | - the number of threads in a block's z dimension, defaults to 1. |
- Overview
- Key type must be an arithmetic type (that is, an integral type or a floating point type).
- Performance depends on the block size and the number of items that will be sorted per thread.
- It is usually better if the block size is a multiple of the size of the hardware warp.
- It is usually increased when there are more than one item per thread. However, when there are too many items per thread, each thread may need so many registers and/or shared memory that it impedes performance.
- Shared memory usage depends on the block size and the maximum number of radix bits that will be considered when comparing keys.
- The storage requirement increases when more bits are considered.
- The maximum amount of keys that can be ranked in a block is
2**16
.
- Examples
- In the example, radix rank is performed on a block of 128 threads. Each thread provides three
float
values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int
values.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<float, 128>;
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3] = ...;
block_rank_float().rank_keys(input,
output,
storage,
10,
4);
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords shared
. It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
storage_type & |
storage, |
|
|
unsigned int |
begin_bit = 0 , |
|
|
unsigned int |
pass_bits = RadixBits |
|
) |
| |
|
inline |
Perform ascending radix rank over keys partitioned across threads in a block.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | storage | - reference to a temporary storage object of type storage_type . |
[in] | begin_bit | - index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)) . |
[in] | pass_bits | - [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits] . Default value: RadixBits. |
- Storage reusage
- A synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
In the example, radix rank is performed on a block of 128 threads. Each thread provides three float
values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int
values.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<float, 128, 4>;
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3];
block_rank_float().rank_keys(input,
output,
storage,
10,
4);
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
unsigned int |
begin_bit = 0 , |
|
|
unsigned int |
pass_bits = RadixBits |
|
) |
| |
|
inline |
Perform ascending radix rank over keys partitioned across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | begin_bit | - index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)) . |
[in] | pass_bits | - [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits] . Default value: RadixBits. |
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
storage_type & |
storage, |
|
|
DigitExtractor |
digit_extractor |
|
) |
| |
|
inline |
Perform ascending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
DigitExtractor | - type of the unary function object used to extract a digit from a key. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | storage | - reference to a temporary storage object of type storage_type . |
[in] | digit_extractor | - function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key); . The signature does not need to have const & , but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits). |
- Storage reusage
- A synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
In the example, radix rank is performed on a block of 128 threads. Each thread provides three int
values, which are ranked according to a digit callback that extracts digits 0 through 4. Results written back in a separate array of three unsigned int
values.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
block_rank_float().rank_keys(input,
output,
storage,
[](const int& key)
{
return key & 0xF;
});
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
DigitExtractor |
digit_extractor |
|
) |
| |
|
inline |
Perform ascending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
DigitExtractor | - type of the unary function object used to extract a digit from a key. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | digit_extractor | - function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key); . The signature does not need to have const & , but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits). |
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
storage_type & |
storage, |
|
|
DigitExtractor |
digit_extractor, |
|
|
unsigned int(&) |
prefix[digits_per_thread], |
|
|
unsigned int(&) |
counts[digits_per_thread] |
|
) |
| |
|
inline |
Perform ascending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key, and provides the counts of each digit and a prefix scan thereof in a blocked arrangement.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
DigitExtractor | - type of the unary function object used to extract a digit from a key. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. Keys are expected in warp-striped arrangement. |
[out] | ranks | - reference to an array where the final ranks are written to. Ranks are provided in warp-striped arrangement. |
[in] | storage | - reference to a temporary storage object of type storage_type . |
[in] | digit_extractor | - function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key); . The signature does not need to have const & , but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits). |
[in] | prefix | - An exclusive prefix scan of the counts per digit. |
[in] | counts | - The number of keys with a particular digit in the input, per digit. |
- Storage reusage
- A synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
unsinged int digit_prefix[block_rank_float::digits_per_thread];
unsinged int digit_counts[block_rank_float::digits_per_thread];
block_rank_float().rank_keys(input,
output,
storage,
[](const int& key)
{
return key & 0xF;
},
digit_prefix,
digit_counts);
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
storage_type & |
storage, |
|
|
unsigned int |
begin_bit = 0 , |
|
|
unsigned int |
pass_bits = RadixBits |
|
) |
| |
|
inline |
Perform ascending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key.
- Template Parameters
-
ItemsPerThread | - the number of items contributed by each thread in the block. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | storage | - reference to a temporary storage object of type storage_type . |
[in] | begin_bit | - index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)) . |
[in] | pass_bits | - [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits] . Default value: RadixBits. |
- Storage reusage
- A synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
In the example, radix rank is performed on a block of 128 threads. Each thread provides three float
values, which are ranked according to bits 10 through 14. The results are written back in a separate array of three unsigned int
values.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<float, 128, 4>;
__shared__ block_rank_float::storage_type storage;
float input[3] = ...;
unsigned int output[3];
block_rank_float().rank_keys_desc(input,
output,
storage,
10,
4);
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread>
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
unsigned int |
begin_bit = 0 , |
|
|
unsigned int |
pass_bits = RadixBits |
|
) |
| |
|
inline |
Perform descending radix rank over keys partitioned across threads in a block.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | begin_bit | - index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)) . |
[in] | pass_bits | - [optional] the number of bits used in key comparison. Must be in the range (0; RadixBits] . Default value: RadixBits. |
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
storage_type & |
storage, |
|
|
DigitExtractor |
digit_extractor |
|
) |
| |
|
inline |
Perform descending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
DigitExtractor | - type of the unary function object used to extract a digit from a key. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | storage | - reference to a temporary storage object of type storage_type . |
[in] | digit_extractor | - function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsigned int f(const Key &key); . The signature does not need to have const & , but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits). |
- Storage reusage
- A synchronization barrier should be placed before
storage
is reused or repurposed: __syncthreads()
or rocprim::syncthreads()
.
- Examples
In the example, radix rank is performed on a block of 128 threads. Each thread provides three int
values, which are ranked according to a digit callback that extracts digits 0 through 4. Results written back in a separate array of three unsigned int
values.
__global__ void example_kernel(...)
{
using block_rank_float = rocprim::block_radix_rank<int, 128, 4>;
__shared__ block_rank_float::storage_type storage;
int input[3] = ...;
unsigned int output[3];
block_rank_float().rank_keys_desc(input,
output,
storage,
[](const int& key)
{
return key & 0xF;
});
...
}
template<unsigned int BlockSizeX, unsigned int RadixBits, block_radix_rank_algorithm Algorithm = block_radix_rank_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
template<typename Key , unsigned ItemsPerThread, typename DigitExtractor >
ROCPRIM_DEVICE void block_radix_rank< BlockSizeX, RadixBits, Algorithm, BlockSizeY, BlockSizeZ >::rank_keys_desc |
( |
const Key(&) |
keys[ItemsPerThread], |
|
|
unsigned int(&) |
ranks[ItemsPerThread], |
|
|
DigitExtractor |
digit_extractor |
|
) |
| |
|
inline |
Perform descending radix rank over bit keys partitioned across threads in a block.
This overload accepts a callback used to extract the radix digit from a key.
- This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters
-
Key | - the key type. |
ItemsPerThread | - the number of items contributed by each thread in the block. |
DigitExtractor | - type of the unary function object used to extract a digit from a key. |
- Parameters
-
[in] | keys | - reference to an array of keys provided by a thread. |
[out] | ranks | - reference to an array where the final ranks are written to. |
[in] | digit_extractor | - function object used to convert a key to a digit. The signature of the digit_extractor should be equivalent to the following: unsinged int f(const Key &key); . The signature does not need to have const & , but function object must not modify the objects passed to it. This function will be used during ranking to extract the digit that indicates the key's value. Values return by this function object must be in range [0; 1 << RadixBits). |