cuda-kat
CUDA kernel author's tools
|
CUDA device computation grid-level primitives, i.e. More...
#include "warp.cuh"
#include <kat/on_device/common.cuh>
#include <kat/on_device/math.cuh>
#include <kat/on_device/grid_info.cuh>
#include <type_traits>
Functions | |
template<typename Function , typename Size = size_t> | |
KAT_FD void | kat::linear_grid::collaborative::grid::at_grid_stride (Size length, const Function &f) |
Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e. More... | |
template<typename Function , typename Size = unsigned> | |
KAT_FD void | kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride (Size length, const Function &f) |
A variant of the one-position-per-thread applicator, collaborative::grid::at_grid_stride() : Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid. More... | |
template<typename Function , typename Size = size_t, bool AssumeLengthIsDivisibleByBlockSize = false, bool GridMayFullyCoverLength = true, typename SerializationFactor = unsigned> | |
KAT_FD void | kat::linear_grid::collaborative::grid::at_block_stride (Size length, const Function &f, SerializationFactor serialization_factor) |
Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (the serialization_factor) at at stride of the block length, i.e. More... | |
CUDA device computation grid-level primitives, i.e.
those involving interaction of threads from different blocks in the grid
KAT_FD void kat::linear_grid::collaborative::grid::at_block_stride | ( | Size | length, |
const Function & | f, | ||
SerializationFactor | serialization_factor | ||
) |
Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (the
serialization_factor) at at stride of the block length, i.e.
a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements
n_b * i_b * serialization_factor + i_t, (n_b * i_b + 1) * serialization_factor + i_t, (n_b * i_b + 2) * serialization_factor + i_t,
and so on. For lengths which are not divisible by n_b * serialization_factor, threads in the last block will work on less items.
Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "block stride":
| 1 | 1 | 222 | 222 | 333 | 333 | 4 | | 11 | 11 | 2 2 | 2 2 | 3 3 | 3 3 | 44 | | 1 | 1 | 2 | 2 | 3 | 3 | 4 4 | | 1 | 1 | 222 | 222 | 3 | 3 | 4 4 | | 1 | 1 | 2 | 2 | 3 3 | 3 3 | 44444 |
(A block strides from one blocks' worth of indices to the next.) This is unlike at_grid_stride()
, for which instead of 1, 1, 2, 2, 3, 3, 4 we would have 1, 2, 3, 1, 2, 3, 1 (if the grid has 3 blocks) or 1, 2, 3, 4, 1, 2 (if the grid has 4 blocks).
serialization_factor | value could be computed by this function itself. This is avoided, assuming that's been take care of before. Specifically, we assume that the |
serialization_factor | is no higher than it absolutely must be. |
Size | type, e.g. if your Size is uint32_t and |
length | is close to 2^32 - 1, the function may fail. |
length | The length of the range (of integers) on which to act |
serialization_factor | the number of elements each thread is to handle (serially) |
f | The callable to execute for each element of the sequence. |
KAT_FD void kat::linear_grid::collaborative::grid::at_grid_stride | ( | Size | length, |
const Function & | f | ||
) |
Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e.
a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements i_t, i_t + n_b, i_t + 2*n_b, and so on.
Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "grid stride":
| 1 | 222 | 333 | 1 | 222 | 333 | 1 | | 11 | 2 2 | 3 3 | 11 | 2 2 | 3 3 | 11 | | 1 | 2 | 3 | 1 | 2 | 3 | 1 | | 1 | 222 | 3 | 1 | 222 | 3 | 1 | | 1 | 2 | 3 3 | 1 | 2 | 3 3 | 1 |
(the grid is 3 blocks' worth, so block 1 strides 3 blocks from one sequence of indices it processes to the next.) This is unlike at_block_stride()
, for which instead of 1, 2, 3, 1, 2, 3, 1 we would have 1, 1, 1, 2, 2, 2, 3 (or 1, 1, 2, 2, 3, 3, 4 if the grid has 4 blocks).
length | The length of the range (of integers) on which to act |
f | The callable to call for each element of the sequence. |
KAT_FD void kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride | ( | Size | length, |
const Function & | f | ||
) |
A variant of the one-position-per-thread applicator, collaborative::grid::at_grid_stride()
: Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid.
at_grid_stride
is specific to linear grids, even though the text of its code looks the same as that of kat::grid_info::collaborative::warp::at_grid_stride .length | The length of the range of positions on which to act |
f | The callable for warps to use each position in the sequence |