cuda-kat
CUDA kernel author's tools
Macros | Functions
grid.cuh File Reference

CUDA device computation grid-level primitives, i.e. More...

#include "warp.cuh"
#include <kat/on_device/common.cuh>
#include <kat/on_device/math.cuh>
#include <kat/on_device/grid_info.cuh>
#include <type_traits>

Functions

template<typename Function , typename Size = size_t>
KAT_FD void kat::linear_grid::collaborative::grid::at_grid_stride (Size length, const Function &f)
 Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e. More...
 
template<typename Function , typename Size = unsigned>
KAT_FD void kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride (Size length, const Function &f)
 A variant of the one-position-per-thread applicator, collaborative::grid::at_grid_stride(): Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid. More...
 
template<typename Function , typename Size = size_t, bool AssumeLengthIsDivisibleByBlockSize = false, bool GridMayFullyCoverLength = true, typename SerializationFactor = unsigned>
KAT_FD void kat::linear_grid::collaborative::grid::at_block_stride (Size length, const Function &f, SerializationFactor serialization_factor)
 Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (the serialization_factor) at at stride of the block length, i.e. More...
 

Detailed Description

CUDA device computation grid-level primitives, i.e.

those involving interaction of threads from different blocks in the grid

Function Documentation

§ at_block_stride()

template<typename Function , typename Size = size_t, bool AssumeLengthIsDivisibleByBlockSize = false, bool GridMayFullyCoverLength = true, typename SerializationFactor = unsigned>
KAT_FD void kat::linear_grid::collaborative::grid::at_block_stride ( Size  length,
const Function &  f,
SerializationFactor  serialization_factor 
)

Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (the serialization_factor) at at stride of the block length, i.e.

a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements

n_b * i_b * serialization_factor + i_t, (n_b * i_b + 1) * serialization_factor + i_t, (n_b * i_b + 2) * serialization_factor + i_t,

and so on. For lengths which are not divisible by n_b * serialization_factor, threads in the last block will work on less items.

Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "block stride":


| 1 | 1 | 222 | 222 | 333 | 333 | 4 | | 11 | 11 | 2 2 | 2 2 | 3 3 | 3 3 | 44 | | 1 | 1 | 2 | 2 | 3 | 3 | 4 4 | | 1 | 1 | 222 | 222 | 3 | 3 | 4 4 | | 1 | 1 | 2 | 2 | 3 3 | 3 3 | 44444 |

| 111 | 111 | 22222 | 22222 | 333 | 333 | 4 |

(A block strides from one blocks' worth of indices to the next.) This is unlike at_grid_stride(), for which instead of 1, 1, 2, 2, 3, 3, 4 we would have 1, 2, 3, 1, 2, 3, 1 (if the grid has 3 blocks) or 1, 2, 3, 4, 1, 2 (if the grid has 4 blocks).

Note
Theoretically, the
Parameters
serialization_factorvalue could be computed by this function itself. This is avoided, assuming that's been take care of before. Specifically, we assume that the
serialization_factoris no higher than it absolutely must be.
Note
There's a block-level variant of this primitive, but there - each block applies f to the same range of elements, rather than covering part of a larger range.
This implementation does not handle cases of overflow of the
Template Parameters
Sizetype, e.g. if your Size is uint32_t and
Parameters
lengthis close to 2^32 - 1, the function may fail.
Note
There's a tricky tradeoff here between avoiding per-iteration checks for whether we're past the end, and avoiding too many initial checks. Two of the the template parameters help up avoid this tradeoff in certain cases by not having to check explicitly for things.
Parameters
lengthThe length of the range (of integers) on which to act
serialization_factorthe number of elements each thread is to handle (serially)
fThe callable to execute for each element of the sequence.

§ at_grid_stride() [1/2]

template<typename Function , typename Size = size_t>
KAT_FD void kat::linear_grid::collaborative::grid::at_grid_stride ( Size  length,
const Function &  f 
)

Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e.

a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements i_t, i_t + n_b, i_t + 2*n_b, and so on.

Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "grid stride":


| 1 | 222 | 333 | 1 | 222 | 333 | 1 | | 11 | 2 2 | 3 3 | 11 | 2 2 | 3 3 | 11 | | 1 | 2 | 3 | 1 | 2 | 3 | 1 | | 1 | 222 | 3 | 1 | 222 | 3 | 1 | | 1 | 2 | 3 3 | 1 | 2 | 3 3 | 1 |

| 111 | 22222 | 333 | 111 | 22222 | 333 | 111 |

(the grid is 3 blocks' worth, so block 1 strides 3 blocks from one sequence of indices it processes to the next.) This is unlike at_block_stride(), for which instead of 1, 2, 3, 1, 2, 3, 1 we would have 1, 1, 1, 2, 2, 2, 3 (or 1, 1, 2, 2, 3, 3, 4 if the grid has 4 blocks).

Note
assumes the number of grid threads is fixed (does that always hold? even with dynamic parallelism?)
Parameters
lengthThe length of the range (of integers) on which to act
fThe callable to call for each element of the sequence.

§ at_grid_stride() [2/2]

template<typename Function , typename Size = unsigned>
KAT_FD void kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride ( Size  length,
const Function &  f 
)

A variant of the one-position-per-thread applicator, collaborative::grid::at_grid_stride(): Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid.

Note
it is assumed the grid only has fully-active warps; any possibly-inactive threads are not given consideration.
This version of at_grid_stride is specific to linear grids, even though the text of its code looks the same as that of kat::grid_info::collaborative::warp::at_grid_stride .
Parameters
lengthThe length of the range of positions on which to act
fThe callable for warps to use each position in the sequence