CUDA device computation grid-level primitives, i.e. More...

#include "warp.cuh"
#include <kat/on_device/common.cuh>
#include <kat/on_device/math.cuh>
#include <kat/on_device/grid_info.cuh>
#include <type_traits>

Functions
template<typename Function , typename Size = size_t>
KAT_FD void	kat::linear_grid::collaborative::grid::at_grid_stride (Size length, const Function &f)
	Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e. More...

template<typename Function , typename Size = unsigned>
KAT_FD void	kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride (Size length, const Function &f)
	A variant of the one-position-per-thread applicator, `collaborative::grid::at_grid_stride()`: Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid. More...

template<typename Function , typename Size = size_t, bool AssumeLengthIsDivisibleByBlockSize = false, bool GridMayFullyCoverLength = true, typename SerializationFactor = unsigned>
KAT_FD void	kat::linear_grid::collaborative::grid::at_block_stride (Size length, const Function &f, SerializationFactor serialization_factor)
	Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (`the` serialization_factor) at at stride of the block length, i.e. More...

Detailed Description

CUDA device computation grid-level primitives, i.e.

those involving interaction of threads from different blocks in the grid

Function Documentation

§ at_block_stride()

template<typename Function , typename Size = size_t, bool AssumeLengthIsDivisibleByBlockSize = false, bool GridMayFullyCoverLength = true, typename SerializationFactor = unsigned>

KAT_FD void kat::linear_grid::collaborative::grid::at_block_stride	(	Size	length,
		const Function &	f,
		SerializationFactor	serialization_factor
	)

Have all grid threads perform some action over the linear range of 0..length-1, with each thread acting on a fixed number of items (the serialization_factor) at at stride of the block length, i.e.

a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements

n_b * i_b * serialization_factor + i_t, (n_b * i_b + 1) * serialization_factor + i_t, (n_b * i_b + 2) * serialization_factor + i_t,

and so on. For lengths which are not divisible by n_b * serialization_factor, threads in the last block will work on less items.

Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "block stride":

| 1 | 1 | 222 | 222 | 333 | 333 | 4 | | 11 | 11 | 2 2 | 2 2 | 3 3 | 3 3 | 44 | | 1 | 1 | 2 | 2 | 3 | 3 | 4 4 | | 1 | 1 | 222 | 222 | 3 | 3 | 4 4 | | 1 | 1 | 2 | 2 | 3 3 | 3 3 | 44444 |

| 111 | 111 | 22222 | 22222 | 333 | 333 | 4 |

(A block strides from one blocks' worth of indices to the next.) This is unlike at_grid_stride(), for which instead of 1, 1, 2, 2, 3, 3, 4 we would have 1, 2, 3, 1, 2, 3, 1 (if the grid has 3 blocks) or 1, 2, 3, 4, 1, 2 (if the grid has 4 blocks).

Note: Theoretically, the

Parameters

serialization_factor	value could be computed by this function itself. This is avoided, assuming that's been take care of before. Specifically, we assume that the
serialization_factor	is no higher than it absolutely must be.

Note: There's a block-level variant of this primitive, but there - each block applies f to the same range of elements, rather than covering part of a larger range.; This implementation does not handle cases of overflow of the

Template Parameters

Size	type, e.g. if your Size is uint32_t and

Parameters

length is close to 2^32 - 1, the function may fail.

Note: There's a tricky tradeoff here between avoiding per-iteration checks for whether we're past the end, and avoiding too many initial checks. Two of the the template parameters help up avoid this tradeoff in certain cases by not having to check explicitly for things.

Parameters

length	The length of the range (of integers) on which to act
serialization_factor	the number of elements each thread is to handle (serially)
f	The callable to execute for each element of the sequence.

§ at_grid_stride() [1/2]

template<typename Function , typename Size = size_t>

KAT_FD void kat::linear_grid::collaborative::grid::at_grid_stride	(	Size	length,
		const Function &	f
	)

Have all kernel threads perform some action over the linear range of 0..length-1, at strides equal to the grid length, i.e.

a thread with index i_t in block with index i_b, where block lengths are n_b, will perform the action on elements i_t, i_t + n_b, i_t + 2*n_b, and so on.

Thus, if in the following chart the rectangles represent consecutive segments of n_b integers, the numbers indicate which blocks work on which elements in "grid stride":

| 1 | 222 | 333 | 1 | 222 | 333 | 1 | | 11 | 2 2 | 3 3 | 11 | 2 2 | 3 3 | 11 | | 1 | 2 | 3 | 1 | 2 | 3 | 1 | | 1 | 222 | 3 | 1 | 222 | 3 | 1 | | 1 | 2 | 3 3 | 1 | 2 | 3 3 | 1 |

| 111 | 22222 | 333 | 111 | 22222 | 333 | 111 |

(the grid is 3 blocks' worth, so block 1 strides 3 blocks from one sequence of indices it processes to the next.) This is unlike at_block_stride(), for which instead of 1, 2, 3, 1, 2, 3, 1 we would have 1, 1, 1, 2, 2, 2, 3 (or 1, 1, 2, 2, 3, 3, 4 if the grid has 4 blocks).

Note: assumes the number of grid threads is fixed (does that always hold? even with dynamic parallelism?)

Parameters

length	The length of the range (of integers) on which to act
f	The callable to call for each element of the sequence.

§ at_grid_stride() [2/2]

template<typename Function , typename Size = unsigned>

KAT_FD void kat::linear_grid::collaborative::grid::warp_per_input_element::at_grid_stride	(	Size	length,
		const Function &	f
	)

A variant of the one-position-per-thread applicator, collaborative::grid::at_grid_stride(): Here each warp works on one input position, advancing by 'grid stride' in the sense of total warps in the grid.

Note: it is assumed the grid only has fully-active warps; any possibly-inactive threads are not given consideration.; This version of at_grid_stride is specific to linear grids, even though the text of its code looks the same as that of kat::grid_info::collaborative::warp::at_grid_stride .

Parameters

length	The length of the range of positions on which to act
f	The callable for warps to use each position in the sequence

Functions

Detailed Description

Function Documentation

§ at_block_stride()

| 111 | 111 | 22222 | 22222 | 333 | 333 | 4 |

§ at_grid_stride() [1/2]

| 111 | 22222 | 333 | 111 | 22222 | 333 | 111 |

§ at_grid_stride() [2/2]