cuda-kat
CUDA kernel author's tools
|
Functions | |
template<typename T > | |
KAT_FD T * | contiguous (unsigned num_elements_per_warp, offset_t base_offset=0) |
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves, using striding. More... | |
template<typename T > | |
KAT_FD T * | strided (offset_t base_offset=0) |
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves into contiguous areas. More... | |
KAT_FD T* kat::shared_memory::dynamic::warp_specific::contiguous | ( | unsigned | num_elements_per_warp, |
offset_t | base_offset = 0 |
||
) |
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves, using striding.
The partitioning pattern is for each warp to get elements at a fixed stride rather than a contiguous set of elements; this pattern ensures that different warps are never in a bank conflict when accessing their "private" shared memory - provided the number of warps divides 32, or is a multiple of 32. The downside of this pattern is that different lanes accessing different elements in a warp's shared memory will likely be in bank conflict (and certainly be in conflict if there are 32 warps).
T | the element type assumed for all shared memory (or at least for alignment and for the warp-specific shared memory) |
base_offset | How far into the block's overall shared memory to start partitioning the memory into warp-specific sequences |
num_elements_per_warp | Size in elements of the area agreed to be specific to each warp |
KAT_FD T* kat::shared_memory::dynamic::warp_specific::strided | ( | offset_t | base_offset = 0 | ) |
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves into contiguous areas.
The partitioning pattern is for each warp to get a contiguous sequence of elements in memory.
T | the element type assumed for all shared memory (or at least for alignment and for the warp-specific shared memory) |
base_offset | How far into the block's overall shared memory to start partitioning the memory into warp-specific sequences |