| 
    cuda-kat
    
   CUDA kernel author's tools 
   | 
 
Functions | |
| template<typename T > | |
| KAT_FD T * | contiguous (unsigned num_elements_per_warp, offset_t base_offset=0) | 
| Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves, using striding.  More... | |
| template<typename T > | |
| KAT_FD T * | strided (offset_t base_offset=0) | 
| Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves into contiguous areas.  More... | |
| KAT_FD T* kat::shared_memory::dynamic::warp_specific::contiguous | ( | unsigned | num_elements_per_warp, | 
| offset_t | base_offset = 0  | 
        ||
| ) | 
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves, using striding.
The partitioning pattern is for each warp to get elements at a fixed stride rather than a contiguous set of elements; this pattern ensures that different warps are never in a bank conflict when accessing their "private" shared memory - provided the number of warps divides 32, or is a multiple of 32. The downside of this pattern is that different lanes accessing different elements in a warp's shared memory will likely be in bank conflict (and certainly be in conflict if there are 32 warps).
| T | the element type assumed for all shared memory (or at least for alignment and for the warp-specific shared memory) | 
| base_offset | How far into the block's overall shared memory to start partitioning the memory into warp-specific sequences | 
| num_elements_per_warp | Size in elements of the area agreed to be specific to each warp | 
| KAT_FD T* kat::shared_memory::dynamic::warp_specific::strided | ( | offset_t | base_offset = 0 | ) | 
Accesses the calling thread's warp-specific dynamic shared memory - assuming the warps voluntarily divvy up the shared memory beyond some point amongst themselves into contiguous areas.
The partitioning pattern is for each warp to get a contiguous sequence of elements in memory.
| T | the element type assumed for all shared memory (or at least for alignment and for the warp-specific shared memory) | 
| base_offset | How far into the block's overall shared memory to start partitioning the memory into warp-specific sequences | 
 1.8.12