cuda-kat
CUDA kernel author's tools
Todo List
File block.cuh
Some of these assume linear grids, others do not - sort them out
Member kat::builtins::bit_field::extract_bits (I bit_field, unsigned int start_pos, unsigned int num_bits)=delete
CUB 1.5.2's BFE wrapper seems kind of fishy. Why does Duane Merill not use PTX for extraction from 64-bit fields? For now only adopting his implementation for the 32-bit case.
Class kat::collaborative::detail::elements_per_lane_in_full_warp_write< T >
: Can't we assume that T is a POD type, and just have lanes not write complete T's?
Member kat::collaborative::warp::active_lanes_atomically_increment (T *counter)
extend this to other atomic operations
Member kat::collaborative::warp::elementwise_accumulate_n (AccumulatingOperation op, D *__restrict__ destination, RandomAccessIterator __restrict__ source, Size length)

consider taking a GSL-span-like parameter isntead of a ptr+length

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Member kat::collaborative::warp::reduce (T value, AccumulationOp op)
offer both an inclusive and an exclusive versionn
Class kat::dimensions_t
consider templating this on the number of dimensions.
Member kat::lane_mask_t
: Consider using a 32-bit bit field
Member kat::linear_grid::collaborative::block::elementwise_accumulate_n (AccumulatingOperation op, D *__restrict__ destination, RandomAccessIterator __restrict__ source, Size length)

consider taking a GSL-span-like parameter isntead of a ptr+length

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Member kat::linear_grid::collaborative::block::scan_and_reduce (T *__restrict__ scratch, T value, AccumulationOp op, T &scan_result, T &reduction_result)

consider returning a pair rather than using non-const references

lots of code duplication with just-scan

add a bool template param allowing the code to assume the block is full (this saves a few ops)

Member kat::linear_grid::collaborative::warp::multisearch (const T &lane_needle, const T &lane_hay_straw)

Does it matter if the needles, as opposed to the hay straws, are sorted? I wonder.

consider specializing for non-full warps

Specialize for smaller and larger data types: For larger ones, compare 4-byte parts of the datum separately (assuming

consider specializing for non-full warps

Specialize for smaller and larger data types: For larger ones, compare 4-byte parts of the datum separately (assuming

Member kat::reinterpret (Original &x)
Would it be better to return a reference?
Member kat::swap (T &a, T &b) noexcept(std::is_nothrow_move_constructible< T >::value &&std::is_nothrow_move_assignable< T >::value)
How does EASTL swap work? Should I incorporate its specializations?
File warp.cuh
Some inclusions in the warp-primitives might only be relevant to the functions here; double-check.
File warp.cuh
  1. Some of these assume linear grids, others do not - sort them out.
  2. Use a lane index type