Some of these assume linear grids, others do not - sort them out

Member kat::builtins::bit_field::extract_bits (I bit_field, unsigned int start_pos, unsigned int num_bits)=delete

CUB 1.5.2's BFE wrapper seems kind of fishy. Why does Duane Merill not use PTX for extraction from 64-bit fields? For now only adopting his implementation for the 32-bit case.

Class kat::collaborative::detail::elements_per_lane_in_full_warp_write< T >

: Can't we assume that T is a POD type, and just have lanes not write complete T's?

Member kat::collaborative::warp::active_lanes_atomically_increment (T *counter)

extend this to other atomic operations

Member kat::collaborative::warp::elementwise_accumulate_n (AccumulatingOperation op, D *__restrict__ destination, RandomAccessIterator __restrict__ source, Size length)

consider taking a GSL-span-like parameter isntead of a ptr+length

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Member kat::collaborative::warp::reduce (T value, AccumulationOp op)

offer both an inclusive and an exclusive versionn

Class kat::dimensions_t

consider templating this on the number of dimensions.

Member kat::lane_mask_t

: Consider using a 32-bit bit field

Member kat::linear_grid::collaborative::block::elementwise_accumulate_n (AccumulatingOperation op, D *__restrict__ destination, RandomAccessIterator __restrict__ source, Size length)

consider taking a GSL-span-like parameter isntead of a ptr+length

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Some inclusions in the block-primitives might only be relevant to the functions here; double-check.

consider using elementwise_apply for this.

Member kat::linear_grid::collaborative::block::scan_and_reduce (T *__restrict__ scratch, T value, AccumulationOp op, T &scan_result, T &reduction_result)

consider returning a pair rather than using non-const references

lots of code duplication with just-scan

add a bool template param allowing the code to assume the block is full (this saves a few ops)

Member kat::linear_grid::collaborative::warp::multisearch (const T &lane_needle, const T &lane_hay_straw)

Does it matter if the needles, as opposed to the hay straws, are sorted? I wonder.

consider specializing for non-full warps

Specialize for smaller and larger data types: For larger ones, compare 4-byte parts of the datum separately (assuming

consider specializing for non-full warps

Specialize for smaller and larger data types: For larger ones, compare 4-byte parts of the datum separately (assuming

Member kat::reinterpret (Original &x)

Would it be better to return a reference?

Member kat::swap (T &a, T &b) noexcept(std::is_nothrow_move_constructible< T >::value &&std::is_nothrow_move_assignable< T >::value)

How does EASTL swap work? Should I incorporate its specializations?

File warp.cuh

Some inclusions in the warp-primitives might only be relevant to the functions here; double-check.

File warp.cuh

Some of these assume linear grids, others do not - sort them out.
Use a lane index type