cuda-kat
CUDA kernel author's tools
|
Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins
namespace).
More...
Namespaces | |
kat::builtins | |
Uniform-naming scheme, templated-when-relevant wrappers of single PTX instruction. | |
kat::builtins::special_registers | |
Special register getter wrappers. | |
Enumerations | |
enum | kat::builtins::funnel_shift_amount_resolution_mode_t { kat::builtins::funnel_shift_amount_resolution_mode_t::take_lower_bits_of_amount, kat::builtins::funnel_shift_amount_resolution_mode_t::cap_at_full_word_size } |
Use this to select which variant of the funnel shift intrinsic to use. More... | |
Functions | |
template<typename I > | |
KAT_FD I | kat::builtins::multiplication_high_bits (I x, I y) |
When multiplying two n-bit numbers, the result may take up to 2n bits. More... | |
template<typename F > | |
KAT_FD F | kat::builtins::divide (F dividend, F divisor) |
Division which becomes faster and less precise than regular "/", when –use-fast-math is specified; otherwise it's the same as regular "/". | |
template<typename F > | |
KAT_FD F | kat::builtins::clamp_to_unit_segment (F x) |
clamps the input value to the unit segment [0.0,+1.0]. More... | |
template<typename T > | |
KAT_FD T | kat::builtins::absolute_value (T x) |
template<typename T > | |
KAT_FD T | kat::builtins::minimum (T x, T y)=delete |
template<typename T > | |
KAT_FD T | kat::builtins::maximum (T x, T y)=delete |
template<typename I > | |
KAT_FD std::make_unsigned< I >::type | kat::builtins::sum_with_absolute_difference (I x, I y, typename std::make_unsigned< I >::type addend) |
Computes addend + |x- y| . More... | |
template<typename I > | |
KAT_FD int | kat::builtins::population_count (I x) |
template<typename I > | |
KAT_FD I | kat::builtins::bit_reverse (I x)=delete |
template<typename I > | |
KAT_FD unsigned | kat::builtins::find_leading_non_sign_bit (I x)=delete |
Find the most-significant, i.e. More... | |
template<typename I > | |
KAT_FD int | kat::builtins::count_leading_zeros (I x)=delete |
Return the number of bits, beginning from the least-significant, which are all 0 ("leading" zeros) More... | |
template<typename I > | |
KAT_FD I | kat::builtins::bit_field::extract_bits (I bit_field, unsigned int start_pos, unsigned int num_bits)=delete |
Extracts the bits with 0-based indices start_pos ... More... | |
template<typename I > | |
KAT_FD I | kat::builtins::bit_field::replace_bits (I original_bit_field, I bits_to_insert, unsigned int start_pos, unsigned int num_bits)=delete |
KAT_FD unsigned | kat::builtins::permute_bytes (unsigned first, unsigned second, unsigned byte_selectors) |
See: relevant section of the CUDA PTX reference for an explanation of what this does exactly. More... | |
template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size> | |
KAT_FD uint32_t | kat::builtins::funnel_shift_right (uint32_t low_word, uint32_t high_word, uint32_t shift_amount) |
Performs a right-shift on the combination of the two arguments into a single, double-the-length, value. More... | |
template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size> | |
KAT_FD uint32_t | kat::builtins::funnel_shift_left (uint32_t low_word, uint32_t high_word, uint32_t shift_amount) |
Performs a left-shift on the combination of the two arguments into a single, double-the-length, value. More... | |
template<typename I > | |
I KAT_FD | kat::builtins::average (I x, I y)=delete |
compute the average of two integer values without needing special accounting for overflow - rounding down | |
template<typename I > | |
I KAT_FD | kat::builtins::average_rounded_up (I x, I y)=delete |
compute the average of two values without needing special accounting for overflow - rounding up More... | |
KAT_FD unsigned | kat::builtins::special_registers::lane_index () |
KAT_FD unsigned | kat::builtins::special_registers::symmetric_multiprocessor_index () |
KAT_FD unsigned long long | kat::builtins::special_registers::grid_index () |
KAT_FD unsigned int | kat::builtins::special_registers::dynamic_shared_memory_size () |
KAT_FD unsigned int | kat::builtins::special_registers::total_shared_memory_size () |
KAT_FD lane_mask_t | kat::builtins::warp::ballot (int condition) |
KAT_FD int | kat::builtins::warp::all_lanes_satisfy (int condition) |
Checks whether a condition holds for an entire warp of threads. More... | |
KAT_FD int | kat::builtins::warp::any_lanes_satisfy (int condition) |
KAT_FD unsigned int | kat::builtins::warp::mask_of_lanes::preceding () |
KAT_FD unsigned int | kat::builtins::warp::mask_of_lanes::preceding_and_self () |
KAT_FD unsigned int | kat::builtins::warp::mask_of_lanes::self () |
KAT_FD unsigned int | kat::builtins::warp::mask_of_lanes::succeeding_and_self () |
KAT_FD unsigned int | kat::builtins::warp::mask_of_lanes::succeeding () |
template<typename T > | |
KAT_FD T | kat::builtins::warp::shuffle::arbitrary (T x, int source_lane, int width=warp_size) |
template<typename T > | |
KAT_FD T | kat::builtins::warp::shuffle::down (T x, unsigned delta, int width=warp_size) |
template<typename T > | |
KAT_FD T | kat::builtins::warp::shuffle::up (T x, unsigned delta, int width=warp_size) |
template<typename T > | |
KAT_FD T | kat::builtins::warp::shuffle::xor_ (T x, int lane_id_xoring_mask, int width=warp_size) |
Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins
namespace).
CUDA provides C functions corresponding to many PTX instructions, which are not otherwise easy, obvious or possible to generate with plain C or C++ code. However - it doesn't provide such functions for all PTX instructions; nor does it provide them in a type-generic way, for use in templated C++ code.
ptx/
directory, which are not templated. KAT_FD int kat::builtins::warp::all_lanes_satisfy | ( | int | condition | ) |
Checks whether a condition holds for an entire warp of threads.
condition | A boolean value (passed as an integer since that's what nVIDIA GPUs actually check with the HW instruction |
|
delete |
Extracts the bits with 0-based indices start_pos
...
start_pos+
num_bits
- 1, counting from least to most significant, from a bit field field. Has sign extension semantics for signed inputs which are bit tricky, see in the PTX ISA guide:
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html