cuda-kat
CUDA kernel author's tools
Namespaces | Macros | Enumerations | Functions
builtins.cuh File Reference

Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins namespace). More...

#include <kat/on_device/common.cuh>
#include "detail/builtins.cuh"

Namespaces

 kat::builtins
 Uniform-naming scheme, templated-when-relevant wrappers of single PTX instruction.
 
 kat::builtins::special_registers
 Special register getter wrappers.
 

Enumerations

enum  kat::builtins::funnel_shift_amount_resolution_mode_t {
  kat::builtins::funnel_shift_amount_resolution_mode_t::take_lower_bits_of_amount,
  kat::builtins::funnel_shift_amount_resolution_mode_t::cap_at_full_word_size
}
 Use this to select which variant of the funnel shift intrinsic to use. More...
 

Functions

template<typename I >
KAT_FD I kat::builtins::multiplication_high_bits (I x, I y)
 When multiplying two n-bit numbers, the result may take up to 2n bits. More...
 
template<typename F >
KAT_FD F kat::builtins::divide (F dividend, F divisor)
 Division which becomes faster and less precise than regular "/", when –use-fast-math is specified; otherwise it's the same as regular "/".
 
template<typename F >
KAT_FD F kat::builtins::clamp_to_unit_segment (F x)
 clamps the input value to the unit segment [0.0,+1.0]. More...
 
template<typename T >
KAT_FD T kat::builtins::absolute_value (T x)
 
template<typename T >
KAT_FD T kat::builtins::minimum (T x, T y)=delete
 
template<typename T >
KAT_FD T kat::builtins::maximum (T x, T y)=delete
 
template<typename I >
KAT_FD std::make_unsigned< I >::type kat::builtins::sum_with_absolute_difference (I x, I y, typename std::make_unsigned< I >::type addend)
 Computes addend + |x- y| . More...
 
template<typename I >
KAT_FD int kat::builtins::population_count (I x)
 
template<typename I >
KAT_FD I kat::builtins::bit_reverse (I x)=delete
 
template<typename I >
KAT_FD unsigned kat::builtins::find_leading_non_sign_bit (I x)=delete
 Find the most-significant, i.e. More...
 
template<typename I >
KAT_FD int kat::builtins::count_leading_zeros (I x)=delete
 Return the number of bits, beginning from the least-significant, which are all 0 ("leading" zeros) More...
 
template<typename I >
KAT_FD I kat::builtins::bit_field::extract_bits (I bit_field, unsigned int start_pos, unsigned int num_bits)=delete
 Extracts the bits with 0-based indices start_pos ... More...
 
template<typename I >
KAT_FD I kat::builtins::bit_field::replace_bits (I original_bit_field, I bits_to_insert, unsigned int start_pos, unsigned int num_bits)=delete
 
KAT_FD unsigned kat::builtins::permute_bytes (unsigned first, unsigned second, unsigned byte_selectors)
 See: relevant section of the CUDA PTX reference for an explanation of what this does exactly. More...
 
template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size>
KAT_FD uint32_t kat::builtins::funnel_shift_right (uint32_t low_word, uint32_t high_word, uint32_t shift_amount)
 Performs a right-shift on the combination of the two arguments into a single, double-the-length, value. More...
 
template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size>
KAT_FD uint32_t kat::builtins::funnel_shift_left (uint32_t low_word, uint32_t high_word, uint32_t shift_amount)
 Performs a left-shift on the combination of the two arguments into a single, double-the-length, value. More...
 
template<typename I >
I KAT_FD kat::builtins::average (I x, I y)=delete
 compute the average of two integer values without needing special accounting for overflow - rounding down
 
template<typename I >
I KAT_FD kat::builtins::average_rounded_up (I x, I y)=delete
 compute the average of two values without needing special accounting for overflow - rounding up More...
 
KAT_FD unsigned kat::builtins::special_registers::lane_index ()
 
KAT_FD unsigned kat::builtins::special_registers::symmetric_multiprocessor_index ()
 
KAT_FD unsigned long long kat::builtins::special_registers::grid_index ()
 
KAT_FD unsigned int kat::builtins::special_registers::dynamic_shared_memory_size ()
 
KAT_FD unsigned int kat::builtins::special_registers::total_shared_memory_size ()
 
KAT_FD lane_mask_t kat::builtins::warp::ballot (int condition)
 
KAT_FD int kat::builtins::warp::all_lanes_satisfy (int condition)
 Checks whether a condition holds for an entire warp of threads. More...
 
KAT_FD int kat::builtins::warp::any_lanes_satisfy (int condition)
 
KAT_FD unsigned int kat::builtins::warp::mask_of_lanes::preceding ()
 
KAT_FD unsigned int kat::builtins::warp::mask_of_lanes::preceding_and_self ()
 
KAT_FD unsigned int kat::builtins::warp::mask_of_lanes::self ()
 
KAT_FD unsigned int kat::builtins::warp::mask_of_lanes::succeeding_and_self ()
 
KAT_FD unsigned int kat::builtins::warp::mask_of_lanes::succeeding ()
 
template<typename T >
KAT_FD T kat::builtins::warp::shuffle::arbitrary (T x, int source_lane, int width=warp_size)
 
template<typename T >
KAT_FD T kat::builtins::warp::shuffle::down (T x, unsigned delta, int width=warp_size)
 
template<typename T >
KAT_FD T kat::builtins::warp::shuffle::up (T x, unsigned delta, int width=warp_size)
 
template<typename T >
KAT_FD T kat::builtins::warp::shuffle::xor_ (T x, int lane_id_xoring_mask, int width=warp_size)
 

Detailed Description

Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins namespace).

CUDA provides C functions corresponding to many PTX instructions, which are not otherwise easy, obvious or possible to generate with plain C or C++ code. However - it doesn't provide such functions for all PTX instructions; nor does it provide them in a type-generic way, for use in templated C++ code.

Note
  1. This obviously doesn't include those built-ins which are inherent operators in C++ as a language, i.e. % + / * - << >> and so on.
  2. PTX collaboration/single instructions don't always translate to single SASS (GPU assembly) instructions - as PTX is an intermediate representation (IR) common to multiple GPU microarchitectures.
  3. No function here performs any computation other beyond a single PTX instruction; non-built-in operations belong in other files. But this is only almost_true. The functions here do have: 3.1 Type casts 3.2 Substitutions of a constant, for an instruction parameter, especially via default arguments.
  4. The templated builtins are only available for a _subset of the fundamental C++ types (and never for aggregate types); other files utilize these actual built-ins to generalize them to a richer set of types.
  5. This file (and its implementation) has no PTX code. PTX instructions are wrapped in functions under the ptx/ directory, which are not templated.

Function Documentation

§ all_lanes_satisfy()

KAT_FD int kat::builtins::warp::all_lanes_satisfy ( int  condition)

Checks whether a condition holds for an entire warp of threads.

Parameters
conditionA boolean value (passed as an integer since that's what nVIDIA GPUs actually check with the HW instruction
Returns
true if condition is non-zero for all threads

§ extract_bits()

template<typename I >
KAT_FD I kat::builtins::bit_field::extract_bits ( bit_field,
unsigned int  start_pos,
unsigned int  num_bits 
)
delete

Extracts the bits with 0-based indices start_pos ...

start_pos+ num_bits - 1, counting from least to most significant, from a bit field field. Has sign extension semantics for signed inputs which are bit tricky, see in the PTX ISA guide:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html

Todo:
CUB 1.5.2's BFE wrapper seems kind of fishy. Why does Duane Merill not use PTX for extraction from 64-bit fields? For now only adopting his implementation for the 32-bit case.
Note
This method is more "strict" in its specialization that others.