Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins namespace). More...

#include <kat/on_device/common.cuh>
#include "detail/builtins.cuh"

Namespaces
	kat::builtins
	Uniform-naming scheme, templated-when-relevant wrappers of single PTX instruction.

	kat::builtins::special_registers
	Special register getter wrappers.

Enumerations
enum	kat::builtins::funnel_shift_amount_resolution_mode_t { kat::builtins::funnel_shift_amount_resolution_mode_t::take_lower_bits_of_amount, kat::builtins::funnel_shift_amount_resolution_mode_t::cap_at_full_word_size }
	Use this to select which variant of the funnel shift intrinsic to use. More...

Functions
template<typename I >
KAT_FD I	kat::builtins::multiplication_high_bits (I x, I y)
	When multiplying two n-bit numbers, the result may take up to 2n bits. More...

template<typename F >
KAT_FD F	kat::builtins::divide (F dividend, F divisor)
	Division which becomes faster and less precise than regular "/", when –use-fast-math is specified; otherwise it's the same as regular "/".

template<typename F >
KAT_FD F	kat::builtins::clamp_to_unit_segment (F x)
	clamps the input value to the unit segment [0.0,+1.0]. More...

template<typename T >
KAT_FD T	kat::builtins::absolute_value (T x)

template<typename T >
KAT_FD T	kat::builtins::minimum (T x, T y)=delete

template<typename T >
KAT_FD T	kat::builtins::maximum (T x, T y)=delete

template<typename I >
KAT_FD std::make_unsigned< I >::type	kat::builtins::sum_with_absolute_difference (I x, I y, typename std::make_unsigned< I >::type addend)
	Computes `addend` + \|`x-` `y\|` . More...

template<typename I >
KAT_FD int	kat::builtins::population_count (I x)

template<typename I >
KAT_FD I	kat::builtins::bit_reverse (I x)=delete

template<typename I >
KAT_FD unsigned	kat::builtins::find_leading_non_sign_bit (I x)=delete
	Find the most-significant, i.e. More...

template<typename I >
KAT_FD int	kat::builtins::count_leading_zeros (I x)=delete
	Return the number of bits, beginning from the least-significant, which are all 0 ("leading" zeros) More...

template<typename I >
KAT_FD I	kat::builtins::bit_field::extract_bits (I bit_field, unsigned int start_pos, unsigned int num_bits)=delete
	Extracts the bits with 0-based indices `start_pos` ... More...

template<typename I >
KAT_FD I	kat::builtins::bit_field::replace_bits (I original_bit_field, I bits_to_insert, unsigned int start_pos, unsigned int num_bits)=delete

KAT_FD unsigned	kat::builtins::permute_bytes (unsigned first, unsigned second, unsigned byte_selectors)
	See: relevant section of the CUDA PTX reference for an explanation of what this does exactly. More...

template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size>
KAT_FD uint32_t	kat::builtins::funnel_shift_right (uint32_t low_word, uint32_t high_word, uint32_t shift_amount)
	Performs a right-shift on the combination of the two arguments into a single, double-the-length, value. More...

template<funnel_shift_amount_resolution_mode_t AmountResolutionMode = funnel_shift_amount_resolution_mode_t::cap_at_full_word_size>
KAT_FD uint32_t	kat::builtins::funnel_shift_left (uint32_t low_word, uint32_t high_word, uint32_t shift_amount)
	Performs a left-shift on the combination of the two arguments into a single, double-the-length, value. More...

template<typename I >
I KAT_FD	kat::builtins::average (I x, I y)=delete
	compute the average of two integer values without needing special accounting for overflow - rounding down

template<typename I >
I KAT_FD	kat::builtins::average_rounded_up (I x, I y)=delete
	compute the average of two values without needing special accounting for overflow - rounding up More...

KAT_FD unsigned	kat::builtins::special_registers::lane_index ()

KAT_FD unsigned	kat::builtins::special_registers::symmetric_multiprocessor_index ()

KAT_FD unsigned long long	kat::builtins::special_registers::grid_index ()

KAT_FD unsigned int	kat::builtins::special_registers::dynamic_shared_memory_size ()

KAT_FD unsigned int	kat::builtins::special_registers::total_shared_memory_size ()

KAT_FD lane_mask_t	kat::builtins::warp::ballot (int condition)

KAT_FD int	kat::builtins::warp::all_lanes_satisfy (int condition)
	Checks whether a condition holds for an entire warp of threads. More...

KAT_FD int	kat::builtins::warp::any_lanes_satisfy (int condition)

KAT_FD unsigned int	kat::builtins::warp::mask_of_lanes::preceding ()

KAT_FD unsigned int	kat::builtins::warp::mask_of_lanes::preceding_and_self ()

KAT_FD unsigned int	kat::builtins::warp::mask_of_lanes::self ()

KAT_FD unsigned int	kat::builtins::warp::mask_of_lanes::succeeding_and_self ()

KAT_FD unsigned int	kat::builtins::warp::mask_of_lanes::succeeding ()

template<typename T >
KAT_FD T	kat::builtins::warp::shuffle::arbitrary (T x, int source_lane, int width=warp_size)

template<typename T >
KAT_FD T	kat::builtins::warp::shuffle::down (T x, unsigned delta, int width=warp_size)

template<typename T >
KAT_FD T	kat::builtins::warp::shuffle::up (T x, unsigned delta, int width=warp_size)

template<typename T >
KAT_FD T	kat::builtins::warp::shuffle::xor_ (T x, int lane_id_xoring_mask, int width=warp_size)

Detailed Description

Templated, uniformly-named C++ functions wrapping single PTX instructions (in a dedicated builtins namespace).

CUDA provides C functions corresponding to many PTX instructions, which are not otherwise easy, obvious or possible to generate with plain C or C++ code. However - it doesn't provide such functions for all PTX instructions; nor does it provide them in a type-generic way, for use in templated C++ code.

Note

This obviously doesn't include those built-ins which are inherent operators in C++ as a language, i.e. % + / * - << >> and so on.
PTX collaboration/single instructions don't always translate to single SASS (GPU assembly) instructions - as PTX is an intermediate representation (IR) common to multiple GPU microarchitectures.
No function here performs any computation other beyond a single PTX instruction; non-built-in operations belong in other files. But this is only almost_true. The functions here do have: 3.1 Type casts 3.2 Substitutions of a constant, for an instruction parameter, especially via default arguments.
The templated builtins are only available for a _subset of the fundamental C++ types (and never for aggregate types); other files utilize these actual built-ins to generalize them to a richer set of types.
This file (and its implementation) has no PTX code. PTX instructions are wrapped in functions under the ptx/ directory, which are not templated.

Function Documentation

§ all_lanes_satisfy()

KAT_FD int kat::builtins::warp::all_lanes_satisfy ( int condition )

Checks whether a condition holds for an entire warp of threads.

Parameters

condition A boolean value (passed as an integer since that's what nVIDIA GPUs actually check with the HW instruction

Returns: true if condition is non-zero for all threads

§ extract_bits()

template<typename I >

KAT_FD I kat::builtins::bit_field::extract_bits	(	I	bit_field,
		unsigned int	start_pos,
		unsigned int	num_bits
	)

delete

Extracts the bits with 0-based indices start_pos ...

start_pos+ num_bits - 1, counting from least to most significant, from a bit field field. Has sign extension semantics for signed inputs which are bit tricky, see in the PTX ISA guide:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html

Todo:: CUB 1.5.2's BFE wrapper seems kind of fishy. Why does Duane Merill not use PTX for extraction from 64-bit fields? For now only adopting his implementation for the 32-bit case.

Note: This method is more "strict" in its specialization that others.

Namespaces

Enumerations

Functions

Detailed Description

Function Documentation

§ all_lanes_satisfy()

§ extract_bits()