Code exposing CUDA's PTX intermediate representation instructions to C++ code. More...

Namespaces
	special_registers
	Wrappers for instructions obtaining the value of one of the special hardware registers on nVIDIA GPUs.

Functions
KAT_FD void	trap ()
	Aborts execution (of the entire kernel grid) and generates an interrupt to the host CPU.

KAT_FD void	exit ()
	Ends execution of the current thread of this kernel/grid.

	DEFINE_IS_IN_MEMORY_SPACE (const) DEFINE_IS_IN_MEMORY_SPACE(global) DEFINE_IS_IN_MEMORY_SPACE(local) DEFINE_IS_IN_MEMORY_SPACE(shared) DEFINE_BFIND(s32) DEFINE_BFIND(s64) DEFINE_BFIND(u32) DEFINE_BFIND(u64) DEFINE_PRMT_WITH_MODE(forward_4_extract

f4e	DEFINE_PRMT_WITH_MODE (backward_4_extract, b4e) DEFINE_PRMT_WITH_MODE(replicate_8

f4e rc8	DEFINE_PRMT_WITH_MODE (replicate_16, rc16) DEFINE_PRMT_WITH_MODE(edge_clam_left

f4e rc8 ecl	DEFINE_PRMT_WITH_MODE (edge_clam_right, ecl) KAT_FD uint32_t prmt(uint32_t first
	See: relevant section of the CUDA PTX reference for an explanation of what this does exactly. More...

	asm ("prmt.b32 %0, %1, %2, %3;" :"=r"(result) :"r"(first), "r"(second), "r"(byte_selectors))

	DEFINE_BFE (s32) DEFINE_BFE(s64) DEFINE_BFE(u32) DEFINE_BFE(u64) KAT_FD uint32_t bfi(uint32_t bits_to_insert

	asm ("bfi.b32 %0, %1, %2, %3, %4;" :"=r"(ret) :"r"(bits_to_insert), "r"(existing_bit_field), "r"(start_position), "r"(num_bits))

KAT_FD uint64_t	bfi (uint64_t bits_to_insert, uint64_t existing_bit_field, uint32_t start_position, uint32_t num_bits)

	DEFINE_SAD_ (u16)

	DEFINE_SAD_ (u32)

	DEFINE_SAD_ (u64)

	DEFINE_SAD_ (s16)

	DEFINE_SAD_ (s32)

	DEFINE_SAD_ (s64)

	DEFINE_SHIFT_AND_OP (l, add) DEFINE_SHIFT_AND_OP(l

min	DEFINE_SHIFT_AND_OP (l, max) DEFINE_SHIFT_AND_OP(r

min add	DEFINE_SHIFT_AND_OP (r, min) DEFINE_SHIFT_AND_OP(r

Variables
f4e rc8 ecl uint32_t	second

f4e rc8 ecl uint32_t uint32_t	byte_selectors

return	result

uint32_t	existing_bit_field

uint32_t uint32_t	start_position

uint32_t uint32_t uint32_t	num_bits

return	ret

Detailed Description

Code exposing CUDA's PTX intermediate representation instructions to C++ code.

With CUDA, device-side code is compiled from a C++-like language to an intermediate representation (IR), which is not supported directly by any GPU, but from which it is easy to compile.

Occasionally, a developer wants to use a specific PTX instruction - e.g. to optimize some code. CUDA's headers expose some of the opcodes for these instructions - but not all of them. Also, the exposed instructions are not templated on the arguments - while PTX instructions are thus templated. These two gaps are filled by this library.

Function Documentation

§ DEFINE_PRMT_WITH_MODE()

f4e rc8 ecl kat::ptx::DEFINE_PRMT_WITH_MODE	(	edge_clam_right	,
		ecl
	)

See: relevant section of the CUDA PTX reference for an explanation of what this does exactly.

Parameters

first	a first value from which to potentially use bytes
second	a second value from which to potentially use bytes
byte_selectors	a packing of 4 selector structures; each selector structure is 3 bits specifying which of the input bytes are to be used (as there are 8 bytes overall in `first` and `second` ), and another bit specifying if it's an actual copy of a byte, or instead whether the sign of the byte (intrepeted as an int8_t) should be replicated to fill the target byte.

Returns: the four bytes of first and/or second, or replicated signs thereof, indicated by the byte selectors

Note: Only the lower 16 bits of byte_selectors are used.; "prmt" stands for "permute"

Variable Documentation

§ byte_selectors

f4e rc8 ecl uint32_t uint32_t kat::ptx::byte_selectors

Initial value:

{

uint32_t result

§ num_bits

uint32_t uint32_t uint32_t kat::ptx::num_bits

Initial value:

{

uint32_t ret

Namespaces