cuda-kat
CUDA kernel author's tools
|
Code exposing CUDA's PTX intermediate representation instructions to C++ code. More...
Namespaces | |
special_registers | |
Wrappers for instructions obtaining the value of one of the special hardware registers on nVIDIA GPUs. | |
Functions | |
KAT_FD void | trap () |
Aborts execution (of the entire kernel grid) and generates an interrupt to the host CPU. | |
KAT_FD void | exit () |
Ends execution of the current thread of this kernel/grid. | |
DEFINE_IS_IN_MEMORY_SPACE (const) DEFINE_IS_IN_MEMORY_SPACE(global) DEFINE_IS_IN_MEMORY_SPACE(local) DEFINE_IS_IN_MEMORY_SPACE(shared) DEFINE_BFIND(s32) DEFINE_BFIND(s64) DEFINE_BFIND(u32) DEFINE_BFIND(u64) DEFINE_PRMT_WITH_MODE(forward_4_extract | |
f4e | DEFINE_PRMT_WITH_MODE (backward_4_extract, b4e) DEFINE_PRMT_WITH_MODE(replicate_8 |
f4e rc8 | DEFINE_PRMT_WITH_MODE (replicate_16, rc16) DEFINE_PRMT_WITH_MODE(edge_clam_left |
f4e rc8 ecl | DEFINE_PRMT_WITH_MODE (edge_clam_right, ecl) KAT_FD uint32_t prmt(uint32_t first |
See: relevant section of the CUDA PTX reference for an explanation of what this does exactly. More... | |
asm ("prmt.b32 %0, %1, %2, %3;" :"=r"(result) :"r"(first), "r"(second), "r"(byte_selectors)) | |
DEFINE_BFE (s32) DEFINE_BFE(s64) DEFINE_BFE(u32) DEFINE_BFE(u64) KAT_FD uint32_t bfi(uint32_t bits_to_insert | |
asm ("bfi.b32 %0, %1, %2, %3, %4;" :"=r"(ret) :"r"(bits_to_insert), "r"(existing_bit_field), "r"(start_position), "r"(num_bits)) | |
KAT_FD uint64_t | bfi (uint64_t bits_to_insert, uint64_t existing_bit_field, uint32_t start_position, uint32_t num_bits) |
DEFINE_SAD_ (u16) | |
DEFINE_SAD_ (u32) | |
DEFINE_SAD_ (u64) | |
DEFINE_SAD_ (s16) | |
DEFINE_SAD_ (s32) | |
DEFINE_SAD_ (s64) | |
DEFINE_SHIFT_AND_OP (l, add) DEFINE_SHIFT_AND_OP(l | |
min | DEFINE_SHIFT_AND_OP (l, max) DEFINE_SHIFT_AND_OP(r |
min add | DEFINE_SHIFT_AND_OP (r, min) DEFINE_SHIFT_AND_OP(r |
Code exposing CUDA's PTX intermediate representation instructions to C++ code.
With CUDA, device-side code is compiled from a C++-like language to an intermediate representation (IR), which is not supported directly by any GPU, but from which it is easy to compile.
Occasionally, a developer wants to use a specific PTX instruction - e.g. to optimize some code. CUDA's headers expose some of the opcodes for these instructions - but not all of them. Also, the exposed instructions are not templated on the arguments - while PTX instructions are thus templated. These two gaps are filled by this library.
f4e rc8 ecl kat::ptx::DEFINE_PRMT_WITH_MODE | ( | edge_clam_right | , |
ecl | |||
) |
See: relevant section of the CUDA PTX reference for an explanation of what this does exactly.
first | a first value from which to potentially use bytes |
second | a second value from which to potentially use bytes |
byte_selectors | a packing of 4 selector structures; each selector structure is 3 bits specifying which of the input bytes are to be used (as there are 8 bytes overall in first and second ), and another bit specifying if it's an actual copy of a byte, or instead whether the sign of the byte (intrepeted as an int8_t) should be replicated to fill the target byte. |
f4e rc8 ecl uint32_t uint32_t kat::ptx::byte_selectors |
uint32_t uint32_t uint32_t kat::ptx::num_bits |