191 vocabulary_strings(
"CIvocab_terms.bin",
"w+b"),
192 vocabulary(
"CIvocab.bin",
"w+b"),
193 postings(
"CIpostings.bin",
"w+b"),
194 primary_keys(
"CIdoclist.bin",
"w+b"),
196 impact_ordered(documents, memory),
197 encoder(
get_compressor(codex, compressor_name, compressor_d_ness)),
199 compressed_buffer(allocator),
200 compressed_segments(allocator),
229 postings.
write(&codex, 1);
251 virtual void finish(
void);
file vocabulary
Details about the term (including a pointer to the term, a pointer to the postings, and the quantum count.
Definition: serialise_jass_v1.h:147
Non-thread-safe object that accumulates a single postings list during indexing.
Definition: index_postings.h:40
std::vector< vocab_tripple > index_key
The entry point into the JASS v1 index is CIvocab.bin, the index key.
Definition: serialise_jass_v1.h:150
static bool strict_weak_order_less_than(const slice &me, const slice &with)
Return true if this < with.
Definition: slice.h:313
virtual void operator()(const slice &term, const index_postings &postings, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies)
The callback function to serialise the postings (given the term) is operator().
Definition: serialise_jass_v1.cpp:199
C++ slices (string-descriptors)
Definition: slice.h:27
Non-thread-Safe object that holds a single postings list during indexing.
std::vector< uint64_t > primary_key_offsets
A list of locations (on disk) of each primary key.
Definition: serialise_jass_v1.h:151
virtual size_t write_postings(const index_postings &postings, size_t &number_of_impacts, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies)
Convert the postings list to the JASS v1 format and serialise it to disk.
Definition: serialise_jass_v1.cpp:76
Base class for the indexer object that stored the actual index during indexing.
Compression codexes for integer sequences.
Definition: compress_integer.h:34
The tripple used in CIvocab.bin.
Definition: serialise_jass_v1.h:80
virtual ~serialise_jass_v1()
Destructor.
Definition: serialise_jass_v1.h:239
std::string compressor_name
The name of the compresson algorithm.
Definition: serialise_jass_v1.h:154
uint32_t integer
This class and descendants will work on integers of this size. Do not change without also changing JA...
Definition: compress_integer.h:40
uint64_t offset
The pointer to the postings stored in the CIpostings.bin file.
Definition: serialise_jass_v1.h:85
allocator_pool memory
Memory used to store the impact-ordered postings list.
Definition: serialise_jass_v1.h:152
Postings are compressed using Elias gamma SIMD encoding with variable byte endings.
Definition: serialise_jass_v1.h:141
vocab_tripple(const slice &string, uint64_t term, uint64_t offset, uint64_t impacts)
Constructor.
Definition: serialise_jass_v1.h:100
Partial file and whole file based I/O methods.
Postings are not compressed.
Definition: serialise_jass_v1.h:134
virtual void serialise_primary_keys(void)
Serialise the primary keys (or any extra stuff at the end of the primary key file).
Definition: serialise_jass_v1.cpp:61
Postings are compressed using ATIRE's variable byte encoding.
Definition: serialise_jass_v1.h:135
size_t write(const void *data, size_t size)
Write bytes number of bytes to the give file at the current cursor position.
Definition: file.h:315
Postings are compressed using ATIRE's simple-8b encoding.
Definition: serialise_jass_v1.h:136
Serialise an index in the experimental JASS-CI format used (by JASS version 1) in the RIGOR workshop...
Definition: serialise_jass_v1.h:70
virtual void serialise_vocabulary_pointers(void)
Serialise the pointers that point between the vocab and the postings (the CIvocab.bin file).
Definition: serialise_jass_v1.cpp:39
allocator_cpp< uint8_t > allocator
C++ allocator between memory object and std::vector object.
Definition: serialise_jass_v1.h:157
static compress_integer * get_compressor(jass_v1_codex codex, std::string &name, int32_t &d_ness)
Return a reference to a compressor/decompressor that can be used with this index. ...
Definition: serialise_jass_v1.cpp:241
Postings are compressed using QMX with Lemire's D4 delta encoding.
Definition: serialise_jass_v1.h:138
index_postings_impact impact_ordered
The re-used impact ordered postings list.
Definition: serialise_jass_v1.h:153
C++11 allocator class that uses a C allocator. See here: https://msdn.microsoft.com/en-us/library/aa9...
Postings are compressed using Elias delta SIMD encoding.
Definition: serialise_jass_v1.h:142
Holder class for an impact ordered postings list.
Definition: index_postings_impact.h:31
static constexpr size_t largest_impact
The largest allowable immpact score (255 is an good value).
Definition: index_postings_impact.h:42
slice token
The term as a string (needed for sorting the std::vector vocab_tripple array later) ...
Definition: serialise_jass_v1.h:83
compress_integer * encoder
The integer encoder used to compress postings lists.
Definition: serialise_jass_v1.h:156
Simple block-allocator that internally allocates a large chunk then allocates smaller blocks from thi...
Definition: allocator_pool.h:61
uint8_t alignment
Postings lists are padded to this alignment (used for codexes that require word alignment).
Definition: serialise_jass_v1.h:160
serialise_jass_v1(size_t documents, jass_v1_codex codex=jass_v1_codex::elias_gamma_simd, int8_t alignment=1)
Constructor.
Definition: serialise_jass_v1.h:189
uint64_t term
The pointer to the \0 terminated string in the CI_vovab_terms.bin file.
Definition: serialise_jass_v1.h:84
uint64_t impacts
The number of impacts that exist for this term.
Definition: serialise_jass_v1.h:86
delegate(size_t documents)
Destructor.
Definition: index_manager.h:60
file vocabulary_strings
The concatination of UTS-8 encoded unique tokens in the collection.
Definition: serialise_jass_v1.h:146
file primary_keys
The list of external identifiers (document primary keys).
Definition: serialise_jass_v1.h:149
Definition: document_id.h:16
uint16_t impact_type
An impact value (i.e. a term frequency value) is of this type.
Definition: index_postings_impact.h:41
bool operator<(const vocab_tripple &other) const
Compare (using strcmp() colaiting sequence) this object with another for less than.
Definition: serialise_jass_v1.h:118
virtual void finish(void)
Finish up any serialising that needs to be done.
Definition: serialise_jass_v1.cpp:22
Postings are compressed using JASS v1's variant of QMX (with difference (D1) encoding).
Definition: serialise_jass_v1.h:137
std::vector< uint8_t, allocator_cpp< uint8_t > > compressed_buffer
The buffer used to compress postings into.
Definition: serialise_jass_v1.h:158
jass_v1_codex
The compression scheme that is active.
Definition: serialise_jass_v1.h:132
int compressor_d_ness
The d-ness of the compression algorithm.
Definition: serialise_jass_v1.h:155
QMX version compatible with JASS v1.
file postings
The postings lists.
Definition: serialise_jass_v1.h:148
Slices (also known as string-descriptors) for C++.
size_t documents
The number of documents in the collection.
Definition: index_manager.h:50
File based I/O methods including whole file and partial files.
Definition: file.h:45
Base class for holding the index during indexing.
Definition: index_manager.h:33
std::vector< slice, allocator_cpp< slice > > compressed_segments
vector of pointers (and lengths) to the compressed postings.
Definition: serialise_jass_v1.h:159
Base class for the callback function called by iterate.
Definition: index_manager.h:47
static void unittest(void)
Unit test this class.
Definition: serialise_jass_v1.cpp:273
Definition: compress_integer_elias_delta_simd.c:23
Pack 32-bit integers into 512-bit SIMD words using elias gamma encoding.
Postings are compressed using Elias gamma SIMD encoding.
Definition: serialise_jass_v1.h:140
Postings are compressed using QMX without delta encoding.
Definition: serialise_jass_v1.h:139