|
JASSv2
|
Serialise an index in the experimental JASS-CI format used (by JASS version 1) in the RIGOR workshop. More...
#include <serialise_jass_v1.h>


Classes | |
| class | vocab_tripple |
| The tripple used in CIvocab.bin. More... | |
Public Types | |
| enum | jass_v1_codex { uncompressed = 's', variable_byte = 'c', simple_8b = '8', qmx = 'q', qmx_d4 = 'Q', qmx_d0 = 'R', elias_gamma_simd = 'G', elias_gamma_simd_vb = 'g', elias_delta_simd = 'D' } |
| The compression scheme that is active. More... | |
Public Member Functions | |
| serialise_jass_v1 (size_t documents, jass_v1_codex codex=jass_v1_codex::elias_gamma_simd, int8_t alignment=1) | |
| Constructor. More... | |
| virtual | ~serialise_jass_v1 () |
| Destructor. | |
| virtual void | finish (void) |
| Finish up any serialising that needs to be done. | |
| virtual void | serialise_vocabulary_pointers (void) |
| Serialise the pointers that point between the vocab and the postings (the CIvocab.bin file). | |
| virtual void | serialise_primary_keys (void) |
| Serialise the primary keys (or any extra stuff at the end of the primary key file). | |
| virtual void | operator() (const slice &term, const index_postings &postings, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies) |
| The callback function to serialise the postings (given the term) is operator(). More... | |
| virtual void | operator() (size_t document_id, const slice &primary_key) |
| The callback function to serialise the primary keys (external document ids) is operator(). More... | |
Public Member Functions inherited from JASS::index_manager::delegate | |
| delegate (size_t documents) | |
| Destructor. | |
| virtual | ~delegate () |
| Destructor. | |
Static Public Member Functions | |
| static compress_integer * | get_compressor (jass_v1_codex codex, std::string &name, int32_t &d_ness) |
| Return a reference to a compressor/decompressor that can be used with this index. More... | |
| static void | unittest (void) |
| Unit test this class. | |
Protected Member Functions | |
| virtual size_t | write_postings (const index_postings &postings, size_t &number_of_impacts, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies) |
| Convert the postings list to the JASS v1 format and serialise it to disk. More... | |
Protected Attributes | |
| file | vocabulary_strings |
| The concatination of UTS-8 encoded unique tokens in the collection. | |
| file | vocabulary |
| Details about the term (including a pointer to the term, a pointer to the postings, and the quantum count. | |
| file | postings |
| The postings lists. | |
| file | primary_keys |
| The list of external identifiers (document primary keys). | |
| std::vector< vocab_tripple > | index_key |
| The entry point into the JASS v1 index is CIvocab.bin, the index key. | |
| std::vector< uint64_t > | primary_key_offsets |
| A list of locations (on disk) of each primary key. | |
| allocator_pool | memory |
| Memory used to store the impact-ordered postings list. | |
| index_postings_impact | impact_ordered |
| The re-used impact ordered postings list. | |
| std::string | compressor_name |
| The name of the compresson algorithm. | |
| int | compressor_d_ness |
| The d-ness of the compression algorithm. | |
| compress_integer * | encoder |
| The integer encoder used to compress postings lists. | |
| allocator_cpp< uint8_t > | allocator |
| C++ allocator between memory object and std::vector object. | |
| std::vector< uint8_t, allocator_cpp< uint8_t > > | compressed_buffer |
| The buffer used to compress postings into. | |
| std::vector< slice, allocator_cpp< slice > > | compressed_segments |
| vector of pointers (and lengths) to the compressed postings. | |
| uint8_t | alignment |
| Postings lists are padded to this alignment (used for codexes that require word alignment). | |
Additional Inherited Members | |
Public Attributes inherited from JASS::index_manager::delegate | |
| size_t | documents |
| The number of documents in the collection. | |
Serialise an index in the experimental JASS-CI format used (by JASS version 1) in the RIGOR workshop.
The original version of JASS was an experimental hack in reducing the complexity of the ATIRE search engine, that resulted in an index that was large, but easy to process. The intent was to go back and "fix" the index to be smaller and faster. That never happened. Instead it was used as the basis of other work. In an effort to bring up this re-write of ATIRE and JASS, compatibility with the hack (known as JASS version 1) is maintained so that the indexer can be checked without writing the search engine itself (i.e. this JASS is being bootstrapped from JASS version 1)
The paper comparing JASS version 1 to other search engines (including ATIRE) is here: J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, S. Vigna (2016), Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge, Proceedings of the European Conference on Information Retrieval (ECIR 2016), pp. 408-420.
The JASS version 1 index in made up of 4 files: CIvocab_terms.bin, CIvocab.bin, CIpostings.bin, and CIdoclist.bin
CIdoclist.bin: The list of document identifiers (each '\0' terminated). Then an index to each of the doclents (stored as a table of uint64_t). The final 8 bytes of the file is an uin64_t storing the total numbner of unique documents in the collection.
CIvocab_terms.bin: This is a list of all the unique terms in the collection (the closure of the vocabulary). It is stored as a sequence of '\0' terminated UTF-8 strings. So, if the vocabularty contains three terms, "a", "bb" and "cc", then the contents of CIvocab_terms.bin will be "a\0bb\0cc\0". This file does not need to be sorted in alphabetical (or similar) order.
CIvocab.bin: This is a list of triples (term, offset, impacts). Term is a pointer to the string in the CIvocab_terms.bin file (i.e. a byte offset within the file). Offset is the offset (in CIpostings.bin) of the start of the postings list. Impacts is the number of impacts in the impact ordered postings list. JASS v1 assumes this file is sorted in alphabetical order by the term string (i.e. where term points to) when using strcmp().
CIpostings.bin: This file contains all the postings lists compressed using the same codex. This is different from ATIRE which allows each postings list to be encoded using a different codex. The first byte of this file specifies the codex where s=uncompressed, c=VarByte, 8=Simple8, q=QMX, Q=QMX4D, R=QMX0D. This is followed by the postings lists. A postings list is: a list of 64-bit pointer to headers. Each header is (uint16_t impact_score, uint64_t start, uint64_t end, uint32_t impact_frequency) where impact_score is the impact value, start and end are pointers to the compressed docids, and impact_frequency is the number of dociment_ids in the list. The header is terminated with a row of all 0s (i.e. 22 consequitive 0-bytes). This is followed by the list of docid's for each segment - each compressed seperately. These lists do not have the impact score stored at the start and do not have 0 terminators on them. This means score-at-a-time processing is the only paradigm, even if term-at-a-time processing is done score-at-a-time for each term. ATIRE could do either (but it was a compile time flag).
The compression scheme that is active.
|
inline |
Constructor.
| documents | [in] The number of documents in the collection (used to allocate re-usable buffers). |
| encoder | [in] An shared pointer to a codex responsible for performing the compression of postings lists (default = compress_integer_QMX_jass_v1()). |
| alignment | [in] The start address of a postings list is padded to start on these boundaries (needed for compress_integer_QMX_jass_v1 (use 16), and others). Default = 0. |
|
static |
Return a reference to a compressor/decompressor that can be used with this index.
| codex | [in] The codex to use |
| name | [out] The name of the compression codex |
| d_ness | [out] Whether the codex requires D0, D1, etc decoding (-1 if it supports decode_and_process via decode_none) |
|
virtual |
The callback function to serialise the postings (given the term) is operator().
| term | [in] The term name. |
| postings | [in] The postings lists. |
| document_frequency | [in] The document frequency of the term |
| document_ids | [in] An array (of length document_frequency) of document ids. |
| term_frequencies | [in] An array (of length document_frequency) of term frequencies (corresponding to document_ids). |
Implements JASS::index_manager::delegate.
Reimplemented in JASS::serialise_jass_v2.
|
virtual |
The callback function to serialise the primary keys (external document ids) is operator().
| document_id | [in] The internal document identfier. |
| primary_key | [in] This document's primary key (external document identifier). |
Implements JASS::index_manager::delegate.
|
protectedvirtual |
Convert the postings list to the JASS v1 format and serialise it to disk.
| postings | [in] The postings list to serialise. |
| number_of_impacts | [out] The number of distinct impact scores seen in the postings list. |
| document_frequency | [in] The document frequency of the term |
| document_ids | [in] An array (of length document_frequency) of document ids. |
| term_frequencies | [in] An array (of length document_frequency) of term frequencies (corresponding to document_ids). |
Reimplemented in JASS::serialise_jass_v2.
1.8.13