JASSv2
Classes | Public Types | Public Member Functions | Static Public Member Functions | Protected Member Functions | Protected Attributes | List of all members
JASS::serialise_jass_v1 Class Reference

Serialise an index in the experimental JASS-CI format used (by JASS version 1) in the RIGOR workshop. More...

#include <serialise_jass_v1.h>

Inheritance diagram for JASS::serialise_jass_v1:
Inheritance graph
[legend]
Collaboration diagram for JASS::serialise_jass_v1:
Collaboration graph
[legend]

Classes

class  vocab_tripple
 The tripple used in CIvocab.bin. More...
 

Public Types

enum  jass_v1_codex {
  uncompressed = 's', variable_byte = 'c', simple_8b = '8', qmx = 'q',
  qmx_d4 = 'Q', qmx_d0 = 'R', elias_gamma_simd = 'G', elias_gamma_simd_vb = 'g',
  elias_delta_simd = 'D'
}
 The compression scheme that is active. More...
 

Public Member Functions

 serialise_jass_v1 (size_t documents, jass_v1_codex codex=jass_v1_codex::elias_gamma_simd, int8_t alignment=1)
 Constructor. More...
 
virtual ~serialise_jass_v1 ()
 Destructor.
 
virtual void finish (void)
 Finish up any serialising that needs to be done.
 
virtual void serialise_vocabulary_pointers (void)
 Serialise the pointers that point between the vocab and the postings (the CIvocab.bin file).
 
virtual void serialise_primary_keys (void)
 Serialise the primary keys (or any extra stuff at the end of the primary key file).
 
virtual void operator() (const slice &term, const index_postings &postings, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies)
 The callback function to serialise the postings (given the term) is operator(). More...
 
virtual void operator() (size_t document_id, const slice &primary_key)
 The callback function to serialise the primary keys (external document ids) is operator(). More...
 
- Public Member Functions inherited from JASS::index_manager::delegate
 delegate (size_t documents)
 Destructor.
 
virtual ~delegate ()
 Destructor.
 

Static Public Member Functions

static compress_integerget_compressor (jass_v1_codex codex, std::string &name, int32_t &d_ness)
 Return a reference to a compressor/decompressor that can be used with this index. More...
 
static void unittest (void)
 Unit test this class.
 

Protected Member Functions

virtual size_t write_postings (const index_postings &postings, size_t &number_of_impacts, compress_integer::integer document_frequency, compress_integer::integer *document_ids, index_postings_impact::impact_type *term_frequencies)
 Convert the postings list to the JASS v1 format and serialise it to disk. More...
 

Protected Attributes

file vocabulary_strings
 The concatination of UTS-8 encoded unique tokens in the collection.
 
file vocabulary
 Details about the term (including a pointer to the term, a pointer to the postings, and the quantum count.
 
file postings
 The postings lists.
 
file primary_keys
 The list of external identifiers (document primary keys).
 
std::vector< vocab_trippleindex_key
 The entry point into the JASS v1 index is CIvocab.bin, the index key.
 
std::vector< uint64_t > primary_key_offsets
 A list of locations (on disk) of each primary key.
 
allocator_pool memory
 Memory used to store the impact-ordered postings list.
 
index_postings_impact impact_ordered
 The re-used impact ordered postings list.
 
std::string compressor_name
 The name of the compresson algorithm.
 
int compressor_d_ness
 The d-ness of the compression algorithm.
 
compress_integerencoder
 The integer encoder used to compress postings lists.
 
allocator_cpp< uint8_t > allocator
 C++ allocator between memory object and std::vector object.
 
std::vector< uint8_t, allocator_cpp< uint8_t > > compressed_buffer
 The buffer used to compress postings into.
 
std::vector< slice, allocator_cpp< slice > > compressed_segments
 vector of pointers (and lengths) to the compressed postings.
 
uint8_t alignment
 Postings lists are padded to this alignment (used for codexes that require word alignment).
 

Additional Inherited Members

- Public Attributes inherited from JASS::index_manager::delegate
size_t documents
 The number of documents in the collection.
 

Detailed Description

Serialise an index in the experimental JASS-CI format used (by JASS version 1) in the RIGOR workshop.

The original version of JASS was an experimental hack in reducing the complexity of the ATIRE search engine, that resulted in an index that was large, but easy to process. The intent was to go back and "fix" the index to be smaller and faster. That never happened. Instead it was used as the basis of other work. In an effort to bring up this re-write of ATIRE and JASS, compatibility with the hack (known as JASS version 1) is maintained so that the indexer can be checked without writing the search engine itself (i.e. this JASS is being bootstrapped from JASS version 1)

The paper comparing JASS version 1 to other search engines (including ATIRE) is here: J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, S. Vigna (2016), Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge, Proceedings of the European Conference on Information Retrieval (ECIR 2016), pp. 408-420.

The JASS version 1 index in made up of 4 files: CIvocab_terms.bin, CIvocab.bin, CIpostings.bin, and CIdoclist.bin

CIdoclist.bin: The list of document identifiers (each '\0' terminated). Then an index to each of the doclents (stored as a table of uint64_t). The final 8 bytes of the file is an uin64_t storing the total numbner of unique documents in the collection.

CIvocab_terms.bin: This is a list of all the unique terms in the collection (the closure of the vocabulary). It is stored as a sequence of '\0' terminated UTF-8 strings. So, if the vocabularty contains three terms, "a", "bb" and "cc", then the contents of CIvocab_terms.bin will be "a\0bb\0cc\0". This file does not need to be sorted in alphabetical (or similar) order.

CIvocab.bin: This is a list of triples (term, offset, impacts). Term is a pointer to the string in the CIvocab_terms.bin file (i.e. a byte offset within the file). Offset is the offset (in CIpostings.bin) of the start of the postings list. Impacts is the number of impacts in the impact ordered postings list. JASS v1 assumes this file is sorted in alphabetical order by the term string (i.e. where term points to) when using strcmp().

CIpostings.bin: This file contains all the postings lists compressed using the same codex. This is different from ATIRE which allows each postings list to be encoded using a different codex. The first byte of this file specifies the codex where s=uncompressed, c=VarByte, 8=Simple8, q=QMX, Q=QMX4D, R=QMX0D. This is followed by the postings lists. A postings list is: a list of 64-bit pointer to headers. Each header is (uint16_t impact_score, uint64_t start, uint64_t end, uint32_t impact_frequency) where impact_score is the impact value, start and end are pointers to the compressed docids, and impact_frequency is the number of dociment_ids in the list. The header is terminated with a row of all 0s (i.e. 22 consequitive 0-bytes). This is followed by the list of docid's for each segment - each compressed seperately. These lists do not have the impact score stored at the start and do not have 0 terminators on them. This means score-at-a-time processing is the only paradigm, even if term-at-a-time processing is done score-at-a-time for each term. ATIRE could do either (but it was a compile time flag).

Member Enumeration Documentation

◆ jass_v1_codex

The compression scheme that is active.

Enumerator
uncompressed 

Postings are not compressed.

variable_byte 

Postings are compressed using ATIRE's variable byte encoding.

simple_8b 

Postings are compressed using ATIRE's simple-8b encoding.

qmx 

Postings are compressed using JASS v1's variant of QMX (with difference (D1) encoding).

qmx_d4 

Postings are compressed using QMX with Lemire's D4 delta encoding.

qmx_d0 

Postings are compressed using QMX without delta encoding.

elias_gamma_simd 

Postings are compressed using Elias gamma SIMD encoding.

elias_gamma_simd_vb 

Postings are compressed using Elias gamma SIMD encoding with variable byte endings.

elias_delta_simd 

Postings are compressed using Elias delta SIMD encoding.

Constructor & Destructor Documentation

◆ serialise_jass_v1()

JASS::serialise_jass_v1::serialise_jass_v1 ( size_t  documents,
jass_v1_codex  codex = jass_v1_codex::elias_gamma_simd,
int8_t  alignment = 1 
)
inline

Constructor.

Parameters
documents[in] The number of documents in the collection (used to allocate re-usable buffers).
encoder[in] An shared pointer to a codex responsible for performing the compression of postings lists (default = compress_integer_QMX_jass_v1()).
alignment[in] The start address of a postings list is padded to start on these boundaries (needed for compress_integer_QMX_jass_v1 (use 16), and others). Default = 0.

Member Function Documentation

◆ get_compressor()

compress_integer * JASS::serialise_jass_v1::get_compressor ( jass_v1_codex  codex,
std::string &  name,
int32_t &  d_ness 
)
static

Return a reference to a compressor/decompressor that can be used with this index.

Parameters
codex[in] The codex to use
name[out] The name of the compression codex
d_ness[out] Whether the codex requires D0, D1, etc decoding (-1 if it supports decode_and_process via decode_none)
Returns
A reference to a compress_integer that can decode the given codex

◆ operator()() [1/2]

void JASS::serialise_jass_v1::operator() ( const slice term,
const index_postings postings,
compress_integer::integer  document_frequency,
compress_integer::integer document_ids,
index_postings_impact::impact_type term_frequencies 
)
virtual

The callback function to serialise the postings (given the term) is operator().

Parameters
term[in] The term name.
postings[in] The postings lists.
document_frequency[in] The document frequency of the term
document_ids[in] An array (of length document_frequency) of document ids.
term_frequencies[in] An array (of length document_frequency) of term frequencies (corresponding to document_ids).

Implements JASS::index_manager::delegate.

Reimplemented in JASS::serialise_jass_v2.

◆ operator()() [2/2]

void JASS::serialise_jass_v1::operator() ( size_t  document_id,
const slice primary_key 
)
virtual

The callback function to serialise the primary keys (external document ids) is operator().

Parameters
document_id[in] The internal document identfier.
primary_key[in] This document's primary key (external document identifier).

Implements JASS::index_manager::delegate.

◆ write_postings()

size_t JASS::serialise_jass_v1::write_postings ( const index_postings postings,
size_t &  number_of_impacts,
compress_integer::integer  document_frequency,
compress_integer::integer document_ids,
index_postings_impact::impact_type term_frequencies 
)
protectedvirtual

Convert the postings list to the JASS v1 format and serialise it to disk.

Parameters
postings[in] The postings list to serialise.
number_of_impacts[out] The number of distinct impact scores seen in the postings list.
document_frequency[in] The document frequency of the term
document_ids[in] An array (of length document_frequency) of document ids.
term_frequencies[in] An array (of length document_frequency) of term frequencies (corresponding to document_ids).
Returns
The location (in CIpostings.bin) of the start of the serialised postings list.

Reimplemented in JASS::serialise_jass_v2.


The documentation for this class was generated from the following files: