JASSv2
Classes | Public Types | Public Member Functions | Public Attributes | Protected Member Functions | Private Attributes | List of all members
JASS::ciff_lin Class Reference

Reader for Jimmy Lin's shared index format. More...

#include <ciff_lin.h>

Collaboration diagram for JASS::ciff_lin:
Collaboration graph
[legend]

Classes

class  doc_record
 a document record object containing document lengths and primary keys More...
 
class  docrecords_foreach
 An object used to allow iteration over document records. More...
 
class  docrecords_iterator
 iterator class for iterating over an index More...
 
class  header
 The header of the CIFF file, it happens first in the file and describes how many postings and document details are included. More...
 
class  postings_foreach
 An object used to allow iteration over postings lists. More...
 
class  postings_list
 A postings list with a term, df, cf, and postings list of <d,tf> pairs. More...
 
class  postings_list_iterator
 iterator class for iterating over an index More...
 

Public Types

enum  error_code { OK = 0, FAIL = 1 }
 success or failure. More...
 

Public Member Functions

 ciff_lin (const uint8_t *source_file)
 Constructor. More...
 
postings_foreach postings (void)
 Return an object capable of being an iterator for postings lists. Assumes the "file pointer" is in the right place. More...
 
docrecords_foreach docrecords (void)
 Return an object capable of being an iterator for document details. Assumes the "file pointer" is in the right place. More...
 
headerget_header (void)
 Return the header object. More...
 

Public Attributes

error_code status
 OK or FAIL (FAIL only on error in input stream)
 

Protected Member Functions

error_code read_header (header &header)
 Read the CIFF header containing details about how many postings lists, etc. More...
 

Private Attributes

const uint8_t * source_file
 The CIFF file in memory.
 
const uint8_t * stream
 Where in the CIFF we currently are.
 
header ciff_header
 The header from the CIFF file.
 

Detailed Description

Reader for Jimmy Lin's shared index format.

Jimmy uses Anserini to index and then exports using Google protocol buffers. The protocol buffer format is specified by:

syntax = "proto3";
package io.osirrc.ciff;
// An index stored in CIFF is a single file comprised of exactly the following:
// - A Header protobuf message,
// - Exactly the number of PostingsList messages specified in the num_postings_lists field of the Header
// - Exactly the number of DocRecord messages specified in the num_doc_records field of the Header
// The protobuf messages are defined below.
// This is the CIFF header. It always comes first.
message Header {
int32 version = 1; // Version.
int32 num_postings_lists = 2; // Exactly the number of PostingsList messages that follow the Header.
int32 num_docs = 3; // Exactly the number of DocRecord messages that follow the PostingsList messages.
// The total number of postings lists in the collection; the vocabulary size. This might differ from
// num_postings_lists, for example, because we only export the postings lists of query terms.
int32 total_postings_lists = 4;
// The total number of documents in the collection; might differ from num_doc_records for a similar reason as above.
int32 total_docs = 5;
// The total number of terms in the entire collection. This is the sum of all document lengths of all documents in
// the collection.
int64 total_terms_in_collection = 6;
// The average document length. We store this value explicitly in case the exporting application wants a particular
// level of precision.
double average_doclength = 7;
// Description of this index, meant for human consumption. Describing, for example, the exporting application,
// document processing and tokenization pipeline, etc.
string description = 8;
}
// An individual posting.
message Posting {
int32 docid = 1;
int32 tf = 2;
}
// A postings list, comprised of one ore more postings.
message PostingsList {
string term = 1; // The term.
int64 df = 2; // The document frequency.
int64 cf = 3; // The collection frequency.
repeated Posting postings = 4;
}
// A record containing metadata about an individual document.
message DocRecord {
int32 docid = 1; // Refers to the docid in the postings lists.
string collection_docid = 2; // Refers to a docid in the external collection.
int32 doclength = 3; // Length of this document.
}

Where each PostingsList is written using writeDelimitedTo() and so each postings list is prefixed by a length integer.

This code provides an iterator over a file of this format (once read into memory)

For details of the encoding see: https://developers.google.com/protocol-buffers/docs/encoding

Member Enumeration Documentation

◆ error_code

success or failure.

Enumerator
OK 

Method completed successfully.

FAIL 

Method did not completed successfully.

Constructor & Destructor Documentation

◆ ciff_lin()

JASS::ciff_lin::ciff_lin ( const uint8_t *  source_file)
inline

Constructor.

Parameters
source_file[in] a Pointer to the protobuf file once already read into memory.
source_file_length[in] The length (in bytes) of the source file once in memory.

Member Function Documentation

◆ docrecords()

docrecords_foreach JASS::ciff_lin::docrecords ( void  )
inline

Return an object capable of being an iterator for document details. Assumes the "file pointer" is in the right place.

Returns
an object for use in a for statement thus: for (const auto &postings_list : ciff.docrecords())

◆ get_header()

header& JASS::ciff_lin::get_header ( void  )
inline

Return the header object.

Returns
a ciff_lin::header object from the start of the file

◆ postings()

postings_foreach JASS::ciff_lin::postings ( void  )
inline

Return an object capable of being an iterator for postings lists. Assumes the "file pointer" is in the right place.

Returns
an object for use in a for statement thus: for (const auto &postings_list : ciff.postings())

◆ read_header()

error_code JASS::ciff_lin::read_header ( header header)
inlineprotected

Read the CIFF header containing details about how many postings lists, etc.

Parameters
header[out] The header once read.
Returns
OK on success, FAIL on failure.

The documentation for this class was generated from the following file: