JASSv2
Classes
ciff_lin.h File Reference

Reader for Jimmy Lin's shared index format. More...

#include <stdint.h>
#include <vector>
#include <limits>
#include "posting.h"
#include "protobuf.h"
Include dependency graph for ciff_lin.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Classes

class  JASS::ciff_lin
 Reader for Jimmy Lin's shared index format. More...
 
class  JASS::ciff_lin::header
 The header of the CIFF file, it happens first in the file and describes how many postings and document details are included. More...
 
class  JASS::ciff_lin::postings_list
 A postings list with a term, df, cf, and postings list of <d,tf> pairs. More...
 
class  JASS::ciff_lin::doc_record
 a document record object containing document lengths and primary keys More...
 
class  JASS::ciff_lin::postings_list_iterator
 iterator class for iterating over an index More...
 
class  JASS::ciff_lin::postings_foreach
 An object used to allow iteration over postings lists. More...
 
class  JASS::ciff_lin::docrecords_iterator
 iterator class for iterating over an index More...
 
class  JASS::ciff_lin::docrecords_foreach
 An object used to allow iteration over document records. More...
 

Detailed Description

Reader for Jimmy Lin's shared index format.

Author
Andrew Trotman

Jimmy uses Anserini to index and then exports using Google protocol buffers. The protocol buffer format is specified by:

syntax = "proto3";
package io.anserini.cidxf;
message Posting {
int32 docid = 1;
int32 tf = 2;
}
message PostingsList {
string term = 1;
int64 df = 2;
int64 cf = 3;
repeated Posting posting = 4;
}

Where each PostingsList is written using writeDelimitedTo() and so each postings list is prefixed by a length integer.

This code provides an iterator over a file of this format (once read into memory)

For details of the encoding see: https://developers.google.com/protocol-buffers/docs/encoding