JASSv2
Public Member Functions | Static Public Member Functions | Protected Member Functions | Protected Attributes | List of all members
JASS::instream_document_trec Class Reference

Child class of instream for creating documents from TREC pre-web (i.e. news articles) data. More...

#include <instream_document_trec.h>

Inheritance diagram for JASS::instream_document_trec:
Inheritance graph
[legend]
Collaboration diagram for JASS::instream_document_trec:
Collaboration graph
[legend]

Public Member Functions

 instream_document_trec (const instream_document_trec &previous)=delete
 Copy constructor (not available). More...
 
 instream_document_trec (std::shared_ptr< instream > &source, const std::string &document_tag="DOC", const std::string &document_primary_key_tag="DOCNO")
 Constructor. More...
 
virtual ~instream_document_trec ()
 Destructor.
 
virtual void read (document &buffer)
 Read the next document from the source instream into document. More...
 
- Public Member Functions inherited from JASS::instream
 instream (std::shared_ptr< instream > &source, std::shared_ptr< allocator > &memory)
 Constructor. More...
 
 instream (std::shared_ptr< instream > &source)
 Constructor. More...
 
 instream (void)
 Constructor.
 
virtual ~instream ()
 Destructor. More...
 
size_t fetch (void *buffer, size_t bytes)
 fetch() generates a document object, sets its contents to the passed buffer, calls read() and returns the number of bytes of data read More...
 

Static Public Member Functions

static void unittest (void)
 Unit test this class.
 

Protected Member Functions

 instream_document_trec (std::shared_ptr< instream > &source, size_t buffer_size, const std::string &document_tag, const std::string &document_primary_key_tag)
 Protected constructor used to set the size of the internal buffer in the unittest. More...
 
void set_tags (const std::string &document_tag, const std::string &primary_key_tag)
 Register the document tag and the primary key tag. Used to set up internal data structures. More...
 
void fetch (void *buffer, size_t bytes)
 Fetch another block of data from the source. More...
 

Protected Attributes

size_t buffer_size
 Size of the disk read buffer. Normally 16MB.
 
uint8_t * buffer
 Pointer to the interal buffer from which documents are extracted. Filled by calling source.read()
 
uint8_t * buffer_end
 Pointer to the end of the buffer (used to prevent read past EOF).
 
size_t buffer_used
 The number of bytes of buffer that have already been used from buffer (buffer + buffer_used is a pointer to the unused data in buffer)
 
std::string document_start_tag
 The start tag used to delineate documents ("<DOC>" be default)
 
std::string document_end_tag
 The end tag used to mark the end of a document ("</DOC>" by defaut)
 
std::string primary_key_start_tag
 The primary key's start tag ("<DOCNO>" by default)
 
std::string primary_key_end_tag
 The primary key's end tag ("</DOCNO>" by default)
 
- Protected Attributes inherited from JASS::instream
std::shared_ptr< instreamsource
 If this object is reading from another instream then this is that instream.
 
std::shared_ptr< allocatormemory
 Any and all memory allocation must happen using this object.
 

Detailed Description

Child class of instream for creating documents from TREC pre-web (i.e. news articles) data.

Connect an object of this class to an input stream and it will return TREC new-article formatted documents one per read. This is done by looking for <DOC> and </DOC> tags in the source stream. Document primary keys are assumed to be between <DOCNO> and </DOCNO> tags.

Examples:
parser_use.cpp.

Constructor & Destructor Documentation

◆ instream_document_trec() [1/3]

JASS::instream_document_trec::instream_document_trec ( std::shared_ptr< instream > &  source,
size_t  buffer_size,
const std::string &  document_tag,
const std::string &  document_primary_key_tag 
)
protected

Protected constructor used to set the size of the internal buffer in the unittest.

Parameters
source[in] The instream responsible for providing data to this class.
buffer_size[in] The size of the internal buffer filled from source.
document_tag[in] The name of the tag used to delineate docments.
document_primary_key_tag[in] The name of the element that contans the document's primary key.

◆ instream_document_trec() [2/3]

JASS::instream_document_trec::instream_document_trec ( const instream_document_trec previous)
delete

Copy constructor (not available).

Parameters
previous[in] The instance to copy.

◆ instream_document_trec() [3/3]

JASS::instream_document_trec::instream_document_trec ( std::shared_ptr< instream > &  source,
const std::string &  document_tag = "DOC",
const std::string &  document_primary_key_tag = "DOCNO" 
)

Constructor.

Parameters
source[in] The instream responsible for providing data to this class.
document_tag[in] The name of the tag used to delineate docments (default = "DOC").
document_primary_key_tag[in] The name of the element that contans the document's primary key (default = "DOCNO").

Member Function Documentation

◆ fetch()

void JASS::instream_document_trec::fetch ( void *  buffer,
size_t  bytes 
)
inlineprotected

Fetch another block of data from the source.

Parameters
buffer[out] Write bytes amount of data into this memory location.
bytes[in] Read this amount of data from the source.

◆ read()

void JASS::instream_document_trec::read ( document buffer)
virtual

Read the next document from the source instream into document.

Parameters
buffer[out] The next document in the source instream.

Implements JASS::instream.

Examples:
parser_use.cpp.

◆ set_tags()

void JASS::instream_document_trec::set_tags ( const std::string &  document_tag,
const std::string &  primary_key_tag 
)
protected

Register the document tag and the primary key tag. Used to set up internal data structures.

Parameters
document_tag[in] The name of the tag used to delineate docments.
primary_key_tag[in] The name of the element that contans the document's primary key.

The documentation for this class was generated from the following files: