Child class of instream for creating documents from TREC pre-web (i.e. news articles) data.
More...
#include <instream_document_trec.h>
|
|
static void | unittest (void) |
| | Unit test this class.
|
| |
|
| | instream_document_trec (std::shared_ptr< instream > &source, size_t buffer_size, const std::string &document_tag, const std::string &document_primary_key_tag) |
| | Protected constructor used to set the size of the internal buffer in the unittest. More...
|
| |
| void | set_tags (const std::string &document_tag, const std::string &primary_key_tag) |
| | Register the document tag and the primary key tag. Used to set up internal data structures. More...
|
| |
| void | fetch (void *buffer, size_t bytes) |
| | Fetch another block of data from the source. More...
|
| |
|
|
size_t | buffer_size |
| | Size of the disk read buffer. Normally 16MB.
|
| |
|
uint8_t * | buffer |
| | Pointer to the interal buffer from which documents are extracted. Filled by calling source.read()
|
| |
|
uint8_t * | buffer_end |
| | Pointer to the end of the buffer (used to prevent read past EOF).
|
| |
|
size_t | buffer_used |
| | The number of bytes of buffer that have already been used from buffer (buffer + buffer_used is a pointer to the unused data in buffer)
|
| |
|
std::string | document_start_tag |
| | The start tag used to delineate documents ("<DOC>" be default)
|
| |
|
std::string | document_end_tag |
| | The end tag used to mark the end of a document ("</DOC>" by defaut)
|
| |
|
std::string | primary_key_start_tag |
| | The primary key's start tag ("<DOCNO>" by default)
|
| |
|
std::string | primary_key_end_tag |
| | The primary key's end tag ("</DOCNO>" by default)
|
| |
|
std::shared_ptr< instream > | source |
| | If this object is reading from another instream then this is that instream.
|
| |
|
std::shared_ptr< allocator > | memory |
| | Any and all memory allocation must happen using this object.
|
| |
Child class of instream for creating documents from TREC pre-web (i.e. news articles) data.
Connect an object of this class to an input stream and it will return TREC new-article formatted documents one per read. This is done by looking for <DOC> and </DOC> tags in the source stream. Document primary keys are assumed to be between <DOCNO> and </DOCNO> tags.
- Examples:
- parser_use.cpp.
◆ instream_document_trec() [1/3]
| JASS::instream_document_trec::instream_document_trec |
( |
std::shared_ptr< instream > & |
source, |
|
|
size_t |
buffer_size, |
|
|
const std::string & |
document_tag, |
|
|
const std::string & |
document_primary_key_tag |
|
) |
| |
|
protected |
Protected constructor used to set the size of the internal buffer in the unittest.
- Parameters
-
| source | [in] The instream responsible for providing data to this class. |
| buffer_size | [in] The size of the internal buffer filled from source. |
| document_tag | [in] The name of the tag used to delineate docments. |
| document_primary_key_tag | [in] The name of the element that contans the document's primary key. |
◆ instream_document_trec() [2/3]
Copy constructor (not available).
- Parameters
-
| previous | [in] The instance to copy. |
◆ instream_document_trec() [3/3]
| JASS::instream_document_trec::instream_document_trec |
( |
std::shared_ptr< instream > & |
source, |
|
|
const std::string & |
document_tag = "DOC", |
|
|
const std::string & |
document_primary_key_tag = "DOCNO" |
|
) |
| |
Constructor.
- Parameters
-
| source | [in] The instream responsible for providing data to this class. |
| document_tag | [in] The name of the tag used to delineate docments (default = "DOC"). |
| document_primary_key_tag | [in] The name of the element that contans the document's primary key (default = "DOCNO"). |
◆ fetch()
| void JASS::instream_document_trec::fetch |
( |
void * |
buffer, |
|
|
size_t |
bytes |
|
) |
| |
|
inlineprotected |
Fetch another block of data from the source.
- Parameters
-
| buffer | [out] Write bytes amount of data into this memory location. |
| bytes | [in] Read this amount of data from the source. |
◆ read()
| void JASS::instream_document_trec::read |
( |
document & |
buffer | ) |
|
|
virtual |
Read the next document from the source instream into document.
- Parameters
-
| buffer | [out] The next document in the source instream. |
Implements JASS::instream.
- Examples:
- parser_use.cpp.
◆ set_tags()
| void JASS::instream_document_trec::set_tags |
( |
const std::string & |
document_tag, |
|
|
const std::string & |
primary_key_tag |
|
) |
| |
|
protected |
Register the document tag and the primary key tag. Used to set up internal data structures.
- Parameters
-
| document_tag | [in] The name of the tag used to delineate docments. |
| primary_key_tag | [in] The name of the element that contans the document's primary key. |
The documentation for this class was generated from the following files: