JASSv2
Classes | Public Member Functions | Static Public Member Functions | Protected Member Functions | Protected Attributes | Private Attributes | List of all members
JASS::parser Class Reference

Simple, but fast, XML parser. More...

#include <parser.h>

Inheritance diagram for JASS::parser:
Inheritance graph
[legend]
Collaboration diagram for JASS::parser:
Collaboration graph
[legend]

Classes

class  token
 A token as returned by the parser. More...
 

Public Member Functions

 parser ()
 Constructor.
 
virtual ~parser ()
 Destructor.
 
virtual void set_document (const class document &document)
 Start parsing from the start of this document. More...
 
virtual void set_document (const std::string &document)
 Parse a string (rather than a document). More...
 
virtual const class parser::tokenget_next_token (void)
 Continue parsing the input looking for the next token. More...
 

Static Public Member Functions

static size_t unittest_count (const char *string)
 count the numner of tokens in the given string. More...
 
static void unittest (void)
 Unit test this class.
 

Protected Member Functions

void build_unicode_alphabetic_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end)
 Helper function used to build alphabetic token from UTF-8. More...
 
void build_unicode_numeric_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end)
 Helper function used to build numeric token from UTF-8. More...
 

Protected Attributes

token eof_token
 Sentinal returned when reading past end of document.
 
const documentthe_document
 The document that is currently being parsed.
 
const uint8_t * current
 The current location within the document.
 
const uint8_t * end_of_document
 Pointer to the end of the document, used to avoid read past end of buffer.
 
token current_token
 The token that is currently being build. A reference to this is returned when the token is complete.
 

Private Attributes

document build_document
 A document used when a string is passed into this object.
 

Detailed Description

Simple, but fast, XML parser.

This is the parser (tokenizer, or lexical analyser) that is most likely to get used for most documents, especially TREC collections. It does not manage entity references (it will strip the '&' and the ';'. It does not manage attributes, which are ignored. It does, however, manage start tags, end tags, alphabetic tokens, alphanumeric tokens, comments, and many other XML characteristics.

An example tying documents, instreams, and parsing to count the number of document and non-unique symbols is:

/*
PARSER_USE.CPP
--------------
Copyright (c) 2016 Andrew Trotman
Released under the 2-clause BSD license (See:https://en.wikipedia.org/wiki/BSD_licenses)
*/
#include "parser.h"
#include "instream_file.h"
/*
MAIN()
------
*/
int main(int argc, char *argv[])
{
/*
allocate a document object and a parser object.
*/
JASS::document document;
/*
build a pipeline - recall that deletes cascade so file is deleted when source goes out of scope.
*/
std::string filename;
try
{
filename = argv[1];
}
catch (...)
{
exit(printf("Cannot parse filename\n"));
}
std::shared_ptr<JASS::instream> file(new JASS::instream_file(filename));
/*
this program counts document and alphbetic tokens in those documents.
*/
size_t total_documents = 0;
size_t alphas = 0;
/*
read document, then parse them.
*/
do
{
/*
read the next document into the same memory the last document used.
*/
document.rewind();
source.read(document);
/*
eof is signaled as an empty document.
*/
if (document.isempty())
break;
/*
count documents.
*/
total_documents++;
/*
now parse the docment.
*/
parser.set_document(document);
bool finished = false;
do
{
/*
get the next token
*/
const auto &token = parser.get_next_token();
/*
what type is that token
*/
switch (token.type)
{
/*
At end of document so signal to leave the loop.
*/
finished = true;
break;
/*
Count the number of alphabetic tokens.
*/
alphas++;
break;
default:
/*
else ignore the token.
*/
break;
}
}
while (!finished);
}
while (!document.isempty());
/*
Dump out the the number of documents and the numner of tokens.
*/
printf("Documents:%lld\n", (long long)total_documents);
printf("alphas :%lld\n", (long long)alphas);
return 0;
}
Examples:
parser_use.cpp.

Member Function Documentation

◆ build_unicode_alphabetic_token()

void JASS::parser::build_unicode_alphabetic_token ( uint32_t  codepoint,
size_t  bytes,
uint8_t *&  buffer_pos,
uint8_t *  buffer_end 
)
protected

Helper function used to build alphabetic token from UTF-8.

Parameters
codepoint[in] The Unicoode codepoint of the first character in the token (which must, by definition, be alphabetic).
bytes[in] The length of the UTF-8 representation of codepoint.
buffer_pos[in/out] Where the UTF-8 representation of the token should be written.
buffer_end[in] The end of the buffer_pos buffer (used to prevent write past end of buffer).

◆ build_unicode_numeric_token()

void JASS::parser::build_unicode_numeric_token ( uint32_t  codepoint,
size_t  bytes,
uint8_t *&  buffer_pos,
uint8_t *  buffer_end 
)
protected

Helper function used to build numeric token from UTF-8.

Parameters
codepoint[in] The Unicoode codepoint of the first character in the token (which must, by definition, be numeric).
bytes[in] The length of the UTF-8 representation of codepoint.
buffer_pos[in/out] Where the UTF-8 representation of the token should be written.
buffer_end[in] The end of the buffer_pos buffer (used to prevent write past end of buffer).

◆ get_next_token()

const class parser::token & JASS::parser::get_next_token ( void  )
virtual

Continue parsing the input looking for the next token.

Returns
A reference to a token object that is valid until either the next call to get_next_token() or the parser is destroyed.

Reimplemented in JASS::parser_fasta, and JASS::parser_unicoil_json.

Examples:
parser_use.cpp.

◆ set_document() [1/2]

virtual void JASS::parser::set_document ( const class document document)
inlinevirtual

Start parsing from the start of this document.

The document must be a '\0' terminated string.

Parameters
document[in] The document to parse.

Reimplemented in JASS::parser_fasta.

Examples:
parser_use.cpp.

◆ set_document() [2/2]

virtual void JASS::parser::set_document ( const std::string &  document)
inlinevirtual

Parse a string (rather than a document).

Parameters
document[in] The document to parse. Must remain in scope for the entire parsing process (a copy is not taken).

◆ unittest_count()

size_t JASS::parser::unittest_count ( const char *  string)
static

count the numner of tokens in the given string.

Parameters
string[in] The string to count characters in.
Returns
The number of tokens in string.

The documentation for this class was generated from the following files: