|
JASSv2
|
Simple, but fast, XML parser. More...
#include <parser.h>


Classes | |
| class | token |
| A token as returned by the parser. More... | |
Public Member Functions | |
| parser () | |
| Constructor. | |
| virtual | ~parser () |
| Destructor. | |
| virtual void | set_document (const class document &document) |
| Start parsing from the start of this document. More... | |
| virtual void | set_document (const std::string &document) |
| Parse a string (rather than a document). More... | |
| virtual const class parser::token & | get_next_token (void) |
| Continue parsing the input looking for the next token. More... | |
Static Public Member Functions | |
| static size_t | unittest_count (const char *string) |
| count the numner of tokens in the given string. More... | |
| static void | unittest (void) |
| Unit test this class. | |
Protected Member Functions | |
| void | build_unicode_alphabetic_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end) |
| Helper function used to build alphabetic token from UTF-8. More... | |
| void | build_unicode_numeric_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end) |
| Helper function used to build numeric token from UTF-8. More... | |
Protected Attributes | |
| token | eof_token |
| Sentinal returned when reading past end of document. | |
| const document * | the_document |
| The document that is currently being parsed. | |
| const uint8_t * | current |
| The current location within the document. | |
| const uint8_t * | end_of_document |
| Pointer to the end of the document, used to avoid read past end of buffer. | |
| token | current_token |
| The token that is currently being build. A reference to this is returned when the token is complete. | |
Private Attributes | |
| document | build_document |
| A document used when a string is passed into this object. | |
Simple, but fast, XML parser.
This is the parser (tokenizer, or lexical analyser) that is most likely to get used for most documents, especially TREC collections. It does not manage entity references (it will strip the '&' and the ';'. It does not manage attributes, which are ignored. It does, however, manage start tags, end tags, alphabetic tokens, alphanumeric tokens, comments, and many other XML characteristics.
An example tying documents, instreams, and parsing to count the number of document and non-unique symbols is:
|
protected |
Helper function used to build alphabetic token from UTF-8.
| codepoint | [in] The Unicoode codepoint of the first character in the token (which must, by definition, be alphabetic). |
| bytes | [in] The length of the UTF-8 representation of codepoint. |
| buffer_pos | [in/out] Where the UTF-8 representation of the token should be written. |
| buffer_end | [in] The end of the buffer_pos buffer (used to prevent write past end of buffer). |
|
protected |
Helper function used to build numeric token from UTF-8.
| codepoint | [in] The Unicoode codepoint of the first character in the token (which must, by definition, be numeric). |
| bytes | [in] The length of the UTF-8 representation of codepoint. |
| buffer_pos | [in/out] Where the UTF-8 representation of the token should be written. |
| buffer_end | [in] The end of the buffer_pos buffer (used to prevent write past end of buffer). |
|
virtual |
Continue parsing the input looking for the next token.
Reimplemented in JASS::parser_fasta, and JASS::parser_unicoil_json.
|
inlinevirtual |
Start parsing from the start of this document.
The document must be a '\0' terminated string.
| document | [in] The document to parse. |
Reimplemented in JASS::parser_fasta.
|
inlinevirtual |
Parse a string (rather than a document).
| document | [in] The document to parse. Must remain in scope for the entire parsing process (a copy is not taken). |
|
static |
count the numner of tokens in the given string.
| string | [in] The string to count characters in. |
1.8.13