Simple, but fast, XML parser. More...

#include <parser.h>

Inheritance diagram for JASS::parser:

Collaboration diagram for JASS::parser:

Classes
class	token
	A token as returned by the parser. More...

Public Member Functions
	parser ()
	Constructor.

virtual	~parser ()
	Destructor.

virtual void	set_document (const class document &document)
	Start parsing from the start of this document. More...

virtual void	set_document (const std::string &document)
	Parse a string (rather than a document). More...

virtual const class parser::token &	get_next_token (void)
	Continue parsing the input looking for the next token. More...

Static Public Member Functions
static size_t	unittest_count (const char *string)
	count the numner of tokens in the given string. More...

static void	unittest (void)
	Unit test this class.

Protected Member Functions
void	build_unicode_alphabetic_token (uint32_t codepoint, size_t bytes, uint8_t &buffer_pos, uint8_t buffer_end)
	Helper function used to build alphabetic token from UTF-8. More...

void	build_unicode_numeric_token (uint32_t codepoint, size_t bytes, uint8_t &buffer_pos, uint8_t buffer_end)
	Helper function used to build numeric token from UTF-8. More...

Protected Attributes
token	eof_token
	Sentinal returned when reading past end of document.

const document *	the_document
	The document that is currently being parsed.

const uint8_t *	current
	The current location within the document.

const uint8_t *	end_of_document
	Pointer to the end of the document, used to avoid read past end of buffer.

token	current_token
	The token that is currently being build. A reference to this is returned when the token is complete.

Private Attributes
document	build_document
	A document used when a string is passed into this object.

Detailed Description

Simple, but fast, XML parser.

This is the parser (tokenizer, or lexical analyser) that is most likely to get used for most documents, especially TREC collections. It does not manage entity references (it will strip the '&' and the ';'. It does not manage attributes, which are ignored. It does, however, manage start tags, end tags, alphabetic tokens, alphanumeric tokens, comments, and many other XML characteristics.

An example tying documents, instreams, and parsing to count the number of document and non-unique symbols is:

/*
    PARSER_USE.CPP
    --------------
    Copyright (c) 2016 Andrew Trotman
    Released under the 2-clause BSD license (See:https://en.wikipedia.org/wiki/BSD_licenses)
*/
#include "parser.h"
#include "instream_file.h"
#include "instream_document_trec.h"
/*
    MAIN()
    ------
*/
int main(int argc, char *argv[])
    {
    /*
        allocate a document object and a parser object.
    */
    JASS::document document;
    JASS::parser parser;
    
    /*
        build a pipeline - recall that deletes cascade so file is deleted when source goes out of scope.
    */
    std::string filename;
    try
        {
        filename = argv[1];
        }
    catch (...)
        {
        exit(printf("Cannot parse filename\n"));
        }
    std::shared_ptr<JASS::instream> file(new JASS::instream_file(filename));
    JASS::instream_document_trec source(file);
    /*
        this program counts document and alphbetic tokens in those documents.
    */
    size_t total_documents = 0;
    size_t alphas = 0;
    /*
        read document, then parse them.
    */
    do
        {
        /*
            read the next document into the same memory the last document used.
        */
        document.rewind();
        source.read(document);
        /*
            eof is signaled as an empty document.
        */
        if (document.isempty())
            break;
        /*
            count documents.
        */
        total_documents++;
        /*
            now parse the docment.
        */
        parser.set_document(document);
        bool finished = false;
        do
            {
            /*
                get the next token
            */
            const auto &token = parser.get_next_token();
            
            /*
                what type is that token
            */
            switch (token.type)
                {
                case JASS::parser::token::eof:
                    /*
                        At end of document so signal to leave the loop.
                    */
                    finished = true;
                    break;
                case JASS::parser::token::alpha:
                    /*
                        Count the number of alphabetic tokens.
                    */
                    alphas++;
                    break;
                default:
                    /*
                        else ignore the token.
                    */
                    break;
                }
            }
        while (!finished);
        }
    while (!document.isempty());
    
    /*
        Dump out the the number of documents and the numner of tokens.
    */
    printf("Documents:%lld\n", (long long)total_documents);
    printf("alphas   :%lld\n", (long long)alphas);
    return 0;
    }

Examples:: parser_use.cpp.

Member Function Documentation

◆ build_unicode_alphabetic_token()

void JASS::parser::build_unicode_alphabetic_token	(	uint32_t	codepoint,
		size_t	bytes,
		uint8_t *&	buffer_pos,
		uint8_t *	buffer_end
	)

protected

Helper function used to build alphabetic token from UTF-8.

Parameters

codepoint	[in] The Unicoode codepoint of the first character in the token (which must, by definition, be alphabetic).
bytes	[in] The length of the UTF-8 representation of codepoint.
buffer_pos	[in/out] Where the UTF-8 representation of the token should be written.
buffer_end	[in] The end of the buffer_pos buffer (used to prevent write past end of buffer).

◆ build_unicode_numeric_token()

void JASS::parser::build_unicode_numeric_token	(	uint32_t	codepoint,
		size_t	bytes,
		uint8_t *&	buffer_pos,
		uint8_t *	buffer_end
	)

protected

Helper function used to build numeric token from UTF-8.

Parameters

codepoint	[in] The Unicoode codepoint of the first character in the token (which must, by definition, be numeric).
bytes	[in] The length of the UTF-8 representation of codepoint.
buffer_pos	[in/out] Where the UTF-8 representation of the token should be written.
buffer_end	[in] The end of the buffer_pos buffer (used to prevent write past end of buffer).

◆ get_next_token()

const class parser::token & JASS::parser::get_next_token ( void )

virtual

Continue parsing the input looking for the next token.

Returns: A reference to a token object that is valid until either the next call to get_next_token() or the parser is destroyed.

Reimplemented in JASS::parser_fasta, and JASS::parser_unicoil_json.

Examples:: parser_use.cpp.

◆ set_document() [1/2]

virtual void JASS::parser::set_document ( const class document & document )

inlinevirtual

Start parsing from the start of this document.

The document must be a '\0' terminated string.

Parameters

document [in] The document to parse.

Reimplemented in JASS::parser_fasta.

Examples:: parser_use.cpp.

◆ set_document() [2/2]

virtual void JASS::parser::set_document ( const std::string & document )

inlinevirtual

Parse a string (rather than a document).

Parameters

document [in] The document to parse. Must remain in scope for the entire parsing process (a copy is not taken).

◆ unittest_count()

size_t JASS::parser::unittest_count ( const char * string )

static

count the numner of tokens in the given string.

Parameters

string [in] The string to count characters in.

Returns: The number of tokens in string.

The documentation for this class was generated from the following files:

source/parser.h
source/parser.cpp

Classes

Public Member Functions

Static Public Member Functions

Protected Member Functions

Protected Attributes

Private Attributes

Detailed Description

Member Function Documentation

◆ build_unicode_alphabetic_token()

◆ build_unicode_numeric_token()

◆ get_next_token()

◆ set_document() [1/2]

◆ set_document() [2/2]

◆ unittest_count()