Parser to turn DNA sequences in FASTA format into k-mers for indexing. More...

#include <parser_fasta.h>

Inheritance diagram for JASS::parser_fasta:

Collaboration diagram for JASS::parser_fasta:

Public Member Functions
	parser_fasta (size_t kmer_length)
	Constructor.

virtual	~parser_fasta ()
	Destructor.

virtual void	set_document (const class document &document)
	Start parsing from the start of this document. More...

virtual const class parser::token &	get_next_token (void)
	Continue parsing the input looking for the next token. More...

Public Member Functions inherited from JASS::parser
	parser ()
	Constructor.

virtual	~parser ()
	Destructor.

virtual void	set_document (const std::string &document)
	Parse a string (rather than a document). More...

Static Public Member Functions
static void	unittest (void)
	Unit test this class.

Static Public Member Functions inherited from JASS::parser
static size_t	unittest_count (const char *string)
	count the numner of tokens in the given string. More...

static void	unittest (void)
	Unit test this class.

Protected Member Functions
const class parser::token &	get_next_token_dna (void)
	Continue parsing the input looking for the next DNA k-mer token. More...

Protected Member Functions inherited from JASS::parser
void	build_unicode_alphabetic_token (uint32_t codepoint, size_t bytes, uint8_t &buffer_pos, uint8_t buffer_end)
	Helper function used to build alphabetic token from UTF-8. More...

void	build_unicode_numeric_token (uint32_t codepoint, size_t bytes, uint8_t &buffer_pos, uint8_t buffer_end)
	Helper function used to build numeric token from UTF-8. More...

Private Types
enum	parser_mode { TEXT, DNA }

Private Attributes
size_t	kmer_length
	The length of the k-mers to compute from the DNA sequences.

parser_mode	mode
	The mode (TEXT or DNA) of the tokenizer;.

uint8_t *	end_of_fasta_document
	Pointer to the end of the FASTA document, end_of_document points to the end of the first line (the primary key) before the DNA starts.

Additional Inherited Members
Protected Attributes inherited from JASS::parser
token	eof_token
	Sentinal returned when reading past end of document.

const document *	the_document
	The document that is currently being parsed.

const uint8_t *	current
	The current location within the document.

const uint8_t *	end_of_document
	Pointer to the end of the document, used to avoid read past end of buffer.

token	current_token
	The token that is currently being build. A reference to this is returned when the token is complete.

Detailed Description

Parser to turn DNA sequences in FASTA format into k-mers for indexing.

k-mers are character n-grams. This parser takes the input document, strips the header from the FASTA record (i.e. drops the first line) then returns the remainder of the document as a set of character n-grams starting at the first DNA character. It assumes the document is syntactically correct. If the document is not syntactically correct the n-grams are computed from the whole document. If a non-base is seen the the parser skips that token (its invalid) and finds the next valid token.

Member Enumeration Documentation

◆ parser_mode

enum JASS::parser_fasta::parser_mode

private

The first line of a document is text and should be indexed as such. The remainder of the document is DNA and should be converted into k-mers

Member Function Documentation

◆ get_next_token()

virtual const class parser::token& JASS::parser_fasta::get_next_token ( void )

inlinevirtual

Continue parsing the input looking for the next token.

Returns: A reference to a token object that is valid until either the next call to get_next_token() or the parser is destroyed.

Reimplemented from JASS::parser.

◆ get_next_token_dna()

const class parser::token & JASS::parser_fasta::get_next_token_dna ( void )

protected

Continue parsing the input looking for the next DNA k-mer token.

Returns: A reference to a token object that is valid until either the next call to get_next_token() or the parser is destroyed.

◆ set_document()

virtual void JASS::parser_fasta::set_document ( const class document & document )

inlinevirtual

Start parsing from the start of this document.

The document must be a '\0' terminated string. Syntactically correct FASTA is assumed, if the necessary parts are not found then n-grams from the start of the whole document are used.

Parameters

document [in] The document to parse.

Reimplemented from JASS::parser.

The documentation for this class was generated from the following files:

source/parser_fasta.h
source/parser_fasta.cpp

Public Member Functions

Static Public Member Functions

Protected Member Functions

Private Types

Private Attributes