|
JASSv2
|
Parser to turn DNA sequences in FASTA format into k-mers for indexing. More...
#include <parser_fasta.h>


Public Member Functions | |
| parser_fasta (size_t kmer_length) | |
| Constructor. | |
| virtual | ~parser_fasta () |
| Destructor. | |
| virtual void | set_document (const class document &document) |
| Start parsing from the start of this document. More... | |
| virtual const class parser::token & | get_next_token (void) |
| Continue parsing the input looking for the next token. More... | |
Public Member Functions inherited from JASS::parser | |
| parser () | |
| Constructor. | |
| virtual | ~parser () |
| Destructor. | |
| virtual void | set_document (const std::string &document) |
| Parse a string (rather than a document). More... | |
Static Public Member Functions | |
| static void | unittest (void) |
| Unit test this class. | |
Static Public Member Functions inherited from JASS::parser | |
| static size_t | unittest_count (const char *string) |
| count the numner of tokens in the given string. More... | |
| static void | unittest (void) |
| Unit test this class. | |
Protected Member Functions | |
| const class parser::token & | get_next_token_dna (void) |
| Continue parsing the input looking for the next DNA k-mer token. More... | |
Protected Member Functions inherited from JASS::parser | |
| void | build_unicode_alphabetic_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end) |
| Helper function used to build alphabetic token from UTF-8. More... | |
| void | build_unicode_numeric_token (uint32_t codepoint, size_t bytes, uint8_t *&buffer_pos, uint8_t *buffer_end) |
| Helper function used to build numeric token from UTF-8. More... | |
Private Types | |
| enum | parser_mode { TEXT, DNA } |
Private Attributes | |
| size_t | kmer_length |
| The length of the k-mers to compute from the DNA sequences. | |
| parser_mode | mode |
| The mode (TEXT or DNA) of the tokenizer;. | |
| uint8_t * | end_of_fasta_document |
| Pointer to the end of the FASTA document, end_of_document points to the end of the first line (the primary key) before the DNA starts. | |
Additional Inherited Members | |
Protected Attributes inherited from JASS::parser | |
| token | eof_token |
| Sentinal returned when reading past end of document. | |
| const document * | the_document |
| The document that is currently being parsed. | |
| const uint8_t * | current |
| The current location within the document. | |
| const uint8_t * | end_of_document |
| Pointer to the end of the document, used to avoid read past end of buffer. | |
| token | current_token |
| The token that is currently being build. A reference to this is returned when the token is complete. | |
Parser to turn DNA sequences in FASTA format into k-mers for indexing.
k-mers are character n-grams. This parser takes the input document, strips the header from the FASTA record (i.e. drops the first line) then returns the remainder of the document as a set of character n-grams starting at the first DNA character. It assumes the document is syntactically correct. If the document is not syntactically correct the n-grams are computed from the whole document. If a non-base is seen the the parser skips that token (its invalid) and finds the next valid token.
|
private |
The first line of a document is text and should be indexed as such. The remainder of the document is DNA and should be converted into k-mers
|
inlinevirtual |
Continue parsing the input looking for the next token.
Reimplemented from JASS::parser.
|
protected |
Continue parsing the input looking for the next DNA k-mer token.
|
inlinevirtual |
Start parsing from the start of this document.
The document must be a '\0' terminated string. Syntactically correct FASTA is assumed, if the necessary parts are not found then n-grams from the start of the whole document are used.
| document | [in] The document to parse. |
Reimplemented from JASS::parser.
1.8.13