|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
RAII wrapper for documents used by the tidy-html5 API. More...
#include <TidyDoc.hpp>
Classes | |
| class | Exception |
| Class for tidy-html5 document exceptions. More... | |
Construction and Destruction | |
| TidyDoc () | |
| Constructor creating an empty tidy-html5 document. More... | |
| virtual | ~TidyDoc () |
| Destructor releasing the underlying tidy-html5 document. More... | |
Getter | |
| std::string | getOutput (std::queue< std::string > &warningsTo) |
| Gets the processed text from the tidy-html5 document. More... | |
Setters | |
| void | setOption (TidyOptionId option, bool value) |
| Sets a boolean option. More... | |
| void | setOption (TidyOptionId option, int value) |
| Sets an integer option. More... | |
| void | setOption (TidyOptionId option, ulong value) |
| Sets a ulong option. More... | |
| void | setOption (TidyOptionId option, const std::string &value) |
| Sets a string option. More... | |
Parsing and Cleanup | |
| void | parse (const std::string &in, std::queue< std::string > &warningsTo) |
| Parses the given markup. More... | |
| void | cleanAndRepair (std::queue< std::string > &warningsTo) |
| Cleans and repairs the previously parsed content of the underlying tidy-html5 document. More... | |
Copy and Move | |
| TidyDoc (TidyDoc &)=delete | |
| Deleted copy constructor. More... | |
| TidyDoc & | operator= (TidyDoc &)=delete |
| Deleted copy assignment operator. More... | |
| TidyDoc (TidyDoc &&)=delete | |
| Deleted move constructor. More... | |
| TidyDoc & | operator= (TidyDoc &&)=delete |
| Deleted move assignment operator. More... | |
RAII wrapper for documents used by the tidy-html5 API.
Creates a Tidy document on construction and automatically releases it on destruction, avoiding memory leaks.
The class encapsulates functionality to configure the API, to parse, clean and repair markup and to retrieve a stringified copy of the resulting tree inside the underlying document.
At the moment, this class is used exclusively by Parsing::HTML::tidyAndConvert().
For more information about the tidy-html5 API, see its GitHub repository.
|
inline |
Constructor creating an empty tidy-html5 document.
Also sets the internal error buffer of the newly created document.
| TidyDoc::Exception | if the error buffer could not be set. |
References crawlservpp::Wrapper::TidyBuffer::get().
|
inlinevirtual |
Destructor releasing the underlying tidy-html5 document.
|
delete |
Deleted copy constructor.
|
delete |
Deleted move constructor.
|
inline |
Cleans and repairs the previously parsed content of the underlying tidy-html5 document.
The parsed content will be cleaned and repaired according to the options that have previously been set by calls to the different setOption() functions.
The result will be stored in the underlying document and can be accessed via the getOutput() function. The raw output from the parsing process will be overwritten.
An exception will only be thrown when a fatal error occured.
| warningsTo | The reference to a queue of strings into which to push errors and warnings that occured while cleaning and repairing. |
| TidyDoc::Exception | if the parsed markup could not be cleaned and repaired. |
References crawlservpp::Wrapper::TidyBuffer::clear(), crawlservpp::Wrapper::TidyBuffer::empty(), crawlservpp::Wrapper::TidyBuffer::getString(), crawlservpp::Helper::Utf8::length(), crawlservpp::Helper::Strings::splitToQueue(), and crawlservpp::Wrapper::TidyBuffer::valid().
Referenced by crawlservpp::Parsing::HTML::tidyAndConvert().
|
inline |
Gets the processed text from the tidy-html5 document.
If the buffer received from the underlying document is invalid (or empty), an empty string will be returned.
An exception will only be thrown when a fatal error occured.
| warningsTo | The reference to a queue of strings into which to push errors and warnings that occured while saving the output. |
| TidyDoc::Exception | if writing to the output buffer failed. |
References crawlservpp::Wrapper::TidyBuffer::clear(), crawlservpp::Wrapper::TidyBuffer::empty(), crawlservpp::Wrapper::TidyBuffer::get(), crawlservpp::Wrapper::TidyBuffer::getString(), crawlservpp::Helper::Strings::splitToQueue(), and crawlservpp::Wrapper::TidyBuffer::valid().
Referenced by crawlservpp::Parsing::HTML::tidyAndConvert().
Deleted copy assignment operator.
Deleted move assignment operator.
|
inline |
Parses the given markup.
The given markup will be parsed according to the options that have previously been set by calls to the different setOption() functions.
The underlying API will correct syntax errors while parsing.
The result will be stored in the underlying document for possible further processing and can be accessed via the getOutput() function. Any previous output will be overwritten.
An exception will only be thrown when a fatal error occured.
| in | A const reference to the string containing the markup to be parsed. |
| warningsTo | The reference to a queue of strings into which to push errors and warnings that occured while parsing the given input. |
| TidyDoc::Exception | if the given markup could not be parsed. |
References crawlservpp::Wrapper::TidyBuffer::clear(), crawlservpp::Wrapper::TidyBuffer::empty(), crawlservpp::Wrapper::TidyBuffer::getString(), crawlservpp::Helper::Strings::splitToQueue(), and crawlservpp::Wrapper::TidyBuffer::valid().
Referenced by crawlservpp::Parsing::HTML::tidyAndConvert().
|
inline |
Sets a boolean option.
Once successfully set, it will take effect for all subsequent parsing, cleaning and repairing of any given input.
| option | The ID of the option to be set. |
| value | The value which the option will be set to. |
| TidyDoc::Exception | if the option could not be set. |
Referenced by crawlservpp::Parsing::HTML::tidyAndConvert().
|
inline |
Sets an integer option.
Once successfully set, it will take effect for all subsequent parsing, cleaning and repairing of any given input.
| option | The ID of the option to be set. |
| value | The value which the option will be set to. |
| TidyDoc::Exception | if the option could not be set. |
|
inline |
Sets a ulong option.
Once successfully set, it will take effect for all subsequent parsing, cleaning and repairing of any given input.
| option | The ID of the option to be set. |
| value | The value which the option will be set to. |
| TidyDoc::Exception | if the option could not be set. |
|
inline |
Sets a string option.
Once successfully set, it will take effect for all subsequent parsing, cleaning and repairing of any given input.
| option | The ID of the option to be set. |
| value | The value which the option will be set to. |
| TidyDoc::Exception | if the option could not be set. |