|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Parses and cleans HTML markup. More...
#include <HTML.hpp>
Classes | |
| class | Exception |
| Class for HTML exceptions. More... | |
Construction and Destruction | |
| HTML ()=default | |
| Default constructor. More... | |
| virtual | ~HTML ()=default |
| Default destructor. More... | |
Functionality | |
| void | tidyAndConvert (std::string &inOut, bool warnings, ulong numOfErrors, std::queue< std::string > &warningsTo) |
| Parse and tidy the given HTML markup and convert the result to XML. More... | |
Copy and Move | |
| HTML (HTML &)=delete | |
| Deleted copy constructor. More... | |
| HTML (HTML &&)=delete | |
| Deleted move constructor. More... | |
| HTML & | operator= (HTML &)=delete |
| Deleted copy operator. More... | |
| HTML & | operator= (HTML &&)=delete |
| Deleted move operator. More... | |
Parses and cleans HTML markup.
Parses the provided HTML markup, tidies it up and converts it into XML using the tidy5-html via Wrapper::TidyDoc.
At the moment, this class is used exclusively by Parsing::XML::parse().
For more information about the tidy-html5 API, see its GitHub repository.
|
default |
Default constructor.
|
virtualdefault |
Default destructor.
|
delete |
Deleted copy constructor.
|
delete |
Deleted move constructor.
|
inline |
Parse and tidy the given HTML markup and convert the result to XML.
The markup will be parsed, cleaned and repaired by tidy-html5 with the following options set:
Additionally, show-warnings and show-errors will be set according to the arguments passed to the function.
| inOut | Reference to a string containing the HTML markup to parse, tidy up and convert. The string will be replaced with the resulting XML output, unless the result would be an empty string. |
| warnings | Specifies whether to add minor warnings to the given queue. |
| numOfErrors | Specifies the number used "to determine if further errors should be added" to the queue. If set to zero, no errors will be added. |
| warningsTo | Reference to a queue of strings to which the reported warnings and errors will be added. |
| HTML::Exception | if a TidyDoc::Exception has been thrown. |
References crawlservpp::Wrapper::TidyDoc::cleanAndRepair(), crawlservpp::Wrapper::TidyDoc::getOutput(), crawlservpp::Wrapper::TidyDoc::parse(), crawlservpp::Wrapper::TidyDoc::setOption(), and crawlservpp::Parsing::tidyEncoding.
Referenced by crawlservpp::Parsing::XML::parse().