|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Parses HTML markup into clean XML. More...
#include <XML.hpp>
Classes | |
| class | Exception |
| Class for XML exceptions. More... | |
Friends | |
| class | Query::XPath |
Construction and Destruction | |
| XML ()=default | |
| Default constructor. More... | |
| XML (const pugi::xml_node &node) | |
| Constructor creating a new XML document from an existing XML node. More... | |
| virtual | ~XML () |
| Destructor clearing the underlying XML document. More... | |
Getters | |
| bool | valid () const |
| Returns whether the underlying document is valid. More... | |
| void | getContent (std::string &resultTo) const |
| Gets the stringified content inside the underlying document. More... | |
Setter | |
| void | setOptions (bool showWarnings, std::uint32_t numOfErrors) noexcept |
| Sets logging options. More... | |
Parsing | |
| void | parse (std::string_view content, bool repairCData, bool repairComments, bool removeXmlInstructions, std::queue< std::string > &warningsTo) |
| Parses the given HTML markup into the underlying XML document. More... | |
Cleanup | |
| void | clear () |
| Clears the content of the underlying XML document. More... | |
Copy and Move | |
| XML (const XML &)=delete | |
| Deleted copy constructor. More... | |
| XML (XML &&) noexcept=default | |
| Default move constructor. More... | |
| XML & | operator= (const XML &)=delete |
| Deleted copy assignment operator. More... | |
| XML & | operator= (XML &&) noexcept=default |
| Default move assignment operator. More... | |
Parses HTML markup into clean XML.
Uses the tidy-html5 via Parsing::HTML and the pugixml library to parse, tidy up and clean the given HTML markup and to convert it into clean XML markup.
For more information about pugixml, see its GitHub repository.
|
default |
Default constructor.
|
inlineexplicit |
Constructor creating a new XML document from an existing XML node.
| node | The node which should be added as root node to the new XML document. |
|
inlinevirtual |
Destructor clearing the underlying XML document.
|
delete |
Deleted copy constructor.
|
defaultnoexcept |
Default move constructor.
|
inline |
Clears the content of the underlying XML document.
Does not have any effect if no content has been parsed.
References crawlservpp::Parsing::cDataBegin, crawlservpp::Parsing::cDataEnd, crawlservpp::Parsing::commentCharsReplaceBy, crawlservpp::Parsing::commentCharsToReplace, crawlservpp::Parsing::conditionalBegin, crawlservpp::Parsing::conditionalEnd, crawlservpp::Parsing::conditionalInsert, crawlservpp::Parsing::conditionalInsertOffsetBegin, crawlservpp::Parsing::conditionalInsertOffsetEnd, crawlservpp::Parsing::conditionalInsertOffsetStrayEnd, crawlservpp::Parsing::invalidBegin, crawlservpp::Parsing::invalidEnd, crawlservpp::Parsing::invalidInsertBegin, crawlservpp::Parsing::invalidInsertEnd, crawlservpp::Parsing::invalidInsertOffsetBegin, crawlservpp::Parsing::numDebugCharacters, crawlservpp::Parsing::xmlInstructionBegin, crawlservpp::Parsing::xmlInstructionEnd, and crawlservpp::Parsing::xmlTags.
Referenced by crawlservpp::Query::Container::clearQueryTarget(), and crawlservpp::Query::Container::reserveForSubSets().
|
inline |
Gets the stringified content inside the underlying document.
The result will be intended with tabs (\t).
The output string will be overwritten, if no exception is thrown.
| resultTo | A reference to the string that will be replaced with the content from the underlying document. |
| XML::Exception | if no content is available. |
References crawlservpp::Helper::Memory::freeIf().
Referenced by crawlservpp::Query::Container::getXml().
Default move assignment operator.
|
inline |
Parses the given HTML markup into the underlying XML document.
A copy of the given markup will be created and ASCII whitespaces at the beginning of the input will be removed.
| content | A view into the HTML markup to be parsed. |
| repairCData | Specifies whether the class should try to repair broken CDATA elements in the input. |
| repairComments | Specifies whether the class should try to replace broken comments in the input. |
| removeXmlInstructions | Specifies whether the class should remove XML processing instructions before parsing HTML content. |
| warningsTo | A reference to a queue of strings to which warnings and errors will be added according to the specified options. |
| XML::Exception | if a HTML::Exception has been thrown. |
References crawlservpp::Parsing::HTML::tidyAndConvert(), crawlservpp::Main::Exception::view(), and crawlservpp::Parsing::xmlBegin.
Referenced by crawlservpp::Query::Container::reserveForSubSets(), and crawlservpp::Main::Server::tick().
|
inlinenoexcept |
Sets logging options.
Forwards the given values to the underlying Parsing::HTML document.
| showWarnings | Specify whether to report simple warnings. The default is false. |
| numOfErrors | Set the number of errors to be reported. The default is zero. |
Referenced by crawlservpp::Query::Container::setTidyErrorsAndWarnings(), and crawlservpp::Main::Server::tick().
|
inline |
Returns whether the underlying document is valid.
|
friend |