crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Parsing::XML Class Reference

Parses HTML markup into clean XML. More...

#include <XML.hpp>

Classes

class  Exception
 Class for XML exceptions. More...
 

Friends

class Query::XPath
 

Construction and Destruction

 XML ()=default
 Default constructor. More...
 
 XML (const pugi::xml_node &node)
 Constructor creating a new XML document from an existing XML node. More...
 
virtual ~XML ()
 Destructor clearing the underlying XML document. More...
 

Getters

bool valid () const
 Returns whether the underlying document is valid. More...
 
void getContent (std::string &resultTo) const
 Gets the stringified content inside the underlying document. More...
 

Setter

void setOptions (bool showWarnings, std::uint32_t numOfErrors) noexcept
 Sets logging options. More...
 

Parsing

void parse (std::string_view content, bool repairCData, bool repairComments, bool removeXmlInstructions, std::queue< std::string > &warningsTo)
 Parses the given HTML markup into the underlying XML document. More...
 

Cleanup

void clear ()
 Clears the content of the underlying XML document. More...
 

Copy and Move

The class is not copyable, only moveable.

 XML (const XML &)=delete
 Deleted copy constructor. More...
 
 XML (XML &&) noexcept=default
 Default move constructor. More...
 
XMLoperator= (const XML &)=delete
 Deleted copy assignment operator. More...
 
XMLoperator= (XML &&) noexcept=default
 Default move assignment operator. More...
 

Detailed Description

Parses HTML markup into clean XML.

Uses the tidy-html5 via Parsing::HTML and the pugixml library to parse, tidy up and clean the given HTML markup and to convert it into clean XML markup.

For more information about pugixml, see its GitHub repository.

Constructor & Destructor Documentation

◆ XML() [1/4]

crawlservpp::Parsing::XML::XML ( )
default

Default constructor.

◆ XML() [2/4]

crawlservpp::Parsing::XML::XML ( const pugi::xml_node &  node)
inlineexplicit

Constructor creating a new XML document from an existing XML node.

Parameters
nodeThe node which should be added as root node to the new XML document.

◆ ~XML()

crawlservpp::Parsing::XML::~XML ( )
inlinevirtual

Destructor clearing the underlying XML document.

◆ XML() [3/4]

crawlservpp::Parsing::XML::XML ( const XML )
delete

Deleted copy constructor.

◆ XML() [4/4]

crawlservpp::Parsing::XML::XML ( XML &&  )
defaultnoexcept

Default move constructor.

Member Function Documentation

◆ clear()

◆ getContent()

void crawlservpp::Parsing::XML::getContent ( std::string &  resultTo) const
inline

Gets the stringified content inside the underlying document.

The result will be intended with tabs (\t).

The output string will be overwritten, if no exception is thrown.

Warning
Should only be called if XML markup has been successfully parsed.
Parameters
resultToA reference to the string that will be replaced with the content from the underlying document.
Exceptions
XML::Exceptionif no content is available.

References crawlservpp::Helper::Memory::freeIf().

Referenced by crawlservpp::Query::Container::getXml().

◆ operator=() [1/2]

XML& crawlservpp::Parsing::XML::operator= ( const XML )
delete

Deleted copy assignment operator.

◆ operator=() [2/2]

XML& crawlservpp::Parsing::XML::operator= ( XML &&  )
defaultnoexcept

Default move assignment operator.

◆ parse()

void crawlservpp::Parsing::XML::parse ( std::string_view  content,
bool  repairCData,
bool  repairComments,
bool  removeXmlInstructions,
std::queue< std::string > &  warningsTo 
)
inline

Parses the given HTML markup into the underlying XML document.

A copy of the given markup will be created and ASCII whitespaces at the beginning of the input will be removed.

Parameters
contentA view into the HTML markup to be parsed.
repairCDataSpecifies whether the class should try to repair broken CDATA elements in the input.
repairCommentsSpecifies whether the class should try to replace broken comments in the input.
removeXmlInstructionsSpecifies whether the class should remove XML processing instructions before parsing HTML content.
warningsToA reference to a queue of strings to which warnings and errors will be added according to the specified options.
Exceptions
XML::Exceptionif a HTML::Exception has been thrown.
See also
setOptions, getContent

References crawlservpp::Parsing::HTML::tidyAndConvert(), crawlservpp::Main::Exception::view(), and crawlservpp::Parsing::xmlBegin.

Referenced by crawlservpp::Query::Container::reserveForSubSets(), and crawlservpp::Main::Server::tick().

◆ setOptions()

void crawlservpp::Parsing::XML::setOptions ( bool  showWarnings,
std::uint32_t  numOfErrors 
)
inlinenoexcept

Sets logging options.

Forwards the given values to the underlying Parsing::HTML document.

Parameters
showWarningsSpecify whether to report simple warnings. The default is false.
numOfErrorsSet the number of errors to be reported. The default is zero.
See also
Parsing::HTML::tidyAndConvert

Referenced by crawlservpp::Query::Container::setTidyErrorsAndWarnings(), and crawlservpp::Main::Server::tick().

◆ valid()

bool crawlservpp::Parsing::XML::valid ( ) const
inline

Returns whether the underlying document is valid.

Returns
True, if the underlying document is valid, i.e. XML content has been sucessfully parsed. False otherwise.

Friends And Related Function Documentation

◆ Query::XPath

friend class Query::XPath
friend

The documentation for this class was generated from the following file: