crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Parsing::HTML Class Reference

Parses and cleans HTML markup. More...

#include <HTML.hpp>

Classes

class  Exception
 Class for HTML exceptions. More...
 

Construction and Destruction

 HTML ()=default
 Default constructor. More...
 
virtual ~HTML ()=default
 Default destructor. More...
 

Functionality

void tidyAndConvert (std::string &inOut, bool warnings, ulong numOfErrors, std::queue< std::string > &warningsTo)
 Parse and tidy the given HTML markup and convert the result to XML. More...
 

Copy and Move

The class is not copyable and not moveable.

 HTML (HTML &)=delete
 Deleted copy constructor. More...
 
 HTML (HTML &&)=delete
 Deleted move constructor. More...
 
HTMLoperator= (HTML &)=delete
 Deleted copy operator. More...
 
HTMLoperator= (HTML &&)=delete
 Deleted move operator. More...
 

Detailed Description

Parses and cleans HTML markup.

Parses the provided HTML markup, tidies it up and converts it into XML using the tidy5-html via Wrapper::TidyDoc.

At the moment, this class is used exclusively by Parsing::XML::parse().

For more information about the tidy-html5 API, see its GitHub repository.

Constructor & Destructor Documentation

◆ HTML() [1/3]

crawlservpp::Parsing::HTML::HTML ( )
default

Default constructor.

◆ ~HTML()

virtual crawlservpp::Parsing::HTML::~HTML ( )
virtualdefault

Default destructor.

◆ HTML() [2/3]

crawlservpp::Parsing::HTML::HTML ( HTML )
delete

Deleted copy constructor.

◆ HTML() [3/3]

crawlservpp::Parsing::HTML::HTML ( HTML &&  )
delete

Deleted move constructor.

Member Function Documentation

◆ operator=() [1/2]

HTML& crawlservpp::Parsing::HTML::operator= ( HTML )
delete

Deleted copy operator.

◆ operator=() [2/2]

HTML& crawlservpp::Parsing::HTML::operator= ( HTML &&  )
delete

Deleted move operator.

◆ tidyAndConvert()

void crawlservpp::Parsing::HTML::tidyAndConvert ( std::string &  inOut,
bool  warnings,
ulong  numOfErrors,
std::queue< std::string > &  warningsTo 
)
inline

Parse and tidy the given HTML markup and convert the result to XML.

The markup will be parsed, cleaned and repaired by tidy-html5 with the following options set:

  • output-xml=yes
  • quiet=yes
  • numeric-entities=yes
  • tidy-mark=no
  • force-output=yes
  • drop-empty-elements=no
  • output-encoding=utf8 [default]

Additionally, show-warnings and show-errors will be set according to the arguments passed to the function.

Note
If the output returned from the underlying TidyDoc is empty, the given markup will not be changed.
Parameters
inOutReference to a string containing the HTML markup to parse, tidy up and convert. The string will be replaced with the resulting XML output, unless the result would be an empty string.
warningsSpecifies whether to add minor warnings to the given queue.
numOfErrorsSpecifies the number used "to determine if further errors should be added" to the queue. If set to zero, no errors will be added.
warningsToReference to a queue of strings to which the reported warnings and errors will be added.
Exceptions
HTML::Exceptionif a TidyDoc::Exception has been thrown.
See also
Options Quick Reference

References crawlservpp::Wrapper::TidyDoc::cleanAndRepair(), crawlservpp::Wrapper::TidyDoc::getOutput(), crawlservpp::Wrapper::TidyDoc::parse(), crawlservpp::Wrapper::TidyDoc::setOption(), and crawlservpp::Parsing::tidyEncoding.

Referenced by crawlservpp::Parsing::XML::parse().


The documentation for this class was generated from the following file: