crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Parsing::URI Class Reference

Parser for RFC 3986 URIs that can also analyze their relationships with each other. More...

#include <URI.hpp>

Classes

class  Exception
 Class for URI exceptions. More...
 

Getters

bool isSameDomain () const
 Checks whether the parsed URI links to the current domain. More...
 
std::string getSubUri () const
 Gets the sub-URI for the current URI. More...
 
std::string getSubUri (const std::vector< std::string > &args, bool whiteList) const
 Gets the sub-URI for the current URI, filtering its query list. More...
 

Setters

void setCurrentDomain (std::string_view currentDomain)
 Sets the current domain. More...
 
void setCurrentOrigin (std::string_view baseUri)
 Sets the current origin. More...
 

Parsing

bool parseLink (std::string_view uriToParse)
 Parses a link, either abolute or into a sub-URI. More...
 

Static Helpers

static std::string escape (std::string_view string, bool plusSpace)
 Public static helper function URI-escaping a string. More...
 
static std::string unescape (std::string_view string, bool plusSpace)
 Public static helper function URI-unescaping a string. More...
 
static std::string escapeUri (std::string_view uriToEscape)
 Public static helper function escaping a URI, but leacing reserved characters intact. More...
 
static void makeAbsolute (std::string_view uriBase, std::vector< std::string > &uris)
 Public static helper function making a set of (possibly) relative URIs absolute. More...
 

Detailed Description

Parser for RFC 3986 URIs that can also analyze their relationships with each other.

Parses URIs, analyzes their relationship to other URIs and provides encoding (escaping) functionality.

Member Function Documentation

◆ escape()

std::string crawlservpp::Parsing::URI::escape ( std::string_view  string,
bool  plusSpace 
)
inlinestatic

Public static helper function URI-escaping a string.

Parameters
stringView of the string to escape.
plusSpaceSpecifies whether spaces should be escaped as plusses.
Returns
A copy of the escaped string.

References crawlservpp::Parsing::maxEscapedCharLength.

Referenced by escapeUri().

◆ escapeUri()

std::string crawlservpp::Parsing::URI::escapeUri ( std::string_view  uriToEscape)
inlinestatic

Public static helper function escaping a URI, but leacing reserved characters intact.

The following characters will be left intact: ; / ? : @ = & # %

Parameters
uriToEscapeView of the URI to be escaped.
Returns
A copy of the escaped URI.

References crawlservpp::Helper::Strings::encodePercentage(), and escape().

Referenced by parseLink(), setCurrentDomain(), and setCurrentOrigin().

◆ getSubUri() [1/2]

std::string crawlservpp::Parsing::URI::getSubUri ( ) const
inline

Gets the sub-URI for the current URI.

The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.

Returns
A copy of the sub-URI, including the domain if the current website is cross-domain.
Exceptions
URI::Exceptionif no URI has been parsed or no domain has been either specified or parsed.
See also
setCurrentDomain, setCurrentOrigin, parse

Referenced by crawlservpp::Module::Crawler::Thread::onReset().

◆ getSubUri() [2/2]

std::string crawlservpp::Parsing::URI::getSubUri ( const std::vector< std::string > &  args,
bool  whiteList 
) const
inline

Gets the sub-URI for the current URI, filtering its query list.

The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.

Parameters
argsVector containing the names of query list parameters to either ignore (if whiteList is false) or keep (if whiteList is true).
whiteListSpecifies whether args is a white list or a black list of query list names.
Returns
A copy of the resulting sub-URI, including the domain if the current website is cross-domain.
Exceptions
URI::Exceptionif no domain has been specified or parse, or no URI has been parsed.
See also
setCurrentDomain, setCurrentOrigin, parse

References crawlservpp::Wrapper::URIQueryList::getc(), crawlservpp::Wrapper::URI::getc(), crawlservpp::Wrapper::URIQueryList::getPtr(), unescape(), and crawlservpp::Wrapper::URI::valid().

◆ isSameDomain()

bool crawlservpp::Parsing::URI::isSameDomain ( ) const
inline

Checks whether the parsed URI links to the current domain.

Returns
True, if the parsed URI links to the current domain or the current website is cross-domain. False otherwise.
Exceptions
URI::Exceptionif no URI has been parsed.

References crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().

Referenced by crawlservpp::Module::Crawler::Thread::onReset().

◆ makeAbsolute()

void crawlservpp::Parsing::URI::makeAbsolute ( std::string_view  uriBase,
std::vector< std::string > &  uris 
)
inlinestatic

Public static helper function making a set of (possibly) relative URIs absolute.

Note
Errors for single URIs will be ignored and those URIs will be quietly removed.
Parameters
uriBaseView of the base URI. Only its host name will be used.
urisReference to a vector containing the (possibly) relative URIs to be made absolute in-situ.
Exceptions
URI::Exceptionif the given base URI could not be parsed.

References crawlservpp::Wrapper::URI::create(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().

Referenced by crawlservpp::Module::Crawler::Thread::onReset().

◆ parseLink()

bool crawlservpp::Parsing::URI::parseLink ( std::string_view  uriToParse)
inline

Parses a link, either abolute or into a sub-URI.

Both domain and current origin need to be set before parsing a link.

The new sub-URI will be saved in-class.

Parameters
uriToParseView of the URI to parse into a sub-URI (beginning with a slash).
Returns
True, if the parsing was successful. False, if the given string is empty.
Exceptions
URI::Exceptionif no domain has been specified or parsed, no sub-URI has been previously parsed, an error occured during parsing the URI, reference resolving failed, or the normalization of the URI failed.
See also
setCurrentDomain, setCurrentOrigin

References crawlservpp::Wrapper::URI::clear(), crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Helper::Strings::trim().

Referenced by crawlservpp::Module::Crawler::Thread::onReset().

◆ setCurrentDomain()

void crawlservpp::Parsing::URI::setCurrentDomain ( std::string_view  currentDomain)
inline

Sets the current domain.

Parameters
currentDomainView of the domain currently being used or of an empty string if the current website is cross-domain.

References escapeUri().

Referenced by crawlservpp::Module::Crawler::Thread::onReset(), and setCurrentOrigin().

◆ setCurrentOrigin()

void crawlservpp::Parsing::URI::setCurrentOrigin ( std::string_view  baseUri)
inline

Sets the current origin.

Links will be parsed originating from this URI. A domain needs to be set before setting the URI.

Parameters
baseUriView of the URI to be used as current origin. It should begin with a slash if it is a sub-URI or with the domain if the current website is cross-domain.
Exceptions
URI::Exceptionif no domain has been specified or parsed, the sub-URI is empty, the sub-URI does not start with a slash, or an error occured during URI parsing.
See also
setCurrentDomain

References crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), and setCurrentDomain().

Referenced by crawlservpp::Module::Crawler::Thread::onReset().

◆ unescape()

std::string crawlservpp::Parsing::URI::unescape ( std::string_view  string,
bool  plusSpace 
)
inlinestatic

Public static helper function URI-unescaping a string.

Parameters
stringView of the string to unescape.
plusSpaceSpecifies whether plusses should be unescaped to spaces.
Returns
A copy of the unescaped string.

Referenced by getSubUri(), and crawlservpp::Module::Crawler::Thread::onReset().


The documentation for this class was generated from the following file: