crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Parser for RFC 3986 URIs that can also analyze their relationships with each other. More...
#include <URI.hpp>
Classes | |
class | Exception |
Class for URI exceptions. More... | |
Getters | |
bool | isSameDomain () const |
Checks whether the parsed URI links to the current domain. More... | |
std::string | getSubUri () const |
Gets the sub-URI for the current URI. More... | |
std::string | getSubUri (const std::vector< std::string > &args, bool whiteList) const |
Gets the sub-URI for the current URI, filtering its query list. More... | |
Setters | |
void | setCurrentDomain (std::string_view currentDomain) |
Sets the current domain. More... | |
void | setCurrentOrigin (std::string_view baseUri) |
Sets the current origin. More... | |
Parsing | |
bool | parseLink (std::string_view uriToParse) |
Parses a link, either abolute or into a sub-URI. More... | |
Static Helpers | |
static std::string | escape (std::string_view string, bool plusSpace) |
Public static helper function URI-escaping a string. More... | |
static std::string | unescape (std::string_view string, bool plusSpace) |
Public static helper function URI-unescaping a string. More... | |
static std::string | escapeUri (std::string_view uriToEscape) |
Public static helper function escaping a URI, but leacing reserved characters intact. More... | |
static void | makeAbsolute (std::string_view uriBase, std::vector< std::string > &uris) |
Public static helper function making a set of (possibly) relative URIs absolute. More... | |
Parser for RFC 3986 URIs that can also analyze their relationships with each other.
Parses URIs, analyzes their relationship to other URIs and provides encoding (escaping) functionality.
|
inlinestatic |
Public static helper function URI-escaping a string.
string | View of the string to escape. |
plusSpace | Specifies whether spaces should be escaped as plusses. |
References crawlservpp::Parsing::maxEscapedCharLength.
Referenced by escapeUri().
|
inlinestatic |
Public static helper function escaping a URI, but leacing reserved characters intact.
The following characters will be left intact: ;
/
?
:
@
=
&
#
%
uriToEscape | View of the URI to be escaped. |
References crawlservpp::Helper::Strings::encodePercentage(), and escape().
Referenced by parseLink(), setCurrentDomain(), and setCurrentOrigin().
|
inline |
Gets the sub-URI for the current URI.
The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.
URI::Exception | if no URI has been parsed or no domain has been either specified or parsed. |
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Gets the sub-URI for the current URI, filtering its query list.
The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.
args | Vector containing the names of query list parameters to either ignore (if whiteList is false) or keep (if whiteList is true). |
whiteList | Specifies whether args is a white list or a black list of query list names. |
URI::Exception | if no domain has been specified or parse, or no URI has been parsed. |
References crawlservpp::Wrapper::URIQueryList::getc(), crawlservpp::Wrapper::URI::getc(), crawlservpp::Wrapper::URIQueryList::getPtr(), unescape(), and crawlservpp::Wrapper::URI::valid().
|
inline |
Checks whether the parsed URI links to the current domain.
URI::Exception | if no URI has been parsed. |
References crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestatic |
Public static helper function making a set of (possibly) relative URIs absolute.
uriBase | View of the base URI. Only its host name will be used. |
uris | Reference to a vector containing the (possibly) relative URIs to be made absolute in-situ. |
URI::Exception | if the given base URI could not be parsed. |
References crawlservpp::Wrapper::URI::create(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Parses a link, either abolute or into a sub-URI.
Both domain and current origin need to be set before parsing a link.
The new sub-URI will be saved in-class.
uriToParse | View of the URI to parse into a sub-URI (beginning with a slash). |
URI::Exception | if no domain has been specified or parsed, no sub-URI has been previously parsed, an error occured during parsing the URI, reference resolving failed, or the normalization of the URI failed. |
References crawlservpp::Wrapper::URI::clear(), crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Helper::Strings::trim().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Sets the current domain.
currentDomain | View of the domain currently being used or of an empty string if the current website is cross-domain. |
References escapeUri().
Referenced by crawlservpp::Module::Crawler::Thread::onReset(), and setCurrentOrigin().
|
inline |
Sets the current origin.
Links will be parsed originating from this URI. A domain needs to be set before setting the URI.
baseUri | View of the URI to be used as current origin. It should begin with a slash if it is a sub-URI or with the domain if the current website is cross-domain. |
URI::Exception | if no domain has been specified or parsed, the sub-URI is empty, the sub-URI does not start with a slash, or an error occured during URI parsing. |
References crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), and setCurrentDomain().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestatic |
Public static helper function URI-unescaping a string.
string | View of the string to unescape. |
plusSpace | Specifies whether plusses should be unescaped to spaces. |
Referenced by getSubUri(), and crawlservpp::Module::Crawler::Thread::onReset().