|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Parser for RFC 3986 URIs that can also analyze their relationships with each other. More...
#include <URI.hpp>
Classes | |
| class | Exception |
| Class for URI exceptions. More... | |
Getters | |
| bool | isSameDomain () const |
| Checks whether the parsed URI links to the current domain. More... | |
| std::string | getSubUri () const |
| Gets the sub-URI for the current URI. More... | |
| std::string | getSubUri (const std::vector< std::string > &args, bool whiteList) const |
| Gets the sub-URI for the current URI, filtering its query list. More... | |
Setters | |
| void | setCurrentDomain (std::string_view currentDomain) |
| Sets the current domain. More... | |
| void | setCurrentOrigin (std::string_view baseUri) |
| Sets the current origin. More... | |
Parsing | |
| bool | parseLink (std::string_view uriToParse) |
| Parses a link, either abolute or into a sub-URI. More... | |
Static Helpers | |
| static std::string | escape (std::string_view string, bool plusSpace) |
| Public static helper function URI-escaping a string. More... | |
| static std::string | unescape (std::string_view string, bool plusSpace) |
| Public static helper function URI-unescaping a string. More... | |
| static std::string | escapeUri (std::string_view uriToEscape) |
| Public static helper function escaping a URI, but leacing reserved characters intact. More... | |
| static void | makeAbsolute (std::string_view uriBase, std::vector< std::string > &uris) |
| Public static helper function making a set of (possibly) relative URIs absolute. More... | |
Parser for RFC 3986 URIs that can also analyze their relationships with each other.
Parses URIs, analyzes their relationship to other URIs and provides encoding (escaping) functionality.
|
inlinestatic |
Public static helper function URI-escaping a string.
| string | View of the string to escape. |
| plusSpace | Specifies whether spaces should be escaped as plusses. |
References crawlservpp::Parsing::maxEscapedCharLength.
Referenced by escapeUri().
|
inlinestatic |
Public static helper function escaping a URI, but leacing reserved characters intact.
The following characters will be left intact: ; / ? : @ = & # %
| uriToEscape | View of the URI to be escaped. |
References crawlservpp::Helper::Strings::encodePercentage(), and escape().
Referenced by parseLink(), setCurrentDomain(), and setCurrentOrigin().
|
inline |
Gets the sub-URI for the current URI.
The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.
| URI::Exception | if no URI has been parsed or no domain has been either specified or parsed. |
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Gets the sub-URI for the current URI, filtering its query list.
The current domain and origin need to be set before getting the sub-URI. A URI needs to be parsed as well.
| args | Vector containing the names of query list parameters to either ignore (if whiteList is false) or keep (if whiteList is true). |
| whiteList | Specifies whether args is a white list or a black list of query list names. |
| URI::Exception | if no domain has been specified or parse, or no URI has been parsed. |
References crawlservpp::Wrapper::URIQueryList::getc(), crawlservpp::Wrapper::URI::getc(), crawlservpp::Wrapper::URIQueryList::getPtr(), unescape(), and crawlservpp::Wrapper::URI::valid().
|
inline |
Checks whether the parsed URI links to the current domain.
| URI::Exception | if no URI has been parsed. |
References crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestatic |
Public static helper function making a set of (possibly) relative URIs absolute.
| uriBase | View of the base URI. Only its host name will be used. |
| uris | Reference to a vector containing the (possibly) relative URIs to be made absolute in-situ. |
| URI::Exception | if the given base URI could not be parsed. |
References crawlservpp::Wrapper::URI::create(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Wrapper::URI::valid().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Parses a link, either abolute or into a sub-URI.
Both domain and current origin need to be set before parsing a link.
The new sub-URI will be saved in-class.
| uriToParse | View of the URI to parse into a sub-URI (beginning with a slash). |
| URI::Exception | if no domain has been specified or parsed, no sub-URI has been previously parsed, an error occured during parsing the URI, reference resolving failed, or the normalization of the URI failed. |
References crawlservpp::Wrapper::URI::clear(), crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), crawlservpp::Wrapper::URI::getc(), and crawlservpp::Helper::Strings::trim().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Sets the current domain.
| currentDomain | View of the domain currently being used or of an empty string if the current website is cross-domain. |
References escapeUri().
Referenced by crawlservpp::Module::Crawler::Thread::onReset(), and setCurrentOrigin().
|
inline |
Sets the current origin.
Links will be parsed originating from this URI. A domain needs to be set before setting the URI.
| baseUri | View of the URI to be used as current origin. It should begin with a slash if it is a sub-URI or with the domain if the current website is cross-domain. |
| URI::Exception | if no domain has been specified or parsed, the sub-URI is empty, the sub-URI does not start with a slash, or an error occured during URI parsing. |
References crawlservpp::Wrapper::URI::create(), escapeUri(), crawlservpp::Wrapper::URI::get(), and setCurrentDomain().
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestatic |
Public static helper function URI-unescaping a string.
| string | View of the string to unescape. |
| plusSpace | Specifies whether plusses should be unescaped to spaces. |
Referenced by getSubUri(), and crawlservpp::Module::Crawler::Thread::onReset().