|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Provides an interface to the libcurl library for sending and receiving data over the network.
More...
#include <Curl.hpp>
Classes | |
| class | Exception |
Class for libcurl exceptions. More... | |
Construction and Destruction | |
| Curl (std::string_view cookieDirectory, const NetworkSettings &setNetworkSettings) | |
| Constructor setting the cookie directory and the network options. More... | |
| virtual | ~Curl ()=default |
| Default destructor. More... | |
Setters | |
| void | setConfigGlobal (const Config &globalConfig, bool limited, std::queue< std::string > &warningsTo) |
| Sets the network options for the connection according to the given configuration. More... | |
| void | setConfigCurrent (const Config ¤tConfig) |
| Sets temporary network options for the connection according to the given configuration. More... | |
| void | setCookies (const std::string &cookies) |
| Sets custom cookies. More... | |
| void | setHeaders (const std::vector< std::string > &customHeaders) |
| Sets custom HTTP headers. More... | |
| void | setVerbose (bool isVerbose) |
Forces libcurl into or out of verbose mode. More... | |
| void | unsetCookies () |
| Unsets custom cookies previously set. More... | |
| void | unsetHeaders () |
| Unsets custom HTTP headers previously set. More... | |
Getters | |
| void | getContent (std::string_view url, bool usePost, std::string &contentTo, const std::vector< std::uint32_t > &errors) |
| Uses the connection to get content by sending a HTTP request to the specified URL. More... | |
| std::uint32_t | getResponseCode () const noexcept |
| Gets the response code of the HTTP reply received last. More... | |
| std::string | getContentType () const noexcept |
| Gets the content type of the HTTP reply received last. More... | |
| CURLcode | getCurlCode () const noexcept |
Gets the libcurl return code received from the last API call. More... | |
| std::string | getPublicIp () |
| Uses the connection to determine its public IP address. More... | |
Reset | |
| void | resetConnection (std::uint64_t sleepForMilliseconds, const IsRunningCallback &isRunningCallback) |
| Resets the connection. More... | |
URL Encoding | |
| std::string | escape (const std::string &stringToEscape, bool usePlusForSpace) |
| URL encodes the given string. More... | |
| std::string | unescape (const std::string &escapedString, bool usePlusForSpace) |
| URL decodes the given string. More... | |
| std::string | escapeUrl (std::string_view urlToEscape) |
| URL encodes the given string while leaving reserved characters (; / ? : @ = & #) intact. More... | |
Copy and Move | |
| Curl (Curl &)=delete | |
| Deleted copy constructor. More... | |
| Curl & | operator= (Curl &)=delete |
| Deleted copy assignment operator. More... | |
| Curl (Curl &&)=delete | |
| Deleted move constructor. More... | |
| Curl & | operator= (Curl &&)=delete |
| Deleted move assignment operator. More... | |
Helper | |
| static std::string | curlStringToString (char *curlString) |
Copies the given libcurl string into a std::string and releases its memory. More... | |
Header Handling | |
| static int | header (char *data, std::size_t size, std::size_t nitems, void *thisPtr) |
| Static header function to handle incoming header data. More... | |
| int | headerInClass (char *data, std::size_t size) |
| In-class header function to handle incoming header data. More... | |
Writers | |
| static int | writer (char *data, std::size_t size, std::size_t nmemb, void *thisPtr) |
| Static writer function to handle incoming network data. More... | |
| int | writerInClass (char *data, std::size_t size) |
| In-class writer function to handle incoming network data. More... | |
Provides an interface to the libcurl library for sending and receiving data over the network.
This class is used by both the crawler and the extractor.
It is not thread-safe, which means you need to use multiple instances for multiple threads.
Internally, the class uses Wrapper::Curl to interface with the libcurl library.
For more information about the libcurl library, see its website.
|
inline |
Constructor setting the cookie directory and the network options.
Initializes libcurl and sets some basic global default options like the write function, which is used to handle incoming network traffic (and is provided by the class).
| cookieDirectory | The path to the directory where cookies will be saved in. |
| setNetworkSettings | The network options for the connection represented by this instance. |
| Curl::Exception | if the API could not be initalized, the used libcurl library does not support SSL, or the initial options not be set. |
References CURL_VERSION_SSL, crawlservpp::Wrapper::Curl::get(), header(), crawlservpp::Wrapper::Curl::valid(), and writer().
|
virtualdefault |
Default destructor.
|
delete |
Deleted copy constructor.
|
delete |
Deleted move constructor.
|
inlinestaticprotected |
Copies the given libcurl string into a std::string and releases its memory.
Afterwards curlString will be invalid and its memory freed.
If curlString is a nullptr it will be ignored.
nullptr or a valid libcurl string. Otherwise the program may crash and the memory be corrupted.| curlString | A pointer to a valid curlString or nullptr. |
nullptr. Referenced by escape(), escapeUrl(), and unescape().
|
inline |
URL encodes the given string.
libcurl library needs to be successfully initialized for URL encoding, except for an empty string.| stringToEscape | Const reference to the string to be encoded. |
| usePlusForSpace | States whether to convert spaces to + instead of %20. |
| Curl::Exception | if the libcurl library has not been initialized. |
References curlStringToString(), crawlservpp::Network::encodedSpace, crawlservpp::Network::encodedSpaceLength, crawlservpp::Wrapper::Curl::get(), and crawlservpp::Wrapper::Curl::valid().
|
inline |
URL encodes the given string while leaving reserved characters (; / ? : @ = & #) intact.
The function will copy those parts of the string that need to be escaped and use the libcurl library to escape them.
Leaves the characters ; / ? : @ = & # unchanged in the resulting string.
libcurl library needs to be successfully initialized for URL encoding, except for an empty string (or a string containing only reserved characters).| urlToEscape | A view to the string containing the URL to be encoded. |
| Curl::Exception | if the libcurl library has not been initialized. |
References curlStringToString(), crawlservpp::Wrapper::Curl::get(), crawlservpp::Network::reservedCharacters, and crawlservpp::Wrapper::Curl::valid().
Referenced by writerInClass().
|
inline |
Uses the connection to get content by sending a HTTP request to the specified URL.
When using HTTP POST, the data to be sent will be determined the same way as for a HTTP GET request – from behind the first question mark (?) in the given URL.
If no question mark is present, no additional data will be sent along the HTTP POST request.
Before sending the request, the given URL will be encoded while keeping possible reserved characters intact.
Response code and content type of the reply will be saved to be requested by getResponseCode() and getContentType().
After a successful request, replies encoded in ISO-8859-1 will be converted to UTF-8 and invalid UTF-8 sequences will be removed.
| url | Const reference to the string containing the URL to request. |
| usePost | States whether to use HTTP POST instead of HTTP GET on this request. |
| contentTo | Reference to a string in which the received content will be stored. |
| errors | Vector of HTTP error codes which will be handled by throwing an exception, except if the error code is also present in the X-ts header returned by the host. |
| Curl::Exception | if setting the necessary options failed, the HTTP request could not be sent, information about the reply could not be retrieved or any of the specified HTTP error codes has been received. |
References crawlservpp::Wrapper::Curl::get().
Referenced by getPublicIp(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inlinenoexcept |
Gets the content type of the HTTP reply received last.
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinenoexcept |
Gets the libcurl return code received from the last API call.
Use this function to determine which error occured after another call to this class failed.
libcurl return code.libcurl error codes Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Uses the connection to determine its public IP address.
Requests the public IP address of the connection from an external URL defined inside this function.
References getContent(), crawlservpp::Network::getPublicIpErrors, crawlservpp::Network::getPublicIpFrom, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inlinenoexcept |
Gets the response code of the HTTP reply received last.
Referenced by crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestaticprotected |
Static header function to handle incoming header data.
If thisPtr is not nullptr, the function will forward the incoming header data without change to the headerInClass function.
| data | Pointer to the incoming header data. |
| size | Always 1. |
| nitems | The size of the incoming header data. |
| thisPtr | Pointer to the instance of the Curl class. |
References headerInClass().
Referenced by Curl(), and resetConnection().
|
inlineprotected |
In-class header function to handle incoming header data.
The function will check for a X-ts header and save its value.
| data | Pointer to the incoming data. |
| size | The size of the incoming header data. |
References crawlservpp::Network::xTsHeaderName, and crawlservpp::Network::xTsHeaderNameLen.
Referenced by header().
|
inline |
Resets the connection.
After cleaning up the connection, the function will wait for the specified sleep time, but regularly check the status of the application to not considerably delay its shutdown.
It then resets the configuration passed to setConfigGlobal().
| sleepForMilliseconds | Time to wait in milliseconds before re-establishing the connection. |
| isRunningCallback | Constant reference to a callback function (or lambda) which returns whether the application is still running. |
| Curl::Exception | if any connection option could not be (re-)set. |
References crawlservpp::Network::checkEveryMilliseconds, crawlservpp::Wrapper::Curl::clear(), crawlservpp::Wrapper::CurlList::clear(), crawlservpp::Wrapper::Curl::get(), header(), crawlservpp::Wrapper::Curl::init(), crawlservpp::Helper::DateTime::now(), setConfigGlobal(), and writer().
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Sets temporary network options for the connection according to the given configuration.
Only uses Config::cookiesOverwrite from the given configuration to add or manipulate cookies already set.
| currentConfig | The network configuration to be used. |
| Curl::Exception |
References crawlservpp::Network::Config::Entries::cookiesOverwrite, and crawlservpp::Network::Config::networkConfig.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Sets the network options for the connection according to the given configuration.
Warnings might include options set, but not supported by the available version of the libcurl library.
| globalConfig | a Network configuration. |
| limited | Indicates whether the settings will have only limited effect (see below). |
| warningsTo | Reference to a queue of strings that will be filled with warnings if they occur. |
| Curl::Exception | if any of the options could not be set. |
References crawlservpp::Wrapper::CurlList::append(), crawlservpp::Network::authTypeTlsSrp, crawlservpp::Network::Config::Entries::connectionsMax, crawlservpp::Helper::FileSystem::contains(), crawlservpp::Network::Config::Entries::contentLengthIgnore, crawlservpp::Network::Config::Entries::cookies, crawlservpp::Network::Config::Entries::cookiesLoad, crawlservpp::Network::Config::Entries::cookiesSave, crawlservpp::Network::Config::Entries::cookiesSession, crawlservpp::Network::Config::Entries::cookiesSet, CURL_HET_DEFAULT, CURL_HTTP_VERSION_2_0, CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE, CURL_HTTP_VERSION_2TLS, CURL_HTTP_VERSION_3, CURL_VERSION_BROTLI, CURL_VERSION_HTTP2, CURL_VERSION_HTTP3, CURL_VERSION_LIBZ, CURL_VERSION_TLSAUTH_SRP, CURL_VERSION_ZSTD, CURLOPT_DNS_SHUFFLE_ADDRESSES, CURLOPT_DOH_URL, CURLOPT_HAPPY_EYEBALLS_TIMEOUT_MS, CURLOPT_PRE_PROXY, CURLOPT_PROXY_SSL_VERIFYHOST, CURLOPT_PROXY_SSL_VERIFYPEER, CURLOPT_PROXY_TLSAUTH_PASSWORD, CURLOPT_PROXY_TLSAUTH_TYPE, CURLOPT_PROXY_TLSAUTH_USERNAME, CURLOPT_TCP_FASTOPEN, crawlservpp::Struct::NetworkSettings::defaultProxy, crawlservpp::Network::Config::Entries::dnsCacheTimeOut, crawlservpp::Network::Config::Entries::dnsDoH, crawlservpp::Network::Config::Entries::dnsInterface, crawlservpp::Network::Config::Entries::dnsResolves, crawlservpp::Network::Config::Entries::dnsServers, crawlservpp::Network::Config::Entries::dnsShuffle, crawlservpp::Network::Config::Entries::encodingBr, crawlservpp::Network::Config::Entries::encodingDeflate, crawlservpp::Network::Config::Entries::encodingGZip, crawlservpp::Network::Config::Entries::encodingIdentity, crawlservpp::Network::Config::Entries::encodingTransfer, crawlservpp::Network::Config::Entries::encodingZstd, crawlservpp::Helper::FileSystem::getPathSeparator(), crawlservpp::Network::Config::Entries::headers, crawlservpp::Network::Config::Entries::http200Aliases, crawlservpp::Network::Config::Entries::httpVersion, crawlservpp::Network::httpVersion1, crawlservpp::Network::httpVersion11, crawlservpp::Network::httpVersion2, crawlservpp::Network::httpVersion2Only, crawlservpp::Network::httpVersion2Tls, crawlservpp::Network::httpVersion3Only, crawlservpp::Network::httpVersionAny, crawlservpp::Network::Config::Entries::localInterface, crawlservpp::Network::Config::Entries::localPort, crawlservpp::Network::Config::Entries::localPortRange, crawlservpp::Network::Config::networkConfig, crawlservpp::Network::Config::Entries::noReUse, crawlservpp::Network::Config::Entries::protocol, crawlservpp::Network::Config::Entries::proxy, crawlservpp::Network::Config::Entries::proxyAuth, crawlservpp::Network::Config::Entries::proxyHeaders, crawlservpp::Network::Config::Entries::proxyPre, crawlservpp::Network::Config::Entries::proxyTlsSrpPassword, crawlservpp::Network::Config::Entries::proxyTlsSrpUser, crawlservpp::Network::Config::Entries::proxyTunnelling, crawlservpp::Network::Config::Entries::redirect, crawlservpp::Network::Config::Entries::redirectMax, crawlservpp::Network::Config::Entries::redirectPost301, crawlservpp::Network::Config::Entries::redirectPost302, crawlservpp::Network::Config::Entries::redirectPost303, crawlservpp::Network::Config::Entries::referer, crawlservpp::Network::Config::Entries::refererAutomatic, setCookies(), crawlservpp::Network::Config::Entries::speedDownLimit, crawlservpp::Network::Config::Entries::speedLowLimit, crawlservpp::Network::Config::Entries::speedLowTime, crawlservpp::Network::Config::Entries::speedUpLimit, crawlservpp::Network::Config::Entries::sslVerifyHost, crawlservpp::Network::Config::Entries::sslVerifyPeer, crawlservpp::Network::Config::Entries::sslVerifyProxyHost, crawlservpp::Network::Config::Entries::sslVerifyProxyPeer, crawlservpp::Network::Config::Entries::sslVerifyStatus, crawlservpp::Network::Config::Entries::tcpFastOpen, crawlservpp::Network::Config::Entries::tcpKeepAlive, crawlservpp::Network::Config::Entries::tcpKeepAliveIdle, crawlservpp::Network::Config::Entries::tcpKeepAliveInterval, crawlservpp::Network::Config::Entries::tcpNagle, crawlservpp::Network::Config::Entries::timeOut, crawlservpp::Network::Config::Entries::timeOutHappyEyeballs, crawlservpp::Network::Config::Entries::timeOutRequest, crawlservpp::Network::Config::Entries::tlsSrpPassword, crawlservpp::Network::Config::Entries::tlsSrpUser, crawlservpp::Network::Config::Entries::userAgent, crawlservpp::Wrapper::Curl::valid(), crawlservpp::Network::Config::Entries::verbose, crawlservpp::Network::versionBrotli, crawlservpp::Network::versionDnsShuffle, crawlservpp::Network::versionDoH, crawlservpp::Network::versionHappyEyeballs, crawlservpp::Network::versionHttp2, crawlservpp::Network::versionHttp2Only, crawlservpp::Network::versionHttp2Tls, crawlservpp::Network::versionHttp3Only, crawlservpp::Network::versionPreProxy, crawlservpp::Network::versionProxySslVerify, crawlservpp::Network::versionProxyTlsAuth, crawlservpp::Network::versionTcpFastOpen, and crawlservpp::Network::versionZstd.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Crawler::Thread::onReset(), and resetConnection().
|
inline |
Sets custom cookies.
These cookies will be sent along with all subsequent HTTP requests as long as the connection is not reset.
If a reference to an empty string is given, the function will unset cookies previously set through this function.
This function works independently from the internal libcurl cookie engine.
| cookies | Const reference to a string containing the cookies to send in the same format as in the corresponding HTTP header, i.e. "name1=content1; name2=content2;" etc. |
| Curl::Exception | if the cookies could not be set. |
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Crawler::Thread::onReset(), and setConfigGlobal().
|
inline |
Sets custom HTTP headers.
These headers will be sent along with all subsequent HTTP requests as long as the connection is not reset.
| customHeaders | A vector of strings providing the custom HTTP headers to be set. |
| Curl::Exception | if the headers could not be set. |
References crawlservpp::Wrapper::CurlList::append(), and crawlservpp::Wrapper::CurlList::clear().
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Forces libcurl into or out of verbose mode.
In verbose mode, extensive connection information will be written to stdout.
| isVerbose | If true, libcurl will be forced into verbose mode. If false, libcurl will be forced out of verbose mode. |
| Curl::Exception | if the verbose mode could not be set. |
|
inline |
URL decodes the given string.
libcurl library needs to be successfully initialized for URL encoding, except for an empty string.| escapedString | Const reference to the string to be decoded. |
| usePlusForSpace | States whether plusses should be decoded to spaces. |
| Curl::Exception | if the libcurl library has not been initialized. |
References curlStringToString(), crawlservpp::Wrapper::Curl::get(), and crawlservpp::Wrapper::Curl::valid().
|
inline |
Unsets custom cookies previously set.
All cookies set by setCookies() will be discarded.
This function works independently from the internal libcurl cookie engine.
| Curl::Exception | if the cookies could not be unset. |
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inline |
Unsets custom HTTP headers previously set.
All HTTP headers set by setHeaders() will be discarded.
| Curl::Exception | if the headers could not be unset. |
References crawlservpp::Wrapper::CurlList::clear().
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Crawler::Thread::onReset().
|
inlinestaticprotected |
Static writer function to handle incoming network data.
If thisPtr is not nullptr, the function will forward the incoming data without change to the writerInClass function.
| data | Pointer to the incoming data. |
| size | Always 1. |
| nmemb | The size of the incoming data. |
| thisPtr | Pointer to the instance of the Curl class. |
References writerInClass().
Referenced by Curl(), and resetConnection().
|
inlineprotected |
In-class writer function to handle incoming network data.
The function will append the data to the currently processed content.
| data | Pointer to the incoming data. |
| size | The size of the incoming data. |
References crawlservpp::Data::Compression::Gzip::decompress(), escapeUrl(), crawlservpp::Wrapper::Curl::get(), crawlservpp::Wrapper::CurlList::get(), crawlservpp::Network::gzipMagicNumber, crawlservpp::Helper::Utf8::iso88591ToUtf8(), and crawlservpp::Helper::Utf8::repairUtf8().
Referenced by writer().