crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Crawler thread. More...
#include <Thread.hpp>
Classes | |
class | Exception |
Class for crawler exceptions. More... | |
Configuration Loader | |
void | loadConfig (const std::string &configJson, LogQueue &warningsTo) |
Loads a configuration. More... | |
Parsing Options | |
enum | StringParsingOption { Default = 0, SQL, SubURL, URL, Trim } |
Options for parsing strings. More... | |
enum | CharParsingOption { FromNumber = 0, FromString } |
Options for parsing char's . More... | |
Configuration Parsing | |
void | category (const std::string &category) |
Sets the category of the subsequent configuration items to be checked for. More... | |
void | option (const std::string &name, bool &target) |
Checks for a configuration option of type bool . More... | |
void | option (const std::string &name, std::vector< bool > &target) |
Checks for a configuration option of type array of bool's . More... | |
void | option (const std::string &name, char &target, CharParsingOption opt) |
Checks for a configuration option of type char . More... | |
void | option (const std::string &name, std::vector< char > &target, CharParsingOption opt) |
Checks for a configuration option of type array of char's . More... | |
void | option (const std::string &name, std::int16_t &target) |
Checks for a configuration option of type 16-bit integer. More... | |
void | option (const std::string &name, std::vector< std::int16_t > &target) |
Checks for a configuration option of type array of 16-bit integers. More... | |
void | option (const std::string &name, std::int32_t &target) |
Checks for a configuration option of type 32-bit integer. More... | |
void | option (const std::string &name, std::vector< std::int32_t > &target) |
Checks for a configuration option of type array of 32-bit integers. More... | |
void | option (const std::string &name, std::int64_t &target) |
Checks for a configuration option of type 64-bit integer. More... | |
void | option (const std::string &name, std::vector< std::int64_t > &target) |
Checks for a configuration option of type array of 64-bit integers. More... | |
void | option (const std::string &name, std::uint8_t &target) |
Checks for a configuration option of type unsigned 8-bit integer. More... | |
void | option (const std::string &name, std::vector< std::uint8_t > &target) |
Checks for a configuration option of type array of unsigned 8-bit integers. More... | |
void | option (const std::string &name, std::uint16_t &target) |
Checks for a configuration option of type unsigned 16-bit integer. More... | |
void | option (const std::string &name, std::vector< std::uint16_t > &target) |
Checks for a configuration option of type array of unsigned 16-bit integers. More... | |
void | option (const std::string &name, std::uint32_t &target) |
Checks for a configuration option of type unsigned 32-bit integer. More... | |
void | option (const std::string &name, std::vector< std::uint32_t > &target) |
Checks for a configuration option of type array of unsigned 32-bit integers. More... | |
void | option (const std::string &name, std::uint64_t &target) |
Checks for a configuration option of type unsigned 64-bit integer. More... | |
void | option (const std::string &name, std::vector< std::uint64_t > &target) |
Checks for a configuration option of type array of unsigned 64-bit integers. More... | |
void | option (const std::string &name, float &target) |
Checks for a configuration option of type floating-point number. More... | |
void | option (const std::string &name, std::vector< float > &target) |
Checks for a configuration option of type array of floating-point numbers. More... | |
void | option (const std::string &name, std::string &target, StringParsingOption opt=Default) |
Checks for a configuration option of type string. More... | |
void | option (const std::string &name, std::vector< std::string > &target, StringParsingOption opt=Default) |
Checks for a configuration option of type array of strings. More... | |
void | warning (const std::string &warning) |
Adds a warning to the logging queue. More... | |
Setter | |
void | setCrossDomain (bool isCrossDomain) |
Sets whether the corresponding website is cross-domain. More... | |
Configuration | |
struct crawlservpp::Module::Crawler::Config::Entries | config |
Configuration of the crawler. More... | |
Crawler-Specific Configuration Parsing | |
void | parseOption () override |
Parses an crawler-specific configuration option. More... | |
void | checkOptions () override |
Checks the crawler-specific configuration options. More... | |
void | reset () override |
Resets the crawler-specific configuration options. More... | |
Construction | |
Thread (Main::Database &dbBase, std::string_view cookieDirectory, const ThreadOptions &threadOptions, const NetworkSettings &networkSettings, const ThreadStatus &threadStatus) | |
Constructor initializing a previously interrupted crawler thread. More... | |
Thread (Main::Database &dbBase, std::string_view cookieDirectory, const ThreadOptions &threadOptions, const NetworkSettings &networkSettings) | |
Constructor initializing a new crawler thread. More... | |
Database Connection | |
Database | database |
Database connection for the crawler thread. More... | |
Networking | |
const NetworkSettings | networkOptions |
Network settings for the crawler thread. More... | |
Network::Curl | networking |
Networking for the crawler thread. More... | |
Network::TorControl | torControl |
TOR control for the crawler thread. More... | |
Implemented Thread Functions | |
void | onInit () override |
Initializes the crawler. More... | |
void | onTick () override |
Performs a crawler tick. More... | |
void | onPause () override |
Pauses the crawler. More... | |
void | onUnpause () override |
Unpauses the crawler. More... | |
void | onClear () override |
Clears the crawler. More... | |
void | onReset () override |
Resets the crawler. More... | |
Getters | |
std::uint64_t | getId () const |
Gets the ID of the thread. More... | |
std::uint64_t | getWebsite () const |
Gets the ID of the website used by the thread. More... | |
std::uint64_t | getUrlList () const |
Gets the ID of the URL list used by the thread. More... | |
std::uint64_t | getConfig () const |
Gets the ID of the configuration used by the thread. More... | |
bool | isShutdown () const |
Checks whether the thread is shutting down or has shut down. More... | |
bool | isRunning () const |
Checks whether the thread is still supposed to run. More... | |
bool | isFinished () const |
Checks whether the shutdown of the thread has been finished. More... | |
bool | isPaused () const |
Checks whether the thread has been paused. More... | |
Thread Control | |
void | end () |
Waits for the thread until shutdown is completed. More... | |
void | reset () |
Will reset the thread before the next tick. More... | |
Time Travel | |
void | warpTo (std::uint64_t target) |
Jumps to the specified target ID ("time travel"). More... | |
Configuration | |
std::string | websiteNamespace |
Namespace of the website used by the thread. More... | |
std::string | urlListNamespace |
Namespace of the URL list used by the thread. More... | |
std::string | configuration |
JSON string of the configuration used by the thread. More... | |
Protected Getters | |
bool | isInterrupted () const |
Checks whether the thread has been interrupted. More... | |
std::string | getStatusMessage () const |
Gets the current status message. More... | |
float | getProgress () const |
Gets the current progress, in percent. More... | |
std::uint64_t | getLast () const |
Gets the value of the last ID processed by the thread. More... | |
std::int64_t | getWarpedOverAndReset () |
Gets the number of IDs that have been jumped over, and resets them. More... | |
Protected Setters | |
void | setStatusMessage (const std::string &statusMessage) |
Sets the status message of the thread. More... | |
void | setProgress (float newProgress) |
Sets the progress of the thread. More... | |
void | setLast (std::uint64_t lastId) |
Sets the last ID processed by the thread. More... | |
void | incrementLast () |
Increments the last ID processed by the thread. More... | |
void | incrementProcessed () |
Increments the number of IDs processed by the thread. More... | |
Protected Thread Control | |
void | sleep (std::uint64_t ms) const |
Lets the thread sleep for the specified number of milliseconds. More... | |
void | allowPausing () |
Allows the thread to be paused. More... | |
void | disallowPausing () |
Disallows the thread to be paused. More... | |
void | pauseByThread () |
Forces the thread to pause. More... | |
Logging | |
bool | isLogLevel (std::uint8_t level) const |
Checks whether a certain logging level is enabled. More... | |
void | log (std::uint8_t level, const std::string &logEntry) |
Adds a thread-specific log entry to the database, if the current logging level is high enough. More... | |
void | log (std::uint8_t level, std::queue< std::string > &logEntries) |
Adds multiple thread-specific log entries to the database, if the current logging level is high enough. More... | |
Configuration | |
struct crawlservpp::Network::Config::Entries | networkConfig |
Configuration for networking. More... | |
Parsing (Network Configuration) | |
void | parseBasicOption () override |
Parses basic network configuration options. More... | |
void | resetBase () override |
Resets basic network configuration options. More... | |
Helper (Network Configuration) | |
const std::string & | getProtocol () const |
Gets the protocol to be used for networking. More... | |
Public Getter | |
bool | isQueryUsed (std::uint64_t queryId) const |
Checks whether the specified query is used by the container. More... | |
Setters | |
void | setRepairCData (bool isRepairCData) |
Sets whether to try to repair CData when parsing XML. More... | |
void | setRepairComments (bool isRepairComments) |
Sets whether to try to repair broken HTML/XML comments. More... | |
void | setRemoveXmlInstructions (bool isRemoveXmlInstructions) |
Sets whether to remove XML processing instructions (< ?xml:...>) before parsing HTML/XML content. More... | |
void | setMinimizeMemory (bool isMinimizeMemory) |
Sets whether to minimize memory usage. More... | |
void | setTidyErrorsAndWarnings (bool warnings, std::uint32_t numOfErrors) |
Sets how tidy-html5 reports errors and warnings. More... | |
void | setQueryTarget (const std::string &content, const std::string &source) |
Sets the content to use the managed queries on. More... | |
Getters | |
std::size_t | getNumberOfSubSets () const |
Gets the number of subsets currently acquired. More... | |
bool | getTarget (std::string &targetTo) |
Gets the current query target, if available, and writes it to the given string. More... | |
bool | getXml (std::string &resultTo, std::queue< std::string > &warningsTo) |
Parses the current query target as tidied XML and writes it to the given string. More... | |
Queries | |
QueryStruct | addQuery (std::uint64_t id, const QueryProperties &properties) |
Adds a query with the given query properties to the container. More... | |
void | clearQueries () |
Clears all queries currently managed by the container and frees the associated memory. More... | |
void | clearQueryTarget () |
Clears the current query target and frees the associated memory. More... | |
Subsets | |
bool | nextSubSet () |
Requests the next subset for all subsequent queries. More... | |
Results | |
bool | getBoolFromRegEx (const QueryStruct &query, const std::string &target, bool &resultTo, std::queue< std::string > &warningsTo) const |
Gets a boolean result from a RegEx query on a separate string. More... | |
bool | getSingleFromRegEx (const QueryStruct &query, const std::string &target, std::string &resultTo, std::queue< std::string > &warningsTo) const |
Gets a single result from a RegEx query on a separate string. More... | |
bool | getMultiFromRegEx (const QueryStruct &query, const std::string &target, std::vector< std::string > &resultTo, std::queue< std::string > &warningsTo) const |
Gets multiple results from a RegEx query on a separate string. More... | |
bool | getBoolFromQuery (const QueryStruct &query, bool &resultTo, std::queue< std::string > &warningsTo) |
Gets a boolean result from a query of any type on the current query target. More... | |
bool | getBoolFromQueryOnSubSet (const QueryStruct &query, bool &resultTo, std::queue< std::string > &warningsTo) |
Gets a boolean result from a query of any type on the current subset. More... | |
bool | getSingleFromQuery (const QueryStruct &query, std::string &resultTo, std::queue< std::string > &warningsTo) |
Gets a single result from a query of any type on the current query target. More... | |
bool | getSingleFromQueryOnSubSet (const QueryStruct &query, std::string &resultTo, std::queue< std::string > &warningsTo) |
Gets a single result from a query of any type on the current subset. More... | |
bool | getMultiFromQuery (const QueryStruct &query, std::vector< std::string > &resultTo, std::queue< std::string > &warningsTo) |
Gets multiple results from a query of any type on the current query target. More... | |
bool | getMultiFromQueryOnSubSet (const QueryStruct &query, std::vector< std::string > &resultTo, std::queue< std::string > &warningsTo) |
Gets multiple results from a query of any type on the current subset. More... | |
bool | setSubSetsFromQuery (const QueryStruct &query, std::queue< std::string > &warningsTo) |
Sets subsets for subsequent queries using a query of any type. More... | |
bool | addSubSetsFromQueryOnSubSet (const QueryStruct &query, std::queue< std::string > &warningsTo) |
Inserts more subsets after the current one based on a query on the current subset. More... | |
Memory | |
void | reserveForSubSets (const QueryStruct &query, std::size_t n) |
Reserves memory for a specific number of subsets. More... | |
Crawler thread.
crawlservpp::Module::Crawler::Thread::Thread | ( | Main::Database & | dbBase, |
std::string_view | cookieDirectory, | ||
const ThreadOptions & | threadOptions, | ||
const NetworkSettings & | networkSettings, | ||
const ThreadStatus & | threadStatus | ||
) |
Constructor initializing a previously interrupted crawler thread.
dbBase | Reference to the main database connection. |
cookieDirectory | View of a string containing the (sub-)directory for storing cookie files. |
threadOptions | Constant reference to a structure containing the options for the thread. |
networkSettings | Network settings. |
threadStatus | Constant reference to a structure containing the last known status of the thread. |
crawlservpp::Module::Crawler::Thread::Thread | ( | Main::Database & | dbBase, |
std::string_view | cookieDirectory, | ||
const ThreadOptions & | threadOptions, | ||
const NetworkSettings & | networkSettings | ||
) |
Constructor initializing a new crawler thread.
|
inlineprotectedinherited |
Adds a query with the given query properties to the container.
id | The ID of the query. It will be saved in a thread-safe way and only be used by Container::isQueryUsed. |
properties | Constant reference to the properties of the query to add to the container. |
Container::Exception | if an error occured while creating a query with the given properties or the specified type of the query is unknown. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Struct::QueryProperties::resultBool, crawlservpp::Struct::QueryStruct::resultBool, crawlservpp::Struct::QueryProperties::resultMulti, crawlservpp::Struct::QueryStruct::resultMulti, crawlservpp::Struct::QueryProperties::resultSingle, crawlservpp::Struct::QueryStruct::resultSingle, crawlservpp::Struct::QueryProperties::resultSubSets, crawlservpp::Struct::QueryStruct::resultSubSets, crawlservpp::Struct::QueryProperties::text, crawlservpp::Struct::QueryProperties::textOnly, crawlservpp::Struct::QueryProperties::type, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Analyzer::Thread::addOptionalQuery(), crawlservpp::Module::Analyzer::Thread::addQueries(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Inserts more subsets after the current one based on a query on the current subset.
This function is used for recursive extracting.
query | A constant reference to a structure identifying the query that will be performed to acquire the subset. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target or no subset has been specified, the current subset is invalid, or the given query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultSubSets, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
protectedinherited |
Allows the thread to be paused.
Threads are pausable by default. Use this function if pausing has been disallowed via disallowPausing().
Thread-safe: Can be used by both the module and the main thread.
|
inlineprotectedinherited |
Clears all queries currently managed by the container and frees the associated memory.
References crawlservpp::Helper::Memory::free().
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Parser::Thread::onClear(), crawlservpp::Module::Extractor::Thread::onClear(), and onClear().
|
inlineprotectedinherited |
Clears the current query target and frees the associated memory.
References crawlservpp::Parsing::XML::clear(), crawlservpp::Helper::Memory::free(), and crawlservpp::Helper::Json::free().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), and crawlservpp::Query::Container::setQueryTarget().
|
protectedinherited |
Disallows the thread to be paused.
Thread-safe: Can be used by both the module and the main thread.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::AllTokens(), crawlservpp::Module::Analyzer::Algo::Assoc::Assoc(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::AssocOverTime(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::CorpusGenerator(), crawlservpp::Module::Analyzer::Algo::ExtractIds::ExtractIds(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::SentimentOverTime(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::TermsOverTime(), crawlservpp::Module::Analyzer::Algo::TopicModelling::TopicModelling(), and crawlservpp::Module::Analyzer::Algo::WordsOverTime::WordsOverTime().
|
inherited |
Waits for the thread until shutdown is completed.
References crawlservpp::Main::Database::deleteThread().
Referenced by onReset(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inlineprotectedinherited |
Gets a boolean result from a query of any type on the current query target.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a boolean variable which will be set according to the result of the query. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target has been specified or the query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultBool, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Gets a boolean result from a query of any type on the current subset.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a boolean variable which will be set according to the result of the query. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target or no subset has been specified, the current subset is invalid, or the given query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultBool, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inlineprotectedinherited |
Gets a boolean result from a RegEx query on a separate string.
query | A constant reference to a structure identifying the RegEx query that will be performed. |
target | A constant reference to a string containing the target on which the query will be performed. |
resultTo | A reference to a boolean variable which will be set according to the result of the query. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Struct::QueryStruct::resultBool, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inherited |
Gets the ID of the configuration used by the thread.
Thread-safe: Can be used by both the module and the main thread, because the configuration is not changed after starting the thread.
References crawlservpp::Struct::ThreadOptions::config.
Referenced by crawlservpp::Module::Thread::Thread().
|
inherited |
Gets the ID of the thread.
Thread-safe: Can be used by both the module and the main thread.
|
protectedinherited |
Gets the value of the last ID processed by the thread.
Referenced by onInit(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inlineprotectedinherited |
Gets multiple results from a query of any type on the current query target.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a vector to which the results of the query will be appended. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target has been specified or the query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultMulti, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Gets multiple results from a query of any type on the current subset.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a vector to which the results of the query will be appended. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target or no subset has been specified, the current subset is invalid, or the given query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultMulti, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inlineprotectedinherited |
Gets multiple results from a RegEx query on a separate string.
query | A constant reference to a structure identifying the RegEx query that will be performed. |
target | A constant reference to a string containing the target on which the query will be performed. |
resultTo | A reference to a vector to which the results of the query will be appended. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Struct::QueryStruct::resultMulti, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Parser::Thread::onReset().
|
inlineprotectedinherited |
Gets the number of subsets currently acquired.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
protectedinherited |
Gets the current progress, in percent.
Thread-safe: Can be used by both the module and the main thread.
0.F
(none) and 1.F
(done). Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Extractor::Thread::onReset().
|
inlineprotectedinherited |
Gets a single result from a query of any type on the current query target.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a string to which the result of the query will be written. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target has been specified or the query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultSingle, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Gets a single result from a query of any type on the current subset.
query | A constant reference to a structure identifying the query that will be performed. |
resultTo | A reference to a string to which the result of the query will be written. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target or no subset has been specified, the current subset is invalid, or the given query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultSingle, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inlineprotectedinherited |
Gets a single result from a RegEx query on a separate string.
query | A constant reference to a structure identifying the RegEx query that will be performed. |
target | A constant reference to a string containing the target on which the query will be performed. |
resultTo | A reference to a string to which the result of the query will be written. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Struct::QueryStruct::resultSingle, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
protectedinherited |
Gets the current status message.
Thread-safe: Can be used by both the module and the main thread.
Referenced by crawlservpp::Module::Thread::log(), crawlservpp::Module::Parser::Thread::onClear(), crawlservpp::Module::Extractor::Thread::onClear(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Gets the current query target, if available, and writes it to the given string.
targetTo | Reference to a string the query target will be written to, if one is available. Its content will not be changed if no query target is available. |
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inherited |
Gets the ID of the URL list used by the thread.
Thread-safe: Can be used by both the module and the main thread, because the URL list is not changed after starting the thread.
References crawlservpp::Struct::ThreadOptions::urlList.
Referenced by crawlservpp::Module::Thread::Thread().
|
protectedinherited |
Gets the number of IDs that have been jumped over, and resets them.
Resets the number of IDs jumped over to zero.
Referenced by onReset(), crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inherited |
Gets the ID of the website used by the thread.
Thread-safe: Can be used by both the module and the main thread, because the website is not changed after starting the thread.
References crawlservpp::Struct::ThreadOptions::website.
Referenced by onReset(), and crawlservpp::Module::Thread::Thread().
|
inlineprotectedinherited |
Parses the current query target as tidied XML and writes it to the given string.
resultTo | Reference to a string the parsed query target will be written to. |
warningsTo | Reference to a vector of strings to which warnings that occured during parsing will be appended. |
References crawlservpp::Parsing::XML::getContent().
Referenced by onReset().
|
protectedinherited |
Increments the last ID processed by the thread.
Also sets the number of processed IDs, make sure to increment it before if the ID has been processed.
References crawlservpp::Module::Thread::database, and crawlservpp::Module::Database::setThreadLast().
|
protectedinherited |
Increments the number of IDs processed by the thread.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inherited |
Checks whether the shutdown of the thread has been finished.
Thread-safe: Can be used by both the module and the main thread.
|
protectedinherited |
Checks whether the thread has been interrupted.
Thread-safe: Can be used by both the module and the main thread.
|
protectedinherited |
Checks whether a certain logging level is enabled.
level | The logging level to be checked for. |
References crawlservpp::Module::Thread::database, and crawlservpp::Module::Database::isLogLevel().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inherited |
Checks whether the thread has been paused.
Thread-safe: Can be used by both the module and the main thread.
|
inlineinherited |
Checks whether the specified query is used by the container.
Thread-safe. This function can be used by any thread.
queryId | ID of the query to be checked. |
|
inherited |
Checks whether the thread is still supposed to run.
Thread-safe: Can be used by both the module and the main thread.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::ExtractIds::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Empty::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoInit(), crawlservpp::Module::Analyzer::Thread::onInit(), crawlservpp::Module::Parser::Thread::onInit(), crawlservpp::Module::Extractor::Thread::onInit(), onInit(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Parser::Thread::onTick(), crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inherited |
Checks whether the thread is shutting down or has shut down.
Thread-safe: Can be used by both the module and the main thread.
|
protectedinherited |
Adds a thread-specific log entry to the database, if the current logging level is high enough.
Removes invalid UTF-8 characters if necessary.
If debug logging is active, the entry will be written to the logging file as well.
The log entry will not be written to the database, if the current logging level is lower than the specified logging level. The logging level does not affect the writing of logging entries being to the logging file when debug logging is active.
level | The logging level for the entry. The entry will only be written to the database, if the current logging level is at least the logging level for the entry. |
logEntry | Constant reference to a string containing the log entry. |
References crawlservpp::Module::Thread::database, and crawlservpp::Module::Database::log().
Referenced by crawlservpp::Module::Analyzer::Thread::addCorpora(), crawlservpp::Module::Analyzer::Algo::TopicModelling::checkAlgoOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Thread::finished(), crawlservpp::Module::Thread::log(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::ExtractIds::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Empty::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick(), crawlservpp::Module::Parser::Thread::onClear(), crawlservpp::Module::Extractor::Thread::onClear(), onClear(), crawlservpp::Module::Analyzer::Thread::onReset(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Parser::Thread::onTick(), crawlservpp::Module::Extractor::Thread::onTick(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo(), and crawlservpp::Module::Analyzer::Thread::uploadResult().
|
protectedinherited |
Adds multiple thread-specific log entries to the database, if the current logging level is high enough.
Removes invalid UTF-8 characters if necessary.
If debug logging is active, the entries will be written to the logging file as well.
The log entries will not be written to the database, if the current logging level is lower than the specified logging level. The logging level does not affect the writing of logging entries being to the logging file when debug logging is active.
level | The logging level for the entries. The entries will only be written to the database, if the current logging level is at least the logging level for the entry. |
logEntries | Reference to a queue of strings containing the log entries to be written. It will be emptied regardless whether the log entries will be written to the database. |
References crawlservpp::Main::Database::connect(), crawlservpp::Module::Thread::database, crawlservpp::Module::Thread::getStatusMessage(), crawlservpp::Main::Database::getThreadPauseTime(), crawlservpp::Main::Database::getThreadRunTime(), crawlservpp::Module::Database::log(), crawlservpp::Module::Thread::log(), crawlservpp::Helper::DateTime::now(), crawlservpp::Module::Thread::onClear(), crawlservpp::Module::Thread::onInit(), crawlservpp::Module::Thread::onPause(), crawlservpp::Module::Thread::onReset(), crawlservpp::Module::Thread::onTick(), crawlservpp::Module::Thread::onUnpause(), crawlservpp::Module::Thread::pause(), crawlservpp::Module::Thread::pauseByThread(), crawlservpp::Module::Database::prepare(), crawlservpp::Helper::DateTime::secondsToString(), crawlservpp::Module::Thread::setLast(), crawlservpp::Module::Thread::setStatusMessage(), crawlservpp::Main::Database::setThreadPauseTime(), crawlservpp::Main::Database::setThreadRunTime(), and crawlservpp::Module::sleepOnConnectionErrorS.
|
inlineprotectedinherited |
Requests the next subset for all subsequent queries.
Container::Exception | if an invalid subset had previously been selected. |
References crawlservpp::Helper::Memory::free(), crawlservpp::Helper::Json::free(), crawlservpp::Helper::Memory::freeIf(), crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, and crawlservpp::Struct::QueryStruct::typeXPath.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
overrideprotectedvirtual |
Clears the crawler.
Implements crawlservpp::Module::Thread.
References crawlservpp::Query::Container::clearQueries(), crawlservpp::Module::Crawler::crawlerLoggingDefault, crawlservpp::Helper::DotLocale::locale(), crawlservpp::Helper::CommaLocale::locale(), crawlservpp::Module::Thread::log(), and crawlservpp::Helper::DateTime::now().
Referenced by onReset().
|
overrideprotectedvirtual |
Initializes the crawler.
Module::Crawler::Thread::Exception | if no query for link extraction has been specified. |
Implements crawlservpp::Module::Thread.
References crawlservpp::Module::Thread::getLast(), and crawlservpp::Module::Thread::isRunning().
Referenced by onReset().
|
overrideprotectedvirtual |
Pauses the crawler.
Stores the current time for keeping track of the time, the crawler is paused.
Implements crawlservpp::Module::Thread.
References crawlservpp::Helper::DateTime::now().
|
overrideprotectedvirtual |
Resets the crawler.
Implements crawlservpp::Module::Thread.
References crawlservpp::Network::TorControl::active(), crawlservpp::Query::Container::addQuery(), crawlservpp::Module::Crawler::Database::addUrlIfNotExists(), crawlservpp::Module::Crawler::Database::addUrlsIfNotExist(), crawlservpp::Helper::Container::append(), crawlservpp::Module::Crawler::archiveMementoContentType, crawlservpp::Module::Crawler::archiveRefString, crawlservpp::Module::Crawler::archiveRefTimeStampLength, crawlservpp::Module::Crawler::archiveRenewUrlLockEveryMs, crawlservpp::Struct::CrawlTimersTick::archives, crawlservpp::Struct::CrawlStatsTick::checkedUrls, crawlservpp::Struct::CrawlStatsTick::checkedUrlsArchive, crawlservpp::Query::Container::clearQueryTarget(), crawlservpp::Module::Crawler::Config::config, crawlservpp::Helper::DateTime::convertLongDateTimeToSQLTimeStamp(), crawlservpp::Helper::DateTime::convertSQLTimeStampToTimeStamp(), crawlservpp::Helper::DateTime::convertTimeStampToSQLTimeStamp(), crawlservpp::Module::Crawler::Config::Entries::crawlerArchives, crawlservpp::Module::Crawler::Config::Entries::crawlerArchivesNames, crawlservpp::Module::Crawler::Config::Entries::crawlerArchivesUrlsMemento, crawlservpp::Module::Crawler::Config::Entries::crawlerArchivesUrlsSkip, crawlservpp::Module::Crawler::Config::Entries::crawlerArchivesUrlsTimemap, crawlservpp::Module::Crawler::Config::Entries::crawlerLogging, crawlservpp::Module::Crawler::crawlerLoggingDefault, crawlservpp::Module::Crawler::crawlerLoggingExtended, crawlservpp::Module::Crawler::crawlerLoggingVerbose, crawlservpp::Module::Crawler::Config::Entries::crawlerMaxBatchSize, crawlservpp::Module::Crawler::Config::Entries::crawlerParamsAdd, crawlservpp::Module::Crawler::Config::Entries::crawlerParamsBlackList, crawlservpp::Module::Crawler::Config::Entries::crawlerParamsWhiteList, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesBlackListContent, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesBlackListTypes, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesBlackListUrls, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinks, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksBlackListContent, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksBlackListTypes, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksBlackListUrls, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksWhiteListContent, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksWhiteListTypes, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesLinksWhiteListUrls, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesWhiteListContent, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesWhiteListTypes, crawlservpp::Module::Crawler::Config::Entries::crawlerQueriesWhiteListUrls, crawlservpp::Module::Crawler::Config::Entries::crawlerReCrawl, crawlservpp::Module::Crawler::Config::Entries::crawlerReCrawlStart, crawlservpp::Module::Crawler::Config::Entries::crawlerRemoveXmlInstructions, crawlservpp::Module::Crawler::Config::Entries::crawlerRepairCData, crawlservpp::Module::Crawler::Config::Entries::crawlerRepairComments, crawlservpp::Module::Crawler::Config::Entries::crawlerRestartAfter, crawlservpp::Module::Crawler::Config::Entries::crawlerReTries, crawlservpp::Module::Crawler::Config::Entries::crawlerRetryArchive, crawlservpp::Module::Crawler::Config::Entries::crawlerRetryEmpty, crawlservpp::Module::Crawler::Config::Entries::crawlerRetryHttp, crawlservpp::Module::Crawler::Config::Entries::crawlerSleepError, crawlservpp::Module::Crawler::Config::Entries::crawlerSleepHttp, crawlservpp::Module::Crawler::Config::Entries::crawlerSleepIdle, crawlservpp::Module::Crawler::Config::Entries::crawlerSleepMySql, crawlservpp::Module::Crawler::Config::Entries::crawlerStart, crawlservpp::Module::Crawler::Config::Entries::crawlerStartIgnore, crawlservpp::Module::Crawler::Config::Entries::crawlerTidyWarnings, crawlservpp::Module::Crawler::Config::Entries::crawlerTiming, crawlservpp::Module::Crawler::Config::Entries::crawlerUrlCaseSensitive, crawlservpp::Module::Crawler::Config::Entries::crawlerUrlChunks, crawlservpp::Module::Crawler::Config::Entries::crawlerUrlDebug, crawlservpp::Module::Crawler::Config::Entries::crawlerUrlMaxLength, crawlservpp::Module::Crawler::Config::Entries::crawlerUrlStartupCheck, crawlservpp::Module::Crawler::Config::Entries::crawlerWarningsFile, crawlservpp::Module::Crawler::Config::Entries::crawlerXml, crawlservpp::Module::Crawler::Config::Entries::customCounters, crawlservpp::Module::Crawler::Config::Entries::customCountersAlias, crawlservpp::Module::Crawler::Config::Entries::customCountersAliasAdd, crawlservpp::Module::Crawler::Config::Entries::customCountersEnd, crawlservpp::Module::Crawler::Config::Entries::customCountersGlobal, crawlservpp::Module::Crawler::Config::Entries::customCountersStart, crawlservpp::Module::Crawler::Config::Entries::customCountersStep, crawlservpp::Module::Crawler::Config::Entries::customReCrawl, crawlservpp::Module::Crawler::Config::Entries::customRobots, crawlservpp::Module::Crawler::Config::Entries::customTokenHeaders, crawlservpp::Module::Crawler::Config::Entries::customTokens, crawlservpp::Module::Crawler::Config::Entries::customTokensCookies, crawlservpp::Module::Crawler::Config::Entries::customTokensKeep, crawlservpp::Module::Crawler::Config::Entries::customTokensRequired, crawlservpp::Module::Crawler::Config::Entries::customTokensSource, crawlservpp::Module::Crawler::Config::Entries::customTokensUsePost, crawlservpp::Module::Crawler::Config::Entries::customUrls, crawlservpp::Module::Crawler::Config::Entries::customUsePost, database, crawlservpp::Module::Thread::end(), crawlservpp::Module::Crawler::Config::Entries::expectedErrorIfLarger, crawlservpp::Module::Crawler::Config::Entries::expectedErrorIfSmaller, crawlservpp::Module::Crawler::Config::Entries::expectedQuery, crawlservpp::Query::Container::getBoolFromQuery(), crawlservpp::Query::Container::getBoolFromRegEx(), crawlservpp::Wrapper::Database::getConfiguration(), crawlservpp::Network::Curl::getContent(), crawlservpp::Network::Curl::getContentType(), crawlservpp::Network::Curl::getCurlCode(), crawlservpp::Module::Thread::getLast(), crawlservpp::Query::Container::getMultiFromQuery(), crawlservpp::Module::Crawler::Database::getNextUrl(), crawlservpp::Module::Crawler::Database::getNumberOfUrls(), crawlservpp::Network::Config::getProtocol(), crawlservpp::Network::Curl::getPublicIp(), crawlservpp::Wrapper::Database::getQueryProperties(), crawlservpp::Network::Curl::getResponseCode(), crawlservpp::Query::Container::getSingleFromQuery(), crawlservpp::Query::Container::getSingleFromRegEx(), crawlservpp::Module::Thread::getStatusMessage(), crawlservpp::Parsing::URI::getSubUri(), crawlservpp::Module::Crawler::Database::getUrlId(), crawlservpp::Module::Crawler::Database::getUrlPosition(), crawlservpp::Module::Thread::getWarpedOverAndReset(), crawlservpp::Module::Thread::getWebsite(), crawlservpp::Wrapper::Database::getWebsiteDomain(), crawlservpp::Query::Container::getXml(), crawlservpp::Struct::CrawlTimersContent::http, crawlservpp::Module::Crawler::httpIgnoreString, crawlservpp::Module::Crawler::httpResponseCodeIgnore, crawlservpp::Module::Crawler::httpResponseCodeMax, crawlservpp::Module::Crawler::httpResponseCodeMin, crawlservpp::Module::Crawler::httpsIgnoreString, crawlservpp::Module::Crawler::httpsString, crawlservpp::Module::Crawler::httpString, crawlservpp::Module::Thread::incrementProcessed(), crawlservpp::Wrapper::DatabaseTryLock< DB >::isActive(), crawlservpp::Module::Crawler::Database::isArchivedContentExists(), crawlservpp::Module::Thread::isLogLevel(), crawlservpp::Module::Thread::isRunning(), crawlservpp::Parsing::URI::isSameDomain(), crawlservpp::Module::Crawler::Database::isUrlCrawled(), crawlservpp::Helper::Utf8::isValidUtf8(), crawlservpp::Module::Config::loadConfig(), crawlservpp::Helper::CommaLocale::locale(), crawlservpp::Module::Crawler::Database::lockUrlIfOk(), crawlservpp::Module::Thread::log(), crawlservpp::Parsing::URI::makeAbsolute(), crawlservpp::Network::Config::networkConfig, networking, networkOptions, crawlservpp::Network::TorControl::newIdentity(), crawlservpp::Struct::CrawlStatsTick::newUrls, crawlservpp::Struct::CrawlStatsTick::newUrlsArchive, crawlservpp::Helper::DateTime::now(), onClear(), onInit(), crawlservpp::Struct::CrawlTimersContent::parse, crawlservpp::Parsing::URI::parseLink(), crawlservpp::Module::Thread::pauseByThread(), crawlservpp::Module::Crawler::Database::prepare(), crawlservpp::Module::Crawler::Config::Entries::redirectCookies, crawlservpp::Module::Crawler::Config::Entries::redirectHeaders, crawlservpp::Module::Crawler::Config::Entries::redirectQueryContent, crawlservpp::Module::Crawler::Config::Entries::redirectQueryUrl, crawlservpp::Module::Crawler::redirectSourceContent, crawlservpp::Module::Crawler::redirectSourceUrl, crawlservpp::Module::Crawler::Config::Entries::redirectTo, crawlservpp::Module::Crawler::Config::Entries::redirectUsePost, crawlservpp::Module::Crawler::Config::Entries::redirectVarNames, crawlservpp::Module::Crawler::Config::Entries::redirectVarSources, crawlservpp::Helper::Strings::replaceAll(), crawlservpp::Network::Config::resetBase(), crawlservpp::Network::Curl::resetConnection(), crawlservpp::Network::Config::Entries::resetTor, crawlservpp::Network::Config::Entries::resetTorAfter, crawlservpp::Network::Config::Entries::resetTorOnlyAfter, crawlservpp::Module::Crawler::robotsFirstLetters, crawlservpp::Module::Crawler::robotsMinLineLength, crawlservpp::Module::Crawler::robotsRelativeUrl, crawlservpp::Module::Crawler::robotsSitemapBegin, crawlservpp::Module::Crawler::Database::saveArchivedContent(), crawlservpp::Module::Crawler::Database::saveContent(), crawlservpp::Struct::CrawlTimersTick::select, crawlservpp::Network::Curl::setConfigCurrent(), crawlservpp::Network::Curl::setConfigGlobal(), crawlservpp::Network::Curl::setCookies(), crawlservpp::Module::Crawler::Config::setCrossDomain(), crawlservpp::Parsing::URI::setCurrentDomain(), crawlservpp::Parsing::URI::setCurrentOrigin(), crawlservpp::Network::Curl::setHeaders(), crawlservpp::Module::Thread::setLast(), crawlservpp::Wrapper::Database::setLogging(), crawlservpp::Module::Crawler::Database::setMaxBatchSize(), crawlservpp::Network::TorControl::setNewIdentityMax(), crawlservpp::Network::TorControl::setNewIdentityMin(), crawlservpp::Module::Thread::setProgress(), crawlservpp::Query::Container::setQueryTarget(), crawlservpp::Module::Crawler::Database::setRecrawl(), crawlservpp::Query::Container::setRemoveXmlInstructions(), crawlservpp::Query::Container::setRepairCData(), crawlservpp::Query::Container::setRepairComments(), crawlservpp::Wrapper::Database::setSleepOnError(), crawlservpp::Module::Thread::setStatusMessage(), crawlservpp::Query::Container::setTidyErrorsAndWarnings(), crawlservpp::Module::Crawler::Database::setUrlCaseSensitive(), crawlservpp::Module::Crawler::Database::setUrlDebug(), crawlservpp::Module::Crawler::Database::setUrlFinishedIfOk(), crawlservpp::Module::Crawler::Database::setUrlStartupCheck(), crawlservpp::Struct::CrawlTimersContent::sleep, crawlservpp::Module::Thread::sleep(), crawlservpp::Helper::Strings::sortAndRemoveDuplicates(), crawlservpp::Timer::StartStop::start(), crawlservpp::Timer::StartStop::stop(), crawlservpp::Timer::Simple::tick(), torControl, crawlservpp::Struct::CrawlTimersTick::total, crawlservpp::Timer::StartStop::totalStr(), crawlservpp::Helper::Strings::trim(), crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Parsing::URI::unescape(), crawlservpp::Module::Crawler::Database::unLockUrlIfOk(), crawlservpp::Network::Curl::unsetCookies(), crawlservpp::Network::Curl::unsetHeaders(), crawlservpp::Struct::CrawlTimersContent::update, crawlservpp::Module::Crawler::updateCustomUrlCountEvery, crawlservpp::Module::Crawler::Database::urlDuplicationCheck(), crawlservpp::Module::Crawler::Database::urlEmptyCheck(), crawlservpp::Module::Crawler::Database::urlHashCheck(), crawlservpp::Module::Thread::urlListNamespace, crawlservpp::Struct::CrawlStatsTick::urlLockTimeArchiveMs, crawlservpp::Main::Exception::view(), crawlservpp::Module::Thread::websiteNamespace, and crawlservpp::Module::Crawler::wwwString.
|
overrideprotectedvirtual |
Performs a crawler tick.
If successful, this will crawl one URL. If not, the URL will either be skipped, or retried in the next tick.
Implements crawlservpp::Module::Thread.
References crawlservpp::Module::Crawler::Config::config, crawlservpp::Module::Crawler::Config::Entries::crawlerTiming, crawlservpp::Struct::CrawlTimersTick::select, crawlservpp::Timer::StartStop::start(), crawlservpp::Network::TorControl::tick(), torControl, and crawlservpp::Struct::CrawlTimersTick::total.
|
overrideprotectedvirtual |
Unpauses the crawler.
Calculates the time, the crawler was paused.
Implements crawlservpp::Module::Thread.
References crawlservpp::Helper::DateTime::now().
|
protectedinherited |
Forces the thread to pause.
References crawlservpp::Module::Thread::database, and crawlservpp::Main::Database::setThreadStatus().
Referenced by crawlservpp::Module::Thread::log(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), and crawlservpp::Module::Analyzer::Thread::pause().
|
inlineprotectedinherited |
Reserves memory for a specific number of subsets.
query | A constant reference to a structure identifying the query for whose type memory will be specifically reserved. |
n | The number of subsets for which memory will be reserved. |
References crawlservpp::Parsing::XML::clear(), crawlservpp::Helper::Memory::free(), crawlservpp::Helper::Json::free(), crawlservpp::Helper::Container::moveInto(), crawlservpp::Parsing::XML::parse(), crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Helper::Json::stringify(), crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inherited |
Will reset the thread before the next tick.
|
protectedinherited |
Sets the last ID processed by the thread.
Also sets the number of processed IDs, make sure to increment it before if the ID has been processed.
lastId | The last ID processed by the thread. |
References crawlservpp::Module::Thread::database, and crawlservpp::Module::Database::setThreadLast().
Referenced by crawlservpp::Module::Thread::log(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Sets whether to minimize memory usage.
isMinimizeMemory | Set whether to minimize memory usage, prioritizing memory usage over performance. |
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
protectedinherited |
Sets the progress of the thread.
newProgress | The new progress of the thread, between 0.f (none), and 1.f (done). |
References crawlservpp::Module::Thread::database, and crawlservpp::Module::Database::setThreadProgress().
Referenced by crawlservpp::Module::Analyzer::Thread::finished(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::ExtractIds::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Empty::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoInit(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Analyzer::Thread::onTick(), crawlservpp::Module::Parser::Thread::onTick(), crawlservpp::Module::Extractor::Thread::onTick(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inlineprotectedinherited |
Sets the content to use the managed queries on.
The old query target referencing the old content will be cleared.
content | Constant reference to a string containing the content to use the managed queries on. |
source | Constant reference to a string containing the source (URL) of the content. It will be used for logging and error reporting purposes only. |
References crawlservpp::Query::Container::clearQueryTarget().
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Sets whether to remove XML processing instructions (<
?xml:...>) before parsing HTML/XML content.
isRemoveXmlInstructions | Sets whether to remove XML processing instructions. |
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Sets whether to try to repair CData when parsing XML.
isRepairCData | Set whether to try to repair CData when parsing XML. |
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
inlineprotectedinherited |
Sets whether to try to repair broken HTML/XML comments.
isRepairComments | Set whether to try to repair broken HTML/XML comments. |
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
protectedinherited |
Sets the status message of the thread.
statusMessage | Constant reference to a string containing the new status message to be set. |
References crawlservpp::Module::Thread::database, and crawlservpp::Main::Database::setThreadStatus().
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Thread::finished(), crawlservpp::Module::Thread::log(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::ExtractIds::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::Empty::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoInit(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoTick(), crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick(), crawlservpp::Module::Parser::Thread::onClear(), crawlservpp::Module::Extractor::Thread::onClear(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), crawlservpp::Module::Parser::Thread::onTick(), crawlservpp::Module::Extractor::Thread::onTick(), crawlservpp::Module::Analyzer::Algo::TermsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo(), crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo(), and crawlservpp::Module::Analyzer::Thread::uploadResult().
|
inlineprotectedinherited |
Sets subsets for subsequent queries using a query of any type.
The subsets resulting from the query will be saved in-class. Previous subsets will be overwritten.
query | A constant reference to a structure identifying the query that will be performed to acquire the subset. |
warningsTo | A reference to a vector of strings to which all warnings will be appended that occur during the execution of the query. |
Container::Exception | if no query target has been specified or the query is of an unknown type. |
References crawlservpp::Struct::QueryStruct::index, crawlservpp::Helper::Json::parseCons(), crawlservpp::Helper::Json::parseRapid(), crawlservpp::Struct::QueryStruct::resultSubSets, crawlservpp::Struct::QueryStruct::type, crawlservpp::Struct::QueryStruct::typeJsonPath, crawlservpp::Struct::QueryStruct::typeJsonPointer, crawlservpp::Struct::QueryStruct::typeNone, crawlservpp::Struct::QueryStruct::typeRegEx, crawlservpp::Struct::QueryStruct::typeXPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPath, crawlservpp::Struct::QueryStruct::typeXPathJsonPointer, and crawlservpp::Main::Exception::view().
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inlineprotectedinherited |
Sets how tidy-html5
reports errors and warnings.
The reporting of both errors and warnings is deactivated by default.
For more information about tidy-html5, see its GitHub repository.
warnings | Specify whether to report simple warnings. |
numOfErrors | Set the number of errors to be reported. Set to zero to deactivate error reporting. |
References crawlservpp::Parsing::XML::setOptions().
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), and onReset().
|
protectedinherited |
Lets the thread sleep for the specified number of milliseconds.
The sleep will be interrupted if the thread is stopped.
Thread-safe: Can be used by both the module and the main thread.
ms | The number of milliseconds for the thread to sleep, if it is not stopped. |
References crawlservpp::Module::sleepMs.
Referenced by crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoTick(), onReset(), crawlservpp::Module::Analyzer::Thread::onTick(), crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inherited |
Jumps to the specified target ID ("time travel").
Skips the normal process of determining the next ID once the current ID has been processed.
Thread-safe: Can be used by both the module and the main thread.
target | The target ID that should be processed next. |
Module::Thread::Exception | if no target is specified, i.e. the target ID is zero. |
|
protectedinherited |
JSON string of the configuration used by the thread.
Referenced by crawlservpp::Module::Thread::Thread().
|
protected |
Database connection for the crawler thread.
Referenced by onReset().
|
protected |
Networking for the crawler thread.
Referenced by onReset().
|
protected |
Network settings for the crawler thread.
Referenced by onReset().
|
protected |
|
protectedinherited |
Namespace of the URL list used by the thread.
Referenced by crawlservpp::Module::Analyzer::Thread::getTargetTableName(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), and crawlservpp::Module::Thread::Thread().
|
protectedinherited |
Namespace of the website used by the thread.
Referenced by crawlservpp::Module::Analyzer::Thread::getTargetTableName(), crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onReset(), onReset(), and crawlservpp::Module::Thread::Thread().