crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Namespace for extractor classes. More...
Classes | |
class | Config |
Configuration for extractors. More... | |
class | Database |
Class providing database functionality for extractor threads by implementing Wrapper::Database. More... | |
class | Thread |
Extractor thread. More... | |
Constants | |
constexpr std::uint8_t | crawlerLoggingVerbose {0} |
Logging is disabled. More... | |
constexpr std::uint8_t | generalLoggingDefault {1} |
Default logging is enabled. More... | |
constexpr std::uint8_t | generalLoggingExtended {2} |
Extended logging is enabled. More... | |
constexpr std::uint8_t | generalLoggingVerbose {3} |
Verbose logging is enabled. More... | |
constexpr std::uint8_t | variablesSourcesParsed {0} |
Extract variable value from parsed data. More... | |
constexpr std::uint8_t | variablesSourcesContent {1} |
Extract variable value from the content of a crawled web page. More... | |
constexpr std::uint8_t | variablesSourcesUrl {2} |
Extract variable value from the URL of a crawled web page. More... | |
constexpr std::uint8_t | expectedSourceExtracting {0} |
Extract data from other extracted data. More... | |
constexpr std::uint8_t | expectedSourceParsed {1} |
Extract data from parsed data. More... | |
constexpr std::uint8_t | expectedSourceContent {2} |
Extract data from the content of a crawled web page. More... | |
constexpr std::array | defaultRetryHttpStatusCodes {429, 502, 503, 504} |
HTTP status codes to retry by default. More... | |
constexpr std::array | protocolsToRemove {"http://"sv, "https://"sv} |
Protocols to remove from URLs. More... | |
constexpr std::uint64_t | defaultCacheSize {2500} |
Default cache size. More... | |
constexpr std::uint32_t | defaultLockS {300} |
Default locking time, in seconds. More... | |
constexpr std::uint16_t | defaultMaxBatchSize {500} |
Default number of URLs and results to be processed in one MySQL query. More... | |
constexpr std::int64_t | defaultReTries {720} |
Default re-tries on connection error. More... | |
constexpr std::uint64_t | defaultSleepErrorMs {10000} |
Default sleeping time on connection errors, in milliseconds. More... | |
constexpr std::uint64_t | defaultSleepHttpMs {0} |
Default time that will be waited between HTTP requests, in milliseconds. More... | |
constexpr std::uint64_t | defaultSleepIdleMs {5000} |
Default time to wait before checking for new URLs when all URLs have been processed, in milliseconds. More... | |
constexpr std::uint64_t | defaultSleepMySqlS {60} |
Default time to wait before last try to re-connect to MySQL server, in seconds. More... | |
constexpr auto | defaultPagingVariable {"$p"sv} |
Default name of the paging variable. More... | |
constexpr std::uint64_t | defaultRecursiveMaxDepth {100} |
Default maximum depth of recursive extracting. More... | |
constexpr auto | minTargetColumns {4} |
Minimum number of columns in the target table. More... | |
constexpr auto | minLinkedColumns {2} |
Minimum number of columns in the linked target table. More... | |
constexpr auto | maxContentSize {1073741824} |
Maximum size of database content (= 1 GiB). More... | |
constexpr auto | maxContentSizeString {"1 GiB"sv} |
Maximum size of database content as string. More... | |
constexpr auto | httpResponseCodeMin {400} |
Minimum HTTP error code. More... | |
constexpr auto | httpResponseCodeMax {599} |
Maximum HTTP error code. More... | |
constexpr auto | httpResponseCodeIgnore {200} |
HTTP response code to be ignored when checking for errors. More... | |
Constants for MySQL Queries | |
constexpr auto | oneAtOnce {1} |
Process one value at once. More... | |
constexpr auto | nAtOnce10 {10} |
Process ten values at once. More... | |
constexpr auto | nAtOnce100 {100} |
Process one hundred values at once. More... | |
constexpr auto | sqlArg1 {1} |
First argument in a SQL query. More... | |
constexpr auto | sqlArg2 {2} |
Second argument in a SQL query. More... | |
constexpr auto | sqlArg3 {3} |
Third argument in a SQL query. More... | |
constexpr auto | sqlArg4 {4} |
Fourth argument in a SQL query. More... | |
constexpr auto | sqlArg5 {5} |
Fifth argument in a SQL query. More... | |
constexpr auto | extractingTableAlias {"a"sv} |
Alias, used in SQL queries, for the extracting table. More... | |
constexpr auto | targetTableAlias {"b"sv} |
Alias, used in SQL queries, for the target table. More... | |
constexpr auto | linkedTableAlias {"c"sv} |
Alias, used in SQL queries, for the linked target table. More... | |
constexpr auto | parsedDataTableAlias {"a"sv} |
Alias, used in SQL queries, for the parsed data table. More... | |
constexpr auto | crawledDataTableAlias {"b"sv} |
Alias, used in SQL queries, for the crawled data table. More... | |
constexpr auto | urlListTableAlias {"c"sv} |
Alias, used in SQL queries, for the URL list table. More... | |
constexpr auto | numArgsLockUrl {3} |
Number of arguments to lock one URL. More... | |
constexpr auto | numArgsAddUpdateData {4} |
Number of arguments to add or update one data entry (without custom columns). More... | |
constexpr auto | numArgsLinked {2} |
Number of additional arguments when data is linked. More... | |
constexpr auto | numArgsOverwriteData {3} |
Number of additional arguments when overwriting existing data. More... | |
constexpr auto | numArgsAddUpdateLinkedData {2} |
Number of arguments to add or update one linked data entry. More... | |
constexpr auto | numArgsOverwriteLinkedData {2} |
Number of additional arguments when overwriting existing linked data. More... | |
constexpr auto | numArgsFinishUrl {2} |
Number of arguments to set a URL to finished. More... | |
Namespace for extractor classes.
|
inline |
Alias, used in SQL queries, for the crawled data table.
Referenced by crawlservpp::Module::Extractor::Database::prepare().
|
inline |
Logging is disabled.
|
inline |
Default cache size.
|
inline |
Default locking time, in seconds.
|
inline |
Default number of URLs and results to be processed in one MySQL query.
|
inline |
Default name of the paging variable.
To be used in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::sourceHeaders. Will be overwritten with either the number, or the name of the current page.
|
inline |
Default maximum depth of recursive extracting.
|
inline |
Default re-tries on connection error.
|
inline |
HTTP status codes to retry by default.
|
inline |
Default sleeping time on connection errors, in milliseconds.
|
inline |
Default time that will be waited between HTTP requests, in milliseconds.
|
inline |
Default time to wait before checking for new URLs when all URLs have been processed, in milliseconds.
|
inline |
Default time to wait before last try to re-connect to MySQL server, in seconds.
|
inline |
Extract data from the content of a crawled web page.
|
inline |
Extract data from other extracted data.
|
inline |
Extract data from parsed data.
|
inline |
Alias, used in SQL queries, for the extracting table.
Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().
|
inline |
Default logging is enabled.
Referenced by crawlservpp::Module::Extractor::Thread::onClear(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inline |
Extended logging is enabled.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Thread::onTick().
|
inline |
Verbose logging is enabled.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
HTTP response code to be ignored when checking for errors.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
Maximum HTTP error code.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
Minimum HTTP error code.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
Alias, used in SQL queries, for the linked target table.
Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().
|
inline |
Maximum size of database content (= 1 GiB).
Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().
|
inline |
Maximum size of database content as string.
Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().
|
inline |
Minimum number of columns in the linked target table.
Referenced by crawlservpp::Module::Extractor::Database::initTargetTables(), and crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Minimum number of columns in the target table.
Referenced by crawlservpp::Module::Extractor::Database::initTargetTables(), and crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Process ten values at once.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls(), crawlservpp::Module::Extractor::Database::prepare(), crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk(), crawlservpp::Module::Extractor::Database::updateOrAddEntries(), and crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Process one hundred values at once.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls(), crawlservpp::Module::Extractor::Database::prepare(), crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk(), crawlservpp::Module::Extractor::Database::updateOrAddEntries(), and crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Number of arguments to add or update one data entry (without custom columns).
Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Number of arguments to add or update one linked data entry.
Referenced by crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Number of arguments to set a URL to finished.
Referenced by crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk().
|
inline |
Number of additional arguments when data is linked.
Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Number of arguments to lock one URL.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls().
|
inline |
Number of additional arguments when overwriting existing data.
Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Number of additional arguments when overwriting existing linked data.
Referenced by crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Process one value at once.
Referenced by crawlservpp::Module::Extractor::Database::prepare().
|
inline |
Alias, used in SQL queries, for the parsed data table.
Referenced by crawlservpp::Module::Extractor::Database::prepare().
|
inline |
Protocols to remove from URLs.
Referenced by crawlservpp::Module::Extractor::Config::reset().
|
inline |
First argument in a SQL query.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls(), crawlservpp::Module::Extractor::Database::getContent(), crawlservpp::Module::Extractor::Database::getLatestParsedData(), crawlservpp::Module::Extractor::Database::getLockTime(), crawlservpp::Module::Extractor::Database::getUrlLockTime(), crawlservpp::Module::Extractor::Database::getUrlPosition(), crawlservpp::Module::Extractor::Database::renewUrlLockIfOk(), crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk(), crawlservpp::Module::Extractor::Database::unLockUrlIfOk(), crawlservpp::Module::Extractor::Database::unLockUrlsIfOk(), crawlservpp::Module::Extractor::Database::updateOrAddEntries(), and crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Second argument in a SQL query.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls(), crawlservpp::Module::Extractor::Database::renewUrlLockIfOk(), crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk(), crawlservpp::Module::Extractor::Database::unLockUrlIfOk(), crawlservpp::Module::Extractor::Database::updateOrAddEntries(), and crawlservpp::Module::Extractor::Database::updateOrAddLinked().
|
inline |
Third argument in a SQL query.
Referenced by crawlservpp::Module::Extractor::Database::fetchUrls(), crawlservpp::Module::Extractor::Database::renewUrlLockIfOk(), and crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Fourth argument in a SQL query.
Referenced by crawlservpp::Module::Extractor::Database::renewUrlLockIfOk(), and crawlservpp::Module::Extractor::Database::updateOrAddEntries().
|
inline |
Fifth argument in a SQL query.
|
inline |
Alias, used in SQL queries, for the target table.
Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().
|
inline |
Alias, used in SQL queries, for the URL list table.
Referenced by crawlservpp::Module::Extractor::Database::prepare().
|
inline |
Extract variable value from the content of a crawled web page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
Extract variable value from parsed data.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().
|
inline |
Extract variable value from the URL of a crawled web page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset().