crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Module::Extractor Namespace Reference

Namespace for extractor classes. More...

Classes

class  Config
 Configuration for extractors. More...
 
class  Database
 Class providing database functionality for extractor threads by implementing Wrapper::Database. More...
 
class  Thread
 Extractor thread. More...
 

Constants

constexpr std::uint8_t crawlerLoggingVerbose {0}
 Logging is disabled. More...
 
constexpr std::uint8_t generalLoggingDefault {1}
 Default logging is enabled. More...
 
constexpr std::uint8_t generalLoggingExtended {2}
 Extended logging is enabled. More...
 
constexpr std::uint8_t generalLoggingVerbose {3}
 Verbose logging is enabled. More...
 
constexpr std::uint8_t variablesSourcesParsed {0}
 Extract variable value from parsed data. More...
 
constexpr std::uint8_t variablesSourcesContent {1}
 Extract variable value from the content of a crawled web page. More...
 
constexpr std::uint8_t variablesSourcesUrl {2}
 Extract variable value from the URL of a crawled web page. More...
 
constexpr std::uint8_t expectedSourceExtracting {0}
 Extract data from other extracted data. More...
 
constexpr std::uint8_t expectedSourceParsed {1}
 Extract data from parsed data. More...
 
constexpr std::uint8_t expectedSourceContent {2}
 Extract data from the content of a crawled web page. More...
 
constexpr std::array defaultRetryHttpStatusCodes {429, 502, 503, 504}
 HTTP status codes to retry by default. More...
 
constexpr std::array protocolsToRemove {"http://"sv, "https://"sv}
 Protocols to remove from URLs. More...
 
constexpr std::uint64_t defaultCacheSize {2500}
 Default cache size. More...
 
constexpr std::uint32_t defaultLockS {300}
 Default locking time, in seconds. More...
 
constexpr std::uint16_t defaultMaxBatchSize {500}
 Default number of URLs and results to be processed in one MySQL query. More...
 
constexpr std::int64_t defaultReTries {720}
 Default re-tries on connection error. More...
 
constexpr std::uint64_t defaultSleepErrorMs {10000}
 Default sleeping time on connection errors, in milliseconds. More...
 
constexpr std::uint64_t defaultSleepHttpMs {0}
 Default time that will be waited between HTTP requests, in milliseconds. More...
 
constexpr std::uint64_t defaultSleepIdleMs {5000}
 Default time to wait before checking for new URLs when all URLs have been processed, in milliseconds. More...
 
constexpr std::uint64_t defaultSleepMySqlS {60}
 Default time to wait before last try to re-connect to MySQL server, in seconds. More...
 
constexpr auto defaultPagingVariable {"$p"sv}
 Default name of the paging variable. More...
 
constexpr std::uint64_t defaultRecursiveMaxDepth {100}
 Default maximum depth of recursive extracting. More...
 
constexpr auto minTargetColumns {4}
 Minimum number of columns in the target table. More...
 
constexpr auto minLinkedColumns {2}
 Minimum number of columns in the linked target table. More...
 
constexpr auto maxContentSize {1073741824}
 Maximum size of database content (= 1 GiB). More...
 
constexpr auto maxContentSizeString {"1 GiB"sv}
 Maximum size of database content as string. More...
 
constexpr auto httpResponseCodeMin {400}
 Minimum HTTP error code. More...
 
constexpr auto httpResponseCodeMax {599}
 Maximum HTTP error code. More...
 
constexpr auto httpResponseCodeIgnore {200}
 HTTP response code to be ignored when checking for errors. More...
 

Constants for MySQL Queries

constexpr auto oneAtOnce {1}
 Process one value at once. More...
 
constexpr auto nAtOnce10 {10}
 Process ten values at once. More...
 
constexpr auto nAtOnce100 {100}
 Process one hundred values at once. More...
 
constexpr auto sqlArg1 {1}
 First argument in a SQL query. More...
 
constexpr auto sqlArg2 {2}
 Second argument in a SQL query. More...
 
constexpr auto sqlArg3 {3}
 Third argument in a SQL query. More...
 
constexpr auto sqlArg4 {4}
 Fourth argument in a SQL query. More...
 
constexpr auto sqlArg5 {5}
 Fifth argument in a SQL query. More...
 
constexpr auto extractingTableAlias {"a"sv}
 Alias, used in SQL queries, for the extracting table. More...
 
constexpr auto targetTableAlias {"b"sv}
 Alias, used in SQL queries, for the target table. More...
 
constexpr auto linkedTableAlias {"c"sv}
 Alias, used in SQL queries, for the linked target table. More...
 
constexpr auto parsedDataTableAlias {"a"sv}
 Alias, used in SQL queries, for the parsed data table. More...
 
constexpr auto crawledDataTableAlias {"b"sv}
 Alias, used in SQL queries, for the crawled data table. More...
 
constexpr auto urlListTableAlias {"c"sv}
 Alias, used in SQL queries, for the URL list table. More...
 
constexpr auto numArgsLockUrl {3}
 Number of arguments to lock one URL. More...
 
constexpr auto numArgsAddUpdateData {4}
 Number of arguments to add or update one data entry (without custom columns). More...
 
constexpr auto numArgsLinked {2}
 Number of additional arguments when data is linked. More...
 
constexpr auto numArgsOverwriteData {3}
 Number of additional arguments when overwriting existing data. More...
 
constexpr auto numArgsAddUpdateLinkedData {2}
 Number of arguments to add or update one linked data entry. More...
 
constexpr auto numArgsOverwriteLinkedData {2}
 Number of additional arguments when overwriting existing linked data. More...
 
constexpr auto numArgsFinishUrl {2}
 Number of arguments to set a URL to finished. More...
 

Detailed Description

Namespace for extractor classes.

Variable Documentation

◆ crawledDataTableAlias

constexpr auto crawlservpp::Module::Extractor::crawledDataTableAlias {"b"sv}
inline

Alias, used in SQL queries, for the crawled data table.

Referenced by crawlservpp::Module::Extractor::Database::prepare().

◆ crawlerLoggingVerbose

constexpr std::uint8_t crawlservpp::Module::Extractor::crawlerLoggingVerbose {0}
inline

Logging is disabled.

◆ defaultCacheSize

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultCacheSize {2500}
inline

Default cache size.

◆ defaultLockS

constexpr std::uint32_t crawlservpp::Module::Extractor::defaultLockS {300}
inline

Default locking time, in seconds.

◆ defaultMaxBatchSize

constexpr std::uint16_t crawlservpp::Module::Extractor::defaultMaxBatchSize {500}
inline

Default number of URLs and results to be processed in one MySQL query.

◆ defaultPagingVariable

constexpr auto crawlservpp::Module::Extractor::defaultPagingVariable {"$p"sv}
inline

Default name of the paging variable.

To be used in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::sourceHeaders. Will be overwritten with either the number, or the name of the current page.

◆ defaultRecursiveMaxDepth

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultRecursiveMaxDepth {100}
inline

Default maximum depth of recursive extracting.

◆ defaultReTries

constexpr std::int64_t crawlservpp::Module::Extractor::defaultReTries {720}
inline

Default re-tries on connection error.

◆ defaultRetryHttpStatusCodes

constexpr std::array crawlservpp::Module::Extractor::defaultRetryHttpStatusCodes {429, 502, 503, 504}
inline

HTTP status codes to retry by default.

◆ defaultSleepErrorMs

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultSleepErrorMs {10000}
inline

Default sleeping time on connection errors, in milliseconds.

◆ defaultSleepHttpMs

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultSleepHttpMs {0}
inline

Default time that will be waited between HTTP requests, in milliseconds.

◆ defaultSleepIdleMs

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultSleepIdleMs {5000}
inline

Default time to wait before checking for new URLs when all URLs have been processed, in milliseconds.

◆ defaultSleepMySqlS

constexpr std::uint64_t crawlservpp::Module::Extractor::defaultSleepMySqlS {60}
inline

Default time to wait before last try to re-connect to MySQL server, in seconds.

◆ expectedSourceContent

constexpr std::uint8_t crawlservpp::Module::Extractor::expectedSourceContent {2}
inline

Extract data from the content of a crawled web page.

◆ expectedSourceExtracting

constexpr std::uint8_t crawlservpp::Module::Extractor::expectedSourceExtracting {0}
inline

Extract data from other extracted data.

◆ expectedSourceParsed

constexpr std::uint8_t crawlservpp::Module::Extractor::expectedSourceParsed {1}
inline

Extract data from parsed data.

◆ extractingTableAlias

constexpr auto crawlservpp::Module::Extractor::extractingTableAlias {"a"sv}
inline

Alias, used in SQL queries, for the extracting table.

Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().

◆ generalLoggingDefault

constexpr std::uint8_t crawlservpp::Module::Extractor::generalLoggingDefault {1}
inline

◆ generalLoggingExtended

constexpr std::uint8_t crawlservpp::Module::Extractor::generalLoggingExtended {2}
inline

◆ generalLoggingVerbose

constexpr std::uint8_t crawlservpp::Module::Extractor::generalLoggingVerbose {3}
inline

Verbose logging is enabled.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ httpResponseCodeIgnore

constexpr auto crawlservpp::Module::Extractor::httpResponseCodeIgnore {200}
inline

HTTP response code to be ignored when checking for errors.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ httpResponseCodeMax

constexpr auto crawlservpp::Module::Extractor::httpResponseCodeMax {599}
inline

Maximum HTTP error code.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ httpResponseCodeMin

constexpr auto crawlservpp::Module::Extractor::httpResponseCodeMin {400}
inline

Minimum HTTP error code.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ linkedTableAlias

constexpr auto crawlservpp::Module::Extractor::linkedTableAlias {"c"sv}
inline

Alias, used in SQL queries, for the linked target table.

Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().

◆ maxContentSize

constexpr auto crawlservpp::Module::Extractor::maxContentSize {1073741824}
inline

Maximum size of database content (= 1 GiB).

Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().

◆ maxContentSizeString

constexpr auto crawlservpp::Module::Extractor::maxContentSizeString {"1 GiB"sv}
inline

Maximum size of database content as string.

Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().

◆ minLinkedColumns

constexpr auto crawlservpp::Module::Extractor::minLinkedColumns {2}
inline

◆ minTargetColumns

constexpr auto crawlservpp::Module::Extractor::minTargetColumns {4}
inline

◆ nAtOnce10

◆ nAtOnce100

◆ numArgsAddUpdateData

constexpr auto crawlservpp::Module::Extractor::numArgsAddUpdateData {4}
inline

Number of arguments to add or update one data entry (without custom columns).

Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().

◆ numArgsAddUpdateLinkedData

constexpr auto crawlservpp::Module::Extractor::numArgsAddUpdateLinkedData {2}
inline

Number of arguments to add or update one linked data entry.

Referenced by crawlservpp::Module::Extractor::Database::updateOrAddLinked().

◆ numArgsFinishUrl

constexpr auto crawlservpp::Module::Extractor::numArgsFinishUrl {2}
inline

Number of arguments to set a URL to finished.

Referenced by crawlservpp::Module::Extractor::Database::setUrlsFinishedIfLockOk().

◆ numArgsLinked

constexpr auto crawlservpp::Module::Extractor::numArgsLinked {2}
inline

Number of additional arguments when data is linked.

Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().

◆ numArgsLockUrl

constexpr auto crawlservpp::Module::Extractor::numArgsLockUrl {3}
inline

Number of arguments to lock one URL.

Referenced by crawlservpp::Module::Extractor::Database::fetchUrls().

◆ numArgsOverwriteData

constexpr auto crawlservpp::Module::Extractor::numArgsOverwriteData {3}
inline

Number of additional arguments when overwriting existing data.

Referenced by crawlservpp::Module::Extractor::Database::updateOrAddEntries().

◆ numArgsOverwriteLinkedData

constexpr auto crawlservpp::Module::Extractor::numArgsOverwriteLinkedData {2}
inline

Number of additional arguments when overwriting existing linked data.

Referenced by crawlservpp::Module::Extractor::Database::updateOrAddLinked().

◆ oneAtOnce

constexpr auto crawlservpp::Module::Extractor::oneAtOnce {1}
inline

Process one value at once.

Referenced by crawlservpp::Module::Extractor::Database::prepare().

◆ parsedDataTableAlias

constexpr auto crawlservpp::Module::Extractor::parsedDataTableAlias {"a"sv}
inline

Alias, used in SQL queries, for the parsed data table.

Referenced by crawlservpp::Module::Extractor::Database::prepare().

◆ protocolsToRemove

constexpr std::array crawlservpp::Module::Extractor::protocolsToRemove {"http://"sv, "https://"sv}
inline

Protocols to remove from URLs.

Referenced by crawlservpp::Module::Extractor::Config::reset().

◆ sqlArg1

◆ sqlArg2

◆ sqlArg3

constexpr auto crawlservpp::Module::Extractor::sqlArg3 {3}
inline

◆ sqlArg4

constexpr auto crawlservpp::Module::Extractor::sqlArg4 {4}
inline

◆ sqlArg5

constexpr auto crawlservpp::Module::Extractor::sqlArg5 {5}
inline

Fifth argument in a SQL query.

◆ targetTableAlias

constexpr auto crawlservpp::Module::Extractor::targetTableAlias {"b"sv}
inline

Alias, used in SQL queries, for the target table.

Referenced by crawlservpp::Module::Extractor::Database::updateTargetTable().

◆ urlListTableAlias

constexpr auto crawlservpp::Module::Extractor::urlListTableAlias {"c"sv}
inline

Alias, used in SQL queries, for the URL list table.

Referenced by crawlservpp::Module::Extractor::Database::prepare().

◆ variablesSourcesContent

constexpr std::uint8_t crawlservpp::Module::Extractor::variablesSourcesContent {1}
inline

Extract variable value from the content of a crawled web page.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ variablesSourcesParsed

constexpr std::uint8_t crawlservpp::Module::Extractor::variablesSourcesParsed {0}
inline

Extract variable value from parsed data.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().

◆ variablesSourcesUrl

constexpr std::uint8_t crawlservpp::Module::Extractor::variablesSourcesUrl {2}
inline

Extract variable value from the URL of a crawled web page.

Referenced by crawlservpp::Module::Extractor::Thread::onReset().