crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Module::Extractor::Config::Entries Struct Reference

Configuration entries for extractor threads. More...

#include <Config.hpp>

Extractor Configuration

std::uint64_t generalCacheSize {defaultCacheSize}
 Number of URLs fetched and extracted from before saving results. More...
 
bool generalExtractCustom {false}
 Specifies whether to include custom URLs when extracting. More...
 
std::uint32_t generalLock {defaultLockS}
 URL locking time, in seconds. More...
 
std::uint8_t generalLogging {generalLoggingDefault}
 Level of logging activity. More...
 
std::uint16_t generalMaxBatchSize {defaultMaxBatchSize}
 Maximum number of URLs and results processed in one MySQL query. More...
 
bool generalMinimizeMemory {false}
 Specifies whether to free small amounts of unused memory more often, at the expense of performance. More...
 
bool generalReExtract {false}
 Specifies whether to re-extract data from already processed URLs. More...
 
std::string generalTargetTable
 Name of table to save extracted data to. More...
 
std::int64_t generalReTries {defaultReTries}
 Number of re-tries on connection errors. More...
 
std::vector< std::uint32_t > generalRetryHttp
 HTTP errors that will be handled like connection errors. More...
 
std::uint64_t generalSleepError {defaultSleepErrorMs}
 Sleeping time (in ms) on connection errors, in milliseconds. More...
 
std::uint64_t generalSleepHttp {defaultSleepHttpMs}
 Time that will be waited between HTTP requests, in milliseconds. More...
 
std::uint64_t generalSleepIdle {defaultSleepIdleMs}
 Time to wait before checking for new URLs when all URLs have been processed, in milliseconds. More...
 
std::uint64_t generalSleepMySql {defaultSleepMySqlS}
 Time to wait before last try to re-connect to mySQL server, in seconds. More...
 
std::uint32_t generalTidyErrors {}
 Number of tidyhtml errors to write to the log. More...
 
bool generalTidyWarnings {false}
 Specifies whether to write tidyhtml warnings to the log. More...
 
bool generalTiming {false}
 Specifies whether to calculate timing statistics for the extractor. More...
 

Variables

std::vector< std::string > variablesAlias
 Alias for the variable with same array index. More...
 
std::vector< std::int64_t > variablesAliasAdd
 Value to add to the variable alias with the same array index. More...
 
std::vector< std::string > variablesDateTimeFormat
 Date/time format to be used for the variable with the same array index. More...
 
std::vector< std::string > variablesDateTimeLocale
 Date/time locale to be used for the variable with the same array index. More...
 
std::vector< std::uint64_t > variablesSkipQuery
 Queries to be used on the value of the variable with the same array index to determine whether to skip the current URL. More...
 
std::vector< std::string > variablesName
 Variable names. More...
 
std::vector< std::string > variablesParsedColumn
 Parsed column for the value of the variable with the same array index. More...
 
std::vector< std::string > variablesParsedTable
 Name of the table containing the parsed data for the variable with the same array index. More...
 
std::vector< std::uint64_t > variablesQuery
 Query on the content or URL for the variable with the same array index. More...
 
std::vector< std::uint8_t > variablesSource
 Source of the variable with the same array index. More...
 
std::vector< std::string > variablesTokens
 List of token variables. More...
 
std::vector< std::string > variablesTokensCookies
 Custom HTTP Cookie header for the token variable with the same array index. More...
 
std::vector< std::uint64_t > variablesTokensQuery
 Query to extract token variable with the same array index. More...
 
std::vector< std::string > variablesTokensSource
 Source URL for the token variable with the same array index. More...
 
std::vector< bool > variablesTokensUsePost
 Specifies whether to use HTTP POST instead of GET for the token variable with the same array index. More...
 
std::vector< std::string > variablesTokenHeaders
 Custom HTTP headers to be used for ALL token variables. More...
 

Paging

std::string pagingAlias
 Alias for the paging variable. More...
 
std::int64_t pagingAliasAdd {}
 Value to add to the alias for the paging variable. More...
 
std::int64_t pagingFirst {}
 Number of the first page. More...
 
std::string pagingFirstString
 Name of the first page. More...
 
std::uint64_t pagingIsNextFrom {}
 Query on page content to determine whether there is another page. More...
 
std::uint64_t pagingNextFrom {}
 Query on page content to find the number(s) or name(s) of additional pages. More...
 
std::uint64_t pagingNumberFrom {}
 Query to determine the total number of pages from the content of the first page. More...
 
std::int64_t pagingStep {1}
 Number to add to page variable for retrieving the next page, if a page number is used. More...
 
std::string pagingVariable {defaultPagingVariable}
 Name of the paging variable. More...
 

Source

std::string sourceCookies
 Custom HTTP Cookie header used when retrieving data. More...
 
std::vector< std::string > sourceHeaders
 Custom HTTP headers used when retrieving data. More...
 
std::string sourceUrl
 URL to retrieve data from. More...
 
std::string sourceUrlFirst
 URL of the first page to retrieve data from. More...
 
bool sourceUsePost {false}
 Specifies whether to use HTTP POST instead of HTTP GET for extracting data. More...
 

Extracting

std::vector< std::uint64_t > extractingDatasetQueries
 Queries to extract datasets. More...
 
std::vector< std::string > extractingDateTimeFormats
 Format of date/time to be extracted by the date/time query with the same array index. More...
 
std::vector< std::string > extractingDateTimeLocales
 Locale used by the date/time query with the same array index for extracting date and time. More...
 
std::vector< std::uint64_t > extractingDateTimeQueries
 Queries used for extracting date/time from the dataset. More...
 
std::vector< std::uint64_t > extractingErrorFail
 Queries to detect fatal errors in the data. More...
 
std::vector< std::uint64_t > extractingErrorRetry
 Queries to detect temporary errors in the data. More...
 
std::vector< std::string > extractingFieldDateTimeFormats
 Date/time format of the field with the same array index. More...
 
std::vector< std::string > extractingFieldDateTimeLocales
 Locale used when converting the field with the same array index to a date/time. More...
 
std::vector< char > extractingFieldDelimiters
 Delimiter between multiple results for the field with the same array index, if not saved as JSON. More...
 
std::vector< bool > extractingFieldIgnoreEmpty
 Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More...
 
std::vector< bool > extractingFieldJSON
 Save the value of the field with the same array index as a JSON array. More...
 
std::vector< std::string > extractingFieldNames
 The names of the custom fields to extract. More...
 
std::vector< std::uint64_t > extractingFieldQueries
 The query used to extract the custom field with the same array index from the data. More...
 
std::vector< bool > extractingFieldTidyTexts
 Specifies whether to remove line breaks and unnecessary whitespaces when extracting the field with the same array index. More...
 
std::vector< bool > extractingFieldWarningsEmpty
 Specifies whether to write a warning to the log when the field with the same array index is empty. More...
 
std::vector< std::string > extractingIdIgnore
 Extracted IDs to be ignored. More...
 
std::vector< std::uint64_t > extractingIdQueries
 Queries to extract the ID from the dataset. More...
 
bool extractingOverwrite {true}
 Specifies whether, if a dataset with the same ID already exists, it will be overwritten. More...
 
std::vector< std::uint64_t > extractingRecursive
 Queries for extracting more datasets from a dataset. More...
 
std::uint64_t extractingRecursiveMaxDepth {defaultRecursiveMaxDepth}
 Maximum depth of recursive extracting. More...
 
bool extractingRemoveDuplicates {true}
 Specifies whether to remove duplicate datasets over multiple pages before checking the expected number of datasets. More...
 
bool extractingRepairCData {true}
 Specifies whether to (try to) repair CData when parsing HTML/XML. More...
 
bool extractingRepairComments {true}
 Specifies whether to (try to) repair broken HTML/XML comments. More...
 
bool extractingRemoveXmlInstructions {true}
 Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content. More...
 
std::uint64_t extractingSkipQuery {}
 Extracting will proceed to the next URL if the current page fulfills this query. More...
 

Linked Data

std::vector< std::uint64_t > linkedDatasetQueries
 Queries to extract linked datasets. More...
 
std::vector< std::string > linkedDateTimeFormats
 Date/time format of the linked field with the same array index. More...
 
std::vector< std::string > linkedDateTimeLocales
 Date/time locale of the linked field with the same array index. More...
 
std::vector< char > linkedDelimiters
 Delimiter between multiple results for the field with the same array index, if not saved as JSON. More...
 
std::vector< std::string > linkedFieldNames
 Names of the linked data fields. More...
 
std::vector< std::uint64_t > linkedFieldQueries
 Query used to extract the custom field with the same array index from the dataset. More...
 
std::vector< std::string > linkedIdIgnore
 IDs of linked data to be ignored. More...
 
std::vector< std::uint64_t > linkedIdQueries
 Queries to extract the linked ID from the dataset. More...
 
std::vector< bool > linkedIgnoreEmpty
 Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More...
 
std::vector< bool > linkedJSON
 Specfies whether to save the value of the field with the same array index as a JSON array. More...
 
std::string linkedLink
 Name of the extracted field that links an extracted dataset to the ID of a linked dataset. More...
 
bool linkedOverwrite {true}
 Specifies whether, if a linked dataset with the same ID already exists, it will be overwritten. More...
 
std::string linkedTargetTable
 Name of the table to save linked data to. More...
 
std::vector< bool > linkedTidyTexts
 Specifies whether to remove line breaks and unnecessary whitespaces when extracting the linked field with the same array index. More...
 
std::vector< bool > linkedWarningsEmpty
 Specifies whether to write a warning to the log when the field with the same array index is empty. More...
 

Expected Number of Results

bool expectedErrorIfLarger {false}
 Specifies whether to throw an exception when the number of expected datasets is exceeded. More...
 
bool expectedErrorIfSmaller {false}
 Specifies whether to throw an exception when the number of expected datasets is subceeded. More...
 
std::string expectedParsedColumn
 Parsed column containing the expected number of datasets. More...
 
std::string expectedParsedTable
 Name of the table containing the expected number of datasets. More...
 
std::uint64_t expectedQuery {}
 Query to be performed to retrieve the expected number of datasets. More...
 
std::uint8_t expectedSource {expectedSourceExtracting}
 Source of the query to retrieve the expected number of datasets. More...
 

Detailed Description

Configuration entries for extractor threads.

Warning
Changing the configuration requires updating json/extractor.json in crawlserv_frontend!

Member Data Documentation

◆ expectedErrorIfLarger

bool crawlservpp::Module::Extractor::Config::Entries::expectedErrorIfLarger {false}

Specifies whether to throw an exception when the number of expected datasets is exceeded.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ expectedErrorIfSmaller

bool crawlservpp::Module::Extractor::Config::Entries::expectedErrorIfSmaller {false}

Specifies whether to throw an exception when the number of expected datasets is subceeded.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ expectedParsedColumn

std::string crawlservpp::Module::Extractor::Config::Entries::expectedParsedColumn

Parsed column containing the expected number of datasets.

Note
Will only be used, if parsed data is the source of the expected number of datasets.
See also
expectedSource

Referenced by crawlservpp::Module::Extractor::Config::parseOption().

◆ expectedParsedTable

std::string crawlservpp::Module::Extractor::Config::Entries::expectedParsedTable

Name of the table containing the expected number of datasets.

Note
Will only be used, if parsed data is the source of the expected number of datasets.
See also
expectedSource

Referenced by crawlservpp::Module::Extractor::Config::parseOption().

◆ expectedQuery

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::expectedQuery {}

Query to be performed to retrieve the expected number of datasets.

Note
Will only be used, if the content or the URL is the source of the expected number of datasets.
See also
expectedSource

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ expectedSource

std::uint8_t crawlservpp::Module::Extractor::Config::Entries::expectedSource {expectedSourceExtracting}

◆ extractingDatasetQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingDatasetQueries

Queries to extract datasets.

The first query that returns a non-empty result will be used.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingDateTimeFormats

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeFormats

Format of date/time to be extracted by the date/time query with the same array index.

If not specified, the format %F %T, i.e. YYYY-MM-DD HH:MM:SS will be used.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
extractingDateTimeQueries, extractingDateTimeLocales, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingDateTimeLocales

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeLocales

◆ extractingDateTimeQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeQueries

Queries used for extracting date/time from the dataset.

The first query that returns a non-empty result will be used.

See also
extractingDateTimeFormats, extractingDateTimeLoclaes

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingErrorFail

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingErrorFail

Queries to detect fatal errors in the data.

The extraction will fail, if any of these queries return true.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingErrorRetry

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingErrorRetry

Queries to detect temporary errors in the data.

The extraction will be retried, as long as any of these queries return true.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldDateTimeFormats

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDateTimeFormats

Date/time format of the field with the same array index.

If empty, no date/time conversion will be performed.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
extractingFieldNames, extractingFieldDateTimeLocales, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldDateTimeLocales

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDateTimeLocales

Locale used when converting the field with the same array index to a date/time.

Will be ignored, if no date/time format has been specified for the field.

See also
extractingFieldNames, extractingFieldDateTimeFormats, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldDelimiters

std::vector<char> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDelimiters

Delimiter between multiple results for the field with the same array index, if not saved as JSON.

Only the first character of the string, \n (default), \t, or \\ will be used.

See also
extractingFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldIgnoreEmpty

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldIgnoreEmpty

Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.

Enabled by default.

See also
extractingFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldJSON

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldJSON

◆ extractingFieldNames

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldNames

The names of the custom fields to extract.

These fields will be extracted from the content of the current page, using the queries specified in Extractor::Config::Entries::extractingFieldQueries.

Field options are matched via the array index in the respective vectors.

If Extractor::Config::Entries::extractingFieldDateTimeFormats contains a non-empty string, a date/time will be parsed for the respective field, using the locale defined in Extractor::Config::Entries::extractingFieldDateTimeLocale.

Multiple values for one field will be detected via the delimiter in Extractor::Config::Entries::extractingFieldDelimiters, Extractor::Config::Entries::extractingFieldIgnoreEmpty determines whether to ignore empty values, and Extractor::Config::Entries::extractingFieldJSON whether to store them as a JSON array.

If the value of a field is empty, Extractor::Config::Entries::extractingFieldWarningsEmpty determines whether to write a warning to the log.

Extractor::Config::Entries::extractingFieldTidyTexts specifies whether to tidy up the resulting text before being stored to the respective field.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingFieldQueries

The query used to extract the custom field with the same array index from the data.

See also
extractingFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldTidyTexts

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldTidyTexts

Specifies whether to remove line breaks and unnecessary whitespaces when extracting the field with the same array index.

See also
extractingFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingFieldWarningsEmpty

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldWarningsEmpty

Specifies whether to write a warning to the log when the field with the same array index is empty.

Note
Logging needs to be enabled in order for this option to have any effect.
See also
extractingFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingIdIgnore

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingIdIgnore

◆ extractingIdQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingIdQueries

Queries to extract the ID from the dataset.

The first query that returns a non-empty result will be used. Datasets with duplicate or empty IDs will not be extracted.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingOverwrite

bool crawlservpp::Module::Extractor::Config::Entries::extractingOverwrite {true}

Specifies whether, if a dataset with the same ID already exists, it will be overwritten.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRecursive

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingRecursive

Queries for extracting more datasets from a dataset.

The first query that returns a non-empty result will be used.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRecursiveMaxDepth

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::extractingRecursiveMaxDepth {defaultRecursiveMaxDepth}

Maximum depth of recursive extracting.

Referenced by crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRemoveDuplicates

bool crawlservpp::Module::Extractor::Config::Entries::extractingRemoveDuplicates {true}

Specifies whether to remove duplicate datasets over multiple pages before checking the expected number of datasets.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRemoveXmlInstructions

bool crawlservpp::Module::Extractor::Config::Entries::extractingRemoveXmlInstructions {true}

Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRepairCData

bool crawlservpp::Module::Extractor::Config::Entries::extractingRepairCData {true}

Specifies whether to (try to) repair CData when parsing HTML/XML.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingRepairComments

bool crawlservpp::Module::Extractor::Config::Entries::extractingRepairComments {true}

Specifies whether to (try to) repair broken HTML/XML comments.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ extractingSkipQuery

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::extractingSkipQuery {}

Extracting will proceed to the next URL if the current page fulfills this query.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalCacheSize

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalCacheSize {defaultCacheSize}

Number of URLs fetched and extracted from before saving results.

Set to zero to cache all URLs at once.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalExtractCustom

bool crawlservpp::Module::Extractor::Config::Entries::generalExtractCustom {false}

Specifies whether to include custom URLs when extracting.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalLock

std::uint32_t crawlservpp::Module::Extractor::Config::Entries::generalLock {defaultLockS}

◆ generalLogging

std::uint8_t crawlservpp::Module::Extractor::Config::Entries::generalLogging {generalLoggingDefault}

◆ generalMaxBatchSize

std::uint16_t crawlservpp::Module::Extractor::Config::Entries::generalMaxBatchSize {defaultMaxBatchSize}

Maximum number of URLs and results processed in one MySQL query.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalMinimizeMemory

bool crawlservpp::Module::Extractor::Config::Entries::generalMinimizeMemory {false}

Specifies whether to free small amounts of unused memory more often, at the expense of performance.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalReExtract

bool crawlservpp::Module::Extractor::Config::Entries::generalReExtract {false}

Specifies whether to re-extract data from already processed URLs.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalReTries

std::int64_t crawlservpp::Module::Extractor::Config::Entries::generalReTries {defaultReTries}

Number of re-tries on connection errors.

Set to -1, if you want to re-try an infinite number of times on connection errors.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalRetryHttp

std::vector<std::uint32_t> crawlservpp::Module::Extractor::Config::Entries::generalRetryHttp
Initial value:

HTTP errors that will be handled like connection errors.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalSleepError

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepError {defaultSleepErrorMs}

Sleeping time (in ms) on connection errors, in milliseconds.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalSleepHttp

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepHttp {defaultSleepHttpMs}

Time that will be waited between HTTP requests, in milliseconds.

Referenced by crawlservpp::Module::Extractor::Config::parseOption().

◆ generalSleepIdle

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepIdle {defaultSleepIdleMs}

Time to wait before checking for new URLs when all URLs have been processed, in milliseconds.

Referenced by crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalSleepMySql

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepMySql {defaultSleepMySqlS}

Time to wait before last try to re-connect to mySQL server, in seconds.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalTargetTable

std::string crawlservpp::Module::Extractor::Config::Entries::generalTargetTable

◆ generalTidyErrors

std::uint32_t crawlservpp::Module::Extractor::Config::Entries::generalTidyErrors {}

Number of tidyhtml errors to write to the log.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Extractor::Config::parseOption().

◆ generalTidyWarnings

bool crawlservpp::Module::Extractor::Config::Entries::generalTidyWarnings {false}

Specifies whether to write tidyhtml warnings to the log.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ generalTiming

bool crawlservpp::Module::Extractor::Config::Entries::generalTiming {false}

◆ linkedDatasetQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedDatasetQueries

Queries to extract linked datasets.

The first query that returns a non-empty result will be used.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedDateTimeFormats

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedDateTimeFormats

Date/time format of the linked field with the same array index.

If empty, no date/time conversion will be performed.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
linkedFieldNames, linkedDateTimeLocales, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedDateTimeLocales

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedDateTimeLocales

Date/time locale of the linked field with the same array index.

Will be ignored, if no corresponding date/time format is given.

See also
linkedFieldNames, linkedDateTimeFormat, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedDelimiters

std::vector<char> crawlservpp::Module::Extractor::Config::Entries::linkedDelimiters

Delimiter between multiple results for the field with the same array index, if not saved as JSON.

Only the first character, \n (default), \t, or \\ will be used.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedFieldNames

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedFieldNames

Names of the linked data fields.

Linked data is additionally extracted data that is linked via its ID field to one of the originally extracted data fields, as specified in Extractor::Config::Entries::linkedLink.

The ID field, as well as the additional data fields will be extracted from the subset retrieved by using the query in Extractor::Config::Entries::linkedDataSetQueries on the content of the current page, using the queries specified in Extractor::Config::Entries::linkedIdQueries for the ID, and Extractor::Config::Entries::linkedFieldQueries for each of the other fields.

Linked data with the IDs specified in Extractor::Config::Entries::linkedIdIgnore will be ignored.

Linked field options are matched via the array index in the respective vectors.

If Extractor::Config::Entries::linkedFieldDateTimeFormats contains a non-empty string, a date/time will be parsed for the respective field, using the locale defined in Extractor::Config::Entries::linkedFieldDateTimeLocale.

Multiple values for one field will be detected via the delimiter in Extractor::Config::Entries::linkedFieldDelimiters, Extractor::Config::Entries::linkedFieldIgnoreEmpty determines whether to ignore empty values, and Extractor::Config::Entries::linkedFieldJSON whether to store them as a JSON array.

If the value of a field is empty, Extractor::Config::Entries::linkedWarningsEmpty determines whether to write a warning to the log.

Extractor::Config::Entries::linkedTidyTexts specifies whether to tidy up the resulting text before being stored to the respective field.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedFieldQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedFieldQueries

Query used to extract the custom field with the same array index from the dataset.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedIdIgnore

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedIdIgnore

◆ linkedIdQueries

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedIdQueries

Queries to extract the linked ID from the dataset.

The first query that returns a non-empty result will be used.

Datasets with duplicate or empty IDs will not be extracted.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedIgnoreEmpty

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedIgnoreEmpty

Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.

Enabled by default.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedJSON

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedJSON

Specfies whether to save the value of the field with the same array index as a JSON array.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedLink

std::string crawlservpp::Module::Extractor::Config::Entries::linkedLink

Name of the extracted field that links an extracted dataset to the ID of a linked dataset.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedOverwrite

bool crawlservpp::Module::Extractor::Config::Entries::linkedOverwrite {true}

Specifies whether, if a linked dataset with the same ID already exists, it will be overwritten.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedTargetTable

std::string crawlservpp::Module::Extractor::Config::Entries::linkedTargetTable

◆ linkedTidyTexts

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedTidyTexts

Specifies whether to remove line breaks and unnecessary whitespaces when extracting the linked field with the same array index.

See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ linkedWarningsEmpty

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedWarningsEmpty

Specifies whether to write a warning to the log when the field with the same array index is empty.

Note
Logging needs to be enabled in order for this option to have any effect.
See also
linkedFieldNames

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingAlias

std::string crawlservpp::Module::Extractor::Config::Entries::pagingAlias

Alias for the paging variable.

A paging alias allows additions to (and subtractions from, via negative values) the current value of the paging variable. The name of the alias will be replaced with the resulting value.

See also
pagingAliasAdd

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingAliasAdd

std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingAliasAdd {}

Value to add to the alias for the paging variable.

Use negative values to subtract from the original value.

See also
pagingAlias

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingFirst

std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingFirst {}

◆ pagingFirstString

std::string crawlservpp::Module::Extractor::Config::Entries::pagingFirstString

Name of the first page.

If not empty, this string will overwrite Extractor::Config::Entries::pagingFirst. Extractor::Config::Entries::pagingStep will also not be used to determine the number of the next page, when a page name is used instead.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingIsNextFrom

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingIsNextFrom {}

Query on page content to determine whether there is another page.

Will be ignored, if no query is set, i.e. the value is zero.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingNextFrom

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingNextFrom {}

Query on page content to find the number(s) or name(s) of additional pages.

Will be ignored, if no query is set, i.e. the value is zero.

If a query is set, it will overwrite Extractor::Config::Entries::pagingStep, which will no longer be used to determine the number of the next page.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingNumberFrom

std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingNumberFrom {}

Query to determine the total number of pages from the content of the first page.

Will be ignored, if no query is set, i.e. the value is zero.

If a query is set, it will overwrite Extractor::Config::Entries::pagingStep, and Extractor::Config::Entries::pagingNumberFrom, which will no longer be used to determine the number of the next page.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingStep

std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingStep {1}

Number to add to page variable for retrieving the next page, if a page number is used.

See also
pagingFirst, pagingNextFrom, pagingNumberFrom, pagingFirstString

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ pagingVariable

std::string crawlservpp::Module::Extractor::Config::Entries::pagingVariable {defaultPagingVariable}

Name of the paging variable.

To be used in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::SourceHeaders. Will be overwritten with either the number, or the name of the current page.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ sourceCookies

std::string crawlservpp::Module::Extractor::Config::Entries::sourceCookies

Custom HTTP Cookie header used when retrieving data.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ sourceHeaders

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::sourceHeaders

Custom HTTP headers used when retrieving data.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ sourceUrl

std::string crawlservpp::Module::Extractor::Config::Entries::sourceUrl

URL to retrieve data from.

Note
The URL needs to be absolute, but without protocol, e.g. en.wikipedia.org/wiki/Main_Page.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ sourceUrlFirst

std::string crawlservpp::Module::Extractor::Config::Entries::sourceUrlFirst

URL of the first page to retrieve data from.

Note
The URL needs to be absolute, but without protocol, e.g. en.wikipedia.org/wiki/Main_Page.

Will be ignored, when empty.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ sourceUsePost

bool crawlservpp::Module::Extractor::Config::Entries::sourceUsePost {false}

Specifies whether to use HTTP POST instead of HTTP GET for extracting data.

Note
When HTTP POST is used, arguments attached to the URL (e.g. ?var1&var2=valueOfVar2) will be sent as arguments of the HTTP POST request instead of parts of the URL.

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesAlias

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesAlias

Alias for the variable with same array index.

Variable aliases allow additions to (and subtractions from, via negative values) the value of variables. The name of the variable alias will be replaced with the resulting value.

See also
variablesName, variablesAliasAdd

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesAliasAdd

std::vector<std::int64_t> crawlservpp::Module::Extractor::Config::Entries::variablesAliasAdd

Value to add to the variable alias with the same array index.

Use negative values to subtract from the original value.

See also
variablesName, variablesAlias

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesDateTimeFormat

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesDateTimeFormat

Date/time format to be used for the variable with the same array index.

If empty, no date/time conversion will be performed.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
variablesName, variablesDateTimeLocale, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesDateTimeLocale

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesDateTimeLocale

Date/time locale to be used for the variable with the same array index.

Will be ignored, if no corresponding date/time format is given.

See also
variablesName, variablesDateTimeFormat, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesName

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesName

Variable names.

Strings to be replaced by the respective variable values in Extractor::Config::Entries::variablesTokensSource, Extractor::Config::variablesTokensHeaders, Extractor::Config::sourceUrl, Extractor::Config::sourceCookies, and Extractor::Config::sourceHeaders.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesParsedColumn

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesParsedColumn

Parsed column for the value of the variable with the same array index.

Note
Will only be used, if parsed data is the source of the variable.
See also
variablesSource

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesParsedTable

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesParsedTable

Name of the table containing the parsed data for the variable with the same array index.

Note
Will only be used, if parsed data is the source of the variable.
See also
variablesSource, variablesQuery

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesQuery

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesQuery

Query on the content or URL for the variable with the same array index.

Note
Will only be used, if the content or the URL is the source of the variable.
See also
variablesSource, variablesQuery

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesSkipQuery

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesSkipQuery

Queries to be used on the value of the variable with the same array index to determine whether to skip the current URL.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesSource

std::vector<std::uint8_t> crawlservpp::Module::Extractor::Config::Entries::variablesSource

◆ variablesTokenHeaders

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokenHeaders

Custom HTTP headers to be used for ALL token variables.

See also
variablesTokensSource

Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesTokens

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokens

List of token variables.

Strings to be replaced with the value of the respective token variable in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::sourceHeaders.

The values of token variables are determined by requesting data from external soures.

See also
variablesTokensSource

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesTokensCookies

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokensCookies

◆ variablesTokensQuery

std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesTokensQuery

◆ variablesTokensSource

std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokensSource

Source URL for the token variable with the same array index.

Note
The URL needs to be absolute, but without protocol, e.g. en.wikipedia.org/wiki/Main_Page.

To retrieve the content of the URL, the headers specified in Extractor::Config::Entries::variablesTokenHeaders, and the cookies specified in the string with the same array index in Extractor::Config::Entries::variablesTokensCookies will be used.

Extractor::Config::Entries::variablesTokensUsePost specifies whether to use HTTP POST, instead of HTTP GET, when retrieving the content. When HTTP POST is used, arguments attached to the URL (e.g. ?var1&var2=valueOfVar2) will be sent as arguments of the HTTP POST request instead of parts of the URL.

Afterwards, the query with the same array index in Extractor::Config::Entries::variablesTokensQuery will be used to determine the value of the respective token variable.

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().

◆ variablesTokensUsePost

std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::variablesTokensUsePost

Specifies whether to use HTTP POST instead of GET for the token variable with the same array index.

See also
variablesTokensSource

Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().


The documentation for this struct was generated from the following file: