|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Configuration entries for extractor threads. More...
#include <Config.hpp>
Extractor Configuration | |
| std::uint64_t | generalCacheSize {defaultCacheSize} |
| Number of URLs fetched and extracted from before saving results. More... | |
| bool | generalExtractCustom {false} |
| Specifies whether to include custom URLs when extracting. More... | |
| std::uint32_t | generalLock {defaultLockS} |
| URL locking time, in seconds. More... | |
| std::uint8_t | generalLogging {generalLoggingDefault} |
| Level of logging activity. More... | |
| std::uint16_t | generalMaxBatchSize {defaultMaxBatchSize} |
| Maximum number of URLs and results processed in one MySQL query. More... | |
| bool | generalMinimizeMemory {false} |
| Specifies whether to free small amounts of unused memory more often, at the expense of performance. More... | |
| bool | generalReExtract {false} |
| Specifies whether to re-extract data from already processed URLs. More... | |
| std::string | generalTargetTable |
| Name of table to save extracted data to. More... | |
| std::int64_t | generalReTries {defaultReTries} |
| Number of re-tries on connection errors. More... | |
| std::vector< std::uint32_t > | generalRetryHttp |
| HTTP errors that will be handled like connection errors. More... | |
| std::uint64_t | generalSleepError {defaultSleepErrorMs} |
| Sleeping time (in ms) on connection errors, in milliseconds. More... | |
| std::uint64_t | generalSleepHttp {defaultSleepHttpMs} |
| Time that will be waited between HTTP requests, in milliseconds. More... | |
| std::uint64_t | generalSleepIdle {defaultSleepIdleMs} |
| Time to wait before checking for new URLs when all URLs have been processed, in milliseconds. More... | |
| std::uint64_t | generalSleepMySql {defaultSleepMySqlS} |
| Time to wait before last try to re-connect to mySQL server, in seconds. More... | |
| std::uint32_t | generalTidyErrors {} |
Number of tidyhtml errors to write to the log. More... | |
| bool | generalTidyWarnings {false} |
Specifies whether to write tidyhtml warnings to the log. More... | |
| bool | generalTiming {false} |
| Specifies whether to calculate timing statistics for the extractor. More... | |
Variables | |
| std::vector< std::string > | variablesAlias |
| Alias for the variable with same array index. More... | |
| std::vector< std::int64_t > | variablesAliasAdd |
| Value to add to the variable alias with the same array index. More... | |
| std::vector< std::string > | variablesDateTimeFormat |
| Date/time format to be used for the variable with the same array index. More... | |
| std::vector< std::string > | variablesDateTimeLocale |
| Date/time locale to be used for the variable with the same array index. More... | |
| std::vector< std::uint64_t > | variablesSkipQuery |
| Queries to be used on the value of the variable with the same array index to determine whether to skip the current URL. More... | |
| std::vector< std::string > | variablesName |
| Variable names. More... | |
| std::vector< std::string > | variablesParsedColumn |
| Parsed column for the value of the variable with the same array index. More... | |
| std::vector< std::string > | variablesParsedTable |
| Name of the table containing the parsed data for the variable with the same array index. More... | |
| std::vector< std::uint64_t > | variablesQuery |
| Query on the content or URL for the variable with the same array index. More... | |
| std::vector< std::uint8_t > | variablesSource |
| Source of the variable with the same array index. More... | |
| std::vector< std::string > | variablesTokens |
| List of token variables. More... | |
| std::vector< std::string > | variablesTokensCookies |
Custom HTTP Cookie header for the token variable with the same array index. More... | |
| std::vector< std::uint64_t > | variablesTokensQuery |
| Query to extract token variable with the same array index. More... | |
| std::vector< std::string > | variablesTokensSource |
| Source URL for the token variable with the same array index. More... | |
| std::vector< bool > | variablesTokensUsePost |
| Specifies whether to use HTTP POST instead of GET for the token variable with the same array index. More... | |
| std::vector< std::string > | variablesTokenHeaders |
| Custom HTTP headers to be used for ALL token variables. More... | |
Paging | |
| std::string | pagingAlias |
| Alias for the paging variable. More... | |
| std::int64_t | pagingAliasAdd {} |
| Value to add to the alias for the paging variable. More... | |
| std::int64_t | pagingFirst {} |
| Number of the first page. More... | |
| std::string | pagingFirstString |
| Name of the first page. More... | |
| std::uint64_t | pagingIsNextFrom {} |
| Query on page content to determine whether there is another page. More... | |
| std::uint64_t | pagingNextFrom {} |
| Query on page content to find the number(s) or name(s) of additional pages. More... | |
| std::uint64_t | pagingNumberFrom {} |
| Query to determine the total number of pages from the content of the first page. More... | |
| std::int64_t | pagingStep {1} |
| Number to add to page variable for retrieving the next page, if a page number is used. More... | |
| std::string | pagingVariable {defaultPagingVariable} |
| Name of the paging variable. More... | |
Source | |
| std::string | sourceCookies |
Custom HTTP Cookie header used when retrieving data. More... | |
| std::vector< std::string > | sourceHeaders |
| Custom HTTP headers used when retrieving data. More... | |
| std::string | sourceUrl |
| URL to retrieve data from. More... | |
| std::string | sourceUrlFirst |
| URL of the first page to retrieve data from. More... | |
| bool | sourceUsePost {false} |
| Specifies whether to use HTTP POST instead of HTTP GET for extracting data. More... | |
Extracting | |
| std::vector< std::uint64_t > | extractingDatasetQueries |
| Queries to extract datasets. More... | |
| std::vector< std::string > | extractingDateTimeFormats |
| Format of date/time to be extracted by the date/time query with the same array index. More... | |
| std::vector< std::string > | extractingDateTimeLocales |
| Locale used by the date/time query with the same array index for extracting date and time. More... | |
| std::vector< std::uint64_t > | extractingDateTimeQueries |
| Queries used for extracting date/time from the dataset. More... | |
| std::vector< std::uint64_t > | extractingErrorFail |
| Queries to detect fatal errors in the data. More... | |
| std::vector< std::uint64_t > | extractingErrorRetry |
| Queries to detect temporary errors in the data. More... | |
| std::vector< std::string > | extractingFieldDateTimeFormats |
| Date/time format of the field with the same array index. More... | |
| std::vector< std::string > | extractingFieldDateTimeLocales |
| Locale used when converting the field with the same array index to a date/time. More... | |
| std::vector< char > | extractingFieldDelimiters |
| Delimiter between multiple results for the field with the same array index, if not saved as JSON. More... | |
| std::vector< bool > | extractingFieldIgnoreEmpty |
| Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More... | |
| std::vector< bool > | extractingFieldJSON |
| Save the value of the field with the same array index as a JSON array. More... | |
| std::vector< std::string > | extractingFieldNames |
| The names of the custom fields to extract. More... | |
| std::vector< std::uint64_t > | extractingFieldQueries |
| The query used to extract the custom field with the same array index from the data. More... | |
| std::vector< bool > | extractingFieldTidyTexts |
| Specifies whether to remove line breaks and unnecessary whitespaces when extracting the field with the same array index. More... | |
| std::vector< bool > | extractingFieldWarningsEmpty |
| Specifies whether to write a warning to the log when the field with the same array index is empty. More... | |
| std::vector< std::string > | extractingIdIgnore |
| Extracted IDs to be ignored. More... | |
| std::vector< std::uint64_t > | extractingIdQueries |
| Queries to extract the ID from the dataset. More... | |
| bool | extractingOverwrite {true} |
| Specifies whether, if a dataset with the same ID already exists, it will be overwritten. More... | |
| std::vector< std::uint64_t > | extractingRecursive |
| Queries for extracting more datasets from a dataset. More... | |
| std::uint64_t | extractingRecursiveMaxDepth {defaultRecursiveMaxDepth} |
| Maximum depth of recursive extracting. More... | |
| bool | extractingRemoveDuplicates {true} |
| Specifies whether to remove duplicate datasets over multiple pages before checking the expected number of datasets. More... | |
| bool | extractingRepairCData {true} |
| Specifies whether to (try to) repair CData when parsing HTML/XML. More... | |
| bool | extractingRepairComments {true} |
| Specifies whether to (try to) repair broken HTML/XML comments. More... | |
| bool | extractingRemoveXmlInstructions {true} |
Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content. More... | |
| std::uint64_t | extractingSkipQuery {} |
| Extracting will proceed to the next URL if the current page fulfills this query. More... | |
Linked Data | |
| std::vector< std::uint64_t > | linkedDatasetQueries |
| Queries to extract linked datasets. More... | |
| std::vector< std::string > | linkedDateTimeFormats |
| Date/time format of the linked field with the same array index. More... | |
| std::vector< std::string > | linkedDateTimeLocales |
| Date/time locale of the linked field with the same array index. More... | |
| std::vector< char > | linkedDelimiters |
| Delimiter between multiple results for the field with the same array index, if not saved as JSON. More... | |
| std::vector< std::string > | linkedFieldNames |
| Names of the linked data fields. More... | |
| std::vector< std::uint64_t > | linkedFieldQueries |
| Query used to extract the custom field with the same array index from the dataset. More... | |
| std::vector< std::string > | linkedIdIgnore |
| IDs of linked data to be ignored. More... | |
| std::vector< std::uint64_t > | linkedIdQueries |
| Queries to extract the linked ID from the dataset. More... | |
| std::vector< bool > | linkedIgnoreEmpty |
| Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More... | |
| std::vector< bool > | linkedJSON |
| Specfies whether to save the value of the field with the same array index as a JSON array. More... | |
| std::string | linkedLink |
| Name of the extracted field that links an extracted dataset to the ID of a linked dataset. More... | |
| bool | linkedOverwrite {true} |
| Specifies whether, if a linked dataset with the same ID already exists, it will be overwritten. More... | |
| std::string | linkedTargetTable |
| Name of the table to save linked data to. More... | |
| std::vector< bool > | linkedTidyTexts |
| Specifies whether to remove line breaks and unnecessary whitespaces when extracting the linked field with the same array index. More... | |
| std::vector< bool > | linkedWarningsEmpty |
| Specifies whether to write a warning to the log when the field with the same array index is empty. More... | |
Expected Number of Results | |
| bool | expectedErrorIfLarger {false} |
| Specifies whether to throw an exception when the number of expected datasets is exceeded. More... | |
| bool | expectedErrorIfSmaller {false} |
| Specifies whether to throw an exception when the number of expected datasets is subceeded. More... | |
| std::string | expectedParsedColumn |
| Parsed column containing the expected number of datasets. More... | |
| std::string | expectedParsedTable |
| Name of the table containing the expected number of datasets. More... | |
| std::uint64_t | expectedQuery {} |
| Query to be performed to retrieve the expected number of datasets. More... | |
| std::uint8_t | expectedSource {expectedSourceExtracting} |
| Source of the query to retrieve the expected number of datasets. More... | |
Configuration entries for extractor threads.
json/extractor.json in crawlserv_frontend! | bool crawlservpp::Module::Extractor::Config::Entries::expectedErrorIfLarger {false} |
Specifies whether to throw an exception when the number of expected datasets is exceeded.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::expectedErrorIfSmaller {false} |
Specifies whether to throw an exception when the number of expected datasets is subceeded.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::expectedParsedColumn |
Parsed column containing the expected number of datasets.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::expectedParsedTable |
Name of the table containing the expected number of datasets.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::expectedQuery {} |
Query to be performed to retrieve the expected number of datasets.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint8_t crawlservpp::Module::Extractor::Config::Entries::expectedSource {expectedSourceExtracting} |
Source of the query to retrieve the expected number of datasets.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingDatasetQueries |
Queries to extract datasets.
The first query that returns a non-empty result will be used.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeFormats |
Format of date/time to be extracted by the date/time query with the same array index.
If not specified, the format %F %T, i.e. YYYY-MM-DD HH:MM:SS will be used.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeLocales |
Locale used by the date/time query with the same array index for extracting date and time.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingDateTimeQueries |
Queries used for extracting date/time from the dataset.
The first query that returns a non-empty result will be used.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingErrorFail |
Queries to detect fatal errors in the data.
The extraction will fail, if any of these queries return true.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingErrorRetry |
Queries to detect temporary errors in the data.
The extraction will be retried, as long as any of these queries return true.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDateTimeFormats |
Date/time format of the field with the same array index.
If empty, no date/time conversion will be performed.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDateTimeLocales |
Locale used when converting the field with the same array index to a date/time.
Will be ignored, if no date/time format has been specified for the field.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<char> crawlservpp::Module::Extractor::Config::Entries::extractingFieldDelimiters |
Delimiter between multiple results for the field with the same array index, if not saved as JSON.
Only the first character of the string, \n (default), \t, or \\ will be used.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldIgnoreEmpty |
Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.
Enabled by default.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldJSON |
Save the value of the field with the same array index as a JSON array.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingFieldNames |
The names of the custom fields to extract.
These fields will be extracted from the content of the current page, using the queries specified in Extractor::Config::Entries::extractingFieldQueries.
Field options are matched via the array index in the respective vectors.
If Extractor::Config::Entries::extractingFieldDateTimeFormats contains a non-empty string, a date/time will be parsed for the respective field, using the locale defined in Extractor::Config::Entries::extractingFieldDateTimeLocale.
Multiple values for one field will be detected via the delimiter in Extractor::Config::Entries::extractingFieldDelimiters, Extractor::Config::Entries::extractingFieldIgnoreEmpty determines whether to ignore empty values, and Extractor::Config::Entries::extractingFieldJSON whether to store them as a JSON array.
If the value of a field is empty, Extractor::Config::Entries::extractingFieldWarningsEmpty determines whether to write a warning to the log.
Extractor::Config::Entries::extractingFieldTidyTexts specifies whether to tidy up the resulting text before being stored to the respective field.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingFieldQueries |
The query used to extract the custom field with the same array index from the data.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldTidyTexts |
Specifies whether to remove line breaks and unnecessary whitespaces when extracting the field with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::extractingFieldWarningsEmpty |
Specifies whether to write a warning to the log when the field with the same array index is empty.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::extractingIdIgnore |
Extracted IDs to be ignored.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingIdQueries |
Queries to extract the ID from the dataset.
The first query that returns a non-empty result will be used. Datasets with duplicate or empty IDs will not be extracted.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::extractingOverwrite {true} |
Specifies whether, if a dataset with the same ID already exists, it will be overwritten.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::extractingRecursive |
Queries for extracting more datasets from a dataset.
The first query that returns a non-empty result will be used.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::extractingRecursiveMaxDepth {defaultRecursiveMaxDepth} |
Maximum depth of recursive extracting.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::extractingRemoveDuplicates {true} |
Specifies whether to remove duplicate datasets over multiple pages before checking the expected number of datasets.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::extractingRemoveXmlInstructions {true} |
Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::extractingRepairCData {true} |
Specifies whether to (try to) repair CData when parsing HTML/XML.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::extractingRepairComments {true} |
Specifies whether to (try to) repair broken HTML/XML comments.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::extractingSkipQuery {} |
Extracting will proceed to the next URL if the current page fulfills this query.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalCacheSize {defaultCacheSize} |
Number of URLs fetched and extracted from before saving results.
Set to zero to cache all URLs at once.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::generalExtractCustom {false} |
Specifies whether to include custom URLs when extracting.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint32_t crawlservpp::Module::Extractor::Config::Entries::generalLock {defaultLockS} |
URL locking time, in seconds.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint8_t crawlservpp::Module::Extractor::Config::Entries::generalLogging {generalLoggingDefault} |
Level of logging activity.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint16_t crawlservpp::Module::Extractor::Config::Entries::generalMaxBatchSize {defaultMaxBatchSize} |
Maximum number of URLs and results processed in one MySQL query.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::generalMinimizeMemory {false} |
Specifies whether to free small amounts of unused memory more often, at the expense of performance.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::generalReExtract {false} |
Specifies whether to re-extract data from already processed URLs.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::int64_t crawlservpp::Module::Extractor::Config::Entries::generalReTries {defaultReTries} |
Number of re-tries on connection errors.
Set to -1, if you want to re-try an infinite number of times on connection errors.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint32_t> crawlservpp::Module::Extractor::Config::Entries::generalRetryHttp |
HTTP errors that will be handled like connection errors.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepError {defaultSleepErrorMs} |
Sleeping time (in ms) on connection errors, in milliseconds.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepHttp {defaultSleepHttpMs} |
Time that will be waited between HTTP requests, in milliseconds.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepIdle {defaultSleepIdleMs} |
Time to wait before checking for new URLs when all URLs have been processed, in milliseconds.
Referenced by crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::generalSleepMySql {defaultSleepMySqlS} |
Time to wait before last try to re-connect to mySQL server, in seconds.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::generalTargetTable |
Name of table to save extracted data to.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint32_t crawlservpp::Module::Extractor::Config::Entries::generalTidyErrors {} |
Number of tidyhtml errors to write to the log.
Referenced by crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::generalTidyWarnings {false} |
Specifies whether to write tidyhtml warnings to the log.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::generalTiming {false} |
Specifies whether to calculate timing statistics for the extractor.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), crawlservpp::Module::Extractor::Thread::onTick(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedDatasetQueries |
Queries to extract linked datasets.
The first query that returns a non-empty result will be used.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedDateTimeFormats |
Date/time format of the linked field with the same array index.
If empty, no date/time conversion will be performed.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedDateTimeLocales |
Date/time locale of the linked field with the same array index.
Will be ignored, if no corresponding date/time format is given.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<char> crawlservpp::Module::Extractor::Config::Entries::linkedDelimiters |
Delimiter between multiple results for the field with the same array index, if not saved as JSON.
Only the first character, \n (default), \t, or \\ will be used.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedFieldNames |
Names of the linked data fields.
Linked data is additionally extracted data that is linked via its ID field to one of the originally extracted data fields, as specified in Extractor::Config::Entries::linkedLink.
The ID field, as well as the additional data fields will be extracted from the subset retrieved by using the query in Extractor::Config::Entries::linkedDataSetQueries on the content of the current page, using the queries specified in Extractor::Config::Entries::linkedIdQueries for the ID, and Extractor::Config::Entries::linkedFieldQueries for each of the other fields.
Linked data with the IDs specified in Extractor::Config::Entries::linkedIdIgnore will be ignored.
Linked field options are matched via the array index in the respective vectors.
If Extractor::Config::Entries::linkedFieldDateTimeFormats contains a non-empty string, a date/time will be parsed for the respective field, using the locale defined in Extractor::Config::Entries::linkedFieldDateTimeLocale.
Multiple values for one field will be detected via the delimiter in Extractor::Config::Entries::linkedFieldDelimiters, Extractor::Config::Entries::linkedFieldIgnoreEmpty determines whether to ignore empty values, and Extractor::Config::Entries::linkedFieldJSON whether to store them as a JSON array.
If the value of a field is empty, Extractor::Config::Entries::linkedWarningsEmpty determines whether to write a warning to the log.
Extractor::Config::Entries::linkedTidyTexts specifies whether to tidy up the resulting text before being stored to the respective field.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedFieldQueries |
Query used to extract the custom field with the same array index from the dataset.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::linkedIdIgnore |
IDs of linked data to be ignored.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::linkedIdQueries |
Queries to extract the linked ID from the dataset.
The first query that returns a non-empty result will be used.
Datasets with duplicate or empty IDs will not be extracted.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedIgnoreEmpty |
Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.
Enabled by default.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedJSON |
Specfies whether to save the value of the field with the same array index as a JSON array.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::linkedLink |
Name of the extracted field that links an extracted dataset to the ID of a linked dataset.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::linkedOverwrite {true} |
Specifies whether, if a linked dataset with the same ID already exists, it will be overwritten.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::linkedTargetTable |
Name of the table to save linked data to.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedTidyTexts |
Specifies whether to remove line breaks and unnecessary whitespaces when extracting the linked field with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::linkedWarningsEmpty |
Specifies whether to write a warning to the log when the field with the same array index is empty.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::pagingAlias |
Alias for the paging variable.
A paging alias allows additions to (and subtractions from, via negative values) the current value of the paging variable. The name of the alias will be replaced with the resulting value.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingAliasAdd {} |
Value to add to the alias for the paging variable.
Use negative values to subtract from the original value.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingFirst {} |
Number of the first page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::pagingFirstString |
Name of the first page.
If not empty, this string will overwrite Extractor::Config::Entries::pagingFirst. Extractor::Config::Entries::pagingStep will also not be used to determine the number of the next page, when a page name is used instead.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingIsNextFrom {} |
Query on page content to determine whether there is another page.
Will be ignored, if no query is set, i.e. the value is zero.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingNextFrom {} |
Query on page content to find the number(s) or name(s) of additional pages.
Will be ignored, if no query is set, i.e. the value is zero.
If a query is set, it will overwrite Extractor::Config::Entries::pagingStep, which will no longer be used to determine the number of the next page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::uint64_t crawlservpp::Module::Extractor::Config::Entries::pagingNumberFrom {} |
Query to determine the total number of pages from the content of the first page.
Will be ignored, if no query is set, i.e. the value is zero.
If a query is set, it will overwrite Extractor::Config::Entries::pagingStep, and Extractor::Config::Entries::pagingNumberFrom, which will no longer be used to determine the number of the next page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::int64_t crawlservpp::Module::Extractor::Config::Entries::pagingStep {1} |
Number to add to page variable for retrieving the next page, if a page number is used.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::pagingVariable {defaultPagingVariable} |
Name of the paging variable.
To be used in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::SourceHeaders. Will be overwritten with either the number, or the name of the current page.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::sourceCookies |
Custom HTTP Cookie header used when retrieving data.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::sourceHeaders |
Custom HTTP headers used when retrieving data.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::sourceUrl |
URL to retrieve data from.
en.wikipedia.org/wiki/Main_Page. Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::string crawlservpp::Module::Extractor::Config::Entries::sourceUrlFirst |
URL of the first page to retrieve data from.
en.wikipedia.org/wiki/Main_Page.Will be ignored, when empty.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| bool crawlservpp::Module::Extractor::Config::Entries::sourceUsePost {false} |
Specifies whether to use HTTP POST instead of HTTP GET for extracting data.
?var1&var2=valueOfVar2) will be sent as arguments of the HTTP POST request instead of parts of the URL. Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesAlias |
Alias for the variable with same array index.
Variable aliases allow additions to (and subtractions from, via negative values) the value of variables. The name of the variable alias will be replaced with the resulting value.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::int64_t> crawlservpp::Module::Extractor::Config::Entries::variablesAliasAdd |
Value to add to the variable alias with the same array index.
Use negative values to subtract from the original value.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesDateTimeFormat |
Date/time format to be used for the variable with the same array index.
If empty, no date/time conversion will be performed.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesDateTimeLocale |
Date/time locale to be used for the variable with the same array index.
Will be ignored, if no corresponding date/time format is given.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesName |
Variable names.
Strings to be replaced by the respective variable values in Extractor::Config::Entries::variablesTokensSource, Extractor::Config::variablesTokensHeaders, Extractor::Config::sourceUrl, Extractor::Config::sourceCookies, and Extractor::Config::sourceHeaders.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesParsedColumn |
Parsed column for the value of the variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesParsedTable |
Name of the table containing the parsed data for the variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesQuery |
Query on the content or URL for the variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesSkipQuery |
Queries to be used on the value of the variable with the same array index to determine whether to skip the current URL.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint8_t> crawlservpp::Module::Extractor::Config::Entries::variablesSource |
Source of the variable with the same array index.
Determines whether to use the table column stored in Extractor::Config::Entries::variablesParsedTable and Extractor::Config::Entries::variablesParsedColumn, or the query stored in Extractor::Config::Entries::variablesQuery to determine the value of the variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokenHeaders |
Custom HTTP headers to be used for ALL token variables.
Referenced by crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokens |
List of token variables.
Strings to be replaced with the value of the respective token variable in Extractor::Config::Entries::sourceUrl, Extractor::Config::Entries::sourceCookies, and Extractor::Config::Entries::sourceHeaders.
The values of token variables are determined by requesting data from external soures.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokensCookies |
Custom HTTP Cookie header for the token variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Extractor::Config::Entries::variablesTokensQuery |
Query to extract token variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Extractor::Config::Entries::variablesTokensSource |
Source URL for the token variable with the same array index.
en.wikipedia.org/wiki/Main_Page.To retrieve the content of the URL, the headers specified in Extractor::Config::Entries::variablesTokenHeaders, and the cookies specified in the string with the same array index in Extractor::Config::Entries::variablesTokensCookies will be used.
Extractor::Config::Entries::variablesTokensUsePost specifies whether to use HTTP POST, instead of HTTP GET, when retrieving the content. When HTTP POST is used, arguments attached to the URL (e.g. ?var1&var2=valueOfVar2) will be sent as arguments of the HTTP POST request instead of parts of the URL.
Afterwards, the query with the same array index in Extractor::Config::Entries::variablesTokensQuery will be used to determine the value of the respective token variable.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Extractor::Config::Entries::variablesTokensUsePost |
Specifies whether to use HTTP POST instead of GET for the token variable with the same array index.
Referenced by crawlservpp::Module::Extractor::Config::checkOptions(), crawlservpp::Module::Extractor::Thread::onReset(), and crawlservpp::Module::Extractor::Config::parseOption().