crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Module::Parser::Config::Entries Struct Reference

Configuration entries for parser threads. More...

#include <Config.hpp>

Parser Configuration

std::uint64_t generalCacheSize {defaultCacheSize}
 Number of URLs fetched and parsed before saving results. More...
 
std::uint64_t generalDbTimeOut {}
 Timeout on MySQL query execution, in milliseconds. More...
 
std::uint32_t generalLock {defaultLockS}
 URL locking time, in seconds. More...
 
std::uint8_t generalLogging {generalLoggingDefault}
 Level of logging activity. More...
 
std::uint16_t generalMaxBatchSize {defaultMaxBatchSize}
 Maximum number of URLs processed in one MySQL query. More...
 
bool generalNewestOnly {true}
 Specifies whether to parse only the newest content for each URL. More...
 
bool generalParseCustom {false}
 Specifies whether to include custom URLs when parsing. More...
 
bool generalReParse {false}
 Specifies whether to re-parse already parsed URLs. More...
 
std::string generalResultTable
 Table name to save parsed data to. More...
 
std::vector< std::uint64_t > generalSkip
 Queries on URLs that will not be parsed. More...
 
std::uint64_t generalSleepIdle {defaultSleepIdleMs}
 Time to wait before checking for new URLs when all URLs have been parsed, in milliseconds. More...
 
std::uint64_t generalSleepMySql {defaultSleepMySqlS}
 Time to wait before last try to re-connect to MySQL server, in seconds. More...
 
bool generalTiming {false}
 Specifies whether to calculate timing statistics. More...
 

Parsing

std::vector< std::uint64_t > parsingContentIgnoreQueries
 Content matching one of these queries will be excluded from parsing. More...
 
std::vector< std::string > parsingDateTimeFormats
 Format of the date/time to be parsed by the date/time query with the same array index. More...
 
std::vector< std::string > parsingDateTimeLocales
 Locale to be used by the date/time query with the same array index. More...
 
std::vector< std::uint64_t > parsingDateTimeQueries
 Queries used for parsing the date/time. More...
 
std::vector< std::uint16_t > parsingDateTimeSources
 Where to parse the date/time from – the URL itself, or the crawled content belonging to the URL. More...
 
bool parsingDateTimeWarningEmpty {true}
 Specifies whether to write a warning to the log if no date/time could be parsed although a query is specified. More...
 
std::vector< std::string > parsingFieldDateTimeFormats
 Date/time format of the field with the same array index. More...
 
std::vector< std::string > parsingFieldDateTimeLocales
 Locale to be used by the query with the same array index. More...
 
std::vector< char > parsingFieldDelimiters
 Delimiter between multiple results for the field with the same array index, if not saved as JSON. More...
 
std::vector< bool > parsingFieldIgnoreEmpty
 Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More...
 
std::vector< bool > parsingFieldJSON
 Specifies whether to save the value of the field with the same array index as a JSON array. More...
 
std::vector< std::string > parsingFieldNames
 Name of the field with the same array index. More...
 
std::vector< std::uint64_t > parsingFieldQueries
 Query for the field with the same array index. More...
 
std::vector< std::uint8_t > parsingFieldSources
 Source of the field with the same array index – the URL itself, or the crawled content belonging to the URL. More...
 
std::vector< bool > parsingFieldTidyTexts
 Specifies whether to remove line breaks and unnecessary whitespaces when parsing the field with the same array index. More...
 
std::vector< bool > parsingFieldWarningsEmpty
 Specifies whether to write a warning to the log if the field with the same array index is empty. More...
 
std::vector< std::string > parsingIdIgnore
 Parsed IDs to be ignored. More...
 
std::vector< std::uint64_t > parsingIdQueries
 Queries to parse the ID. More...
 
std::vector< std::uint8_t > parsingIdSources
 Where to parse the ID from when using the ID query with the same array index – – the URL itself, or the crawled content belonging to the URL. More...
 
bool parsingRepairCData {true}
 Specifies whether to (try to) repair CData when parsing HTML/XML. More...
 
bool parsingRepairComments {true}
 Specifies whether to (try to) repair broken HTML/XML comments. More...
 
bool parsingRemoveXmlInstructions {true}
 Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content. More...
 
std::uint16_t parsingTidyErrors {}
 Number of tidyhtml errors to write to the log. More...
 
bool parsingTidyWarnings {false}
 Specifies whether to write tidyhtml warnings to the log. More...
 

Detailed Description

Configuration entries for parser threads.

Warning
Changing the configuration requires updating json/parser.json in crawlserv_frontend!

Member Data Documentation

◆ generalCacheSize

std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalCacheSize {defaultCacheSize}

Number of URLs fetched and parsed before saving results.

Set to zero to cache all URLs at once.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalDbTimeOut

std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalDbTimeOut {}

Timeout on MySQL query execution, in milliseconds.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalLock

std::uint32_t crawlservpp::Module::Parser::Config::Entries::generalLock {defaultLockS}

◆ generalLogging

std::uint8_t crawlservpp::Module::Parser::Config::Entries::generalLogging {generalLoggingDefault}

◆ generalMaxBatchSize

std::uint16_t crawlservpp::Module::Parser::Config::Entries::generalMaxBatchSize {defaultMaxBatchSize}

Maximum number of URLs processed in one MySQL query.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalNewestOnly

bool crawlservpp::Module::Parser::Config::Entries::generalNewestOnly {true}

Specifies whether to parse only the newest content for each URL.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalParseCustom

bool crawlservpp::Module::Parser::Config::Entries::generalParseCustom {false}

Specifies whether to include custom URLs when parsing.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalReParse

bool crawlservpp::Module::Parser::Config::Entries::generalReParse {false}

Specifies whether to re-parse already parsed URLs.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalResultTable

std::string crawlservpp::Module::Parser::Config::Entries::generalResultTable

◆ generalSkip

std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::generalSkip

Queries on URLs that will not be parsed.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalSleepIdle

std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalSleepIdle {defaultSleepIdleMs}

Time to wait before checking for new URLs when all URLs have been parsed, in milliseconds.

Referenced by crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalSleepMySql

std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalSleepMySql {defaultSleepMySqlS}

Time to wait before last try to re-connect to MySQL server, in seconds.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ generalTiming

bool crawlservpp::Module::Parser::Config::Entries::generalTiming {false}

◆ parsingContentIgnoreQueries

std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingContentIgnoreQueries

Content matching one of these queries will be excluded from parsing.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingDateTimeFormats

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeFormats

Format of the date/time to be parsed by the date/time query with the same array index.

If not specified, the format %F %T, i.e. YYYY-MM-DD HH:MM:SS will be used.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
parsingDateTimeSources, parsingDateTimeQueries, parsingDateTimeLocale, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingDateTimeLocales

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeLocales

◆ parsingDateTimeQueries

std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeQueries

Queries used for parsing the date/time.

The first query that returns a non-empty result will be used.

See also
parsingDateTimeSources

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingDateTimeSources

std::vector<std::uint16_t> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeSources

Where to parse the date/time from – the URL itself, or the crawled content belonging to the URL.

See also
parsingSourceUrl, parsingSourceContent, parsingDateTimeQueries

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingDateTimeWarningEmpty

bool crawlservpp::Module::Parser::Config::Entries::parsingDateTimeWarningEmpty {true}

Specifies whether to write a warning to the log if no date/time could be parsed although a query is specified.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldDateTimeFormats

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldDateTimeFormats

Date/time format of the field with the same array index.

If not specified, no date/time conversion will be performed.

See Howard E. Hinnant's C++ date.h library documentation for details.

Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.

See also
parsingFieldQueries, parsingFieldDateTimeLocales, Helper::DateTime::convertCustomDateTimeToSQLTimeStamp

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldDateTimeLocales

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldDateTimeLocales

◆ parsingFieldDelimiters

std::vector<char> crawlservpp::Module::Parser::Config::Entries::parsingFieldDelimiters

Delimiter between multiple results for the field with the same array index, if not saved as JSON.

Only the first character of the string, \n (default), \t, or \\ will be used.

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldIgnoreEmpty

std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldIgnoreEmpty

Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.

Enabled by default.

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldJSON

std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldJSON

Specifies whether to save the value of the field with the same array index as a JSON array.

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldNames

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldNames

◆ parsingFieldQueries

std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingFieldQueries

◆ parsingFieldSources

std::vector<std::uint8_t> crawlservpp::Module::Parser::Config::Entries::parsingFieldSources

Source of the field with the same array index – the URL itself, or the crawled content belonging to the URL.

See also
parsingSourceUrl, parsingSourceContent, parsingFieldQueries

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldTidyTexts

std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldTidyTexts

Specifies whether to remove line breaks and unnecessary whitespaces when parsing the field with the same array index.

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingFieldWarningsEmpty

std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldWarningsEmpty

Specifies whether to write a warning to the log if the field with the same array index is empty.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingIdIgnore

std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingIdIgnore

◆ parsingIdQueries

std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingIdQueries

Queries to parse the ID.

The first query that returns a non-empty result will be used. Datasets with duplicate or empty IDs will not be parsed.

See also
parsingIdSources

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingIdSources

std::vector<std::uint8_t> crawlservpp::Module::Parser::Config::Entries::parsingIdSources

Where to parse the ID from when using the ID query with the same array index – – the URL itself, or the crawled content belonging to the URL.

See also
parsingSourceUrl, parsingSourceContent, parsingIdQueries

Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingRemoveXmlInstructions

bool crawlservpp::Module::Parser::Config::Entries::parsingRemoveXmlInstructions {true}

Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingRepairCData

bool crawlservpp::Module::Parser::Config::Entries::parsingRepairCData {true}

Specifies whether to (try to) repair CData when parsing HTML/XML.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingRepairComments

bool crawlservpp::Module::Parser::Config::Entries::parsingRepairComments {true}

Specifies whether to (try to) repair broken HTML/XML comments.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().

◆ parsingTidyErrors

std::uint16_t crawlservpp::Module::Parser::Config::Entries::parsingTidyErrors {}

Number of tidyhtml errors to write to the log.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Parser::Config::parseOption().

◆ parsingTidyWarnings

bool crawlservpp::Module::Parser::Config::Entries::parsingTidyWarnings {false}

Specifies whether to write tidyhtml warnings to the log.

Note
Logging needs to be enabled in order for this option to have any effect.

Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().


The documentation for this struct was generated from the following file: