|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Configuration entries for parser threads. More...
#include <Config.hpp>
Parser Configuration | |
| std::uint64_t | generalCacheSize {defaultCacheSize} |
| Number of URLs fetched and parsed before saving results. More... | |
| std::uint64_t | generalDbTimeOut {} |
| Timeout on MySQL query execution, in milliseconds. More... | |
| std::uint32_t | generalLock {defaultLockS} |
| URL locking time, in seconds. More... | |
| std::uint8_t | generalLogging {generalLoggingDefault} |
| Level of logging activity. More... | |
| std::uint16_t | generalMaxBatchSize {defaultMaxBatchSize} |
| Maximum number of URLs processed in one MySQL query. More... | |
| bool | generalNewestOnly {true} |
| Specifies whether to parse only the newest content for each URL. More... | |
| bool | generalParseCustom {false} |
| Specifies whether to include custom URLs when parsing. More... | |
| bool | generalReParse {false} |
| Specifies whether to re-parse already parsed URLs. More... | |
| std::string | generalResultTable |
| Table name to save parsed data to. More... | |
| std::vector< std::uint64_t > | generalSkip |
| Queries on URLs that will not be parsed. More... | |
| std::uint64_t | generalSleepIdle {defaultSleepIdleMs} |
| Time to wait before checking for new URLs when all URLs have been parsed, in milliseconds. More... | |
| std::uint64_t | generalSleepMySql {defaultSleepMySqlS} |
| Time to wait before last try to re-connect to MySQL server, in seconds. More... | |
| bool | generalTiming {false} |
| Specifies whether to calculate timing statistics. More... | |
Parsing | |
| std::vector< std::uint64_t > | parsingContentIgnoreQueries |
| Content matching one of these queries will be excluded from parsing. More... | |
| std::vector< std::string > | parsingDateTimeFormats |
| Format of the date/time to be parsed by the date/time query with the same array index. More... | |
| std::vector< std::string > | parsingDateTimeLocales |
| Locale to be used by the date/time query with the same array index. More... | |
| std::vector< std::uint64_t > | parsingDateTimeQueries |
| Queries used for parsing the date/time. More... | |
| std::vector< std::uint16_t > | parsingDateTimeSources |
| Where to parse the date/time from – the URL itself, or the crawled content belonging to the URL. More... | |
| bool | parsingDateTimeWarningEmpty {true} |
| Specifies whether to write a warning to the log if no date/time could be parsed although a query is specified. More... | |
| std::vector< std::string > | parsingFieldDateTimeFormats |
| Date/time format of the field with the same array index. More... | |
| std::vector< std::string > | parsingFieldDateTimeLocales |
| Locale to be used by the query with the same array index. More... | |
| std::vector< char > | parsingFieldDelimiters |
| Delimiter between multiple results for the field with the same array index, if not saved as JSON. More... | |
| std::vector< bool > | parsingFieldIgnoreEmpty |
| Specifies whether to ignore empty values when parsing multiple results for the field with the same array index. More... | |
| std::vector< bool > | parsingFieldJSON |
| Specifies whether to save the value of the field with the same array index as a JSON array. More... | |
| std::vector< std::string > | parsingFieldNames |
| Name of the field with the same array index. More... | |
| std::vector< std::uint64_t > | parsingFieldQueries |
| Query for the field with the same array index. More... | |
| std::vector< std::uint8_t > | parsingFieldSources |
| Source of the field with the same array index – the URL itself, or the crawled content belonging to the URL. More... | |
| std::vector< bool > | parsingFieldTidyTexts |
| Specifies whether to remove line breaks and unnecessary whitespaces when parsing the field with the same array index. More... | |
| std::vector< bool > | parsingFieldWarningsEmpty |
| Specifies whether to write a warning to the log if the field with the same array index is empty. More... | |
| std::vector< std::string > | parsingIdIgnore |
| Parsed IDs to be ignored. More... | |
| std::vector< std::uint64_t > | parsingIdQueries |
| Queries to parse the ID. More... | |
| std::vector< std::uint8_t > | parsingIdSources |
| Where to parse the ID from when using the ID query with the same array index – – the URL itself, or the crawled content belonging to the URL. More... | |
| bool | parsingRepairCData {true} |
| Specifies whether to (try to) repair CData when parsing HTML/XML. More... | |
| bool | parsingRepairComments {true} |
| Specifies whether to (try to) repair broken HTML/XML comments. More... | |
| bool | parsingRemoveXmlInstructions {true} |
Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content. More... | |
| std::uint16_t | parsingTidyErrors {} |
Number of tidyhtml errors to write to the log. More... | |
| bool | parsingTidyWarnings {false} |
Specifies whether to write tidyhtml warnings to the log. More... | |
Configuration entries for parser threads.
json/parser.json in crawlserv_frontend! | std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalCacheSize {defaultCacheSize} |
Number of URLs fetched and parsed before saving results.
Set to zero to cache all URLs at once.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalDbTimeOut {} |
Timeout on MySQL query execution, in milliseconds.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint32_t crawlservpp::Module::Parser::Config::Entries::generalLock {defaultLockS} |
URL locking time, in seconds.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Thread::onTick().
| std::uint8_t crawlservpp::Module::Parser::Config::Entries::generalLogging {generalLoggingDefault} |
Level of logging activity.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint16_t crawlservpp::Module::Parser::Config::Entries::generalMaxBatchSize {defaultMaxBatchSize} |
Maximum number of URLs processed in one MySQL query.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::generalNewestOnly {true} |
Specifies whether to parse only the newest content for each URL.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::generalParseCustom {false} |
Specifies whether to include custom URLs when parsing.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::generalReParse {false} |
Specifies whether to re-parse already parsed URLs.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::string crawlservpp::Module::Parser::Config::Entries::generalResultTable |
Table name to save parsed data to.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::generalSkip |
Queries on URLs that will not be parsed.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalSleepIdle {defaultSleepIdleMs} |
Time to wait before checking for new URLs when all URLs have been parsed, in milliseconds.
Referenced by crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint64_t crawlservpp::Module::Parser::Config::Entries::generalSleepMySql {defaultSleepMySqlS} |
Time to wait before last try to re-connect to MySQL server, in seconds.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::generalTiming {false} |
Specifies whether to calculate timing statistics.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), crawlservpp::Module::Parser::Thread::onTick(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingContentIgnoreQueries |
Content matching one of these queries will be excluded from parsing.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeFormats |
Format of the date/time to be parsed by the date/time query with the same array index.
If not specified, the format %F %T, i.e. YYYY-MM-DD HH:MM:SS will be used.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeLocales |
Locale to be used by the date/time query with the same array index.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeQueries |
Queries used for parsing the date/time.
The first query that returns a non-empty result will be used.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint16_t> crawlservpp::Module::Parser::Config::Entries::parsingDateTimeSources |
Where to parse the date/time from – the URL itself, or the crawled content belonging to the URL.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::parsingDateTimeWarningEmpty {true} |
Specifies whether to write a warning to the log if no date/time could be parsed although a query is specified.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldDateTimeFormats |
Date/time format of the field with the same array index.
If not specified, no date/time conversion will be performed.
See Howard E. Hinnant's C++ date.h library documentation for details.
Set a string to UNIX to parse Unix timestamps, i.e. seconds since the Unix epoch, instead.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldDateTimeLocales |
Locale to be used by the query with the same array index.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<char> crawlservpp::Module::Parser::Config::Entries::parsingFieldDelimiters |
Delimiter between multiple results for the field with the same array index, if not saved as JSON.
Only the first character of the string, \n (default), \t, or \\ will be used.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldIgnoreEmpty |
Specifies whether to ignore empty values when parsing multiple results for the field with the same array index.
Enabled by default.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldJSON |
Specifies whether to save the value of the field with the same array index as a JSON array.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingFieldNames |
Name of the field with the same array index.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingFieldQueries |
Query for the field with the same array index.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint8_t> crawlservpp::Module::Parser::Config::Entries::parsingFieldSources |
Source of the field with the same array index – the URL itself, or the crawled content belonging to the URL.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldTidyTexts |
Specifies whether to remove line breaks and unnecessary whitespaces when parsing the field with the same array index.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<bool> crawlservpp::Module::Parser::Config::Entries::parsingFieldWarningsEmpty |
Specifies whether to write a warning to the log if the field with the same array index is empty.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::string> crawlservpp::Module::Parser::Config::Entries::parsingIdIgnore |
Parsed IDs to be ignored.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint64_t> crawlservpp::Module::Parser::Config::Entries::parsingIdQueries |
Queries to parse the ID.
The first query that returns a non-empty result will be used. Datasets with duplicate or empty IDs will not be parsed.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::vector<std::uint8_t> crawlservpp::Module::Parser::Config::Entries::parsingIdSources |
Where to parse the ID from when using the ID query with the same array index – – the URL itself, or the crawled content belonging to the URL.
Referenced by crawlservpp::Module::Parser::Config::checkOptions(), crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::parsingRemoveXmlInstructions {true} |
Specifies whether to remove XML processing instructions (<?xml:...>) before parsing HTML content.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::parsingRepairCData {true} |
Specifies whether to (try to) repair CData when parsing HTML/XML.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::parsingRepairComments {true} |
Specifies whether to (try to) repair broken HTML/XML comments.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().
| std::uint16_t crawlservpp::Module::Parser::Config::Entries::parsingTidyErrors {} |
Number of tidyhtml errors to write to the log.
Referenced by crawlservpp::Module::Parser::Config::parseOption().
| bool crawlservpp::Module::Parser::Config::Entries::parsingTidyWarnings {false} |
Specifies whether to write tidyhtml warnings to the log.
Referenced by crawlservpp::Module::Parser::Thread::onReset(), and crawlservpp::Module::Parser::Config::parseOption().