crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Configuration entries for analyzer threads. More...
#include <Config.hpp>
Analyzer Configuration | |
bool | generalCorpusChecks {true} |
Check the consistency of text corpora. More... | |
std::uint8_t | generalCorpusSlicing {defaultPercentageCorpusSlices} |
Corpus chunk size in percent of the maximum allowed package size by the MySQL server. More... | |
std::vector< std::string > | generalInputFields |
Columns to be used from the input tables. More... | |
std::vector< std::uint8_t > | generalInputSources |
Types of tables to be used as input. More... | |
std::vector< std::string > | generalInputTables |
Names of tables to be used as input. More... | |
std::uint8_t | generalLogging {generalLoggingDefault} |
Level of logging activity. More... | |
std::int32_t | generalRestartAfter {defaultRestartAfter} |
Time (in s) after which to restart analysis once it has been completed (-1=deactivated). More... | |
std::uint64_t | generalSleepMySql {defaultSleepMySqlS} |
Time (in s) to wait before last try to re-connect to mySQL server. More... | |
std::uint64_t | generalSleepWhenFinished {defaultSleepWhenFinishedMs} |
Time (in ms) to wait each tick when finished. More... | |
std::string | generalTargetTable |
Table name to save analyzed data to. More... | |
Group by Date | |
bool | groupDateFillGaps {true} |
Enables filling the gaps inbetween dates. More... | |
std::uint8_t | groupDateResolution {} |
The resolution to be used when grouping dates. More... | |
Filter by Date | |
bool | filterDateEnable {false} |
Enable filtering source data by date (only applies to parsed data). More... | |
std::string | filterDateFrom |
The date from which to filter the parsed data. More... | |
std::string | filterDateTo |
The date until which to filter the parsed data. More... | |
Filter by Query | |
std::vector< std::uint64_t > | filterQueryQueries |
Queries which need to be fulfilled for at least one token in an article in order to keep it. More... | |
bool | filterQueryAll {false} |
Specifies whether articles must contain a word fulfilling all of the queries instead of only of one of them. More... | |
Corpus Tokenization | |
std::vector< std::string > | tokenizerDicts |
Dictionary for the (token-based) manipulator with the same array index. More... | |
std::uint64_t | tokenizerFreeMemoryEvery {defaultFreeMemoryEvery} |
Number of processed bytes in a continuous corpus after which memory will be freed. More... | |
std::vector< std::string > | tokenizerLanguages |
Language for the (token-based aspell) manipulator with the same array index. More... | |
std::vector< std::uint16_t > | tokenizerManipulators |
Manipulators used on the text corpus. More... | |
std::vector< std::string > | tokenizerModels |
Model for the (sentence-based) manipulator with the same array index. More... | |
std::vector< std::uint16_t > | tokenizerSavePoints {} |
Steps after which the corpus will be stored in the database. More... | |
std::string | uploadFTP |
URL to upload a JSON file containing the results to. More... | |
std::string | uploadProxy |
URL of proxy to use while uploading a JSON file containing the results. More... | |
std::string | uploadTargetColumn |
Name of the column in the target table to create the JSON file for uploading from. More... | |
bool | uploadVerbose {false} |
Specified whether FTP network information will be printed to the server console while uploading the results. More... | |
Configuration entries for analyzer threads.
json/analyzer.json
in crawlserv_frontend!
bool crawlservpp::Module::Analyzer::Config::Entries::filterDateEnable {false} |
Enable filtering source data by date (only applies to parsed data).
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::string crawlservpp::Module::Analyzer::Config::Entries::filterDateFrom |
The date from which to filter the parsed data.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption().
std::string crawlservpp::Module::Analyzer::Config::Entries::filterDateTo |
The date until which to filter the parsed data.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
bool crawlservpp::Module::Analyzer::Config::Entries::filterQueryAll {false} |
Specifies whether articles must contain a word fulfilling all of the queries instead of only of one of them.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::uint64_t> crawlservpp::Module::Analyzer::Config::Entries::filterQueryQueries |
Queries which need to be fulfilled for at least one token in an article in order to keep it.
If no queries are given, no filtering will take place.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
bool crawlservpp::Module::Analyzer::Config::Entries::generalCorpusChecks {true} |
Check the consistency of text corpora.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::generalCorpusSlicing {defaultPercentageCorpusSlices} |
Corpus chunk size in percent of the maximum allowed package size by the MySQL server.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::generalInputFields |
Columns to be used from the input tables.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::uint8_t> crawlservpp::Module::Analyzer::Config::Entries::generalInputSources |
Types of tables to be used as input.
Referenced by crawlservpp::Module::Analyzer::Thread::addCorpora(), crawlservpp::Module::Analyzer::Thread::checkCorpusSources(), crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::generalInputTables |
Names of tables to be used as input.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::generalLogging {generalLoggingDefault} |
Level of logging activity.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::int32_t crawlservpp::Module::Analyzer::Config::Entries::generalRestartAfter {defaultRestartAfter} |
Time (in s) after which to restart analysis once it has been completed (-1=deactivated).
Referenced by crawlservpp::Module::Analyzer::Thread::onTick(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::generalSleepMySql {defaultSleepMySqlS} |
Time (in s) to wait before last try to re-connect to mySQL server.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::generalSleepWhenFinished {defaultSleepWhenFinishedMs} |
Time (in ms) to wait each tick when finished.
Referenced by crawlservpp::Module::Analyzer::Thread::onTick(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::string crawlservpp::Module::Analyzer::Config::Entries::generalTargetTable |
Table name to save analyzed data to.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), crawlservpp::Module::Analyzer::Thread::getTargetTableName(), crawlservpp::Module::Analyzer::Config::parseOption(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
bool crawlservpp::Module::Analyzer::Config::Entries::groupDateFillGaps {true} |
Enables filling the gaps inbetween dates.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::groupDateResolution {} |
The resolution to be used when grouping dates.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo(), crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerDicts |
Dictionary for the (token-based) manipulator with the same array index.
Empty strings will be ignored.
Preprocessing of the corpus will fail, if no dictionary is set for a manipulator that requires one.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::tokenizerFreeMemoryEvery {defaultFreeMemoryEvery} |
Number of processed bytes in a continuous corpus after which memory will be freed.
If zero, memory will only be freed after processing is complete.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerLanguages |
Language for the (token-based aspell) manipulator with the same array index.
Empty strings will be ignored.
If not set, the default language of the server's aspell configuration will be used.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::uint16_t> crawlservpp::Module::Analyzer::Config::Entries::tokenizerManipulators |
Manipulators used on the text corpus.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerModels |
Model for the (sentence-based) manipulator with the same array index.
Empty strings will be ignored.
Preprocessing of the corpus will fail, if no model is set for a manipulator that requires one.
Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::vector<std::uint16_t> crawlservpp::Module::Analyzer::Config::Entries::tokenizerSavePoints {} |
Steps after which the corpus will be stored in the database.
If zero, the unmanipulated corpus will be stored. Starting from one, the number corresponds to the manipulators used.
Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().
std::string crawlservpp::Module::Analyzer::Config::Entries::uploadFTP |
URL to upload a JSON file containing the results to.
Needs to start with 'ftp://' or 'sftp://'. Might include username, password, and path on the FTP server.
If empty, no result will be uploaded.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), and crawlservpp::Module::Analyzer::Thread::uploadResult().
std::string crawlservpp::Module::Analyzer::Config::Entries::uploadProxy |
URL of proxy to use while uploading a JSON file containing the results.
If empty, no proxy will be used.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption().
std::string crawlservpp::Module::Analyzer::Config::Entries::uploadTargetColumn |
Name of the column in the target table to create the JSON file for uploading from.
May not include the prefix ('analyzed_' or 'analyzed__')
Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), and crawlservpp::Module::Analyzer::Thread::uploadResult().
bool crawlservpp::Module::Analyzer::Config::Entries::uploadVerbose {false} |
Specified whether FTP network information will be printed to the server console while uploading the results.
Referenced by crawlservpp::Module::Analyzer::Config::parseOption().