crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Module::Analyzer::Config::Entries Struct Reference

Configuration entries for analyzer threads. More...

#include <Config.hpp>

Analyzer Configuration

bool generalCorpusChecks {true}
 Check the consistency of text corpora. More...
 
std::uint8_t generalCorpusSlicing {defaultPercentageCorpusSlices}
 Corpus chunk size in percent of the maximum allowed package size by the MySQL server. More...
 
std::vector< std::string > generalInputFields
 Columns to be used from the input tables. More...
 
std::vector< std::uint8_t > generalInputSources
 Types of tables to be used as input. More...
 
std::vector< std::string > generalInputTables
 Names of tables to be used as input. More...
 
std::uint8_t generalLogging {generalLoggingDefault}
 Level of logging activity. More...
 
std::int32_t generalRestartAfter {defaultRestartAfter}
 Time (in s) after which to restart analysis once it has been completed (-1=deactivated). More...
 
std::uint64_t generalSleepMySql {defaultSleepMySqlS}
 Time (in s) to wait before last try to re-connect to mySQL server. More...
 
std::uint64_t generalSleepWhenFinished {defaultSleepWhenFinishedMs}
 Time (in ms) to wait each tick when finished. More...
 
std::string generalTargetTable
 Table name to save analyzed data to. More...
 

Group by Date

bool groupDateFillGaps {true}
 Enables filling the gaps inbetween dates. More...
 
std::uint8_t groupDateResolution {}
 The resolution to be used when grouping dates. More...
 

Filter by Date

bool filterDateEnable {false}
 Enable filtering source data by date (only applies to parsed data). More...
 
std::string filterDateFrom
 The date from which to filter the parsed data. More...
 
std::string filterDateTo
 The date until which to filter the parsed data. More...
 

Filter by Query

std::vector< std::uint64_t > filterQueryQueries
 Queries which need to be fulfilled for at least one token in an article in order to keep it. More...
 
bool filterQueryAll {false}
 Specifies whether articles must contain a word fulfilling all of the queries instead of only of one of them. More...
 

Corpus Tokenization

std::vector< std::string > tokenizerDicts
 Dictionary for the (token-based) manipulator with the same array index. More...
 
std::uint64_t tokenizerFreeMemoryEvery {defaultFreeMemoryEvery}
 Number of processed bytes in a continuous corpus after which memory will be freed. More...
 
std::vector< std::string > tokenizerLanguages
 Language for the (token-based aspell) manipulator with the same array index. More...
 
std::vector< std::uint16_t > tokenizerManipulators
 Manipulators used on the text corpus. More...
 
std::vector< std::string > tokenizerModels
 Model for the (sentence-based) manipulator with the same array index. More...
 
std::vector< std::uint16_t > tokenizerSavePoints {}
 Steps after which the corpus will be stored in the database. More...
 
std::string uploadFTP
 URL to upload a JSON file containing the results to. More...
 
std::string uploadProxy
 URL of proxy to use while uploading a JSON file containing the results. More...
 
std::string uploadTargetColumn
 Name of the column in the target table to create the JSON file for uploading from. More...
 
bool uploadVerbose {false}
 Specified whether FTP network information will be printed to the server console while uploading the results. More...
 

Detailed Description

Configuration entries for analyzer threads.

Warning
Changing the configuration requires updating json/analyzer.json in crawlserv_frontend!

Member Data Documentation

◆ filterDateEnable

bool crawlservpp::Module::Analyzer::Config::Entries::filterDateEnable {false}

Enable filtering source data by date (only applies to parsed data).

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ filterDateFrom

std::string crawlservpp::Module::Analyzer::Config::Entries::filterDateFrom

The date from which to filter the parsed data.

Referenced by crawlservpp::Module::Analyzer::Config::parseOption().

◆ filterDateTo

std::string crawlservpp::Module::Analyzer::Config::Entries::filterDateTo

◆ filterQueryAll

bool crawlservpp::Module::Analyzer::Config::Entries::filterQueryAll {false}

Specifies whether articles must contain a word fulfilling all of the queries instead of only of one of them.

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ filterQueryQueries

std::vector<std::uint64_t> crawlservpp::Module::Analyzer::Config::Entries::filterQueryQueries

Queries which need to be fulfilled for at least one token in an article in order to keep it.

If no queries are given, no filtering will take place.

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ generalCorpusChecks

bool crawlservpp::Module::Analyzer::Config::Entries::generalCorpusChecks {true}

◆ generalCorpusSlicing

std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::generalCorpusSlicing {defaultPercentageCorpusSlices}

Corpus chunk size in percent of the maximum allowed package size by the MySQL server.

Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ generalInputFields

std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::generalInputFields

◆ generalInputSources

◆ generalInputTables

std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::generalInputTables

◆ generalLogging

std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::generalLogging {generalLoggingDefault}

◆ generalRestartAfter

std::int32_t crawlservpp::Module::Analyzer::Config::Entries::generalRestartAfter {defaultRestartAfter}

Time (in s) after which to restart analysis once it has been completed (-1=deactivated).

Referenced by crawlservpp::Module::Analyzer::Thread::onTick(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ generalSleepMySql

std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::generalSleepMySql {defaultSleepMySqlS}

Time (in s) to wait before last try to re-connect to mySQL server.

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ generalSleepWhenFinished

std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::generalSleepWhenFinished {defaultSleepWhenFinishedMs}

Time (in ms) to wait each tick when finished.

Referenced by crawlservpp::Module::Analyzer::Thread::onTick(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ generalTargetTable

◆ groupDateFillGaps

◆ groupDateResolution

std::uint8_t crawlservpp::Module::Analyzer::Config::Entries::groupDateResolution {}

◆ tokenizerDicts

std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerDicts

Dictionary for the (token-based) manipulator with the same array index.

Empty strings will be ignored.

Preprocessing of the corpus will fail, if no dictionary is set for a manipulator that requires one.

Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ tokenizerFreeMemoryEvery

std::uint64_t crawlservpp::Module::Analyzer::Config::Entries::tokenizerFreeMemoryEvery {defaultFreeMemoryEvery}

Number of processed bytes in a continuous corpus after which memory will be freed.

If zero, memory will only be freed after processing is complete.

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ tokenizerLanguages

std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerLanguages

Language for the (token-based aspell) manipulator with the same array index.

Empty strings will be ignored.

If not set, the default language of the server's aspell configuration will be used.

Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ tokenizerManipulators

◆ tokenizerModels

std::vector<std::string> crawlservpp::Module::Analyzer::Config::Entries::tokenizerModels

Model for the (sentence-based) manipulator with the same array index.

Empty strings will be ignored.

Preprocessing of the corpus will fail, if no model is set for a manipulator that requires one.

Referenced by crawlservpp::Module::Analyzer::Config::checkOptions(), crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ tokenizerSavePoints

std::vector<std::uint16_t> crawlservpp::Module::Analyzer::Config::Entries::tokenizerSavePoints {}

Steps after which the corpus will be stored in the database.

If zero, the unmanipulated corpus will be stored. Starting from one, the number corresponds to the manipulators used.

Note
Savepoints will not be stored, if a suitable savepoint already exists beyond them.

Referenced by crawlservpp::Module::Analyzer::Thread::cleanUpQueries(), and crawlservpp::Module::Analyzer::Config::parseOption().

◆ uploadFTP

std::string crawlservpp::Module::Analyzer::Config::Entries::uploadFTP

URL to upload a JSON file containing the results to.

Needs to start with 'ftp://' or 'sftp://'. Might include username, password, and path on the FTP server.

If empty, no result will be uploaded.

Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), and crawlservpp::Module::Analyzer::Thread::uploadResult().

◆ uploadProxy

std::string crawlservpp::Module::Analyzer::Config::Entries::uploadProxy

URL of proxy to use while uploading a JSON file containing the results.

If empty, no proxy will be used.

Referenced by crawlservpp::Module::Analyzer::Config::parseOption().

◆ uploadTargetColumn

std::string crawlservpp::Module::Analyzer::Config::Entries::uploadTargetColumn

Name of the column in the target table to create the JSON file for uploading from.

May not include the prefix ('analyzed_' or 'analyzed__')

Referenced by crawlservpp::Module::Analyzer::Config::parseOption(), and crawlservpp::Module::Analyzer::Thread::uploadResult().

◆ uploadVerbose

bool crawlservpp::Module::Analyzer::Config::Entries::uploadVerbose {false}

Specified whether FTP network information will be printed to the server console while uploading the results.

Referenced by crawlservpp::Module::Analyzer::Config::parseOption().


The documentation for this struct was generated from the following file: