crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Module::Analyzer::Algo Namespace Reference

Namespace for algorithm classes. More...

Classes

class  AllTokens
 Counts all tokens in a corpus. More...
 
class  Assoc
 Empty algorithm template. More...
 
class  AssocOverTime
 Empty algorithm template. More...
 
class  CorpusGenerator
 Algorithm building a text corpus and creating corpus statistics from the input data. More...
 
class  Empty
 Empty algorithm template. More...
 
class  ExtractIds
 Extracts the parsed IDs from a filtered corpus. More...
 
class  SentimentOverTime
 Sentiment analysis using the VADER algorithm. More...
 
class  TermsOverTime
 Algorithm counting specific terms in a text corpus over time. More...
 
class  TopicModelling
 Topic Modeller. More...
 
class  WordsOverTime
 Counts the occurrence of articles, sentences, and tokens in a corpus over time. More...
 

Typedefs

using AlgoThreadProperties = Struct::AlgoThreadProperties
 
using AlgoThreadPtr = std::unique_ptr< Module::Analyzer::Thread >
 

Registration

AlgoThreadPtr initAlgo (const AlgoThreadProperties &thread)
 Creates an algorithm thread. More...
 

Constants

constexpr auto allTokensColumns {2}
 The number of columns in the tokens table. More...
 
constexpr auto allTokensUpdateEveryDate {100U}
 Indicates after how many dates the status will be updated, if a date map is available. More...
 
constexpr auto allTokensUpdateEveryArticle {1000U}
 Indicates after how many articles the status will be updated, if no date map, but an article map is available. More...
 
constexpr auto allTokensUpdateEveryToken {10000U}
 Indicates after how many tokens the status will be updated, if no date and no article map is available. More...
 
constexpr auto allTokensUpdateEveryRow {1000U}
 Indicates after how many rows the status will be updated while saving the results to the database. More...
 
constexpr auto assocUpdateProgressEvery {1000}
 Indicates, while saving, after how many articles the progress of the thread will be updated. More...
 
constexpr auto assocAddColumns {2}
 Number of extra columns included in a dataset (except date). More...
 
constexpr auto assocMinColumns {assocAddColumns + 1 }
 Minimum number of columns included in a dataset (including date). More...
 
constexpr auto assocOverTimeUpdateProgressEvery {100}
 Indicates, while saving, after how many rows the progress of the thread will be updated. More...
 
constexpr auto assocOverTimeAddColumns {2}
 Number of extra columns included in a dataset (except date). More...
 
constexpr auto assocOverTimeMinColumns {assocOverTimeAddColumns + 1 }
 Minimum number of columns included in a dataset (including date). More...
 
constexpr auto corpusNumFields {9}
 Number of target fields. More...
 
constexpr auto extractIdsUpdateProgressEvery {1000}
 Indicates after how many articles the progress of the thread will be updated. More...
 
constexpr auto sentimentUpdateCalculateProgressEvery {250000}
 Indicates, while calculating, after how many sentences the progress of the thread will be updated. More...
 
constexpr auto sentimentUpdateSavingProgressEvery {10}
 Indicates, while saving, after how many rows the progress of the thread will be updated. More...
 
constexpr auto sentimentMinNumColumns {1}
 Number of default columns to be written to the target table. More...
 
constexpr auto sentimentMinColumnsPerCategory {2}
 Number of columns per category if article-based sentiment is deactivated. More...
 
constexpr auto sentimentArticleColumnsPerCategory {4}
 Number of columns per category if article-based sentiment is activated. More...
 
constexpr auto sentimentDefaultThreshold {10}
 The default threshold (sentiments lower than that number will be ignored). More...
 
constexpr auto sentimentDictionary {"sentiment-en"sv}
 The default sentiment dictionary to be used. More...
 
constexpr auto sentimentEmojis {"emojis-en"sv}
 The default emoji dictionary to be used. More...
 
constexpr auto sentimentPercentageFactor {100.F}
 Factor to convert value to percentage. More...
 
constexpr auto topicModellingDirectory {"mdl"sv}
 The directory for model files. More...
 
constexpr auto topicModellingDefaultNumberOfTopics {2}
 The default number of initial topics. More...
 
constexpr auto topicModellingDefaultNumberOfTopicTokens {5}
 The default number of most-probable tokens for each detected topic. More...
 
constexpr auto topicModellingDefaultBurnIn {100}
 The default number of burn-in iterations. More...
 
constexpr auto topicModellingDefaultIterations {1000}
 The default number of iterations to train the model. More...
 
constexpr auto topicModellingDefaultIterationsAtOnce {25}
 The default number of iterations to train the model at once. More...
 
constexpr auto topicModellingDefaultMinCf {1}
 The default number of a token's minimum frequency in the corpus. More...
 
constexpr auto topicModellingDefaultMinDf {1}
 The default number of a token's minimum document frequency. More...
 
constexpr auto topicModellingDefaultOptimizeEvery {10}
 The default optimization interval for the model parameters, in training iterations. More...
 
constexpr auto topicModellingDefaultRemoveTopN {0}
 The default number of most-common tokens to ignore. More...
 
constexpr auto topicModellingDefaultNumberOfThreads {1}
 The default number of threads for training the model. More...
 
constexpr auto topicModellingDefaultAlpha {0.1F}
 The default initial hyperparameter for the Dirichlet distribution for document–table. More...
 
constexpr auto topicModellingDefaultConversionThreshold {0.F}
 The default threshold for topics to be included when converting a HDP to a LDA model. More...
 
constexpr auto topicModellingDefaultEta {0.01F}
 The default initial hyperparameter for the Dirichlet distribution for topic–token. More...
 
constexpr auto topicModellingDefaultGamma {0.1F}
 The default initial concentration coefficient of the Dirichlet Process for table–topic. More...
 
constexpr auto topicModellingDefaultDocIterations {100}
 The default number of maximum iterations to classify a document. More...
 
constexpr auto topicModellingDefaultNumberOfWorkers {0}
 The default number of worker threads for infering the topics of articles. More...
 
constexpr auto topicModellingDefaultMinLabelCf {1}
 The default number of a topic label's minimum frequency in the corpus. More...
 
constexpr auto topicModellingDefaultMinLabelDf {1}
 The default number of a topic label's minimum document frequency. More...
 
constexpr auto topicModellingDefaultMinLabelLength {2}
 The default minimum length of topic labels, in tokens. More...
 
constexpr auto topicModellingDefaultMaxLabelLength {5}
 The default maximum length of topic labels, in tokens. More...
 
constexpr auto topicModellingDefaultMaxLabelCandidates {10000}
 The default maximum number of topic label candidates to be extracted from the training data. More...
 
constexpr auto topicModellingDefaultLabelSmoothing {.1F}
 The default Laplace smoothing for the automated detection of topic labels. More...
 
constexpr auto topicModellingDefaultLabelMu {.25F}
 The default discriminative coefficient for the automated detection of topic labels. More...
 
constexpr auto topicModellingUpdateProgressEvery {1000}
 The number of added/saved articles after which the progress will be updated. More...
 
constexpr auto topicModellingUpdateProgressEveryDocs {25}
 The number of classified documents after which the progress will be updated. More...
 
constexpr auto topicModellingPrecisionLL {6}
 The number of digits of the log-likelihood to be logged. More...
 
constexpr auto topicModellingTargetColumns {2}
 The number of additional columns in the target table. More...
 
constexpr auto topicModellingTopicColumns {2}
 The number of additional columns in the topic table. More...
 
constexpr auto topicModellingColumnsPerLabel {2}
 The number of columns per top label. More...
 
constexpr auto topicModellingColumnsPerToken {2}
 The number of columns per top token. More...
 
constexpr auto topicModellingPrecisionUlp {5}
 Precision used when testing topic probabilities for equality, in ULPs (units in the last place). More...
 
constexpr auto wordsUpdateProgressEvery {100}
 Indicates after how many date groups the progress of the thread will be updated. More...
 
constexpr auto wordsNumberOfColumns {4}
 The number of columns to write to the target table. More...
 

Detailed Description

Namespace for algorithm classes.

Typedef Documentation

◆ AlgoThreadProperties

◆ AlgoThreadPtr

Function Documentation

◆ initAlgo()

AlgoThreadPtr crawlservpp::Module::Analyzer::Algo::initAlgo ( const AlgoThreadProperties thread)

Creates an algorithm thread.

Use the

REGISTER_ALGORITHM(ID, CLASS)

macro to register an algorithm class.

The macro will check the algorithm ID inside the given properties and return the pointer to a new algorithm thread if it matches the algorithm that has been registered using the macro.

Parameters
threadConstant reference to the properties of the algorithm thread to create.
Returns
The pointer to a new algorithm thread or nullptr if the algorithm ID specified in the given structure has not been registered.

References REGISTER_ALGORITHM.

Referenced by crawlservpp::Main::Server::tick().

Variable Documentation

◆ allTokensColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::allTokensColumns {2}
inline

The number of columns in the tokens table.

Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo().

◆ allTokensUpdateEveryArticle

constexpr auto crawlservpp::Module::Analyzer::Algo::allTokensUpdateEveryArticle {1000U}
inline

Indicates after how many articles the status will be updated, if no date map, but an article map is available.

Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().

◆ allTokensUpdateEveryDate

constexpr auto crawlservpp::Module::Analyzer::Algo::allTokensUpdateEveryDate {100U}
inline

Indicates after how many dates the status will be updated, if a date map is available.

Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().

◆ allTokensUpdateEveryRow

constexpr auto crawlservpp::Module::Analyzer::Algo::allTokensUpdateEveryRow {1000U}
inline

Indicates after how many rows the status will be updated while saving the results to the database.

Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo().

◆ allTokensUpdateEveryToken

constexpr auto crawlservpp::Module::Analyzer::Algo::allTokensUpdateEveryToken {10000U}
inline

Indicates after how many tokens the status will be updated, if no date and no article map is available.

Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().

◆ assocAddColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::assocAddColumns {2}
inline

Number of extra columns included in a dataset (except date).

◆ assocMinColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::assocMinColumns {assocAddColumns + 1 }
inline

Minimum number of columns included in a dataset (including date).

Referenced by crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo().

◆ assocOverTimeAddColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::assocOverTimeAddColumns {2}
inline

Number of extra columns included in a dataset (except date).

Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().

◆ assocOverTimeMinColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::assocOverTimeMinColumns {assocOverTimeAddColumns + 1 }
inline

Minimum number of columns included in a dataset (including date).

Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().

◆ assocOverTimeUpdateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::assocOverTimeUpdateProgressEvery {100}
inline

Indicates, while saving, after how many rows the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().

◆ assocUpdateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::assocUpdateProgressEvery {1000}
inline

Indicates, while saving, after how many articles the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo().

◆ corpusNumFields

constexpr auto crawlservpp::Module::Analyzer::Algo::corpusNumFields {9}
inline

◆ extractIdsUpdateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::extractIdsUpdateProgressEvery {1000}
inline

Indicates after how many articles the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo().

◆ sentimentArticleColumnsPerCategory

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentArticleColumnsPerCategory {4}
inline

Number of columns per category if article-based sentiment is activated.

Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().

◆ sentimentDefaultThreshold

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentDefaultThreshold {10}
inline

The default threshold (sentiments lower than that number will be ignored).

◆ sentimentDictionary

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentDictionary {"sentiment-en"sv}
inline

The default sentiment dictionary to be used.

◆ sentimentEmojis

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentEmojis {"emojis-en"sv}
inline

The default emoji dictionary to be used.

◆ sentimentMinColumnsPerCategory

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentMinColumnsPerCategory {2}
inline

Number of columns per category if article-based sentiment is deactivated.

Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().

◆ sentimentMinNumColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentMinNumColumns {1}
inline

◆ sentimentPercentageFactor

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentPercentageFactor {100.F}
inline

Factor to convert value to percentage.

Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().

◆ sentimentUpdateCalculateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentUpdateCalculateProgressEvery {250000}
inline

Indicates, while calculating, after how many sentences the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().

◆ sentimentUpdateSavingProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::sentimentUpdateSavingProgressEvery {10}
inline

Indicates, while saving, after how many rows the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().

◆ topicModellingColumnsPerLabel

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingColumnsPerLabel {2}
inline

The number of columns per top label.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingColumnsPerToken

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingColumnsPerToken {2}
inline

The number of columns per top token.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingDefaultAlpha

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultAlpha {0.1F}
inline

The default initial hyperparameter for the Dirichlet distribution for document–table.

◆ topicModellingDefaultBurnIn

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultBurnIn {100}
inline

The default number of burn-in iterations.

"Burned in" iterations will be skipped before starting to train the model.

◆ topicModellingDefaultConversionThreshold

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultConversionThreshold {0.F}
inline

The default threshold for topics to be included when converting a HDP to a LDA model.

◆ topicModellingDefaultDocIterations

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultDocIterations {100}
inline

The default number of maximum iterations to classify a document.

◆ topicModellingDefaultEta

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultEta {0.01F}
inline

The default initial hyperparameter for the Dirichlet distribution for topic–token.

◆ topicModellingDefaultGamma

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultGamma {0.1F}
inline

The default initial concentration coefficient of the Dirichlet Process for table–topic.

Will be ignored, if the LDA instead of the HDP algorithm is used, i.e. a fixed number of topics is set.

◆ topicModellingDefaultIterations

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultIterations {1000}
inline

The default number of iterations to train the model.

◆ topicModellingDefaultIterationsAtOnce

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultIterationsAtOnce {25}
inline

The default number of iterations to train the model at once.

◆ topicModellingDefaultLabelMu

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultLabelMu {.25F}
inline

The default discriminative coefficient for the automated detection of topic labels.

◆ topicModellingDefaultLabelSmoothing

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultLabelSmoothing {.1F}
inline

The default Laplace smoothing for the automated detection of topic labels.

◆ topicModellingDefaultMaxLabelCandidates

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMaxLabelCandidates {10000}
inline

The default maximum number of topic label candidates to be extracted from the training data.

◆ topicModellingDefaultMaxLabelLength

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMaxLabelLength {5}
inline

The default maximum length of topic labels, in tokens.

◆ topicModellingDefaultMinCf

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMinCf {1}
inline

The default number of a token's minimum frequency in the corpus.

◆ topicModellingDefaultMinDf

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMinDf {1}
inline

The default number of a token's minimum document frequency.

◆ topicModellingDefaultMinLabelCf

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMinLabelCf {1}
inline

The default number of a topic label's minimum frequency in the corpus.

◆ topicModellingDefaultMinLabelDf

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMinLabelDf {1}
inline

The default number of a topic label's minimum document frequency.

◆ topicModellingDefaultMinLabelLength

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultMinLabelLength {2}
inline

The default minimum length of topic labels, in tokens.

◆ topicModellingDefaultNumberOfThreads

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultNumberOfThreads {1}
inline

The default number of threads for training the model.

◆ topicModellingDefaultNumberOfTopics

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultNumberOfTopics {2}
inline

The default number of initial topics.

Will be changed according to the data if the HDP (and not the LDA) algorithm is used, i.e. if the number of topics is not set to be fixed.

◆ topicModellingDefaultNumberOfTopicTokens

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultNumberOfTopicTokens {5}
inline

The default number of most-probable tokens for each detected topic.

This number of most-probable tokens for each detected topic will be written to the provided topic table.

◆ topicModellingDefaultNumberOfWorkers

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultNumberOfWorkers {0}
inline

The default number of worker threads for infering the topics of articles.

◆ topicModellingDefaultOptimizeEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultOptimizeEvery {10}
inline

The default optimization interval for the model parameters, in training iterations.

◆ topicModellingDefaultRemoveTopN

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDefaultRemoveTopN {0}
inline

The default number of most-common tokens to ignore.

◆ topicModellingDirectory

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingDirectory {"mdl"sv}
inline

The directory for model files.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingPrecisionLL

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingPrecisionLL {6}
inline

The number of digits of the log-likelihood to be logged.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingPrecisionUlp

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingPrecisionUlp {5}
inline

Precision used when testing topic probabilities for equality, in ULPs (units in the last place).

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingTargetColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingTargetColumns {2}
inline

The number of additional columns in the target table.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingTopicColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingTopicColumns {2}
inline

The number of additional columns in the topic table.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingUpdateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingUpdateProgressEvery {1000}
inline

The number of added/saved articles after which the progress will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ topicModellingUpdateProgressEveryDocs

constexpr auto crawlservpp::Module::Analyzer::Algo::topicModellingUpdateProgressEveryDocs {25}
inline

The number of classified documents after which the progress will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ wordsNumberOfColumns

constexpr auto crawlservpp::Module::Analyzer::Algo::wordsNumberOfColumns {4}
inline

◆ wordsUpdateProgressEvery

constexpr auto crawlservpp::Module::Analyzer::Algo::wordsUpdateProgressEvery {100}
inline

Indicates after how many date groups the progress of the thread will be updated.

Referenced by crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo().