|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Namespace for algorithm classes. More...
Classes | |
| class | AllTokens |
| Counts all tokens in a corpus. More... | |
| class | Assoc |
| Empty algorithm template. More... | |
| class | AssocOverTime |
| Empty algorithm template. More... | |
| class | CorpusGenerator |
| Algorithm building a text corpus and creating corpus statistics from the input data. More... | |
| class | Empty |
| Empty algorithm template. More... | |
| class | ExtractIds |
| Extracts the parsed IDs from a filtered corpus. More... | |
| class | SentimentOverTime |
| Sentiment analysis using the VADER algorithm. More... | |
| class | TermsOverTime |
| Algorithm counting specific terms in a text corpus over time. More... | |
| class | TopicModelling |
| Topic Modeller. More... | |
| class | WordsOverTime |
| Counts the occurrence of articles, sentences, and tokens in a corpus over time. More... | |
Typedefs | |
| using | AlgoThreadProperties = Struct::AlgoThreadProperties |
| using | AlgoThreadPtr = std::unique_ptr< Module::Analyzer::Thread > |
Registration | |
| AlgoThreadPtr | initAlgo (const AlgoThreadProperties &thread) |
| Creates an algorithm thread. More... | |
Constants | |
| constexpr auto | allTokensColumns {2} |
| The number of columns in the tokens table. More... | |
| constexpr auto | allTokensUpdateEveryDate {100U} |
| Indicates after how many dates the status will be updated, if a date map is available. More... | |
| constexpr auto | allTokensUpdateEveryArticle {1000U} |
| Indicates after how many articles the status will be updated, if no date map, but an article map is available. More... | |
| constexpr auto | allTokensUpdateEveryToken {10000U} |
| Indicates after how many tokens the status will be updated, if no date and no article map is available. More... | |
| constexpr auto | allTokensUpdateEveryRow {1000U} |
| Indicates after how many rows the status will be updated while saving the results to the database. More... | |
| constexpr auto | assocUpdateProgressEvery {1000} |
| Indicates, while saving, after how many articles the progress of the thread will be updated. More... | |
| constexpr auto | assocAddColumns {2} |
| Number of extra columns included in a dataset (except date). More... | |
| constexpr auto | assocMinColumns {assocAddColumns + 1 } |
| Minimum number of columns included in a dataset (including date). More... | |
| constexpr auto | assocOverTimeUpdateProgressEvery {100} |
| Indicates, while saving, after how many rows the progress of the thread will be updated. More... | |
| constexpr auto | assocOverTimeAddColumns {2} |
| Number of extra columns included in a dataset (except date). More... | |
| constexpr auto | assocOverTimeMinColumns {assocOverTimeAddColumns + 1 } |
| Minimum number of columns included in a dataset (including date). More... | |
| constexpr auto | corpusNumFields {9} |
| Number of target fields. More... | |
| constexpr auto | extractIdsUpdateProgressEvery {1000} |
| Indicates after how many articles the progress of the thread will be updated. More... | |
| constexpr auto | sentimentUpdateCalculateProgressEvery {250000} |
| Indicates, while calculating, after how many sentences the progress of the thread will be updated. More... | |
| constexpr auto | sentimentUpdateSavingProgressEvery {10} |
| Indicates, while saving, after how many rows the progress of the thread will be updated. More... | |
| constexpr auto | sentimentMinNumColumns {1} |
| Number of default columns to be written to the target table. More... | |
| constexpr auto | sentimentMinColumnsPerCategory {2} |
| Number of columns per category if article-based sentiment is deactivated. More... | |
| constexpr auto | sentimentArticleColumnsPerCategory {4} |
| Number of columns per category if article-based sentiment is activated. More... | |
| constexpr auto | sentimentDefaultThreshold {10} |
| The default threshold (sentiments lower than that number will be ignored). More... | |
| constexpr auto | sentimentDictionary {"sentiment-en"sv} |
| The default sentiment dictionary to be used. More... | |
| constexpr auto | sentimentEmojis {"emojis-en"sv} |
| The default emoji dictionary to be used. More... | |
| constexpr auto | sentimentPercentageFactor {100.F} |
| Factor to convert value to percentage. More... | |
| constexpr auto | topicModellingDirectory {"mdl"sv} |
| The directory for model files. More... | |
| constexpr auto | topicModellingDefaultNumberOfTopics {2} |
| The default number of initial topics. More... | |
| constexpr auto | topicModellingDefaultNumberOfTopicTokens {5} |
| The default number of most-probable tokens for each detected topic. More... | |
| constexpr auto | topicModellingDefaultBurnIn {100} |
| The default number of burn-in iterations. More... | |
| constexpr auto | topicModellingDefaultIterations {1000} |
| The default number of iterations to train the model. More... | |
| constexpr auto | topicModellingDefaultIterationsAtOnce {25} |
| The default number of iterations to train the model at once. More... | |
| constexpr auto | topicModellingDefaultMinCf {1} |
| The default number of a token's minimum frequency in the corpus. More... | |
| constexpr auto | topicModellingDefaultMinDf {1} |
| The default number of a token's minimum document frequency. More... | |
| constexpr auto | topicModellingDefaultOptimizeEvery {10} |
| The default optimization interval for the model parameters, in training iterations. More... | |
| constexpr auto | topicModellingDefaultRemoveTopN {0} |
| The default number of most-common tokens to ignore. More... | |
| constexpr auto | topicModellingDefaultNumberOfThreads {1} |
| The default number of threads for training the model. More... | |
| constexpr auto | topicModellingDefaultAlpha {0.1F} |
| The default initial hyperparameter for the Dirichlet distribution for document–table. More... | |
| constexpr auto | topicModellingDefaultConversionThreshold {0.F} |
| The default threshold for topics to be included when converting a HDP to a LDA model. More... | |
| constexpr auto | topicModellingDefaultEta {0.01F} |
| The default initial hyperparameter for the Dirichlet distribution for topic–token. More... | |
| constexpr auto | topicModellingDefaultGamma {0.1F} |
| The default initial concentration coefficient of the Dirichlet Process for table–topic. More... | |
| constexpr auto | topicModellingDefaultDocIterations {100} |
| The default number of maximum iterations to classify a document. More... | |
| constexpr auto | topicModellingDefaultNumberOfWorkers {0} |
| The default number of worker threads for infering the topics of articles. More... | |
| constexpr auto | topicModellingDefaultMinLabelCf {1} |
| The default number of a topic label's minimum frequency in the corpus. More... | |
| constexpr auto | topicModellingDefaultMinLabelDf {1} |
| The default number of a topic label's minimum document frequency. More... | |
| constexpr auto | topicModellingDefaultMinLabelLength {2} |
| The default minimum length of topic labels, in tokens. More... | |
| constexpr auto | topicModellingDefaultMaxLabelLength {5} |
| The default maximum length of topic labels, in tokens. More... | |
| constexpr auto | topicModellingDefaultMaxLabelCandidates {10000} |
| The default maximum number of topic label candidates to be extracted from the training data. More... | |
| constexpr auto | topicModellingDefaultLabelSmoothing {.1F} |
| The default Laplace smoothing for the automated detection of topic labels. More... | |
| constexpr auto | topicModellingDefaultLabelMu {.25F} |
| The default discriminative coefficient for the automated detection of topic labels. More... | |
| constexpr auto | topicModellingUpdateProgressEvery {1000} |
| The number of added/saved articles after which the progress will be updated. More... | |
| constexpr auto | topicModellingUpdateProgressEveryDocs {25} |
| The number of classified documents after which the progress will be updated. More... | |
| constexpr auto | topicModellingPrecisionLL {6} |
| The number of digits of the log-likelihood to be logged. More... | |
| constexpr auto | topicModellingTargetColumns {2} |
| The number of additional columns in the target table. More... | |
| constexpr auto | topicModellingTopicColumns {2} |
| The number of additional columns in the topic table. More... | |
| constexpr auto | topicModellingColumnsPerLabel {2} |
| The number of columns per top label. More... | |
| constexpr auto | topicModellingColumnsPerToken {2} |
| The number of columns per top token. More... | |
| constexpr auto | topicModellingPrecisionUlp {5} |
| Precision used when testing topic probabilities for equality, in ULPs (units in the last place). More... | |
| constexpr auto | wordsUpdateProgressEvery {100} |
| Indicates after how many date groups the progress of the thread will be updated. More... | |
| constexpr auto | wordsNumberOfColumns {4} |
| The number of columns to write to the target table. More... | |
Namespace for algorithm classes.
| using crawlservpp::Module::Analyzer::Algo::AlgoThreadProperties = typedef Struct::AlgoThreadProperties |
| using crawlservpp::Module::Analyzer::Algo::AlgoThreadPtr = typedef std::unique_ptr<Module::Analyzer::Thread> |
| AlgoThreadPtr crawlservpp::Module::Analyzer::Algo::initAlgo | ( | const AlgoThreadProperties & | thread | ) |
Creates an algorithm thread.
Use the
macro to register an algorithm class.
The macro will check the algorithm ID inside the given properties and return the pointer to a new algorithm thread if it matches the algorithm that has been registered using the macro.
| thread | Constant reference to the properties of the algorithm thread to create. |
nullptr if the algorithm ID specified in the given structure has not been registered. References REGISTER_ALGORITHM.
Referenced by crawlservpp::Main::Server::tick().
|
inline |
The number of columns in the tokens table.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo().
|
inline |
Indicates after how many articles the status will be updated, if no date map, but an article map is available.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().
|
inline |
Indicates after how many dates the status will be updated, if a date map is available.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().
|
inline |
Indicates after how many rows the status will be updated while saving the results to the database.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::resetAlgo().
|
inline |
Indicates after how many tokens the status will be updated, if no date and no article map is available.
Referenced by crawlservpp::Module::Analyzer::Algo::AllTokens::onAlgoTick().
|
inline |
Number of extra columns included in a dataset (except date).
|
inline |
Minimum number of columns included in a dataset (including date).
Referenced by crawlservpp::Module::Analyzer::Algo::Assoc::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo().
|
inline |
Number of extra columns included in a dataset (except date).
Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().
|
inline |
Minimum number of columns included in a dataset (including date).
Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().
|
inline |
Indicates, while saving, after how many rows the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::AssocOverTime::resetAlgo().
|
inline |
Indicates, while saving, after how many articles the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::Assoc::resetAlgo().
|
inline |
Number of target fields.
Referenced by crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), and crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInitTarget().
|
inline |
Indicates after how many articles the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::ExtractIds::resetAlgo().
|
inline |
Number of columns per category if article-based sentiment is activated.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
The default threshold (sentiments lower than that number will be ignored).
|
inline |
The default sentiment dictionary to be used.
|
inline |
The default emoji dictionary to be used.
|
inline |
Number of columns per category if article-based sentiment is deactivated.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
Number of default columns to be written to the target table.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
Factor to convert value to percentage.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
Indicates, while calculating, after how many sentences the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
Indicates, while saving, after how many rows the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::SentimentOverTime::resetAlgo().
|
inline |
The number of columns per top label.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of columns per top token.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The default initial hyperparameter for the Dirichlet distribution for document–table.
|
inline |
The default number of burn-in iterations.
"Burned in" iterations will be skipped before starting to train the model.
|
inline |
The default threshold for topics to be included when converting a HDP to a LDA model.
|
inline |
The default number of maximum iterations to classify a document.
|
inline |
The default initial hyperparameter for the Dirichlet distribution for topic–token.
|
inline |
The default initial concentration coefficient of the Dirichlet Process for table–topic.
Will be ignored, if the LDA instead of the HDP algorithm is used, i.e. a fixed number of topics is set.
|
inline |
The default number of iterations to train the model.
|
inline |
The default number of iterations to train the model at once.
|
inline |
The default discriminative coefficient for the automated detection of topic labels.
|
inline |
The default Laplace smoothing for the automated detection of topic labels.
|
inline |
The default maximum number of topic label candidates to be extracted from the training data.
|
inline |
The default maximum length of topic labels, in tokens.
|
inline |
The default number of a token's minimum frequency in the corpus.
|
inline |
The default number of a token's minimum document frequency.
|
inline |
The default number of a topic label's minimum frequency in the corpus.
|
inline |
The default number of a topic label's minimum document frequency.
|
inline |
The default minimum length of topic labels, in tokens.
|
inline |
The default number of threads for training the model.
|
inline |
The default number of initial topics.
Will be changed according to the data if the HDP (and not the LDA) algorithm is used, i.e. if the number of topics is not set to be fixed.
|
inline |
The default number of most-probable tokens for each detected topic.
This number of most-probable tokens for each detected topic will be written to the provided topic table.
|
inline |
The default number of worker threads for infering the topics of articles.
|
inline |
The default optimization interval for the model parameters, in training iterations.
|
inline |
The default number of most-common tokens to ignore.
|
inline |
The directory for model files.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of digits of the log-likelihood to be logged.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Precision used when testing topic probabilities for equality, in ULPs (units in the last place).
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of additional columns in the target table.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of additional columns in the topic table.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of added/saved articles after which the progress will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of classified documents after which the progress will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
The number of columns to write to the target table.
Referenced by crawlservpp::Module::Analyzer::Algo::WordsOverTime::onAlgoInitTarget(), and crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo().
|
inline |
Indicates after how many date groups the progress of the thread will be updated.
Referenced by crawlservpp::Module::Analyzer::Algo::WordsOverTime::resetAlgo().