crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Data::TopicModel Class Reference

Topic modeller. More...

#include <TopicModel.hpp>

Classes

class  Exception
 Class for topic modelling-specific exceptions. More...
 

Getters

std::size_t getNumberOfDocuments () const
 Gets the number of added documents after training has begun. More...
 
std::unordered_map< std::string, std::size_t > getDocuments () const
 Gets a map with the documents and their indices from the model. More...
 
std::size_t getVocabularySize () const
 Gets the number of distinct tokens after training has begun. More...
 
std::size_t getOriginalVocabularySize () const
 Gets the number of distinct tokens before training. More...
 
const std::vector< std::string > & getVocabulary () const
 Gets the complete dictionary used by the model. More...
 
std::size_t getNumberOfTokens () const
 Gets the number of tokens after training has begun. More...
 
std::size_t getBurnInIterations () const
 Get the number of skipped iterations. More...
 
std::size_t getIterations () const
 Get the number of training iterations performed so far. More...
 
std::size_t getParameterOptimizationInterval () const
 Gets the interval for parameter optimization, in iterations. More...
 
std::size_t getRandomNumberGenerationSeed () const
 Gets the seed used for random number generation. More...
 
std::string_view getModelName () const
 Gets the name of the current model. More...
 
std::string_view getTermWeighting () const
 Gets the term weighting mode of the current model. More...
 
std::size_t getDocumentId (const std::string &name) const
 Gets the ID of the document with the specified name. More...
 
std::vector< std::string > getRemovedTokens () const
 Gets the most common tokens (i.e. stopwords) that have been removed. More...
 
std::size_t getNumberOfTopics () const
 Gets the number of topics. More...
 
std::vector< std::size_t > getTopics () const
 Gets the IDs of the topics. More...
 
std::vector< std::pair< std::size_t, std::uint64_t > > getTopicsSorted () const
 Gets the IDs and counts of the topics, sorted by count. More...
 
double getLogLikelihoodPerToken () const
 Gets the log-likelihood per token. More...
 
double getTokenEntropy () const
 Gets the token entropy after training. More...
 
std::vector< std::pair< std::string, float > > getTopicTopNTokens (std::size_t topic, std::size_t n) const
 Gets the top N tokens for the specified topic. More...
 
std::vector< std::pair< std::string, float > > getTopicTopNLabels (std::size_t topic, std::size_t n) const
 Gets the top N labels for the specified topic. More...
 
std::vector< std::pair< std::string, std::vector< float > > > getDocumentsTopics (std::unordered_set< std::string > &done) const
 Gets the topic distributions of all documents the model has been trained on, if available. More...
 
std::vector< std::vector< float > > getDocumentsTopics (const std::vector< std::vector< std::string >> &documents, std::size_t maxIterations, std::size_t numberOfWorkers) const
 Infers the topic distributions for previously unseen documents. More...
 
TopicModelInfo getModelInfo () const
 Gets information about the model after training. More...
 

Setters

void setFixedNumberOfTopics (std::size_t k)
 Sets the fixed number of topics. More...
 
void setUseIdf (bool idf)
 Sets whether to use IDF term weighting. More...
 
void setBurnInIteration (std::size_t skipIterations)
 Sets the number of iterations that will be skipped at the beginnig of training. More...
 
void setTokenRemoval (std::size_t collectionFrequency, std::size_t documentFrequency, std::size_t fixedNumberOfTopTokens)
 Sets which (un)common tokens to remove before training. More...
 
void setInitialParameters (std::size_t initialTopics, float alpha, float eta, float gamma)
 Sets the initial parameters for the model. More...
 
void setParameterOptimizationInterval (std::size_t interval)
 Sets the interval for parameter optimization, in iterations. More...
 
void setRandomNumberGenerationSeed (std::size_t newSeed)
 Sets the seed for random number generation. More...
 
void setLabelingOptions (bool activate, std::size_t minCf, std::size_t minDf, std::size_t minLength, std::size_t maxLength, std::size_t maxCandidates, float smoothing, float mu, std::size_t windowSize)
 Sets the options for automated topic labeling. More...
 

Topic Modelling

void addDocument (const std::string &name, const std::vector< std::string > &tokens, std::size_t firstToken, std::size_t numTokens)
 Adds a document from a tokenized corpus. More...
 
void startTraining ()
 Starts training without performing any iteration. More...
 
void train (std::size_t iterations, std::size_t threads)
 Trains the underlying HLDA model. More...
 
void label (std::size_t threads)
 Labels the resulting topics. More...
 

Load and Save

std::size_t load (const std::string &fileName)
 Loads a model from a file. More...
 
std::size_t save (const std::string &fileName, bool full) const
 Writes the model to a file. More...
 

Cleanup

void clear (bool labelingOptions)
 Clears the model, resets its settings and frees memory. More...
 

Detailed Description

Topic modeller.

Uses the Hierarchical Dirichlet Process (HDP) and Latent Dirichlet Allocation (LDA) algorithms.

The former will be used if no fixed number of topics is given, the latter will be used if a fixed number of topics is given.

Using tomoto, the underlying C++ API of tomotopy, see: https://bab2min.github.io/tomotopy/

If you use the HDP topic modelling algorithm, please cite:

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems, 1385–1392.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.

If you use the LDA topic modelling algorithm, please cite:

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.

Member Function Documentation

◆ addDocument()

void crawlservpp::Data::TopicModel::addDocument ( const std::string &  name,
const std::vector< std::string > &  tokens,
std::size_t  firstToken,
std::size_t  numTokens 
)
inline

Adds a document from a tokenized corpus.

A copy of the document will be created, i.e. the corpus can be cleared after all documents have been added.

Parameters
nameThe name of the document.
tokensConstant reference to all tokens in the corpus.
firstTokenIndex of the document's first token.
numTokensNumber of tokens in the document.
Exceptions
TopicModel::Exceptionif the model has already been trained.
Note
It is recommended to stem (or lemmatize) the tokens in the document before adding it to the model.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ clear()

◆ getBurnInIterations()

std::size_t crawlservpp::Data::TopicModel::getBurnInIterations ( ) const
inline

Get the number of skipped iterations.

Returns
The number of skipped iterations before training.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getDocumentId()

std::size_t crawlservpp::Data::TopicModel::getDocumentId ( const std::string &  name) const
inline

Gets the ID of the document with the specified name.

Parameters
nameThe name of the document.
Returns
The ID of the document with the specified name.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or no document with the specified name has been added to the model.

References DATA_TOPICMODEL_RETRIEVE.

◆ getDocuments()

std::unordered_map< std::string, std::size_t > crawlservpp::Data::TopicModel::getDocuments ( ) const
inline

Gets a map with the documents and their indices from the model.

Returns
An unordered map with the document IDs as keys and the document indices as values.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().

◆ getDocumentsTopics() [1/2]

std::vector< std::pair< std::string, std::vector< float > > > crawlservpp::Data::TopicModel::getDocumentsTopics ( std::unordered_set< std::string > &  done) const
inline

Gets the topic distributions of all documents the model has been trained on, if available.

Unnamed documents inside the model will be ignored.

Parameters
doneAn unordered map which will be used to not classify any article twice. All articles with an ID contained in this map will be ignored. The IDs of all articles that will be returned will be added to the map.
Returns
A vector containing pairs of a string containing the name of the document and a vector of floating-point numbers indicating the topic distribution for that document. An empty vector if the model does not contain any named documents, e.g. if a model has been loaded that has not been saved together with all its training data.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getDocumentsTopics() [2/2]

std::vector< std::vector< float > > crawlservpp::Data::TopicModel::getDocumentsTopics ( const std::vector< std::vector< std::string >> &  documents,
std::size_t  maxIterations,
std::size_t  numberOfWorkers 
) const
inline

Infers the topic distributions for previously unseen documents.

Parameters
documentsA constant reference to a vector containing vectors with the processed tokens of the documents to infer the topics for.
maxIterationsThe maximum number of iterations to perform for infering the topic distributions.
numberOfWorkersThe number of working threads to be used for infering the topic distributions.
Returns
A vector containing vectors of floating- point numbers indicating the topic distribution for each of the given documents.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or a document could not be created.

References DATA_TOPICMODEL_CALL, and DATA_TOPICMODEL_RETRIEVE.

◆ getIterations()

std::size_t crawlservpp::Data::TopicModel::getIterations ( ) const
inline

Get the number of training iterations performed so far.

Returns
The number of iterations performed during training.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getLogLikelihoodPerToken()

double crawlservpp::Data::TopicModel::getLogLikelihoodPerToken ( ) const
inline

Gets the log-likelihood per token.

Returns
The current log-likelihood per token.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick().

◆ getModelInfo()

Struct::TopicModelInfo crawlservpp::Data::TopicModel::getModelInfo ( ) const
inline

Gets information about the model after training.

Returns
A structure containing all available information about the trained model.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References crawlservpp::Struct::TopicModelInfo::alpha, crawlservpp::Struct::TopicModelInfo::alphas, DATA_TOPICMODEL_RETRIEVE_NOARGS, crawlservpp::Struct::TopicModelInfo::eta, crawlservpp::Struct::TopicModelInfo::gamma, getBurnInIterations(), getIterations(), getLogLikelihoodPerToken(), getModelName(), getNumberOfDocuments(), getNumberOfTokens(), getNumberOfTopics(), getOriginalVocabularySize(), getParameterOptimizationInterval(), getRemovedTokens(), getTermWeighting(), getTokenEntropy(), crawlservpp::Helper::Versions::getTomotoVersion(), getVocabularySize(), crawlservpp::Struct::TopicModelInfo::initialAlpha, crawlservpp::Struct::TopicModelInfo::initialEta, crawlservpp::Struct::TopicModelInfo::initialGamma, crawlservpp::Struct::TopicModelInfo::logLikelihoodPerToken, crawlservpp::Struct::TopicModelInfo::minCollectionFrequency, crawlservpp::Struct::TopicModelInfo::minDocumentFrequency, crawlservpp::Struct::TopicModelInfo::modelName, crawlservpp::Struct::TopicModelInfo::modelVersion, crawlservpp::Struct::TopicModelInfo::numberOfBurnInSteps, crawlservpp::Struct::TopicModelInfo::numberOfDocuments, crawlservpp::Struct::TopicModelInfo::numberOfInitialTopics, crawlservpp::Struct::TopicModelInfo::numberOfIterations, crawlservpp::Struct::TopicModelInfo::numberOfTables, crawlservpp::Struct::TopicModelInfo::numberOfTokens, crawlservpp::Struct::TopicModelInfo::numberOfTopics, crawlservpp::Struct::TopicModelInfo::numberOfTopTokensToBeRemoved, crawlservpp::Struct::TopicModelInfo::optimizationInterval, crawlservpp::Struct::TopicModelInfo::removedTokens, crawlservpp::Struct::TopicModelInfo::seed, crawlservpp::Struct::TopicModelInfo::sizeOfVocabulary, crawlservpp::Struct::TopicModelInfo::sizeOfVocabularyUsed, crawlservpp::Struct::TopicModelInfo::tokenEntropy, crawlservpp::Struct::TopicModelInfo::trainedWithVersion, and crawlservpp::Struct::TopicModelInfo::weighting.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getModelName()

std::string_view crawlservpp::Data::TopicModel::getModelName ( ) const
inline

Gets the name of the current model.

Returns
A view of a string containing the name of the current model.
Exceptions
TopicModel::Exceptionif no documents have been added or the topic modeller has been already cleared, i.e. no model is available.

References crawlservpp::Data::hdpModelName, and crawlservpp::Data::ldaModelName.

Referenced by getModelInfo().

◆ getNumberOfDocuments()

std::size_t crawlservpp::Data::TopicModel::getNumberOfDocuments ( ) const
inline

Gets the number of added documents after training has begun.

Returns
The number of added documents.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getDocuments(), getDocumentsTopics(), and getModelInfo().

◆ getNumberOfTokens()

std::size_t crawlservpp::Data::TopicModel::getNumberOfTokens ( ) const
inline

Gets the number of tokens after training has begun.

Returns
The number of tokens after stopwords have been removed.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getNumberOfTopics()

std::size_t crawlservpp::Data::TopicModel::getNumberOfTopics ( ) const
inline

Gets the number of topics.

Returns
Number of topics that are alive after training. Returns the fixed number of topics (k) if it is non-zero, i.e. when the LDA algorithm is being used.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getOriginalVocabularySize()

std::size_t crawlservpp::Data::TopicModel::getOriginalVocabularySize ( ) const
inline

Gets the number of distinct tokens before training.

Returns
The number of distinct tokens before stopwords have been removed.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo().

◆ getParameterOptimizationInterval()

std::size_t crawlservpp::Data::TopicModel::getParameterOptimizationInterval ( ) const
inline

Gets the interval for parameter optimization, in iterations.

Returns
The interval after which the parameters of the model will be optimized, in iterations.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getRandomNumberGenerationSeed()

std::size_t crawlservpp::Data::TopicModel::getRandomNumberGenerationSeed ( ) const
inline

Gets the seed used for random number generation.

Returns
The seed used for random number generation by the model.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

◆ getRemovedTokens()

std::vector< std::string > crawlservpp::Data::TopicModel::getRemovedTokens ( ) const
inline

Gets the most common tokens (i.e. stopwords) that have been removed.

Returns
A vector of strings containing the removed tokens. The vector is empty if no tokens have been removed.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo().

◆ getTermWeighting()

std::string_view crawlservpp::Data::TopicModel::getTermWeighting ( ) const
inline

Gets the term weighting mode of the current model.

Returns
A view of a string containing the term weighting mode of the current model.
Exceptions
TopicModel::Exceptionif no documents have been added or the topic modeller has been already cleared, i.e. no model is available.

Referenced by getModelInfo().

◆ getTokenEntropy()

double crawlservpp::Data::TopicModel::getTokenEntropy ( ) const
inline

Gets the token entropy after training.

Returns
The token entropy for the whole corpus.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE_NOARGS.

Referenced by getModelInfo().

◆ getTopics()

std::vector< std::size_t > crawlservpp::Data::TopicModel::getTopics ( ) const
inline

Gets the IDs of the topics.

Returns
Vector with the IDs of the topics that are alive after training. Will return [0,1,...k] if the number of topics is fixed, i.e. the LDA algorithm is being used.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicsSorted()

std::vector< std::pair< std::size_t, std::uint64_t > > crawlservpp::Data::TopicModel::getTopicsSorted ( ) const
inline

Gets the IDs and counts of the topics, sorted by count.

Returns
Vector of pairs with the IDs and counts of the topics, sorted by the latter.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE_NOARGS.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicTopNLabels()

std::vector< std::pair< std::string, float > > crawlservpp::Data::TopicModel::getTopicTopNLabels ( std::size_t  topic,
std::size_t  n 
) const
inline

Gets the top N labels for the specified topic.

Parameters
topicThe ID of the topic.
nThe number of labels to retrieve from the topic, i.e. N.
Returns
A vector containing pairs of the top N labels of the specified topic and their probabiities, sorted by the latter.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or automated topic labelling has not been activated.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicTopNTokens()

std::vector< std::pair< std::string, float > > crawlservpp::Data::TopicModel::getTopicTopNTokens ( std::size_t  topic,
std::size_t  n 
) const
inline

Gets the top N tokens for the specified topic.

Parameters
topicThe ID of the topic.
nThe number of top tokens to retrieve from the topic, i.e. N.
Returns
A vector containing pairs of the top N tokens of the specified topic and their probabiities, sorted by the latter.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getVocabulary()

const std::vector< std::string > & crawlservpp::Data::TopicModel::getVocabulary ( ) const
inline

Gets the complete dictionary used by the model.

Note
Includes tokens removed during training.
Returns
Constant reference to a vector of strings containing the complete dictionary of the model.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

◆ getVocabularySize()

std::size_t crawlservpp::Data::TopicModel::getVocabularySize ( ) const
inline

Gets the number of distinct tokens after training has begun.

Returns
The number of distinct tokens after stopwords have been removed.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ label()

void crawlservpp::Data::TopicModel::label ( std::size_t  threads)
inline

Labels the resulting topics.

Does nothing, except clearing any existing labeling, if labeling has not been activated or has been deactivated.

Parameters
threadsNumber of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware.
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read.
See also
setLabelingOptions

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo(), and setLabelingOptions().

◆ load()

size_t crawlservpp::Data::TopicModel::load ( const std::string &  fileName)
inline

Loads a model from a file.

Clears all previous data before trying to load the new model, if applicable.

Parameters
fileNameName of the file to load the model from.
Returns
The number of bytes read from the model file (best guess).
Exceptions
TopicModel::Exceptionif the model could not be loaded from the specified file, e.g. because the file does not exist or the file format is unsupported.

References clear(), DATA_TOPICMODEL_CALL, crawlservpp::Data::defaultNumberOfInitialTopics, and setUseIdf().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ save()

std::size_t crawlservpp::Data::TopicModel::save ( const std::string &  fileName,
bool  full 
) const
inline

Writes the model to a file.

Parameters
fileNameName of the file to write the model to.
fullSets whether to save all documents with the model so that the training can be continued. If false, the saved model can only be used for topic classification.
Returns
The number of bytes written to the model file (best guess).
Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setBurnInIteration()

void crawlservpp::Data::TopicModel::setBurnInIteration ( std::size_t  skipIterations)
inline

Sets the number of iterations that will be skipped at the beginnig of training.

Parameters
skipIterationsThe number of iterations to be skipped at the beginning of the training.
Exceptions
TopicModel::Exceptionif the model has already been traind.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setFixedNumberOfTopics()

void crawlservpp::Data::TopicModel::setFixedNumberOfTopics ( std::size_t  k)
inline

Sets the fixed number of topics.

Parameters
kThe fixed number of topics, or zero for using the HDP algorithm to determine the number of topics from the data.
Exceptions
TopicModel::Exceptionif the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setInitialParameters()

void crawlservpp::Data::TopicModel::setInitialParameters ( std::size_t  initialTopics,
float  alpha,
float  eta,
float  gamma 
)
inline

Sets the initial parameters for the model.

Parameters
initialTopicsThe initial number of topics between 2 and 32767. The number of topics will be adjusted for the data during training, if the HDP algorithm is used. Will be ignored if a fixed number of topics is set, i.e. the LDA algorithm is used.
alphaThe initial concentration coeficient of the Dirichlet Process for document-table.
etaThe Dirichlet prior on the per-topic token distribution.
gammaThe initial concentration coeficient of Dirichlet Process for table-topic. Will be ignored if LDA will be used, i.e. the number of fixed topics is non-zero.
Exceptions
TopicModel::Exceptionif the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setLabelingOptions()

void crawlservpp::Data::TopicModel::setLabelingOptions ( bool  activate,
std::size_t  minCf,
std::size_t  minDf,
std::size_t  minLength,
std::size_t  maxLength,
std::size_t  maxCandidates,
float  smoothing,
float  mu,
std::size_t  windowSize 
)
inline

Sets the options for automated topic labeling.

Re-labels the topics if they have already been labeled.

Parameters
activateSets whether to activate automated topic labeling.
minCfThe minimum total occurrence of a collocation to be used as a topic label.
minDfThe minimum number of documents in which a collocation needs to occur to be used as a topic label.
minLengthThe minimum length of a topic label, in words.
maxLengthThe minimum length of a topic label, in words. If set to one, single words will be included in possible labels, although they are excluded in counting the maximum number of candidates.
maxCandidatesSets the maximum number of label candidates to extract from the topics.
smoothingA small value greater than zero for Laplace smoothing.
muA discriminative coefficient. Candidates with a high score on a specific topic and with a low score on other topics get a higher final score when this value is larger.
windowSizeThe size of the sliding window for calculating co-occurrence. If it is equal or exceeds the length of a document, the whole document is used. Should be between 50 and 100 for long documents.
See also
label

References label().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setParameterOptimizationInterval()

void crawlservpp::Data::TopicModel::setParameterOptimizationInterval ( std::size_t  interval)
inline

Sets the interval for parameter optimization, in iterations.

Parameters
intervalThe interval after which the parameters of the model will be optimized, in iterations.
Exceptions
TopicModel::Exceptionif the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setRandomNumberGenerationSeed()

void crawlservpp::Data::TopicModel::setRandomNumberGenerationSeed ( std::size_t  newSeed)
inline

Sets the seed for random number generation.

Parameters
newSeedThe seed used by the model for the generation of random numbers.
Exceptions
TopicModel::Exceptionif the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setTokenRemoval()

void crawlservpp::Data::TopicModel::setTokenRemoval ( std::size_t  collectionFrequency,
std::size_t  documentFrequency,
std::size_t  fixedNumberOfTopTokens 
)
inline

Sets which (un)common tokens to remove before training.

Parameters
collectionFrequencyThe minimum number of occurrences in the corpus required for a token to be kept. Zero or one to not use this criterion.
documentFrequencyThe minimum number of documents in which a token is required to occur in order to be kept. Zero or one to not use this criterion.
fixedNumberOfTopTokensThe number of most-occurring tokens that will be classified as stopwords and ignored. Zero to not define any stopwords.
Exceptions
TopicModel::Exceptionif the model has already been trained.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setUseIdf()

void crawlservpp::Data::TopicModel::setUseIdf ( bool  idf)
inline

Sets whether to use IDF term weighting.

Parameters
idfIf true, IDF term weighting is used. If false, every term is weighted the same.
Exceptions
TopicModel::Exceptionif the model has already been initialized.

Referenced by load(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ startTraining()

void crawlservpp::Data::TopicModel::startTraining ( )
inline

Starts training without performing any iteration.

Can be used to retrieve general information about the training data afterwards.

Exceptions
TopicModel::Exceptionif no documents have been added, the topic modeller has been cleared, or an invalid token ID is encountered while removing stopwords.

References crawlservpp::Helper::Versions::getTomotoVersion().

Referenced by clear(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ train()

void crawlservpp::Data::TopicModel::train ( std::size_t  iterations,
std::size_t  threads 
)
inline

Trains the underlying HLDA model.

Training can be performed multiple times, but after training has been started no additional documents can be added to the model.

Parameters
iterationsThe number of iterations for modelling the topics.
threadsNumber of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware.
Exceptions
TopicModel::Exceptionif no documents have been added or the topic modeller has been cleared.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().


The documentation for this class was generated from the following file: