Topic modeller. More...

#include <TopicModel.hpp>

Classes
class	Exception
	Class for topic modelling-specific exceptions. More...

Getters
std::size_t	getNumberOfDocuments () const
	Gets the number of added documents after training has begun. More...

std::unordered_map< std::string, std::size_t >	getDocuments () const
	Gets a map with the documents and their indices from the model. More...

std::size_t	getVocabularySize () const
	Gets the number of distinct tokens after training has begun. More...

std::size_t	getOriginalVocabularySize () const
	Gets the number of distinct tokens before training. More...

const std::vector< std::string > &	getVocabulary () const
	Gets the complete dictionary used by the model. More...

std::size_t	getNumberOfTokens () const
	Gets the number of tokens after training has begun. More...

std::size_t	getBurnInIterations () const
	Get the number of skipped iterations. More...

std::size_t	getIterations () const
	Get the number of training iterations performed so far. More...

std::size_t	getParameterOptimizationInterval () const
	Gets the interval for parameter optimization, in iterations. More...

std::size_t	getRandomNumberGenerationSeed () const
	Gets the seed used for random number generation. More...

std::string_view	getModelName () const
	Gets the name of the current model. More...

std::string_view	getTermWeighting () const
	Gets the term weighting mode of the current model. More...

std::size_t	getDocumentId (const std::string &name) const
	Gets the ID of the document with the specified name. More...

std::vector< std::string >	getRemovedTokens () const
	Gets the most common tokens (i.e. stopwords) that have been removed. More...

std::size_t	getNumberOfTopics () const
	Gets the number of topics. More...

std::vector< std::size_t >	getTopics () const
	Gets the IDs of the topics. More...

std::vector< std::pair< std::size_t, std::uint64_t > >	getTopicsSorted () const
	Gets the IDs and counts of the topics, sorted by count. More...

double	getLogLikelihoodPerToken () const
	Gets the log-likelihood per token. More...

double	getTokenEntropy () const
	Gets the token entropy after training. More...

std::vector< std::pair< std::string, float > >	getTopicTopNTokens (std::size_t topic, std::size_t n) const
	Gets the top `N` tokens for the specified topic. More...

std::vector< std::pair< std::string, float > >	getTopicTopNLabels (std::size_t topic, std::size_t n) const
	Gets the top `N` labels for the specified topic. More...

std::vector< std::pair< std::string, std::vector< float > > >	getDocumentsTopics (std::unordered_set< std::string > &done) const
	Gets the topic distributions of all documents the model has been trained on, if available. More...

std::vector< std::vector< float > >	getDocumentsTopics (const std::vector< std::vector< std::string >> &documents, std::size_t maxIterations, std::size_t numberOfWorkers) const
	Infers the topic distributions for previously unseen documents. More...

TopicModelInfo	getModelInfo () const
	Gets information about the model after training. More...

Setters
void	setFixedNumberOfTopics (std::size_t k)
	Sets the fixed number of topics. More...

void	setUseIdf (bool idf)
	Sets whether to use IDF term weighting. More...

void	setBurnInIteration (std::size_t skipIterations)
	Sets the number of iterations that will be skipped at the beginnig of training. More...

void	setTokenRemoval (std::size_t collectionFrequency, std::size_t documentFrequency, std::size_t fixedNumberOfTopTokens)
	Sets which (un)common tokens to remove before training. More...

void	setInitialParameters (std::size_t initialTopics, float alpha, float eta, float gamma)
	Sets the initial parameters for the model. More...

void	setParameterOptimizationInterval (std::size_t interval)
	Sets the interval for parameter optimization, in iterations. More...

void	setRandomNumberGenerationSeed (std::size_t newSeed)
	Sets the seed for random number generation. More...

void	setLabelingOptions (bool activate, std::size_t minCf, std::size_t minDf, std::size_t minLength, std::size_t maxLength, std::size_t maxCandidates, float smoothing, float mu, std::size_t windowSize)
	Sets the options for automated topic labeling. More...

Topic Modelling
void	addDocument (const std::string &name, const std::vector< std::string > &tokens, std::size_t firstToken, std::size_t numTokens)
	Adds a document from a tokenized corpus. More...

void	startTraining ()
	Starts training without performing any iteration. More...

void	train (std::size_t iterations, std::size_t threads)
	Trains the underlying HLDA model. More...

void	label (std::size_t threads)
	Labels the resulting topics. More...

Load and Save
std::size_t	load (const std::string &fileName)
	Loads a model from a file. More...

std::size_t	save (const std::string &fileName, bool full) const
	Writes the model to a file. More...

Cleanup
void	clear (bool labelingOptions)
	Clears the model, resets its settings and frees memory. More...

Detailed Description

Topic modeller.

Uses the Hierarchical Dirichlet Process (HDP) and Latent Dirichlet Allocation (LDA) algorithms.

The former will be used if no fixed number of topics is given, the latter will be used if a fixed number of topics is given.

Using tomoto, the underlying C++ API of tomotopy, see: https://bab2min.github.io/tomotopy/

If you use the HDP topic modelling algorithm, please cite:

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems, 1385–1392.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.

If you use the LDA topic modelling algorithm, please cite:

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.

Member Function Documentation

◆ addDocument()

void crawlservpp::Data::TopicModel::addDocument	(	const std::string &	name,
		const std::vector< std::string > &	tokens,
		std::size_t	firstToken,
		std::size_t	numTokens
	)

inline

Adds a document from a tokenized corpus.

A copy of the document will be created, i.e. the corpus can be cleared after all documents have been added.

Parameters

name	The name of the document.
tokens	Constant reference to all tokens in the corpus.
firstToken	Index of the document's first token.
numTokens	Number of tokens in the document.

Exceptions

TopicModel::Exception if the model has already been trained.

Note: It is recommended to stem (or lemmatize) the tokens in the document before adding it to the model.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ clear()

void crawlservpp::Data::TopicModel::clear ( bool labelingOptions )

inline

Clears the model, resets its settings and frees memory.

Parameters

labelingOptions If true, labeling options will also be cleared.

Referenced by load(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getBurnInIterations()

std::size_t crawlservpp::Data::TopicModel::getBurnInIterations ( ) const

inline

Get the number of skipped iterations.

Returns: The number of skipped iterations before training.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getDocumentId()

std::size_t crawlservpp::Data::TopicModel::getDocumentId ( const std::string & name ) const

inline

Gets the ID of the document with the specified name.

Parameters

name	The name of the document.

Returns: The ID of the document with the specified name.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or no document with the specified name has been added to the model.

References DATA_TOPICMODEL_RETRIEVE.

◆ getDocuments()

std::unordered_map< std::string, std::size_t > crawlservpp::Data::TopicModel::getDocuments ( ) const

inline

Gets a map with the documents and their indices from the model.

Returns: An unordered map with the document IDs as keys and the document indices as values.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().

◆ getDocumentsTopics() [1/2]

std::vector< std::pair< std::string, std::vector< float > > > crawlservpp::Data::TopicModel::getDocumentsTopics ( std::unordered_set< std::string > & done ) const

inline

Gets the topic distributions of all documents the model has been trained on, if available.

Unnamed documents inside the model will be ignored.

Parameters

done	An unordered map which will be used to not classify any article twice. All articles with an ID contained in this map will be ignored. The IDs of all articles that will be returned will be added to the map.

Returns: A vector containing pairs of a string containing the name of the document and a vector of floating-point numbers indicating the topic distribution for that document. An empty vector if the model does not contain any named documents, e.g. if a model has been loaded that has not been saved together with all its training data.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getDocumentsTopics() [2/2]

std::vector< std::vector< float > > crawlservpp::Data::TopicModel::getDocumentsTopics	(	const std::vector< std::vector< std::string >> &	documents,
		std::size_t	maxIterations,
		std::size_t	numberOfWorkers
	)		const

inline

Infers the topic distributions for previously unseen documents.

Parameters

documents	A constant reference to a vector containing vectors with the processed tokens of the documents to infer the topics for.
maxIterations	The maximum number of iterations to perform for infering the topic distributions.
numberOfWorkers	The number of working threads to be used for infering the topic distributions.

Returns: A vector containing vectors of floating- point numbers indicating the topic distribution for each of the given documents.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or a document could not be created.

References DATA_TOPICMODEL_CALL, and DATA_TOPICMODEL_RETRIEVE.

◆ getIterations()

std::size_t crawlservpp::Data::TopicModel::getIterations ( ) const

inline

Get the number of training iterations performed so far.

Returns: The number of iterations performed during training.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getLogLikelihoodPerToken()

double crawlservpp::Data::TopicModel::getLogLikelihoodPerToken ( ) const

inline

Gets the log-likelihood per token.

Returns: The current log-likelihood per token.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick().

◆ getModelInfo()

Struct::TopicModelInfo crawlservpp::Data::TopicModel::getModelInfo ( ) const

inline

Gets information about the model after training.

Returns: A structure containing all available information about the trained model.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getModelName()

std::string_view crawlservpp::Data::TopicModel::getModelName ( ) const

inline

Gets the name of the current model.

Returns: A view of a string containing the name of the current model.

Exceptions

TopicModel::Exception if no documents have been added or the topic modeller has been already cleared, i.e. no model is available.

References crawlservpp::Data::hdpModelName, and crawlservpp::Data::ldaModelName.

Referenced by getModelInfo().

◆ getNumberOfDocuments()

std::size_t crawlservpp::Data::TopicModel::getNumberOfDocuments ( ) const

inline

Gets the number of added documents after training has begun.

Returns: The number of added documents.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getDocuments(), getDocumentsTopics(), and getModelInfo().

◆ getNumberOfTokens()

std::size_t crawlservpp::Data::TopicModel::getNumberOfTokens ( ) const

inline

Gets the number of tokens after training has begun.

Returns: The number of tokens after stopwords have been removed.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getNumberOfTopics()

std::size_t crawlservpp::Data::TopicModel::getNumberOfTopics ( ) const

inline

Gets the number of topics.

Returns: Number of topics that are alive after training. Returns the fixed number of topics (k) if it is non-zero, i.e. when the LDA algorithm is being used.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getOriginalVocabularySize()

std::size_t crawlservpp::Data::TopicModel::getOriginalVocabularySize ( ) const

inline

Gets the number of distinct tokens before training.

Returns: The number of distinct tokens before stopwords have been removed.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo().

◆ getParameterOptimizationInterval()

std::size_t crawlservpp::Data::TopicModel::getParameterOptimizationInterval ( ) const

inline

Gets the interval for parameter optimization, in iterations.

Returns: The interval after which the parameters of the model will be optimized, in iterations.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ getRandomNumberGenerationSeed()

std::size_t crawlservpp::Data::TopicModel::getRandomNumberGenerationSeed ( ) const

inline

Gets the seed used for random number generation.

Returns: The seed used for random number generation by the model.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

◆ getRemovedTokens()

std::vector< std::string > crawlservpp::Data::TopicModel::getRemovedTokens ( ) const

inline

Gets the most common tokens (i.e. stopwords) that have been removed.

Returns: A vector of strings containing the removed tokens. The vector is empty if no tokens have been removed.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by getModelInfo().

◆ getTermWeighting()

std::string_view crawlservpp::Data::TopicModel::getTermWeighting ( ) const

inline

Gets the term weighting mode of the current model.

Returns: A view of a string containing the term weighting mode of the current model.

Exceptions

TopicModel::Exception if no documents have been added or the topic modeller has been already cleared, i.e. no model is available.

Referenced by getModelInfo().

◆ getTokenEntropy()

double crawlservpp::Data::TopicModel::getTokenEntropy ( ) const

inline

Gets the token entropy after training.

Returns: The token entropy for the whole corpus.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE_NOARGS.

Referenced by getModelInfo().

◆ getTopics()

std::vector< std::size_t > crawlservpp::Data::TopicModel::getTopics ( ) const

inline

Gets the IDs of the topics.

Returns: Vector with the IDs of the topics that are alive after training. Will return [0,1,...k] if the number of topics is fixed, i.e. the LDA algorithm is being used.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicsSorted()

std::vector< std::pair< std::size_t, std::uint64_t > > crawlservpp::Data::TopicModel::getTopicsSorted ( ) const

inline

Gets the IDs and counts of the topics, sorted by count.

Returns: Vector of pairs with the IDs and counts of the topics, sorted by the latter.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE_NOARGS.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicTopNLabels()

std::vector< std::pair< std::string, float > > crawlservpp::Data::TopicModel::getTopicTopNLabels	(	std::size_t	topic,
		std::size_t	n
	)		const

inline

Gets the top N labels for the specified topic.

Parameters

topic	The ID of the topic.
n	The number of labels to retrieve from the topic, i.e. `N`.

Returns: A vector containing pairs of the top N labels of the specified topic and their probabiities, sorted by the latter.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or automated topic labelling has not been activated.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getTopicTopNTokens()

std::vector< std::pair< std::string, float > > crawlservpp::Data::TopicModel::getTopicTopNTokens	(	std::size_t	topic,
		std::size_t	n
	)		const

inline

Gets the top N tokens for the specified topic.

Parameters

topic	The ID of the topic.
n	The number of top tokens to retrieve from the topic, i.e. `N`.

Returns: A vector containing pairs of the top N tokens of the specified topic and their probabiities, sorted by the latter.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETRIEVE.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ getVocabulary()

const std::vector< std::string > & crawlservpp::Data::TopicModel::getVocabulary ( ) const

inline

Gets the complete dictionary used by the model.

Note: Includes tokens removed during training.

Returns: Constant reference to a vector of strings containing the complete dictionary of the model.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

◆ getVocabularySize()

std::size_t crawlservpp::Data::TopicModel::getVocabularySize ( ) const

inline

Gets the number of distinct tokens after training has begun.

Returns: The number of distinct tokens after stopwords have been removed.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet.

References DATA_TOPICMODEL_RETURN.

Referenced by getModelInfo().

◆ label()

void crawlservpp::Data::TopicModel::label ( std::size_t threads )

inline

Labels the resulting topics.

Does nothing, except clearing any existing labeling, if labeling has not been activated or has been deactivated.

Parameters

threads Number of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read.

See also: setLabelingOptions

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo(), and setLabelingOptions().

◆ load()

size_t crawlservpp::Data::TopicModel::load ( const std::string & fileName )

inline

Loads a model from a file.

Clears all previous data before trying to load the new model, if applicable.

Parameters

fileName Name of the file to load the model from.

Returns: The number of bytes read from the model file (best guess).

Exceptions

TopicModel::Exception if the model could not be loaded from the specified file, e.g. because the file does not exist or the file format is unsupported.

References clear(), DATA_TOPICMODEL_CALL, crawlservpp::Data::defaultNumberOfInitialTopics, and setUseIdf().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ save()

std::size_t crawlservpp::Data::TopicModel::save	(	const std::string &	fileName,
		bool	full
	)		const

inline

Writes the model to a file.

Parameters

fileName	Name of the file to write the model to.
full	Sets whether to save all documents with the model so that the training can be continued. If false, the saved model can only be used for topic classification.

Returns: The number of bytes written to the model file (best guess).

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setBurnInIteration()

void crawlservpp::Data::TopicModel::setBurnInIteration ( std::size_t skipIterations )

inline

Sets the number of iterations that will be skipped at the beginnig of training.

Parameters

skipIterations The number of iterations to be skipped at the beginning of the training.

Exceptions

TopicModel::Exception if the model has already been traind.

References DATA_TOPICMODEL_CALL.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setFixedNumberOfTopics()

void crawlservpp::Data::TopicModel::setFixedNumberOfTopics ( std::size_t k )

inline

Sets the fixed number of topics.

Parameters

k	The fixed number of topics, or zero for using the HDP algorithm to determine the number of topics from the data.

Exceptions

TopicModel::Exception if the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setInitialParameters()

void crawlservpp::Data::TopicModel::setInitialParameters	(	std::size_t	initialTopics,
		float	alpha,
		float	eta,
		float	gamma
	)

inline

Sets the initial parameters for the model.

Parameters

initialTopics	The initial number of topics between 2 and 32767. The number of topics will be adjusted for the data during training, if the HDP algorithm is used. Will be ignored if a fixed number of topics is set, i.e. the LDA algorithm is used.
alpha	The initial concentration coeficient of the Dirichlet Process for document-table.
eta	The Dirichlet prior on the per-topic token distribution.
gamma	The initial concentration coeficient of Dirichlet Process for table-topic. Will be ignored if LDA will be used, i.e. the number of fixed topics is non-zero.

Exceptions

TopicModel::Exception if the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setLabelingOptions()

void crawlservpp::Data::TopicModel::setLabelingOptions	(	bool	activate,
		std::size_t	minCf,
		std::size_t	minDf,
		std::size_t	minLength,
		std::size_t	maxLength,
		std::size_t	maxCandidates,
		float	smoothing,
		float	mu,
		std::size_t	windowSize
	)

inline

Sets the options for automated topic labeling.

Re-labels the topics if they have already been labeled.

Parameters

activate	Sets whether to activate automated topic labeling.
minCf	The minimum total occurrence of a collocation to be used as a topic label.
minDf	The minimum number of documents in which a collocation needs to occur to be used as a topic label.
minLength	The minimum length of a topic label, in words.
maxLength	The minimum length of a topic label, in words. If set to one, single words will be included in possible labels, although they are excluded in counting the maximum number of candidates.
maxCandidates	Sets the maximum number of label candidates to extract from the topics.
smoothing	A small value greater than zero for Laplace smoothing.
mu	A discriminative coefficient. Candidates with a high score on a specific topic and with a low score on other topics get a higher final score when this value is larger.
windowSize	The size of the sliding window for calculating co-occurrence. If it is equal or exceeds the length of a document, the whole document is used. Should be between 50 and 100 for long documents.

See also: label

References label().

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setParameterOptimizationInterval()

void crawlservpp::Data::TopicModel::setParameterOptimizationInterval ( std::size_t interval )

inline

Sets the interval for parameter optimization, in iterations.

Parameters

interval The interval after which the parameters of the model will be optimized, in iterations.

Exceptions

TopicModel::Exception if the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setRandomNumberGenerationSeed()

void crawlservpp::Data::TopicModel::setRandomNumberGenerationSeed ( std::size_t newSeed )

inline

Sets the seed for random number generation.

Parameters

newSeed The seed used by the model for the generation of random numbers.

Exceptions

TopicModel::Exception if the model has already been initialized.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setTokenRemoval()

void crawlservpp::Data::TopicModel::setTokenRemoval	(	std::size_t	collectionFrequency,
		std::size_t	documentFrequency,
		std::size_t	fixedNumberOfTopTokens
	)

inline

Sets which (un)common tokens to remove before training.

Parameters

collectionFrequency	The minimum number of occurrences in the corpus required for a token to be kept. Zero or one to not use this criterion.
documentFrequency	The minimum number of documents in which a token is required to occur in order to be kept. Zero or one to not use this criterion.
fixedNumberOfTopTokens	The number of most-occurring tokens that will be classified as stopwords and ignored. Zero to not define any stopwords.

Exceptions

TopicModel::Exception if the model has already been trained.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ setUseIdf()

void crawlservpp::Data::TopicModel::setUseIdf ( bool idf )

inline

Sets whether to use IDF term weighting.

Parameters

idf	If true, IDF term weighting is used. If false, every term is weighted the same.

Exceptions

TopicModel::Exception if the model has already been initialized.

Referenced by load(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ startTraining()

void crawlservpp::Data::TopicModel::startTraining ( )

inline

Starts training without performing any iteration.

Can be used to retrieve general information about the training data afterwards.

Exceptions

TopicModel::Exception if no documents have been added, the topic modeller has been cleared, or an invalid token ID is encountered while removing stopwords.

References crawlservpp::Helper::Versions::getTomotoVersion().

Referenced by clear(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

◆ train()

void crawlservpp::Data::TopicModel::train	(	std::size_t	iterations,
		std::size_t	threads
	)

inline

Trains the underlying HLDA model.

Training can be performed multiple times, but after training has been started no additional documents can be added to the model.

Parameters

iterations	The number of iterations for modelling the topics.
threads	Number of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware.

Exceptions

TopicModel::Exception if no documents have been added or the topic modeller has been cleared.

Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().

The documentation for this class was generated from the following file:

Data/TopicModel.hpp

Classes

Getters

Setters

Topic Modelling

Load and Save

Cleanup

Detailed Description

Member Function Documentation

◆ addDocument()

◆ clear()

◆ getBurnInIterations()

◆ getDocumentId()

◆ getDocuments()

◆ getDocumentsTopics() [1/2]

◆ getDocumentsTopics() [2/2]

◆ getIterations()

◆ getLogLikelihoodPerToken()

◆ getModelInfo()

◆ getModelName()

◆ getNumberOfDocuments()

◆ getNumberOfTokens()

◆ getNumberOfTopics()

◆ getOriginalVocabularySize()

◆ getParameterOptimizationInterval()

◆ getRandomNumberGenerationSeed()

◆ getRemovedTokens()

◆ getTermWeighting()

◆ getTokenEntropy()

◆ getTopics()

◆ getTopicsSorted()

◆ getTopicTopNLabels()

◆ getTopicTopNTokens()

◆ getVocabulary()

◆ getVocabularySize()

◆ label()

◆ load()

◆ save()

◆ setBurnInIteration()

◆ setFixedNumberOfTopics()

◆ setInitialParameters()

◆ setLabelingOptions()

◆ setParameterOptimizationInterval()

◆ setRandomNumberGenerationSeed()

◆ setTokenRemoval()

◆ setUseIdf()

◆ startTraining()

◆ train()