crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Topic modeller. More...
#include <TopicModel.hpp>
Classes | |
class | Exception |
Class for topic modelling-specific exceptions. More... | |
Getters | |
std::size_t | getNumberOfDocuments () const |
Gets the number of added documents after training has begun. More... | |
std::unordered_map< std::string, std::size_t > | getDocuments () const |
Gets a map with the documents and their indices from the model. More... | |
std::size_t | getVocabularySize () const |
Gets the number of distinct tokens after training has begun. More... | |
std::size_t | getOriginalVocabularySize () const |
Gets the number of distinct tokens before training. More... | |
const std::vector< std::string > & | getVocabulary () const |
Gets the complete dictionary used by the model. More... | |
std::size_t | getNumberOfTokens () const |
Gets the number of tokens after training has begun. More... | |
std::size_t | getBurnInIterations () const |
Get the number of skipped iterations. More... | |
std::size_t | getIterations () const |
Get the number of training iterations performed so far. More... | |
std::size_t | getParameterOptimizationInterval () const |
Gets the interval for parameter optimization, in iterations. More... | |
std::size_t | getRandomNumberGenerationSeed () const |
Gets the seed used for random number generation. More... | |
std::string_view | getModelName () const |
Gets the name of the current model. More... | |
std::string_view | getTermWeighting () const |
Gets the term weighting mode of the current model. More... | |
std::size_t | getDocumentId (const std::string &name) const |
Gets the ID of the document with the specified name. More... | |
std::vector< std::string > | getRemovedTokens () const |
Gets the most common tokens (i.e. stopwords) that have been removed. More... | |
std::size_t | getNumberOfTopics () const |
Gets the number of topics. More... | |
std::vector< std::size_t > | getTopics () const |
Gets the IDs of the topics. More... | |
std::vector< std::pair< std::size_t, std::uint64_t > > | getTopicsSorted () const |
Gets the IDs and counts of the topics, sorted by count. More... | |
double | getLogLikelihoodPerToken () const |
Gets the log-likelihood per token. More... | |
double | getTokenEntropy () const |
Gets the token entropy after training. More... | |
std::vector< std::pair< std::string, float > > | getTopicTopNTokens (std::size_t topic, std::size_t n) const |
Gets the top N tokens for the specified topic. More... | |
std::vector< std::pair< std::string, float > > | getTopicTopNLabels (std::size_t topic, std::size_t n) const |
Gets the top N labels for the specified topic. More... | |
std::vector< std::pair< std::string, std::vector< float > > > | getDocumentsTopics (std::unordered_set< std::string > &done) const |
Gets the topic distributions of all documents the model has been trained on, if available. More... | |
std::vector< std::vector< float > > | getDocumentsTopics (const std::vector< std::vector< std::string >> &documents, std::size_t maxIterations, std::size_t numberOfWorkers) const |
Infers the topic distributions for previously unseen documents. More... | |
TopicModelInfo | getModelInfo () const |
Gets information about the model after training. More... | |
Setters | |
void | setFixedNumberOfTopics (std::size_t k) |
Sets the fixed number of topics. More... | |
void | setUseIdf (bool idf) |
Sets whether to use IDF term weighting. More... | |
void | setBurnInIteration (std::size_t skipIterations) |
Sets the number of iterations that will be skipped at the beginnig of training. More... | |
void | setTokenRemoval (std::size_t collectionFrequency, std::size_t documentFrequency, std::size_t fixedNumberOfTopTokens) |
Sets which (un)common tokens to remove before training. More... | |
void | setInitialParameters (std::size_t initialTopics, float alpha, float eta, float gamma) |
Sets the initial parameters for the model. More... | |
void | setParameterOptimizationInterval (std::size_t interval) |
Sets the interval for parameter optimization, in iterations. More... | |
void | setRandomNumberGenerationSeed (std::size_t newSeed) |
Sets the seed for random number generation. More... | |
void | setLabelingOptions (bool activate, std::size_t minCf, std::size_t minDf, std::size_t minLength, std::size_t maxLength, std::size_t maxCandidates, float smoothing, float mu, std::size_t windowSize) |
Sets the options for automated topic labeling. More... | |
Topic Modelling | |
void | addDocument (const std::string &name, const std::vector< std::string > &tokens, std::size_t firstToken, std::size_t numTokens) |
Adds a document from a tokenized corpus. More... | |
void | startTraining () |
Starts training without performing any iteration. More... | |
void | train (std::size_t iterations, std::size_t threads) |
Trains the underlying HLDA model. More... | |
void | label (std::size_t threads) |
Labels the resulting topics. More... | |
Load and Save | |
std::size_t | load (const std::string &fileName) |
Loads a model from a file. More... | |
std::size_t | save (const std::string &fileName, bool full) const |
Writes the model to a file. More... | |
Cleanup | |
void | clear (bool labelingOptions) |
Clears the model, resets its settings and frees memory. More... | |
Topic modeller.
Uses the Hierarchical Dirichlet Process (HDP) and Latent Dirichlet Allocation (LDA) algorithms.
The former will be used if no fixed number of topics is given, the latter will be used if a fixed number of topics is given.
Using tomoto, the underlying C++ API of tomotopy
, see: https://bab2min.github.io/tomotopy/
If you use the HDP topic modelling algorithm, please cite:
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems, 1385–1392.
Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.
If you use the LDA topic modelling algorithm, please cite:
Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022.
Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10 (Aug), 1801–1828.
|
inline |
Adds a document from a tokenized corpus.
A copy of the document will be created, i.e. the corpus can be cleared after all documents have been added.
name | The name of the document. |
tokens | Constant reference to all tokens in the corpus. |
firstToken | Index of the document's first token. |
numTokens | Number of tokens in the document. |
TopicModel::Exception | if the model has already been trained. |
References DATA_TOPICMODEL_CALL.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Clears the model, resets its settings and frees memory.
labelingOptions | If true, labeling options will also be cleared. |
References DATA_TOPICMODEL_CALL, DATA_TOPICMODEL_RETRIEVE_NOARGS, DATA_TOPICMODEL_RETURN, crawlservpp::Data::defaultAlpha, crawlservpp::Data::defaultEta, crawlservpp::Data::defaultGamma, crawlservpp::Data::defaultNumberOfInitialTopics, crawlservpp::Data::defaultOptimizationInterval, crawlservpp::Helper::Memory::free(), crawlservpp::Data::PickleDict::getFloat(), crawlservpp::Data::PickleDict::getNumber(), crawlservpp::Data::PickleDict::getString(), crawlservpp::Data::modelFileHead, crawlservpp::Data::modelFileTermWeightingIdf, crawlservpp::Data::modelFileTermWeightingLen, crawlservpp::Data::modelFileTermWeightingOne, crawlservpp::Data::modelFileType, crawlservpp::Data::PickleDict::setFloat(), crawlservpp::Data::PickleDict::setNumber(), crawlservpp::Data::PickleDict::setString(), startTraining(), and crawlservpp::Data::PickleDict::writeTo().
Referenced by load(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Get the number of skipped iterations.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo().
|
inline |
Gets the ID of the document with the specified name.
name | The name of the document. |
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or no document with the specified name has been added to the model. |
References DATA_TOPICMODEL_RETRIEVE.
|
inline |
Gets a map with the documents and their indices from the model.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().
|
inline |
Gets the topic distributions of all documents the model has been trained on, if available.
Unnamed documents inside the model will be ignored.
done | An unordered map which will be used to not classify any article twice. All articles with an ID contained in this map will be ignored. The IDs of all articles that will be returned will be added to the map. |
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETRIEVE, and getNumberOfDocuments().
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Infers the topic distributions for previously unseen documents.
documents | A constant reference to a vector containing vectors with the processed tokens of the documents to infer the topics for. |
maxIterations | The maximum number of iterations to perform for infering the topic distributions. |
numberOfWorkers | The number of working threads to be used for infering the topic distributions. |
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or a document could not be created. |
References DATA_TOPICMODEL_CALL, and DATA_TOPICMODEL_RETRIEVE.
|
inline |
Get the number of training iterations performed so far.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo().
|
inline |
Gets the log-likelihood per token.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick().
|
inline |
Gets information about the model after training.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References crawlservpp::Struct::TopicModelInfo::alpha, crawlservpp::Struct::TopicModelInfo::alphas, DATA_TOPICMODEL_RETRIEVE_NOARGS, crawlservpp::Struct::TopicModelInfo::eta, crawlservpp::Struct::TopicModelInfo::gamma, getBurnInIterations(), getIterations(), getLogLikelihoodPerToken(), getModelName(), getNumberOfDocuments(), getNumberOfTokens(), getNumberOfTopics(), getOriginalVocabularySize(), getParameterOptimizationInterval(), getRemovedTokens(), getTermWeighting(), getTokenEntropy(), crawlservpp::Helper::Versions::getTomotoVersion(), getVocabularySize(), crawlservpp::Struct::TopicModelInfo::initialAlpha, crawlservpp::Struct::TopicModelInfo::initialEta, crawlservpp::Struct::TopicModelInfo::initialGamma, crawlservpp::Struct::TopicModelInfo::logLikelihoodPerToken, crawlservpp::Struct::TopicModelInfo::minCollectionFrequency, crawlservpp::Struct::TopicModelInfo::minDocumentFrequency, crawlservpp::Struct::TopicModelInfo::modelName, crawlservpp::Struct::TopicModelInfo::modelVersion, crawlservpp::Struct::TopicModelInfo::numberOfBurnInSteps, crawlservpp::Struct::TopicModelInfo::numberOfDocuments, crawlservpp::Struct::TopicModelInfo::numberOfInitialTopics, crawlservpp::Struct::TopicModelInfo::numberOfIterations, crawlservpp::Struct::TopicModelInfo::numberOfTables, crawlservpp::Struct::TopicModelInfo::numberOfTokens, crawlservpp::Struct::TopicModelInfo::numberOfTopics, crawlservpp::Struct::TopicModelInfo::numberOfTopTokensToBeRemoved, crawlservpp::Struct::TopicModelInfo::optimizationInterval, crawlservpp::Struct::TopicModelInfo::removedTokens, crawlservpp::Struct::TopicModelInfo::seed, crawlservpp::Struct::TopicModelInfo::sizeOfVocabulary, crawlservpp::Struct::TopicModelInfo::sizeOfVocabularyUsed, crawlservpp::Struct::TopicModelInfo::tokenEntropy, crawlservpp::Struct::TopicModelInfo::trainedWithVersion, and crawlservpp::Struct::TopicModelInfo::weighting.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the name of the current model.
TopicModel::Exception | if no documents have been added or the topic modeller has been already cleared, i.e. no model is available. |
References crawlservpp::Data::hdpModelName, and crawlservpp::Data::ldaModelName.
Referenced by getModelInfo().
|
inline |
Gets the number of added documents after training has begun.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getDocuments(), getDocumentsTopics(), and getModelInfo().
|
inline |
Gets the number of tokens after training has begun.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo().
|
inline |
Gets the number of topics.
k
) if it is non-zero, i.e. when the LDA algorithm is being used.TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
Referenced by getModelInfo(), crawlservpp::Module::Analyzer::Algo::TopicModelling::onAlgoTick(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the number of distinct tokens before training.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
Referenced by getModelInfo().
|
inline |
Gets the interval for parameter optimization, in iterations.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo().
|
inline |
Gets the seed used for random number generation.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
|
inline |
Gets the most common tokens (i.e. stopwords) that have been removed.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
Referenced by getModelInfo().
|
inline |
Gets the term weighting mode of the current model.
TopicModel::Exception | if no documents have been added or the topic modeller has been already cleared, i.e. no model is available. |
Referenced by getModelInfo().
|
inline |
Gets the token entropy after training.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETRIEVE_NOARGS.
Referenced by getModelInfo().
|
inline |
Gets the IDs of the topics.
[0,1,...k] if the number of topics is fixed, i.e. the LDA algorithm is being used.TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the IDs and counts of the topics, sorted by count.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETRIEVE_NOARGS.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the top N
labels for the specified topic.
topic | The ID of the topic. |
n | The number of labels to retrieve from the topic, i.e. N . |
N
labels of the specified topic and their probabiities, sorted by the latter.TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or automated topic labelling has not been activated. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the top N
tokens for the specified topic.
topic | The ID of the topic. |
n | The number of top tokens to retrieve from the topic, i.e. N . |
N
tokens of the specified topic and their probabiities, sorted by the latter.TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETRIEVE.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Gets the complete dictionary used by the model.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
|
inline |
Gets the number of distinct tokens after training has begun.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or the model has not been trained yet. |
References DATA_TOPICMODEL_RETURN.
Referenced by getModelInfo().
|
inline |
Labels the resulting topics.
Does nothing, except clearing any existing labeling, if labeling has not been activated or has been deactivated.
threads | Number of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware. |
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo(), and setLabelingOptions().
|
inline |
Loads a model from a file.
Clears all previous data before trying to load the new model, if applicable.
fileName | Name of the file to load the model from. |
TopicModel::Exception | if the model could not be loaded from the specified file, e.g. because the file does not exist or the file format is unsupported. |
References clear(), DATA_TOPICMODEL_CALL, crawlservpp::Data::defaultNumberOfInitialTopics, and setUseIdf().
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Writes the model to a file.
fileName | Name of the file to write the model to. |
full | Sets whether to save all documents with the model so that the training can be continued. If false, the saved model can only be used for topic classification. |
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, the model has not been trained yet, or the file cannot be read. |
References DATA_TOPICMODEL_CALL.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the number of iterations that will be skipped at the beginnig of training.
skipIterations | The number of iterations to be skipped at the beginning of the training. |
TopicModel::Exception | if the model has already been traind. |
References DATA_TOPICMODEL_CALL.
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the fixed number of topics.
k | The fixed number of topics, or zero for using the HDP algorithm to determine the number of topics from the data. |
TopicModel::Exception | if the model has already been initialized. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the initial parameters for the model.
initialTopics | The initial number of topics between 2 and 32767. The number of topics will be adjusted for the data during training, if the HDP algorithm is used. Will be ignored if a fixed number of topics is set, i.e. the LDA algorithm is used. |
alpha | The initial concentration coeficient of the Dirichlet Process for document-table. |
eta | The Dirichlet prior on the per-topic token distribution. |
gamma | The initial concentration coeficient of Dirichlet Process for table-topic. Will be ignored if LDA will be used, i.e. the number of fixed topics is non-zero. |
TopicModel::Exception | if the model has already been initialized. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the options for automated topic labeling.
Re-labels the topics if they have already been labeled.
activate | Sets whether to activate automated topic labeling. |
minCf | The minimum total occurrence of a collocation to be used as a topic label. |
minDf | The minimum number of documents in which a collocation needs to occur to be used as a topic label. |
minLength | The minimum length of a topic label, in words. |
maxLength | The minimum length of a topic label, in words. If set to one, single words will be included in possible labels, although they are excluded in counting the maximum number of candidates. |
maxCandidates | Sets the maximum number of label candidates to extract from the topics. |
smoothing | A small value greater than zero for Laplace smoothing. |
mu | A discriminative coefficient. Candidates with a high score on a specific topic and with a low score on other topics get a higher final score when this value is larger. |
windowSize | The size of the sliding window for calculating co-occurrence. If it is equal or exceeds the length of a document, the whole document is used. Should be between 50 and 100 for long documents. |
References label().
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the interval for parameter optimization, in iterations.
interval | The interval after which the parameters of the model will be optimized, in iterations. |
TopicModel::Exception | if the model has already been initialized. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets the seed for random number generation.
newSeed | The seed used by the model for the generation of random numbers. |
TopicModel::Exception | if the model has already been initialized. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets which (un)common tokens to remove before training.
collectionFrequency | The minimum number of occurrences in the corpus required for a token to be kept. Zero or one to not use this criterion. |
documentFrequency | The minimum number of documents in which a token is required to occur in order to be kept. Zero or one to not use this criterion. |
fixedNumberOfTopTokens | The number of most-occurring tokens that will be classified as stopwords and ignored. Zero to not define any stopwords. |
TopicModel::Exception | if the model has already been trained. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Sets whether to use IDF term weighting.
idf | If true, IDF term weighting is used. If false, every term is weighted the same. |
TopicModel::Exception | if the model has already been initialized. |
Referenced by load(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Starts training without performing any iteration.
Can be used to retrieve general information about the training data afterwards.
TopicModel::Exception | if no documents have been added, the topic modeller has been cleared, or an invalid token ID is encountered while removing stopwords. |
References crawlservpp::Helper::Versions::getTomotoVersion().
Referenced by clear(), and crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().
|
inline |
Trains the underlying HLDA model.
Training can be performed multiple times, but after training has been started no additional documents can be added to the model.
iterations | The number of iterations for modelling the topics. |
threads | Number of threads. One for single threading. Zero for guessing the number of concurrent threads supported by the hardware. |
TopicModel::Exception | if no documents have been added or the topic modeller has been cleared. |
Referenced by crawlservpp::Module::Analyzer::Algo::TopicModelling::resetAlgo().