|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Corpus properties containing the type, table, and column name of its source. More...
#include <CorpusProperties.hpp>
Properties | |
| std::uint16_t | sourceType {} |
| The type of the source from which the corpus is created (see below). More... | |
| std::string | sourceTable |
| The name of the table from which the corpus is created. More... | |
| std::string | sourceColumn |
| The name of the table column from which the corpus is created. More... | |
| std::vector< std::uint16_t > | manipulators |
| The IDs of manipulators for preprocessing the corpus. More... | |
| std::vector< std::string > | models |
| The models used by the manipulators with the same array index. More... | |
| std::vector< std::string > | dictionaries |
| The dictionaries used by the manipulators with the same array index. More... | |
| std::vector< std::string > | languages |
| The languages used by the manipulators with the same array index. More... | |
| std::vector< std::uint16_t > | savePoints {{}} |
| List of savepoints. More... | |
| std::uint64_t | freeMemoryEvery {} |
| Number of processed bytes in a continuous corpus after which memory will be freed. More... | |
| bool | tokenize {false} |
| Tokenization. More... | |
Construction | |
| CorpusProperties ()=default | |
| Default constructor. More... | |
| CorpusProperties (std::uint16_t setSourceType, const std::string &setSourceTable, const std::string &setSourceColumn, const std::vector< std::uint16_t > &setManipulators, const std::vector< std::string > &setModels, const std::vector< std::string > &setDictionaries, const std::vector< std::string > &setLanguages, const std::vector< std::uint16_t > &setSavePoints, std::uint64_t setFreeMemoryEvery) | |
| Constructor setting properties for a tokenized corpus. More... | |
| CorpusProperties (std::uint16_t setSourceType, const std::string &setSourceTable, const std::string &setSourceColumn, std::uint64_t setFreeMemoryEvery) | |
| Constructor setting properties for a continuous corpus. More... | |
Corpus properties containing the type, table, and column name of its source.
|
default |
Default constructor.
|
inline |
Constructor setting properties for a tokenized corpus.
| setSourceType | The type of the source from which the corpus is created (see below).e |
| setSourceTable | Constant reference to a string containing the name of the table from which the corpus is created. |
| setSourceColumn | Constant reference to a string containing the name of the table column from which the corpus is created. |
| setManipulators | Constant reference to a vector containing the manipulators to be applied when preprocessing the corpus. |
| setModels | Constant reference to a vector of strings, containing a model for each manipulator, or an empty string if no model is required by the manipulator. |
| setDictionaries | Constant reference to a vector of strings, containing a dictionary for each manipulator, or an empty string if no dictionary is required by the manipulator. |
| setLanguages | Constant reference to a vector of strings, containing a language for each manipulator, or an empty string if no language is required by the manipulator, or its default language should be used. |
| setSavePoints | Constant reference to a vector containing the save points to be generated. A value of zero indicates that the unmanipulated corpus will be saved. Starting from one, the number corresponds to the manipulator used on the corpus. |
| setFreeMemoryEvery | Number of processed bytes in a continuous corpus after which memory will be freed. If zero, memory will only be freed after processing is complete. |
References freeMemoryEvery, savePoints, and tokenize.
|
inline |
Constructor setting properties for a continuous corpus.
| setSourceType | The type of the source from which the corpus is created (see below).e |
| setSourceTable | Constant reference to a string containing the name of the table from which the corpus is created. |
| setSourceColumn | Constant reference to a string containing the name of the table column from which the corpus is created. |
| setFreeMemoryEvery | Number of processed bytes in a continuous corpus after which memory will be freed. If zero, memory will only be freed after processing is complete. |
References freeMemoryEvery.
| std::vector<std::string> crawlservpp::Struct::CorpusProperties::dictionaries |
The dictionaries used by the manipulators with the same array index.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources().
| std::uint64_t crawlservpp::Struct::CorpusProperties::freeMemoryEvery {} |
Number of processed bytes in a continuous corpus after which memory will be freed.
If zero, memory will only be freed after processing is complete.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and CorpusProperties().
| std::vector<std::string> crawlservpp::Struct::CorpusProperties::languages |
The languages used by the manipulators with the same array index.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources().
| std::vector<std::uint16_t> crawlservpp::Struct::CorpusProperties::manipulators |
The IDs of manipulators for preprocessing the corpus.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources().
| std::vector<std::string> crawlservpp::Struct::CorpusProperties::models |
The models used by the manipulators with the same array index.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources().
| std::vector<std::uint16_t> crawlservpp::Struct::CorpusProperties::savePoints {{}} |
List of savepoints.
Manipulation steps after which the result will be stored in the database. If zero, the unmanipulated corpus will be stored. Starting with one, the save points correspond to the manipulators used on the corpus.
Only the unmanipulated corpus will be stored by default.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and CorpusProperties().
| std::string crawlservpp::Struct::CorpusProperties::sourceColumn |
The name of the table column from which the corpus is created.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and crawlservpp::Module::Analyzer::Database::getCorpus().
| std::string crawlservpp::Struct::CorpusProperties::sourceTable |
The name of the table from which the corpus is created.
Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and crawlservpp::Module::Analyzer::Database::getCorpus().
| std::uint16_t crawlservpp::Struct::CorpusProperties::sourceType {} |
The type of the source from which the corpus is created (see below).
Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and crawlservpp::Module::Analyzer::Database::getCorpus().
| bool crawlservpp::Struct::CorpusProperties::tokenize {false} |
Tokenization.
True, of the corpus will be tokenized. False otherwise.
Referenced by CorpusProperties().