crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Struct::CorpusProperties Struct Reference

Corpus properties containing the type, table, and column name of its source. More...

#include <CorpusProperties.hpp>

Properties

std::uint16_t sourceType {}
 The type of the source from which the corpus is created (see below). More...
 
std::string sourceTable
 The name of the table from which the corpus is created. More...
 
std::string sourceColumn
 The name of the table column from which the corpus is created. More...
 
std::vector< std::uint16_t > manipulators
 The IDs of manipulators for preprocessing the corpus. More...
 
std::vector< std::string > models
 The models used by the manipulators with the same array index. More...
 
std::vector< std::string > dictionaries
 The dictionaries used by the manipulators with the same array index. More...
 
std::vector< std::string > languages
 The languages used by the manipulators with the same array index. More...
 
std::vector< std::uint16_t > savePoints {{}}
 List of savepoints. More...
 
std::uint64_t freeMemoryEvery {}
 Number of processed bytes in a continuous corpus after which memory will be freed. More...
 
bool tokenize {false}
 Tokenization. More...
 

Construction

 CorpusProperties ()=default
 Default constructor. More...
 
 CorpusProperties (std::uint16_t setSourceType, const std::string &setSourceTable, const std::string &setSourceColumn, const std::vector< std::uint16_t > &setManipulators, const std::vector< std::string > &setModels, const std::vector< std::string > &setDictionaries, const std::vector< std::string > &setLanguages, const std::vector< std::uint16_t > &setSavePoints, std::uint64_t setFreeMemoryEvery)
 Constructor setting properties for a tokenized corpus. More...
 
 CorpusProperties (std::uint16_t setSourceType, const std::string &setSourceTable, const std::string &setSourceColumn, std::uint64_t setFreeMemoryEvery)
 Constructor setting properties for a continuous corpus. More...
 

Detailed Description

Corpus properties containing the type, table, and column name of its source.

Constructor & Destructor Documentation

◆ CorpusProperties() [1/3]

crawlservpp::Struct::CorpusProperties::CorpusProperties ( )
default

Default constructor.

◆ CorpusProperties() [2/3]

crawlservpp::Struct::CorpusProperties::CorpusProperties ( std::uint16_t  setSourceType,
const std::string &  setSourceTable,
const std::string &  setSourceColumn,
const std::vector< std::uint16_t > &  setManipulators,
const std::vector< std::string > &  setModels,
const std::vector< std::string > &  setDictionaries,
const std::vector< std::string > &  setLanguages,
const std::vector< std::uint16_t > &  setSavePoints,
std::uint64_t  setFreeMemoryEvery 
)
inline

Constructor setting properties for a tokenized corpus.

Parameters
setSourceTypeThe type of the source from which the corpus is created (see below).e
setSourceTableConstant reference to a string containing the name of the table from which the corpus is created.
setSourceColumnConstant reference to a string containing the name of the table column from which the corpus is created.
setManipulatorsConstant reference to a vector containing the manipulators to be applied when preprocessing the corpus.
setModelsConstant reference to a vector of strings, containing a model for each manipulator, or an empty string if no model is required by the manipulator.
setDictionariesConstant reference to a vector of strings, containing a dictionary for each manipulator, or an empty string if no dictionary is required by the manipulator.
setLanguagesConstant reference to a vector of strings, containing a language for each manipulator, or an empty string if no language is required by the manipulator, or its default language should be used.
setSavePointsConstant reference to a vector containing the save points to be generated. A value of zero indicates that the unmanipulated corpus will be saved. Starting from one, the number corresponds to the manipulator used on the corpus.
setFreeMemoryEveryNumber of processed bytes in a continuous corpus after which memory will be freed. If zero, memory will only be freed after processing is complete.
See also
Module::Analyzer::generalInputSourcesParsing, Module::Analyzer::generalInputSourcesExtracting, Module::Analyzer::generalInputSourcesAnalyzing, Module::Analyzer::generalInputSourcesCrawling, Data::Corpus::corpusManipNone, Data::Corpus::corpusManipTagger, Data::Corpus::corpusManipTaggerPosterior, Data::Corpus::corpusManipEnglishStemmer, Data::Corpus::corpusManipGermanStemmer, Data::Corpus::corpusManipLemmatizer, Data::Corpus::corpusManipRemove, Data::Corpus::corpusManipCorrect

References freeMemoryEvery, savePoints, and tokenize.

◆ CorpusProperties() [3/3]

crawlservpp::Struct::CorpusProperties::CorpusProperties ( std::uint16_t  setSourceType,
const std::string &  setSourceTable,
const std::string &  setSourceColumn,
std::uint64_t  setFreeMemoryEvery 
)
inline

Constructor setting properties for a continuous corpus.

Parameters
setSourceTypeThe type of the source from which the corpus is created (see below).e
setSourceTableConstant reference to a string containing the name of the table from which the corpus is created.
setSourceColumnConstant reference to a string containing the name of the table column from which the corpus is created.
setFreeMemoryEveryNumber of processed bytes in a continuous corpus after which memory will be freed. If zero, memory will only be freed after processing is complete.
See also
Module::Analyzer::generalInputSourcesParsing, Module::Analyzer::generalInputSourcesExtracting, Module::Analyzer::generalInputSourcesAnalyzing, Module::Analyzer::generalInputSourcesCrawling

References freeMemoryEvery.

Member Data Documentation

◆ dictionaries

std::vector<std::string> crawlservpp::Struct::CorpusProperties::dictionaries

The dictionaries used by the manipulators with the same array index.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources().

◆ freeMemoryEvery

std::uint64_t crawlservpp::Struct::CorpusProperties::freeMemoryEvery {}

Number of processed bytes in a continuous corpus after which memory will be freed.

If zero, memory will only be freed after processing is complete.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and CorpusProperties().

◆ languages

std::vector<std::string> crawlservpp::Struct::CorpusProperties::languages

The languages used by the manipulators with the same array index.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources().

◆ manipulators

std::vector<std::uint16_t> crawlservpp::Struct::CorpusProperties::manipulators

The IDs of manipulators for preprocessing the corpus.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources().

◆ models

std::vector<std::string> crawlservpp::Struct::CorpusProperties::models

The models used by the manipulators with the same array index.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources().

◆ savePoints

std::vector<std::uint16_t> crawlservpp::Struct::CorpusProperties::savePoints {{}}

List of savepoints.

Manipulation steps after which the result will be stored in the database. If zero, the unmanipulated corpus will be stored. Starting with one, the save points correspond to the manipulators used on the corpus.

Only the unmanipulated corpus will be stored by default.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and CorpusProperties().

◆ sourceColumn

std::string crawlservpp::Struct::CorpusProperties::sourceColumn

The name of the table column from which the corpus is created.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and crawlservpp::Module::Analyzer::Database::getCorpus().

◆ sourceTable

std::string crawlservpp::Struct::CorpusProperties::sourceTable

The name of the table from which the corpus is created.

Referenced by crawlservpp::Module::Analyzer::Database::checkSources(), and crawlservpp::Module::Analyzer::Database::getCorpus().

◆ sourceType

◆ tokenize

bool crawlservpp::Struct::CorpusProperties::tokenize {false}

Tokenization.

True, of the corpus will be tokenized. False otherwise.

Referenced by CorpusProperties().


The documentation for this struct was generated from the following file: