crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Data Namespace Reference

Namespace for different types of data. More...

Namespaces

 Compression
 Namespace for data compression.
 
 File
 Namespace for functions accessing files.
 
 ImportExport
 Namespace for the import and export of data.
 
 Stemmer
 Namespace for linguistic stemmers.
 

Classes

class  Corpus
 Class representing a text corpus. More...
 
struct  GetColumn
 Structure for retrieving the values in a table column. More...
 
struct  GetColumns
 Structure for retrieving multiple table columns of the same type. More...
 
struct  GetColumnsMixed
 Structure for retrieving multiple table columns of different types. More...
 
struct  GetFields
 Structure for retrieving multiple values of the same type from a table column. More...
 
struct  GetFieldsMixed
 Structure for getting multiple values of different types from a table column. More...
 
struct  GetValue
 Structure for retrieving one value from a table column. More...
 
struct  InsertFields
 Structure for inserting multiple values of the same type into a table. More...
 
struct  InsertFieldsMixed
 Structure for inserting multiple values of different types into a row. More...
 
struct  InsertValue
 Structure for inserting one value into a table. More...
 
class  Lemmatizer
 Lemmatizer. More...
 
class  PickleDict
 Simple Python pickle dictionary. More...
 
class  Sentiment
 Implementation of the VADER sentiment analysis algorithm. More...
 
struct  SentimentScores
 Structure for VADER sentiment scores. More...
 
class  Tagger
 Multilingual POS (part of speech) tagger using Wapiti by Thomas Lavergne. More...
 
class  TokenCorrect
 Corrects tokens using an aspell dictionary. More...
 
class  TokenRemover
 Token remover and trimmer. More...
 
class  TopicModel
 Topic modeller. More...
 
struct  UpdateFields
 Structure for updating multiple values of the same type in a table. More...
 
struct  UpdateFieldsMixed
 Structure for updating multiple values of different types in a table. More...
 
struct  UpdateValue
 Structure for updating one value in a table. More...
 
struct  Value
 A generic value. More...
 

Typedefs

using Bytes = std::vector< std::uint8_t >
 

Enumerations

enum  Type {
  _unknown, _bool, _int32, _uint32,
  _int64, _uint64, _double, _string
}
 Data types. More...
 

Constants

constexpr auto dateLength {10}
 The length of a date string in the format YYYY-MM-DD. More...
 
constexpr std::uint8_t utf8MaxBytes {4}
 Maximum number of bytes used by one UTF-8-encoded multibyte character. More...
 
constexpr auto mergeUpdateEvery {10000}
 After how many sentences the status is updated when merging corpora. More...
 
constexpr auto tokenizeUpdateEvery {10000}
 After how many sentences the status is updated when tokenizing a corpus. More...
 
constexpr auto filterUpdateEvery {10000}
 After how many articles the status is updated when filtering a corpus (by queries). More...
 
constexpr auto minSingleUtf8CharSize {2}
 Minimum length of single UTF-8 code points to remove. More...
 
constexpr auto maxSingleUtf8CharSize {4}
 Maximum length of single UTF-8 code points to remove. More...
 
constexpr auto bytes32bit {4}
 The number of bytes of a 32-bit value. More...
 
constexpr auto bytes64bit {8}
 The number of bytes of a 64-bit value. More...
 
constexpr auto dictDir {"dict"sv}
 Directory for dictionaries. More...
 
constexpr auto colLemma {1}
 Column containing the lemma in a dictionary file. More...
 
constexpr auto colTag {2}
 Column containing the tag in a dictionary file. More...
 
constexpr auto colCount {3}
 Column containing the number of occurences in a dictionary file. More...
 
constexpr auto pickleOneByte {1}
 One byte. More...
 
constexpr auto pickleTwoBytes {2}
 Two bytes. More...
 
constexpr auto pickleFourBytes {4}
 Four bytes. More...
 
constexpr auto pickleEightBytes {8}
 Eight bytes. More...
 
constexpr auto pickleNineBytes {9}
 Nine bytes (eight bytes and an op-code). More...
 
constexpr auto pickleMinSize {11}
 The minimum size of a Python pickle to extract a frame. More...
 
constexpr auto pickleProtocolVersion {4}
 The protocol version of Python pickles used. More...
 
constexpr auto pickleProtoByte {0}
 The position of the protocol byte in a Python pickle. More...
 
constexpr auto pickleVersionByte {1}
 The position of the version byte in a Python pickle. More...
 
constexpr auto pickleHeadSize {2}
 The size of the Python pickle header, in bytes. More...
 
constexpr auto pickleMinFrameSize {9}
 The minimum size of a Python pickle frame. More...
 
constexpr std::uint8_t pickleMaxUOneByteNumber {255}
 Maximum number in unsigned one-byte number. More...
 
constexpr std::uint16_t pickleMaxUTwoByteNumber {65535}
 Maximum number in unsigned two-byte number. More...
 
constexpr std::uint32_t pickleMaxUFourByteNumber {4294967295}
 Maximum number in unsigned four-byte number. More...
 
constexpr auto pickleBase {10}
 The base used for converting strings to numbers. More...
 
constexpr auto VaderZero {0}
 Zero. More...
 
constexpr auto VaderOne {1}
 One. More...
 
constexpr auto VaderTwo {2}
 Two. More...
 
constexpr auto VaderThree {3}
 Three. More...
 
constexpr auto VaderFour {4}
 Four. More...
 
constexpr auto VaderFOne {1.F}
 Factor of One. More...
 
constexpr auto VaderDampOne {0.95F}
 Factor by which the scalar modifier of immediately preceding tokens is dampened. More...
 
constexpr auto VaderDampTwo {0.9F}
 Factor by which the scalar modifier of previously preceding tokens is dampened. More...
 
constexpr auto VaderButFactorBefore {0.5F}
 Factor by which the modifier is dampened before a "but". More...
 
constexpr auto VaderButFactorAfter {1.5F}
 Factor by which the modifier is heightened after a "but". More...
 
constexpr auto VaderNeverFactor {1.25F}
 Factor by which the modifier is heightened after a "never". More...
 
constexpr auto VaderB_INCR {0.293F}
 Empirically derived mean sentiment intensity rating increase for booster tokens. More...
 
constexpr auto VaderB_DECR {-0.293F}
 Empirically derived mean sentiment intensity rating decrease for negative booster tokens. More...
 
constexpr auto VaderC_INCR {0.733F}
 Empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a token. More...
 
constexpr auto VaderN_SCALAR {-0.74F}
 Negation factor. More...
 
constexpr auto hdpModelName {"HDPModel"sv}
 The name of the HDP model. More...
 
constexpr auto ldaModelName {"LDAModel"sv}
 The name of the LDA model. More...
 
constexpr auto defaultNumberOfInitialTopics {2}
 The initial number of topics by default. More...
 
constexpr auto defaultAlpha {0.1F}
 The default concentration coeficient of the Dirichlet Process for document-table. More...
 
constexpr auto defaultEta {0.01F}
 The default hyperparameter for the Dirichlet distribution for topic-token. More...
 
constexpr auto defaultGamma {0.1F}
 The default concentration coefficient of the Dirichlet Process for table-topic. More...
 
constexpr auto defaultOptimizationInterval {10}
 The default interval for optimizing the parameters, in iterations. More...
 
constexpr auto modelFileHead {"LDA\0\0"sv}
 The beginning of a valid model file containing a LDA (or HDP) model. More...
 
constexpr auto modelFileTermWeightingLen {5}
 The number of bytes determining the term weighting scheme in a model file. More...
 
constexpr auto modelFileTermWeightingOne {"one\0\0"sv}
 The term weighting scheme ONE as saved in a model file. More...
 
constexpr auto modelFileTermWeightingIdf {"idf\0\0"sv}
 The term weighting scheme IDF (tf-idf) as saved in a model file. More...
 
constexpr auto modelFileType {"TPTK"sv}
 The tomoto file format as saved in a model file (after model head and term weighting scheme). More...
 

Sentence and Token Manipulation

constexpr std::uint16_t corpusManipNone {0}
 Do not manipulate anything. More...
 
constexpr std::uint16_t corpusManipTagger {1}
 The POS (position of speech) tagger based on Wapiti by Thomas Lavergne. More...
 
constexpr std::uint16_t corpusManipTaggerPosterior {2}
 The posterior POS tagger based on Wapiti by Thomas Lavergne (slow, but more accurate). More...
 
constexpr std::uint16_t corpusManipEnglishStemmer {3}
 The porter2_stemmer algorithm for English only, implemented by Sean Massung. More...
 
constexpr std::uint16_t corpusManipGermanStemmer {4}
 Simple stemmer for German only, based on CISTEM by Leonie Weißweiler and Alexander Fraser. More...
 
constexpr std::uint16_t corpusManipLemmatizer {5}
 Multilingual lemmatizer. More...
 
constexpr std::uint16_t corpusManipRemove {6}
 Remove single tokens found in a dictionary. More...
 
constexpr std::uint16_t corpusManipTrim {7}
 Trim tokens by tokens found in a dictionary. More...
 
constexpr std::uint16_t corpusManipCorrect {8}
 Correct single tokens using a aspell dictionary. More...
 

Helper Function

Type parseSQLType (std::string sqlType)
 Parses the given SQL data type. More...
 

Template Functions

template<int >
Type getTypeOfSizeT ()
 Resolves std::size_t into the appropriate data type. More...
 
template<>
Type getTypeOfSizeT< bytes32bit > ()
 Identifies std::size_t as a 32-bit integer. More...
 
template<>
Type getTypeOfSizeT< bytes64bit > ()
 Identifies std::size_t as a 64-bit integer. More...
 

Detailed Description

Namespace for different types of data.

Typedef Documentation

◆ Bytes

using crawlservpp::Data::Bytes = typedef std::vector<std::uint8_t>

Enumeration Type Documentation

◆ Type

Data types.

Enumerator
_unknown 

Unknown data type.

_bool 

Boolean value.

_int32 

32-bit integer.

_uint32 

Unsigned 32-bit integer.

_int64 

64-bit integer.

_uint64 

Unsigned 64-bit integer.

_double 

Floating point value (with double precision).

_string 

String.

Function Documentation

◆ getTypeOfSizeT()

template<int >
Type crawlservpp::Data::getTypeOfSizeT ( )
inline

Resolves std::size_t into the appropriate data type.

Referenced by parseSQLType().

◆ getTypeOfSizeT< bytes32bit >()

Identifies std::size_t as a 32-bit integer.

References _uint32.

◆ getTypeOfSizeT< bytes64bit >()

Identifies std::size_t as a 64-bit integer.

References _uint64.

◆ parseSQLType()

Type crawlservpp::Data::parseSQLType ( std::string  sqlType)
inline

Parses the given SQL data type.

Parameters
sqlTypeConstant reference to a string containing the SQL data type to parse.
Returns
The parsed data type.
See also
Type

References _bool, _double, _int32, _int64, _string, _uint32, _uint64, _unknown, and getTypeOfSizeT().

Referenced by crawlservpp::Module::Analyzer::Thread::uploadResult().

Variable Documentation

◆ bytes32bit

constexpr auto crawlservpp::Data::bytes32bit {4}
inline

The number of bytes of a 32-bit value.

◆ bytes64bit

constexpr auto crawlservpp::Data::bytes64bit {8}
inline

The number of bytes of a 64-bit value.

◆ colCount

constexpr auto crawlservpp::Data::colCount {3}
inline

Column containing the number of occurences in a dictionary file.

Column numbers start at zero. Columns are separated by tabulators

Referenced by crawlservpp::Data::Lemmatizer::clear().

◆ colLemma

constexpr auto crawlservpp::Data::colLemma {1}
inline

Column containing the lemma in a dictionary file.

Column numbers start at zero. Columns are separated by tabulators

Referenced by crawlservpp::Data::Lemmatizer::clear().

◆ colTag

constexpr auto crawlservpp::Data::colTag {2}
inline

Column containing the tag in a dictionary file.

Column numbers start at zero. Columns are separated by tabulators

Referenced by crawlservpp::Data::Lemmatizer::clear().

◆ corpusManipCorrect

constexpr std::uint16_t crawlservpp::Data::corpusManipCorrect {8}
inline

Correct single tokens using a aspell dictionary.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipEnglishStemmer

constexpr std::uint16_t crawlservpp::Data::corpusManipEnglishStemmer {3}
inline

The porter2_stemmer algorithm for English only, implemented by Sean Massung.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipGermanStemmer

constexpr std::uint16_t crawlservpp::Data::corpusManipGermanStemmer {4}
inline

Simple stemmer for German only, based on CISTEM by Leonie Weißweiler and Alexander Fraser.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipLemmatizer

constexpr std::uint16_t crawlservpp::Data::corpusManipLemmatizer {5}
inline

Multilingual lemmatizer.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipNone

constexpr std::uint16_t crawlservpp::Data::corpusManipNone {0}
inline

Do not manipulate anything.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipRemove

constexpr std::uint16_t crawlservpp::Data::corpusManipRemove {6}
inline

Remove single tokens found in a dictionary.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipTagger

constexpr std::uint16_t crawlservpp::Data::corpusManipTagger {1}
inline

The POS (position of speech) tagger based on Wapiti by Thomas Lavergne.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipTaggerPosterior

constexpr std::uint16_t crawlservpp::Data::corpusManipTaggerPosterior {2}
inline

The posterior POS tagger based on Wapiti by Thomas Lavergne (slow, but more accurate).

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ corpusManipTrim

constexpr std::uint16_t crawlservpp::Data::corpusManipTrim {7}
inline

Trim tokens by tokens found in a dictionary.

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ dateLength

constexpr auto crawlservpp::Data::dateLength {10}
inline

The length of a date string in the format YYYY-MM-DD.

Referenced by crawlservpp::Data::Corpus::clear(), crawlservpp::Data::Corpus::getDate(), and crawlservpp::Data::Corpus::getDateTokenized().

◆ defaultAlpha

constexpr auto crawlservpp::Data::defaultAlpha {0.1F}
inline

The default concentration coeficient of the Dirichlet Process for document-table.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ defaultEta

constexpr auto crawlservpp::Data::defaultEta {0.01F}
inline

The default hyperparameter for the Dirichlet distribution for topic-token.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ defaultGamma

constexpr auto crawlservpp::Data::defaultGamma {0.1F}
inline

The default concentration coefficient of the Dirichlet Process for table-topic.

Not used by LDA models, i.e. when a fixed number of topics is set.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ defaultNumberOfInitialTopics

constexpr auto crawlservpp::Data::defaultNumberOfInitialTopics {2}
inline

The initial number of topics by default.

Referenced by crawlservpp::Data::TopicModel::clear(), and crawlservpp::Data::TopicModel::load().

◆ defaultOptimizationInterval

constexpr auto crawlservpp::Data::defaultOptimizationInterval {10}
inline

The default interval for optimizing the parameters, in iterations.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ dictDir

constexpr auto crawlservpp::Data::dictDir {"dict"sv}
inline

◆ filterUpdateEvery

constexpr auto crawlservpp::Data::filterUpdateEvery {10000}
inline

After how many articles the status is updated when filtering a corpus (by queries).

Referenced by crawlservpp::Data::Corpus::filterArticles().

◆ hdpModelName

constexpr auto crawlservpp::Data::hdpModelName {"HDPModel"sv}
inline

The name of the HDP model.

Referenced by crawlservpp::Data::TopicModel::getModelName().

◆ ldaModelName

constexpr auto crawlservpp::Data::ldaModelName {"LDAModel"sv}
inline

The name of the LDA model.

Referenced by crawlservpp::Data::TopicModel::getModelName().

◆ maxSingleUtf8CharSize

constexpr auto crawlservpp::Data::maxSingleUtf8CharSize {4}
inline

Maximum length of single UTF-8 code points to remove.

◆ mergeUpdateEvery

constexpr auto crawlservpp::Data::mergeUpdateEvery {10000}
inline

After how many sentences the status is updated when merging corpora.

Referenced by crawlservpp::Data::Corpus::clear().

◆ minSingleUtf8CharSize

constexpr auto crawlservpp::Data::minSingleUtf8CharSize {2}
inline

Minimum length of single UTF-8 code points to remove.

◆ modelFileHead

constexpr auto crawlservpp::Data::modelFileHead {"LDA\0\0"sv}
inline

The beginning of a valid model file containing a LDA (or HDP) model.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ modelFileTermWeightingIdf

constexpr auto crawlservpp::Data::modelFileTermWeightingIdf {"idf\0\0"sv}
inline

The term weighting scheme IDF (tf-idf) as saved in a model file.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ modelFileTermWeightingLen

constexpr auto crawlservpp::Data::modelFileTermWeightingLen {5}
inline

The number of bytes determining the term weighting scheme in a model file.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ modelFileTermWeightingOne

constexpr auto crawlservpp::Data::modelFileTermWeightingOne {"one\0\0"sv}
inline

The term weighting scheme ONE as saved in a model file.

Referenced by crawlservpp::Data::TopicModel::clear().

◆ modelFileType

constexpr auto crawlservpp::Data::modelFileType {"TPTK"sv}
inline

The tomoto file format as saved in a model file (after model head and term weighting scheme).

Referenced by crawlservpp::Data::TopicModel::clear().

◆ pickleBase

constexpr auto crawlservpp::Data::pickleBase {10}
inline

The base used for converting strings to numbers.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleEightBytes

constexpr auto crawlservpp::Data::pickleEightBytes {8}
inline

Eight bytes.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleFourBytes

constexpr auto crawlservpp::Data::pickleFourBytes {4}
inline

Four bytes.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleHeadSize

constexpr auto crawlservpp::Data::pickleHeadSize {2}
inline

The size of the Python pickle header, in bytes.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleMaxUFourByteNumber

constexpr std::uint32_t crawlservpp::Data::pickleMaxUFourByteNumber {4294967295}
inline

Maximum number in unsigned four-byte number.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleMaxUOneByteNumber

constexpr std::uint8_t crawlservpp::Data::pickleMaxUOneByteNumber {255}
inline

Maximum number in unsigned one-byte number.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleMaxUTwoByteNumber

constexpr std::uint16_t crawlservpp::Data::pickleMaxUTwoByteNumber {65535}
inline

Maximum number in unsigned two-byte number.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleMinFrameSize

constexpr auto crawlservpp::Data::pickleMinFrameSize {9}
inline

The minimum size of a Python pickle frame.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleMinSize

constexpr auto crawlservpp::Data::pickleMinSize {11}
inline

The minimum size of a Python pickle to extract a frame.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleNineBytes

constexpr auto crawlservpp::Data::pickleNineBytes {9}
inline

Nine bytes (eight bytes and an op-code).

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleOneByte

constexpr auto crawlservpp::Data::pickleOneByte {1}
inline

One byte.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleProtoByte

constexpr auto crawlservpp::Data::pickleProtoByte {0}
inline

The position of the protocol byte in a Python pickle.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleProtocolVersion

constexpr auto crawlservpp::Data::pickleProtocolVersion {4}
inline

The protocol version of Python pickles used.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleTwoBytes

constexpr auto crawlservpp::Data::pickleTwoBytes {2}
inline

Two bytes.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ pickleVersionByte

constexpr auto crawlservpp::Data::pickleVersionByte {1}
inline

The position of the version byte in a Python pickle.

Referenced by crawlservpp::Data::PickleDict::writeTo().

◆ tokenizeUpdateEvery

constexpr auto crawlservpp::Data::tokenizeUpdateEvery {10000}
inline

After how many sentences the status is updated when tokenizing a corpus.

Referenced by crawlservpp::Data::Corpus::clear().

◆ utf8MaxBytes

constexpr std::uint8_t crawlservpp::Data::utf8MaxBytes {4}
inline

Maximum number of bytes used by one UTF-8-encoded multibyte character.

Referenced by crawlservpp::Data::Corpus::clear().

◆ VaderB_DECR

constexpr auto crawlservpp::Data::VaderB_DECR {-0.293F}
inline

Empirically derived mean sentiment intensity rating decrease for negative booster tokens.

◆ VaderB_INCR

constexpr auto crawlservpp::Data::VaderB_INCR {0.293F}
inline

Empirically derived mean sentiment intensity rating increase for booster tokens.

◆ VaderButFactorAfter

constexpr auto crawlservpp::Data::VaderButFactorAfter {1.5F}
inline

Factor by which the modifier is heightened after a "but".

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderButFactorBefore

constexpr auto crawlservpp::Data::VaderButFactorBefore {0.5F}
inline

Factor by which the modifier is dampened before a "but".

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderC_INCR

constexpr auto crawlservpp::Data::VaderC_INCR {0.733F}
inline

Empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a token.

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderDampOne

constexpr auto crawlservpp::Data::VaderDampOne {0.95F}
inline

Factor by which the scalar modifier of immediately preceding tokens is dampened.

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderDampTwo

constexpr auto crawlservpp::Data::VaderDampTwo {0.9F}
inline

Factor by which the scalar modifier of previously preceding tokens is dampened.

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderFOne

constexpr auto crawlservpp::Data::VaderFOne {1.F}
inline

Factor of One.

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderFour

constexpr auto crawlservpp::Data::VaderFour {4}
inline

Four.

◆ VaderN_SCALAR

constexpr auto crawlservpp::Data::VaderN_SCALAR {-0.74F}
inline

Negation factor.

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderNeverFactor

constexpr auto crawlservpp::Data::VaderNeverFactor {1.25F}
inline

Factor by which the modifier is heightened after a "never".

Referenced by crawlservpp::Data::Sentiment::analyze().

◆ VaderOne

constexpr auto crawlservpp::Data::VaderOne {1}
inline

◆ VaderThree

constexpr auto crawlservpp::Data::VaderThree {3}
inline

◆ VaderTwo

constexpr auto crawlservpp::Data::VaderTwo {2}
inline

◆ VaderZero

constexpr auto crawlservpp::Data::VaderZero {0}
inline