|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Namespace for different types of data. More...
Namespaces | |
| Compression | |
| Namespace for data compression. | |
| File | |
| Namespace for functions accessing files. | |
| ImportExport | |
| Namespace for the import and export of data. | |
| Stemmer | |
| Namespace for linguistic stemmers. | |
Classes | |
| class | Corpus |
| Class representing a text corpus. More... | |
| struct | GetColumn |
| Structure for retrieving the values in a table column. More... | |
| struct | GetColumns |
| Structure for retrieving multiple table columns of the same type. More... | |
| struct | GetColumnsMixed |
| Structure for retrieving multiple table columns of different types. More... | |
| struct | GetFields |
| Structure for retrieving multiple values of the same type from a table column. More... | |
| struct | GetFieldsMixed |
| Structure for getting multiple values of different types from a table column. More... | |
| struct | GetValue |
| Structure for retrieving one value from a table column. More... | |
| struct | InsertFields |
| Structure for inserting multiple values of the same type into a table. More... | |
| struct | InsertFieldsMixed |
| Structure for inserting multiple values of different types into a row. More... | |
| struct | InsertValue |
| Structure for inserting one value into a table. More... | |
| class | Lemmatizer |
| Lemmatizer. More... | |
| class | PickleDict |
| Simple Python pickle dictionary. More... | |
| class | Sentiment |
| Implementation of the VADER sentiment analysis algorithm. More... | |
| struct | SentimentScores |
| Structure for VADER sentiment scores. More... | |
| class | Tagger |
Multilingual POS (part of speech) tagger using Wapiti by Thomas Lavergne. More... | |
| class | TokenCorrect |
Corrects tokens using an aspell dictionary. More... | |
| class | TokenRemover |
| Token remover and trimmer. More... | |
| class | TopicModel |
| Topic modeller. More... | |
| struct | UpdateFields |
| Structure for updating multiple values of the same type in a table. More... | |
| struct | UpdateFieldsMixed |
| Structure for updating multiple values of different types in a table. More... | |
| struct | UpdateValue |
| Structure for updating one value in a table. More... | |
| struct | Value |
| A generic value. More... | |
Typedefs | |
| using | Bytes = std::vector< std::uint8_t > |
Enumerations | |
| enum | Type { _unknown, _bool, _int32, _uint32, _int64, _uint64, _double, _string } |
| Data types. More... | |
Constants | |
| constexpr auto | dateLength {10} |
| The length of a date string in the format YYYY-MM-DD. More... | |
| constexpr std::uint8_t | utf8MaxBytes {4} |
| Maximum number of bytes used by one UTF-8-encoded multibyte character. More... | |
| constexpr auto | mergeUpdateEvery {10000} |
| After how many sentences the status is updated when merging corpora. More... | |
| constexpr auto | tokenizeUpdateEvery {10000} |
| After how many sentences the status is updated when tokenizing a corpus. More... | |
| constexpr auto | filterUpdateEvery {10000} |
| After how many articles the status is updated when filtering a corpus (by queries). More... | |
| constexpr auto | minSingleUtf8CharSize {2} |
| Minimum length of single UTF-8 code points to remove. More... | |
| constexpr auto | maxSingleUtf8CharSize {4} |
| Maximum length of single UTF-8 code points to remove. More... | |
| constexpr auto | bytes32bit {4} |
| The number of bytes of a 32-bit value. More... | |
| constexpr auto | bytes64bit {8} |
| The number of bytes of a 64-bit value. More... | |
| constexpr auto | dictDir {"dict"sv} |
| Directory for dictionaries. More... | |
| constexpr auto | colLemma {1} |
| Column containing the lemma in a dictionary file. More... | |
| constexpr auto | colTag {2} |
| Column containing the tag in a dictionary file. More... | |
| constexpr auto | colCount {3} |
| Column containing the number of occurences in a dictionary file. More... | |
| constexpr auto | pickleOneByte {1} |
| One byte. More... | |
| constexpr auto | pickleTwoBytes {2} |
| Two bytes. More... | |
| constexpr auto | pickleFourBytes {4} |
| Four bytes. More... | |
| constexpr auto | pickleEightBytes {8} |
| Eight bytes. More... | |
| constexpr auto | pickleNineBytes {9} |
| Nine bytes (eight bytes and an op-code). More... | |
| constexpr auto | pickleMinSize {11} |
| The minimum size of a Python pickle to extract a frame. More... | |
| constexpr auto | pickleProtocolVersion {4} |
| The protocol version of Python pickles used. More... | |
| constexpr auto | pickleProtoByte {0} |
| The position of the protocol byte in a Python pickle. More... | |
| constexpr auto | pickleVersionByte {1} |
| The position of the version byte in a Python pickle. More... | |
| constexpr auto | pickleHeadSize {2} |
| The size of the Python pickle header, in bytes. More... | |
| constexpr auto | pickleMinFrameSize {9} |
| The minimum size of a Python pickle frame. More... | |
| constexpr std::uint8_t | pickleMaxUOneByteNumber {255} |
| Maximum number in unsigned one-byte number. More... | |
| constexpr std::uint16_t | pickleMaxUTwoByteNumber {65535} |
| Maximum number in unsigned two-byte number. More... | |
| constexpr std::uint32_t | pickleMaxUFourByteNumber {4294967295} |
| Maximum number in unsigned four-byte number. More... | |
| constexpr auto | pickleBase {10} |
| The base used for converting strings to numbers. More... | |
| constexpr auto | VaderZero {0} |
| Zero. More... | |
| constexpr auto | VaderOne {1} |
| One. More... | |
| constexpr auto | VaderTwo {2} |
| Two. More... | |
| constexpr auto | VaderThree {3} |
| Three. More... | |
| constexpr auto | VaderFour {4} |
| Four. More... | |
| constexpr auto | VaderFOne {1.F} |
| Factor of One. More... | |
| constexpr auto | VaderDampOne {0.95F} |
| Factor by which the scalar modifier of immediately preceding tokens is dampened. More... | |
| constexpr auto | VaderDampTwo {0.9F} |
| Factor by which the scalar modifier of previously preceding tokens is dampened. More... | |
| constexpr auto | VaderButFactorBefore {0.5F} |
| Factor by which the modifier is dampened before a "but". More... | |
| constexpr auto | VaderButFactorAfter {1.5F} |
| Factor by which the modifier is heightened after a "but". More... | |
| constexpr auto | VaderNeverFactor {1.25F} |
| Factor by which the modifier is heightened after a "never". More... | |
| constexpr auto | VaderB_INCR {0.293F} |
| Empirically derived mean sentiment intensity rating increase for booster tokens. More... | |
| constexpr auto | VaderB_DECR {-0.293F} |
| Empirically derived mean sentiment intensity rating decrease for negative booster tokens. More... | |
| constexpr auto | VaderC_INCR {0.733F} |
| Empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a token. More... | |
| constexpr auto | VaderN_SCALAR {-0.74F} |
| Negation factor. More... | |
| constexpr auto | hdpModelName {"HDPModel"sv} |
| The name of the HDP model. More... | |
| constexpr auto | ldaModelName {"LDAModel"sv} |
| The name of the LDA model. More... | |
| constexpr auto | defaultNumberOfInitialTopics {2} |
| The initial number of topics by default. More... | |
| constexpr auto | defaultAlpha {0.1F} |
| The default concentration coeficient of the Dirichlet Process for document-table. More... | |
| constexpr auto | defaultEta {0.01F} |
| The default hyperparameter for the Dirichlet distribution for topic-token. More... | |
| constexpr auto | defaultGamma {0.1F} |
| The default concentration coefficient of the Dirichlet Process for table-topic. More... | |
| constexpr auto | defaultOptimizationInterval {10} |
| The default interval for optimizing the parameters, in iterations. More... | |
| constexpr auto | modelFileHead {"LDA\0\0"sv} |
| The beginning of a valid model file containing a LDA (or HDP) model. More... | |
| constexpr auto | modelFileTermWeightingLen {5} |
| The number of bytes determining the term weighting scheme in a model file. More... | |
| constexpr auto | modelFileTermWeightingOne {"one\0\0"sv} |
| The term weighting scheme ONE as saved in a model file. More... | |
| constexpr auto | modelFileTermWeightingIdf {"idf\0\0"sv} |
| The term weighting scheme IDF (tf-idf) as saved in a model file. More... | |
| constexpr auto | modelFileType {"TPTK"sv} |
| The tomoto file format as saved in a model file (after model head and term weighting scheme). More... | |
Sentence and Token Manipulation | |
| constexpr std::uint16_t | corpusManipNone {0} |
| Do not manipulate anything. More... | |
| constexpr std::uint16_t | corpusManipTagger {1} |
The POS (position of speech) tagger based on Wapiti by Thomas Lavergne. More... | |
| constexpr std::uint16_t | corpusManipTaggerPosterior {2} |
The posterior POS tagger based on Wapiti by Thomas Lavergne (slow, but more accurate). More... | |
| constexpr std::uint16_t | corpusManipEnglishStemmer {3} |
The porter2_stemmer algorithm for English only, implemented by Sean Massung. More... | |
| constexpr std::uint16_t | corpusManipGermanStemmer {4} |
Simple stemmer for German only, based on CISTEM by Leonie Weißweiler and Alexander Fraser. More... | |
| constexpr std::uint16_t | corpusManipLemmatizer {5} |
| Multilingual lemmatizer. More... | |
| constexpr std::uint16_t | corpusManipRemove {6} |
| Remove single tokens found in a dictionary. More... | |
| constexpr std::uint16_t | corpusManipTrim {7} |
| Trim tokens by tokens found in a dictionary. More... | |
| constexpr std::uint16_t | corpusManipCorrect {8} |
Correct single tokens using a aspell dictionary. More... | |
Helper Function | |
| Type | parseSQLType (std::string sqlType) |
| Parses the given SQL data type. More... | |
Template Functions | |
| template<int > | |
| Type | getTypeOfSizeT () |
| Resolves std::size_t into the appropriate data type. More... | |
| template<> | |
| Type | getTypeOfSizeT< bytes32bit > () |
| Identifies std::size_t as a 32-bit integer. More... | |
| template<> | |
| Type | getTypeOfSizeT< bytes64bit > () |
| Identifies std::size_t as a 64-bit integer. More... | |
Namespace for different types of data.
| using crawlservpp::Data::Bytes = typedef std::vector<std::uint8_t> |
Data types.
|
inline |
Resolves std::size_t into the appropriate data type.
Referenced by parseSQLType().
|
inline |
Identifies std::size_t as a 32-bit integer.
References _uint32.
|
inline |
Identifies std::size_t as a 64-bit integer.
References _uint64.
|
inline |
Parses the given SQL data type.
| sqlType | Constant reference to a string containing the SQL data type to parse. |
References _bool, _double, _int32, _int64, _string, _uint32, _uint64, _unknown, and getTypeOfSizeT().
Referenced by crawlservpp::Module::Analyzer::Thread::uploadResult().
|
inline |
The number of bytes of a 32-bit value.
|
inline |
The number of bytes of a 64-bit value.
|
inline |
Column containing the number of occurences in a dictionary file.
Column numbers start at zero. Columns are separated by tabulators
Referenced by crawlservpp::Data::Lemmatizer::clear().
|
inline |
Column containing the lemma in a dictionary file.
Column numbers start at zero. Columns are separated by tabulators
Referenced by crawlservpp::Data::Lemmatizer::clear().
|
inline |
Column containing the tag in a dictionary file.
Column numbers start at zero. Columns are separated by tabulators
Referenced by crawlservpp::Data::Lemmatizer::clear().
|
inline |
Correct single tokens using a aspell dictionary.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
The porter2_stemmer algorithm for English only, implemented by Sean Massung.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Simple stemmer for German only, based on CISTEM by Leonie Weißweiler and Alexander Fraser.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Multilingual lemmatizer.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Do not manipulate anything.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Remove single tokens found in a dictionary.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
The POS (position of speech) tagger based on Wapiti by Thomas Lavergne.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
The posterior POS tagger based on Wapiti by Thomas Lavergne (slow, but more accurate).
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Trim tokens by tokens found in a dictionary.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
The length of a date string in the format YYYY-MM-DD.
Referenced by crawlservpp::Data::Corpus::clear(), crawlservpp::Data::Corpus::getDate(), and crawlservpp::Data::Corpus::getDateTokenized().
|
inline |
The default concentration coeficient of the Dirichlet Process for document-table.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The default hyperparameter for the Dirichlet distribution for topic-token.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The default concentration coefficient of the Dirichlet Process for table-topic.
Not used by LDA models, i.e. when a fixed number of topics is set.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The initial number of topics by default.
Referenced by crawlservpp::Data::TopicModel::clear(), and crawlservpp::Data::TopicModel::load().
|
inline |
The default interval for optimizing the parameters, in iterations.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
Directory for dictionaries.
Referenced by crawlservpp::Data::TokenRemover::clear(), crawlservpp::Data::Lemmatizer::clear(), and crawlservpp::Module::Analyzer::Algo::SentimentOverTime::onAlgoInit().
|
inline |
After how many articles the status is updated when filtering a corpus (by queries).
Referenced by crawlservpp::Data::Corpus::filterArticles().
|
inline |
The name of the HDP model.
Referenced by crawlservpp::Data::TopicModel::getModelName().
|
inline |
The name of the LDA model.
Referenced by crawlservpp::Data::TopicModel::getModelName().
|
inline |
Maximum length of single UTF-8 code points to remove.
|
inline |
After how many sentences the status is updated when merging corpora.
Referenced by crawlservpp::Data::Corpus::clear().
|
inline |
Minimum length of single UTF-8 code points to remove.
|
inline |
The beginning of a valid model file containing a LDA (or HDP) model.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The term weighting scheme IDF (tf-idf) as saved in a model file.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The number of bytes determining the term weighting scheme in a model file.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The term weighting scheme ONE as saved in a model file.
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The tomoto file format as saved in a model file (after model head and term weighting scheme).
Referenced by crawlservpp::Data::TopicModel::clear().
|
inline |
The base used for converting strings to numbers.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Eight bytes.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Four bytes.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The size of the Python pickle header, in bytes.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Maximum number in unsigned four-byte number.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Maximum number in unsigned one-byte number.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Maximum number in unsigned two-byte number.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The minimum size of a Python pickle frame.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The minimum size of a Python pickle to extract a frame.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Nine bytes (eight bytes and an op-code).
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
One byte.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The position of the protocol byte in a Python pickle.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The protocol version of Python pickles used.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
Two bytes.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
The position of the version byte in a Python pickle.
Referenced by crawlservpp::Data::PickleDict::writeTo().
|
inline |
After how many sentences the status is updated when tokenizing a corpus.
Referenced by crawlservpp::Data::Corpus::clear().
|
inline |
Maximum number of bytes used by one UTF-8-encoded multibyte character.
Referenced by crawlservpp::Data::Corpus::clear().
|
inline |
Empirically derived mean sentiment intensity rating decrease for negative booster tokens.
|
inline |
Empirically derived mean sentiment intensity rating increase for booster tokens.
|
inline |
Factor by which the modifier is heightened after a "but".
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Factor by which the modifier is dampened before a "but".
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a token.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Factor by which the scalar modifier of immediately preceding tokens is dampened.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Factor by which the scalar modifier of previously preceding tokens is dampened.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Factor of One.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Four.
|
inline |
Negation factor.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Factor by which the modifier is heightened after a "never".
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
One.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Three.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Two.
Referenced by crawlservpp::Data::Sentiment::analyze().
|
inline |
Zero.
Referenced by crawlservpp::Data::Sentiment::analyze().