|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Namespace for linguistic stemmers. More...
Functions | |
| void | stemEnglish (std::string &token) |
| Stems a token in English. More... | |
| void | stemGerman (std::string &token) |
| Stems a token in German. More... | |
| constexpr auto | minLengthStrip2 {6} |
| Minimum length of a token to strip two letters from the end or the beginning. More... | |
| constexpr auto | minLengthStrip1 {4} |
| Minimum length of a token to strip one letter from the end. More... | |
| constexpr auto | binInv {0xff} |
| Literal for binary inversion. More... | |
| constexpr auto | toLowerCase {32} |
| Number to add to make uppercase ASCII letters lowercase. More... | |
| constexpr auto | utf8mb2 {0xC3} |
| First byte of 2-byte UTF-8 characters for umlauts and sharp s. More... | |
| constexpr auto | utf8mb3 {0xE1} |
| First byte of 3-byte UTF-8 character for capital sharp s. More... | |
| constexpr auto | umlautA2sm {0xA4} |
| Second byte of UTF-8 umlaut ä. More... | |
| constexpr auto | umlautA2l {0x84} |
| Second byte of UTF-8 umlaut Ä. More... | |
| constexpr auto | umlautO2sm {0xB6} |
| Second byte of UTF-8 umlaut ö. More... | |
| constexpr auto | umlautO2l {0x96} |
| Second byte of UTF-8 umlaut Ö. More... | |
| constexpr auto | umlautU2sm {0xBC} |
| Second byte of UTF-8 umlaut ü. More... | |
| constexpr auto | umlautU2l {0x9C} |
| Second byte of UTF-8 umlaut Ü. More... | |
| constexpr auto | sharpS2sm {0x9F} |
| Second byte of UTF-8 sharp s. More... | |
| constexpr auto | sharpS2l {0xBA} |
| Second byte of UTF-8 capital sharp s. More... | |
| constexpr auto | sharpS3l {0x9E} |
| Third byte of UTF-8 capital sharp s. More... | |
Namespace for linguistic stemmers.
|
inline |
Stems a token in English.
| token | The token to be stemmed in situ. |
References crawlservpp::Helper::Strings::trim().
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Stems a token in German.
| token | The token to be stemmed in situ. |
References binInv, minLengthStrip1, minLengthStrip2, sharpS2l, sharpS2sm, sharpS3l, toLowerCase, umlautA2l, umlautA2sm, umlautO2l, umlautO2sm, umlautU2l, umlautU2sm, utf8mb2, and utf8mb3.
Referenced by crawlservpp::Data::Corpus::tokenize().
|
inline |
Literal for binary inversion.
Referenced by stemGerman().
|
inline |
Minimum length of a token to strip one letter from the end.
Referenced by stemGerman().
|
inline |
Minimum length of a token to strip two letters from the end or the beginning.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 capital sharp s.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 sharp s.
Referenced by stemGerman().
|
inline |
Third byte of UTF-8 capital sharp s.
Referenced by stemGerman().
|
inline |
Number to add to make uppercase ASCII letters lowercase.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut Ä.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut ä.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut Ö.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut ö.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut Ü.
Referenced by stemGerman().
|
inline |
Second byte of UTF-8 umlaut ü.
Referenced by stemGerman().
|
inline |
First byte of 2-byte UTF-8 characters for umlauts and sharp s.
Referenced by stemGerman().
|
inline |
First byte of 3-byte UTF-8 character for capital sharp s.
Referenced by stemGerman().