crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Data::Stemmer Namespace Reference

Namespace for linguistic stemmers. More...

Functions

void stemEnglish (std::string &token)
 Stems a token in English. More...
 
void stemGerman (std::string &token)
 Stems a token in German. More...
 
constexpr auto minLengthStrip2 {6}
 Minimum length of a token to strip two letters from the end or the beginning. More...
 
constexpr auto minLengthStrip1 {4}
 Minimum length of a token to strip one letter from the end. More...
 
constexpr auto binInv {0xff}
 Literal for binary inversion. More...
 
constexpr auto toLowerCase {32}
 Number to add to make uppercase ASCII letters lowercase. More...
 
constexpr auto utf8mb2 {0xC3}
 First byte of 2-byte UTF-8 characters for umlauts and sharp s. More...
 
constexpr auto utf8mb3 {0xE1}
 First byte of 3-byte UTF-8 character for capital sharp s. More...
 
constexpr auto umlautA2sm {0xA4}
 Second byte of UTF-8 umlaut ä. More...
 
constexpr auto umlautA2l {0x84}
 Second byte of UTF-8 umlaut Ä. More...
 
constexpr auto umlautO2sm {0xB6}
 Second byte of UTF-8 umlaut ö. More...
 
constexpr auto umlautO2l {0x96}
 Second byte of UTF-8 umlaut Ö. More...
 
constexpr auto umlautU2sm {0xBC}
 Second byte of UTF-8 umlaut ü. More...
 
constexpr auto umlautU2l {0x9C}
 Second byte of UTF-8 umlaut Ü. More...
 
constexpr auto sharpS2sm {0x9F}
 Second byte of UTF-8 sharp s. More...
 
constexpr auto sharpS2l {0xBA}
 Second byte of UTF-8 capital sharp s. More...
 
constexpr auto sharpS3l {0x9E}
 Third byte of UTF-8 capital sharp s. More...
 

Detailed Description

Namespace for linguistic stemmers.

Function Documentation

◆ stemEnglish()

void crawlservpp::Data::Stemmer::stemEnglish ( std::string &  token)
inline

Stems a token in English.

Parameters
tokenThe token to be stemmed in situ.

References crawlservpp::Helper::Strings::trim().

Referenced by crawlservpp::Data::Corpus::tokenize().

◆ stemGerman()

void crawlservpp::Data::Stemmer::stemGerman ( std::string &  token)
inline

Stems a token in German.

Parameters
tokenThe token to be stemmed in situ.

References binInv, minLengthStrip1, minLengthStrip2, sharpS2l, sharpS2sm, sharpS3l, toLowerCase, umlautA2l, umlautA2sm, umlautO2l, umlautO2sm, umlautU2l, umlautU2sm, utf8mb2, and utf8mb3.

Referenced by crawlservpp::Data::Corpus::tokenize().

Variable Documentation

◆ binInv

constexpr auto crawlservpp::Data::Stemmer::binInv {0xff}
inline

Literal for binary inversion.

Referenced by stemGerman().

◆ minLengthStrip1

constexpr auto crawlservpp::Data::Stemmer::minLengthStrip1 {4}
inline

Minimum length of a token to strip one letter from the end.

Referenced by stemGerman().

◆ minLengthStrip2

constexpr auto crawlservpp::Data::Stemmer::minLengthStrip2 {6}
inline

Minimum length of a token to strip two letters from the end or the beginning.

Referenced by stemGerman().

◆ sharpS2l

constexpr auto crawlservpp::Data::Stemmer::sharpS2l {0xBA}
inline

Second byte of UTF-8 capital sharp s.

Referenced by stemGerman().

◆ sharpS2sm

constexpr auto crawlservpp::Data::Stemmer::sharpS2sm {0x9F}
inline

Second byte of UTF-8 sharp s.

Referenced by stemGerman().

◆ sharpS3l

constexpr auto crawlservpp::Data::Stemmer::sharpS3l {0x9E}
inline

Third byte of UTF-8 capital sharp s.

Referenced by stemGerman().

◆ toLowerCase

constexpr auto crawlservpp::Data::Stemmer::toLowerCase {32}
inline

Number to add to make uppercase ASCII letters lowercase.

Referenced by stemGerman().

◆ umlautA2l

constexpr auto crawlservpp::Data::Stemmer::umlautA2l {0x84}
inline

Second byte of UTF-8 umlaut Ä.

Referenced by stemGerman().

◆ umlautA2sm

constexpr auto crawlservpp::Data::Stemmer::umlautA2sm {0xA4}
inline

Second byte of UTF-8 umlaut ä.

Referenced by stemGerman().

◆ umlautO2l

constexpr auto crawlservpp::Data::Stemmer::umlautO2l {0x96}
inline

Second byte of UTF-8 umlaut Ö.

Referenced by stemGerman().

◆ umlautO2sm

constexpr auto crawlservpp::Data::Stemmer::umlautO2sm {0xB6}
inline

Second byte of UTF-8 umlaut ö.

Referenced by stemGerman().

◆ umlautU2l

constexpr auto crawlservpp::Data::Stemmer::umlautU2l {0x9C}
inline

Second byte of UTF-8 umlaut Ü.

Referenced by stemGerman().

◆ umlautU2sm

constexpr auto crawlservpp::Data::Stemmer::umlautU2sm {0xBC}
inline

Second byte of UTF-8 umlaut ü.

Referenced by stemGerman().

◆ utf8mb2

constexpr auto crawlservpp::Data::Stemmer::utf8mb2 {0xC3}
inline

First byte of 2-byte UTF-8 characters for umlauts and sharp s.

Referenced by stemGerman().

◆ utf8mb3

constexpr auto crawlservpp::Data::Stemmer::utf8mb3 {0xE1}
inline

First byte of 3-byte UTF-8 character for capital sharp s.

Referenced by stemGerman().