crawlserv++  [under development]
Application for crawling and analyzing textual content of websites.
crawlservpp::Helper::Utf8 Namespace Reference

Namespace for global UTF-8 encoding functions. More...

Classes

class  Exception
 Class for UTF-8 exceptions. More...
 

Functions

bool isLastCharValidUtf8 (const std::string &stringToCheck)
 Checks the last character (i.e. up to four bytes at the end) of the given string for valid UTF-8. More...
 

Constants

constexpr auto utf8MemoryFactor {2}
 Factor for guessing the maximum amount of memory used for UTF-8 compared to ISO-8859-1. More...
 
constexpr auto bitmaskTopBit {0x80}
 Bit mask to extract the first bit of a multibyte character. More...
 
constexpr auto bitmaskTopTwoBits {0xc0}
 Bit mask to extract the top two bits of a multibyte character. More...
 
constexpr auto shiftSixBits {6}
 Shift six bits. More...
 
constexpr auto bitmaskLastSixBits0b000001 {0x3F}
 Bit mask to check the last six bits for 0b000001. More...
 
constexpr auto oneByte {1}
 One byte. More...
 
constexpr auto twoBytes {2}
 Two bytes. More...
 
constexpr auto threeBytes {3}
 Three bytes. More...
 
constexpr auto fourBytes {4}
 Four bytes. More...
 

Conversion

std::string iso88591ToUtf8 (std::string_view strIn)
 Converts a string from ISO-8859-1 to UTF-8. More...
 

Validation

bool isValidUtf8 (std::string_view stringToCheck, std::string &errTo)
 Checks whether a string contains valid UTF-8. More...
 
bool isLastCharValidUtf8 (std::string_view stringToCheck)
 
bool isSingleUtf8Char (std::string_view stringToCheck)
 Returns whether the given string contains exactly one UTF-8 code point. More...
 

Repair

bool repairUtf8 (std::string_view strIn, std::string &strOut)
 Replaces invalid UTF-8 characters in the given string and returns whether invalid characters occured. More...
 

Length

std::size_t length (std::string_view str)
 

Detailed Description

Namespace for global UTF-8 encoding functions.

Function Documentation

◆ isLastCharValidUtf8() [1/2]

bool crawlservpp::Helper::Utf8::isLastCharValidUtf8 ( std::string_view  stringToCheck)

◆ isLastCharValidUtf8() [2/2]

bool crawlservpp::Helper::Utf8::isLastCharValidUtf8 ( const std::string &  stringToCheck)
inline

Checks the last character (i.e. up to four bytes at the end) of the given string for valid UTF-8.

Uses the UTF8-CPP library for UTF-8 validation. See its GitHub repository for more information.

Parameters
stringToCheckConstant reference to the string whose last character will be checked for valid UTF-8.
Returns
True if the last character of the given string is valid UTF-8 or the given string is empty. False otherwise.

References fourBytes, oneByte, threeBytes, and twoBytes.

◆ iso88591ToUtf8()

std::string crawlservpp::Helper::Utf8::iso88591ToUtf8 ( std::string_view  strIn)
inline

Converts a string from ISO-8859-1 to UTF-8.

Parameters
strInView of the string to be converted.
Returns
A copy of the converted string.

References bitmaskLastSixBits0b000001, bitmaskTopBit, bitmaskTopTwoBits, shiftSixBits, and utf8MemoryFactor.

Referenced by crawlservpp::Network::Curl::writerInClass().

◆ isSingleUtf8Char()

bool crawlservpp::Helper::Utf8::isSingleUtf8Char ( std::string_view  stringToCheck)
inline

Returns whether the given string contains exactly one UTF-8 code point.

Parameters
stringToCheckString view to a string that will be checked for containing exactly one UTF-8 code point.
Returns
True, if the given string contains exactly one UTF-8 code point. False otherwise.

◆ isValidUtf8()

bool crawlservpp::Helper::Utf8::isValidUtf8 ( std::string_view  stringToCheck,
std::string &  errTo 
)
inline

Checks whether a string contains valid UTF-8.

Uses the UTF8-CPP library for UTF-8 validation. See its GitHub repository for more information.

Parameters
stringToCheckView of the string to check for valid UTF-8.
errToReference to a string to which a UTF-8 error will be written.
Returns
True if the given string contains valid UTF-8. False otherwise.

Referenced by crawlservpp::Module::Crawler::Thread::onReset(), and crawlservpp::Module::Crawler::Database::urlUtf8Check().

◆ length()

std::size_t crawlservpp::Helper::Utf8::length ( std::string_view  str)
inline

Returns the number of UTF-8 codepoints in the given string.

Parameters
strThe string to be checked for UTF-8 codepoints.
Returns
The number of UTF-8 codepoints found in the string.
Exceptions
Utf8::Exceptionif the string contains invalid UTF-8 codepoints.

References crawlservpp::Helper::Container::bytes().

Referenced by crawlservpp::Wrapper::TidyDoc::cleanAndRepair(), crawlservpp::Data::Lemmatizer::clear(), crawlservpp::Data::Corpus::clear(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Helper::Json::stringify(), crawlservpp::Main::Server::tick(), and crawlservpp::Data::PickleDict::writeTo().

◆ repairUtf8()

bool crawlservpp::Helper::Utf8::repairUtf8 ( std::string_view  strIn,
std::string &  strOut 
)
inline

Replaces invalid UTF-8 characters in the given string and returns whether invalid characters occured.

Uses the UTF8-CPP library for UTF-8 validation and replacement. See its GitHub repository for more information.

Parameters
strInView of the string in which invalid UTF-8 characters will be replaced.
strOutReference to a string that will be replaced with the resulting string, if invalid UTF-8 characters have been encountered.
Returns
True, if the given string contains invalid UTF-8 characters that have been replaced in the resulting string.
Exceptions
Utf8::Exceptionif invalid characters could not be replaced.

Referenced by crawlservpp::Module::Database::log(), crawlservpp::Main::Database::log(), and crawlservpp::Network::Curl::writerInClass().

Variable Documentation

◆ bitmaskLastSixBits0b000001

constexpr auto crawlservpp::Helper::Utf8::bitmaskLastSixBits0b000001 {0x3F}
inline

Bit mask to check the last six bits for 0b000001.

Referenced by iso88591ToUtf8().

◆ bitmaskTopBit

constexpr auto crawlservpp::Helper::Utf8::bitmaskTopBit {0x80}
inline

Bit mask to extract the first bit of a multibyte character.

Referenced by iso88591ToUtf8().

◆ bitmaskTopTwoBits

constexpr auto crawlservpp::Helper::Utf8::bitmaskTopTwoBits {0xc0}
inline

Bit mask to extract the top two bits of a multibyte character.

Referenced by iso88591ToUtf8().

◆ fourBytes

constexpr auto crawlservpp::Helper::Utf8::fourBytes {4}
inline

Four bytes.

Referenced by isLastCharValidUtf8().

◆ oneByte

constexpr auto crawlservpp::Helper::Utf8::oneByte {1}
inline

One byte.

Referenced by isLastCharValidUtf8().

◆ shiftSixBits

constexpr auto crawlservpp::Helper::Utf8::shiftSixBits {6}
inline

Shift six bits.

Referenced by iso88591ToUtf8().

◆ threeBytes

constexpr auto crawlservpp::Helper::Utf8::threeBytes {3}
inline

Three bytes.

Referenced by isLastCharValidUtf8().

◆ twoBytes

constexpr auto crawlservpp::Helper::Utf8::twoBytes {2}
inline

Two bytes.

Referenced by isLastCharValidUtf8().

◆ utf8MemoryFactor

constexpr auto crawlservpp::Helper::Utf8::utf8MemoryFactor {2}
inline

Factor for guessing the maximum amount of memory used for UTF-8 compared to ISO-8859-1.

Referenced by iso88591ToUtf8().