|
crawlserv++
[under development]
Application for crawling and analyzing textual content of websites.
|
Namespace for global UTF-8 encoding functions. More...
Classes | |
| class | Exception |
| Class for UTF-8 exceptions. More... | |
Functions | |
| bool | isLastCharValidUtf8 (const std::string &stringToCheck) |
| Checks the last character (i.e. up to four bytes at the end) of the given string for valid UTF-8. More... | |
Constants | |
| constexpr auto | utf8MemoryFactor {2} |
| Factor for guessing the maximum amount of memory used for UTF-8 compared to ISO-8859-1. More... | |
| constexpr auto | bitmaskTopBit {0x80} |
| Bit mask to extract the first bit of a multibyte character. More... | |
| constexpr auto | bitmaskTopTwoBits {0xc0} |
| Bit mask to extract the top two bits of a multibyte character. More... | |
| constexpr auto | shiftSixBits {6} |
| Shift six bits. More... | |
| constexpr auto | bitmaskLastSixBits0b000001 {0x3F} |
| Bit mask to check the last six bits for 0b000001. More... | |
| constexpr auto | oneByte {1} |
| One byte. More... | |
| constexpr auto | twoBytes {2} |
| Two bytes. More... | |
| constexpr auto | threeBytes {3} |
| Three bytes. More... | |
| constexpr auto | fourBytes {4} |
| Four bytes. More... | |
Conversion | |
| std::string | iso88591ToUtf8 (std::string_view strIn) |
| Converts a string from ISO-8859-1 to UTF-8. More... | |
Validation | |
| bool | isValidUtf8 (std::string_view stringToCheck, std::string &errTo) |
| Checks whether a string contains valid UTF-8. More... | |
| bool | isLastCharValidUtf8 (std::string_view stringToCheck) |
| bool | isSingleUtf8Char (std::string_view stringToCheck) |
| Returns whether the given string contains exactly one UTF-8 code point. More... | |
Repair | |
| bool | repairUtf8 (std::string_view strIn, std::string &strOut) |
| Replaces invalid UTF-8 characters in the given string and returns whether invalid characters occured. More... | |
Length | |
| std::size_t | length (std::string_view str) |
Namespace for global UTF-8 encoding functions.
| bool crawlservpp::Helper::Utf8::isLastCharValidUtf8 | ( | std::string_view | stringToCheck | ) |
Referenced by crawlservpp::Data::Corpus::clear().
|
inline |
Checks the last character (i.e. up to four bytes at the end) of the given string for valid UTF-8.
Uses the UTF8-CPP library for UTF-8 validation. See its GitHub repository for more information.
| stringToCheck | Constant reference to the string whose last character will be checked for valid UTF-8. |
References fourBytes, oneByte, threeBytes, and twoBytes.
|
inline |
Converts a string from ISO-8859-1 to UTF-8.
| strIn | View of the string to be converted. |
References bitmaskLastSixBits0b000001, bitmaskTopBit, bitmaskTopTwoBits, shiftSixBits, and utf8MemoryFactor.
Referenced by crawlservpp::Network::Curl::writerInClass().
|
inline |
Returns whether the given string contains exactly one UTF-8 code point.
| stringToCheck | String view to a string that will be checked for containing exactly one UTF-8 code point. |
|
inline |
Checks whether a string contains valid UTF-8.
Uses the UTF8-CPP library for UTF-8 validation. See its GitHub repository for more information.
| stringToCheck | View of the string to check for valid UTF-8. |
| errTo | Reference to a string to which a UTF-8 error will be written. |
Referenced by crawlservpp::Module::Crawler::Thread::onReset(), and crawlservpp::Module::Crawler::Database::urlUtf8Check().
|
inline |
Returns the number of UTF-8 codepoints in the given string.
| str | The string to be checked for UTF-8 codepoints. |
| Utf8::Exception | if the string contains invalid UTF-8 codepoints. |
References crawlservpp::Helper::Container::bytes().
Referenced by crawlservpp::Wrapper::TidyDoc::cleanAndRepair(), crawlservpp::Data::Lemmatizer::clear(), crawlservpp::Data::Corpus::clear(), crawlservpp::Module::Analyzer::Algo::CorpusGenerator::onAlgoInit(), crawlservpp::Helper::Json::stringify(), crawlservpp::Main::Server::tick(), and crawlservpp::Data::PickleDict::writeTo().
|
inline |
Replaces invalid UTF-8 characters in the given string and returns whether invalid characters occured.
Uses the UTF8-CPP library for UTF-8 validation and replacement. See its GitHub repository for more information.
| strIn | View of the string in which invalid UTF-8 characters will be replaced. |
| strOut | Reference to a string that will be replaced with the resulting string, if invalid UTF-8 characters have been encountered. |
| Utf8::Exception | if invalid characters could not be replaced. |
Referenced by crawlservpp::Module::Database::log(), crawlservpp::Main::Database::log(), and crawlservpp::Network::Curl::writerInClass().
|
inline |
Bit mask to check the last six bits for 0b000001.
Referenced by iso88591ToUtf8().
|
inline |
Bit mask to extract the first bit of a multibyte character.
Referenced by iso88591ToUtf8().
|
inline |
Bit mask to extract the top two bits of a multibyte character.
Referenced by iso88591ToUtf8().
|
inline |
Four bytes.
Referenced by isLastCharValidUtf8().
|
inline |
One byte.
Referenced by isLastCharValidUtf8().
|
inline |
Shift six bits.
Referenced by iso88591ToUtf8().
|
inline |
Three bytes.
Referenced by isLastCharValidUtf8().
|
inline |
Two bytes.
Referenced by isLastCharValidUtf8().
|
inline |
Factor for guessing the maximum amount of memory used for UTF-8 compared to ISO-8859-1.
Referenced by iso88591ToUtf8().