Generate C sourcecode for is() methods for Unicode UTF-8. More...

#include <time.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <iostream>
#include <vector>
#include <map>
#include "file.h"
#include "bitstring.h"

Include dependency graph for unicode_database_to_c.cpp:

Functions
void	usage (char *filename)
	Tell the user how to use this program. More...

void	serialise (const std::string &operation, const std::vector< size_t > &list)
	Walk through each of the lists turning it into C code. More...

unsigned	process_unicodedata (const char *line, unsigned last_codepoint)
	Given a line form the UnicodeData.txt file, work out which functions should know about this codepoint. More...

void	process_proplist (const char *line)
	Given a line form the PropList.txt file, work out which functions should know about this codepoint. More...

void	make_codepoint_isalnum (void)
	Construct the codepoint_isalnum[] map for determining whether or not a given codepoint is alphanumeric.

void	foldcase (std::vector< int > &destination, int codepoint)

void	process_normalisation_recursively (std::vector< int > &answer, int head_codepoint)
	Given a codepoint apply the normalisation rules recursively to get an expansion. More...

void	process_JASS_normalization (const char *line)
	Process a single line of UnicodeData.txt and extract the normaliation of that codepoint. More...

void	normalize (void)

void	process_casefolding (const char *line)
	process a single line of CaseFolding.txt and extract the full case folding data (that is, the "C+F" subset) More...

int	main_event (int argc, char *argv[])

int	main (int argc, char *argv[])

Variables
static const size_t	MAX_CODEPOINT = 0x10FFFF

std::vector< size_t >	alpha
	list of alphabetical characters

std::vector< size_t >	uppercase
	list of uppercase characters

std::vector< size_t >	lowercase
	list of lowercase characgers

std::vector< size_t >	digit
	list of digits

std::vector< size_t >	alphanumeric
	list of alphanumeric characters

std::vector< size_t >	punc
	list of punctuation symbols

std::vector< size_t >	space
	list of Unicode space characters (not a superset C's isspace())

std::vector< size_t >	whitespace
	list of space characters (is a superset C's isspace())

std::vector< size_t >	mark
	list of diacritic marks

std::vector< size_t >	symbol
	list of symbols

std::vector< size_t >	control
	list of control characters

std::vector< size_t >	graphical
	list of graphical (printable) characters

std::vector< size_t >	xdigit
	list of Unicode defined hexadecimal characters

std::map< int, std::vector< int > >	JASS_normalisation
	JASS normalisation rules (one codepoint can become more than one codepoint) More...

std::map< int, std::vector< int > >	casefold
	The casefolded version of the codepoint.

std::map< size_t, bool >	codepoint_isalnum

std::vector< size_t >	xmlnamestartchar

std::vector< size_t >	xmlnamechar

Detailed Description

Generate C sourcecode for is() methods for Unicode UTF-8.

Author: Andrew Trotman

Copyright: 2016 Andrew Trotman

Function Documentation

◆ normalize()

void normalize ( void )

Compute the normalisation for the entire Unicode database.

◆ process_casefolding()

void process_casefolding ( const char * line )

process a single line of CaseFolding.txt and extract the full case folding data (that is, the "C+F" subset)

Parameters

line	[in] A single line from CaseFolding.txt

The JASS normalisation process is: Unicode NFKD normalization, remove all non-alphanumerics, then case fold.

◆ process_JASS_normalization()

void process_JASS_normalization ( const char * line )

Process a single line of UnicodeData.txt and extract the normaliation of that codepoint.

Parameters

line	[in] a single line from UnicodeData.txt.

◆ process_normalisation_recursively()

void process_normalisation_recursively	(	std::vector< int > &	answer,
		int	head_codepoint
	)

Given a codepoint apply the normalisation rules recursively to get an expansion.

Parameters

answer	[out] The recursively expanded (but not re-ordered) answer to the expanstion.
head_codepoin	[in] The codepoint to compute the expansion for.

◆ process_proplist()

void process_proplist ( const char * line )

Given a line form the PropList.txt file, work out which functions should know about this codepoint.

Parameters

line	[in] The line to process.

◆ process_unicodedata()

unsigned process_unicodedata	(	const char *	line,
		unsigned	last_codepoint
	)

Given a line form the UnicodeData.txt file, work out which functions should know about this codepoint.

Parameters

line	[in] The line to process.
last_codepoint	[in] The start of the current unicode range

Returns: The last Unicode codepoint we've seen (which must be passed back next call for ranges).

◆ serialise()

void serialise	(	const std::string &	operation,
		const std::vector< size_t > &	list
	)

Walk through each of the lists turning it into C code.

Parameters

operation	[in] The name of the "is" operation.
list	[in] The list of which codepoints are valid for this "is" condition.

◆ usage()

void usage ( char * filename )

Tell the user how to use this program.

Parameters

filename [in] the name of this executable (normally argv[0]).

Variable Documentation

◆ JASS_normalisation

std::map<int, std::vector<int> > JASS_normalisation

JASS normalisation rules (one codepoint can become more than one codepoint)

an array of pointers to JASS normalised codepoints for the given codepoint

Functions

Variables

Detailed Description

Function Documentation

◆ normalize()

◆ process_casefolding()

◆ process_JASS_normalization()

◆ process_normalisation_recursively()

◆ process_proplist()

◆ process_unicodedata()

◆ serialise()

◆ usage()

Variable Documentation

◆ JASS_normalisation