utf8 (GNU Dico Manual)

D.10 UTF-8

This section describes functions for handling UTF-8 strings. A UTF-8 character can be represented either as a multi-byte character or a wide character.

Multibyte character is a char * pointing to one or more bytes representing the UTF-8 character.

Wide character is an unsigned value identifying the character.

In the discussion below, a sequence of one or more multi-byte characters is called a multi-byte string. Multibyte strings terminate with a single ‘nul’ (0) character.

A sequence of one or more wide characters is called a wide character string. Such strings terminate with a single 0 value.

D.10.1 Character sizes

Function: size_t utf8_char_width (const unsigned char *cp): Returns length in bytes of the UTF-8 character representation pointed to by cp.

Function: size_t utf8_strlen (const char *str): Returns number of UTF-8 characters (not bytes) in str.

Function: size_t utf8_wc_strlen (const unsigned *s): Returns number of wide characters in the wide character string s.

D.10.2 Iterating over UTF-8 strings

struct: utf8_iterator

A data type for iterating over a string of UTF-8 characters. Defined as:

struct utf8_iterator {
    char *string;
    char *curptr;
    unsigned curwidth;
};

When iterating over characters in string, curptr points to the current character, and curwidth holds its length in bytes.

Function: int utf8_iter_isascii (struct utf8_iterator itr): Returns ‘true’ if itr points to a ASCII character.

Function: int utf8_iter_end_p (struct utf8_iterator *itr): Returns ‘true’ if itr reached end of the input string.

Function: int utf8_iter_first (struct utf8_iterator *itr, unsigned char *str): Initializes itr for iterating over the string str. On success, positions itr.curptr to the next character from the input string, sets itr.curwidth to the length of that character in bytes, and returns ‘0’. If str is an empty string, returns ‘1’.

Function: int utf8_iter_next (struct utf8_iterator *itr): Positions itr.curptr to the next character from the input string. Sets itr.curwidth to the length of that character in bytes.

D.10.3 Conversions

The following functions convert between the two string representations.

Function: int utf8_mbtowc_internal (void *data, int (*read) (void*), unsigned int *pwc): Internal function for converting a single UTF-8 character to a corresponding wide character representation. The character to convert is obtained by calling the function pointed to by read with data as its only argument. If that call returns a non-positive value, the function sets errno to ‘ENODATA’ and returns -1.

Function: int utf8_mbtowc (unsigned int *pwc, const char *r, size_t len)

Converts first len characters from the multi-byte string r to wide character representation. On success, returns 0 and stores the result in pwc. The result pointer is allocated using malloc(3).

On error (invalid multi-byte sequence encountered), returns -1 and sets errno to ‘EILSEQ’.

Function: int utf8_wctomb (unsigned char *r, unsigned int wc): Stores the UTF-8 representation of the Unicode character wc in r[0..5]. Returns the number of bytes stored. If wc is out of range, return -1 and sets errno to ‘EILSEQ’.

Function: int utf8_wc_to_mbstr (const unsigned *word, size_t wordlen, char **retptr)

Converts first wordlen characters of the wide character string word to multi-byte representation. The result is returned in retptr. It is allocated using malloc(3).

Returns 0 on success. On error, returns -1 and sets errno to one of the following values:

ENOMEM: Not enough memory to allocate the return buffer.
EILSEQ: An invalid wide character is encountered.

Function: int utf8_mbstr_to_wc (const char *str, unsigned **wptr, size_t *plen)

Converts a multi-byte string from str to its wide character representation.

The result is returned in retptr. It is allocated using malloc(3).

Returns 0 on success. On error, returns -1 and sets errno to one of the following values:

ENOMEM: Not enough memory to allocate the return buffer.
EILSEQ: An invalid wide character is encountered.

Function: int utf8_mbstr_to_norm_wc (const char *str, unsigned **wptr, size_t *plen)

Converts a multi-byte string from str to its wide character representation, replacing each run of one or more whitespace characters with a single space character (ASCII 32).

The result is returned in retptr. It is allocated using malloc(3).

Returns 0 on success. On error, returns -1 and sets errno to one of the following values:

ENOMEM: Not enough memory to allocate the return buffer.
EILSEQ: An invalid wide character is encountered.

D.10.4 Comparing UTF-8 strings

Function: int utf8_symcmp (unsigned char *a, unsigned char *b): Compares first UTF-8 characters from a and b.

Function: int utf8_symcasecmp (unsigned char *a, unsigned char *b): Compares first UTF-8 characters from a and b, ignoring the case.

Function: int utf8_strcasecmp (unsigned char *a, unsigned char *b): Compares the two UTF-8 strings a and b, ignoring the case of the characters.

Function: int utf8_strncasecmp (unsigned char *a, unsigned char *b, size_t maxlen): Compares at most maxlen first characters from the two UTF-8 strings a and b, ignoring the case of the characters.

Function: int utf8_wc_strcmp (const unsigned *a, const unsigned *b): Compare the two wide character strings a and b.

Function: int utf8_wc_strncmp (const unsigned *a, const unsigned *b, size_t n): Compares at most n first characters from the wide character strings a and b.

Function: int utf8_wc_strcasecmp (const unsigned *a, const unsigned *b): Compares the two wide character strings a and b, ignoring the case of the characters.

Function: int utf8_wc_strncasecmp (const unsigned *a, const unsigned *b, size_t n): Compares at most first n characters of the two wide character strings a and b, ignoring the case.

D.10.5 Character lookups

Function: unsigned * utf8_wc_strchr (const unsigned *str, unsigned chr): Returns a pointer to the first occurrence of wide character chr in string str, or ‘NULL’, if no such character is encountered.

Function: unsigned * utf8_wc_strchr_ci (const unsigned *str, unsigned chr): Returns a pointer to the first occurrence of wide character chr (case-insensitive) in string str, or ‘NULL’, if no such character is encountered.

Function: const unsigned * utf8_wc_strstr (const unsigned *vartext, const unsigned *pattern): Finds the first occurrence of pattern in text. Returns a pointer to the beginning of pattern in text. Returns NULL if no occurrence was found.

D.10.6 Functions for converting UTF-8 characters

Function: unsigned utf8_wc_toupper (unsigned wc): Converts wide character wc to upper case, if possible. Returns wc, if it cannot be converted.

Function: int utf8_toupper (char *s, size_t len): Converts first len bytes of the UTF-8 string s to upper case, if possible.

Function: unsigned utf8_wc_tolower (unsigned wc): Converts wide character wc to lower case, if possible. Returns wc, if it cannot be converted.

Function: int utf8_tolower (char *s, size_t len): Converts first len bytes of the UTF-8 string s to lower case, if possible.

Function: void utf8_wc_strupper (unsigned *str): Converts each character from the wide character string str to uppercase, if applicable.

Function: void utf8_wc_strlower (unsigned *str): Converts each character from the wide character string str to lowercase, if applicable.

D.10.7 Additional functions

Function: unsigned * utf8_wc_strdup (const unsigned *s): Returns a pointer to a new wide character string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).

Function: unsigned * utf8_wc_quote (const unsigned *s): Quotes occurrences of backslash and double-quote in s by prefixing each of them with a backslash. The return value is allocated using malloc(3).

Function: int utf8_quote (const char *str, char **sptr)

Quotes occurrences of backslash and double-quote in s by prefixing each of them with a backslash. On success stores the result (allocated with malloc(3)) in sptr, and returns 0. On error, returns -1 and sets errno to the one of the following:

ENOMEM: Not enough memory to allocate the return buffer.
EILSEQ: An invalid wide character is encountered.

Function: size_t utf8_wc_hash_string (const unsigned *ws, size_t n): Compute a hash code of ws for a symbol table of n buckets.

Function: int dico_levenshtein_distance (const char *a, const char *b, int flags)

Computes Levenshtein distance between UTF-8 strings a and b. The flags argument is a bitwise or of one or more flags:

0: Default - compute Levenstein distance, treating both arguments literally.
DICO_LEV_NORM: Treat runs of one or more whitespace characters as a single space character (ASCII 32).
DICO_LEV_DAMERAU: Compute Damerau-Levenshtein distance. This distance takes into account transpositions.

Function: int dico_soundex (const char *word, char code[DICO_SOUNDEX_SIZE])

Computes the Soundex code for the given word. The code is stored in code. Returns 0 on success, -1 if word is not a valid UTF-8 string.

Define: DICO_SOUNDEX_SIZE: This macro definition expands to the size of Soundex code buffer, including the terminal zero.

Note that this function silently ignores all characters, except Latin letters.

This document was generated on September 4, 2020 using makeinfo.

Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.

Dico

GNU Dictionary Server

D.10 UTF-8

D.10.1 Character sizes

D.10.2 Iterating over UTF-8 strings

D.10.3 Conversions

D.10.4 Comparing UTF-8 strings

D.10.5 Character lookups

D.10.6 Functions for converting UTF-8 characters

D.10.7 Additional functions