![]() |
Dico |
GNU Dictionary Server |
Sergey Poznyakoff |
D.10 UTF-8
This section describes functions for handling UTF-8 strings. A UTF-8 character can be represented either as a multi-byte character or a wide character.
Multibyte character is a char *
pointing to one or more
bytes representing the UTF-8 character.
Wide character is an unsigned
value identifying the
character.
In the discussion below, a sequence of one or more multi-byte characters is called a multi-byte string. Multibyte strings terminate with a single ‘nul’ (0) character.
A sequence of one or more wide characters is called a wide character string. Such strings terminate with a single 0 value.
D.10.1 Character sizes
- Function: size_t utf8_char_width (const unsigned char *cp)
Returns length in bytes of the UTF-8 character representation pointed to by cp.
- Function: size_t utf8_strlen (const char *str)
Returns number of UTF-8 characters (not bytes) in str.
- Function: size_t utf8_wc_strlen (const unsigned *s)
Returns number of wide characters in the wide character string s.
D.10.2 Iterating over UTF-8 strings
- struct: utf8_iterator
A data type for iterating over a string of UTF-8 characters. Defined as:
struct utf8_iterator { char *string; char *curptr; unsigned curwidth; };
When iterating over characters in string,
curptr
points to the current character, andcurwidth
holds its length in bytes.
- Function: int utf8_iter_isascii (struct utf8_iterator itr)
Returns ‘true’ if itr points to a ASCII character.
- Function: int utf8_iter_end_p (struct utf8_iterator *itr)
Returns ‘true’ if itr reached end of the input string.
- Function: int utf8_iter_first (struct utf8_iterator *itr, unsigned char *str)
Initializes itr for iterating over the string str. On success, positions
itr.curptr
to the next character from the input string, setsitr.curwidth
to the length of that character in bytes, and returns ‘0’. If str is an empty string, returns ‘1’.
- Function: int utf8_iter_next (struct utf8_iterator *itr)
Positions
itr.curptr
to the next character from the input string. Setsitr.curwidth
to the length of that character in bytes.
D.10.3 Conversions
The following functions convert between the two string representations.
- Function: int utf8_mbtowc_internal (void *data, int (*read) (void*), unsigned int *pwc)
Internal function for converting a single UTF-8 character to a corresponding wide character representation. The character to convert is obtained by calling the function pointed to by read with data as its only argument. If that call returns a non-positive value, the function sets
errno
to ‘ENODATA’ and returns -1.
- Function: int utf8_mbtowc (unsigned int *pwc, const char *r, size_t len)
Converts first len characters from the multi-byte string r to wide character representation. On success, returns 0 and stores the result in pwc. The result pointer is allocated using
malloc
(3).On error (invalid multi-byte sequence encountered), returns -1 and sets
errno
to ‘EILSEQ’.
- Function: int utf8_wctomb (unsigned char *r, unsigned int wc)
Stores the UTF-8 representation of the Unicode character wc in
r[0..5]
. Returns the number of bytes stored. If wc is out of range, return -1 and setserrno
to ‘EILSEQ’.
- Function: int utf8_wc_to_mbstr (const unsigned *word, size_t wordlen, char **retptr)
Converts first wordlen characters of the wide character string word to multi-byte representation. The result is returned in retptr. It is allocated using
malloc
(3).Returns 0 on success. On error, returns -1 and sets
errno
to one of the following values:- ENOMEM
Not enough memory to allocate the return buffer.
- EILSEQ
An invalid wide character is encountered.
- Function: int utf8_mbstr_to_wc (const char *str, unsigned **wptr, size_t *plen)
Converts a multi-byte string from str to its wide character representation.
The result is returned in retptr. It is allocated using
malloc
(3).Returns 0 on success. On error, returns -1 and sets
errno
to one of the following values:- ENOMEM
Not enough memory to allocate the return buffer.
- EILSEQ
An invalid wide character is encountered.
- Function: int utf8_mbstr_to_norm_wc (const char *str, unsigned **wptr, size_t *plen)
Converts a multi-byte string from str to its wide character representation, replacing each run of one or more whitespace characters with a single space character (ASCII 32).
The result is returned in retptr. It is allocated using
malloc
(3).Returns 0 on success. On error, returns -1 and sets
errno
to one of the following values:- ENOMEM
Not enough memory to allocate the return buffer.
- EILSEQ
An invalid wide character is encountered.
D.10.4 Comparing UTF-8 strings
- Function: int utf8_symcmp (unsigned char *a, unsigned char *b)
Compares first UTF-8 characters from a and b.
- Function: int utf8_symcasecmp (unsigned char *a, unsigned char *b)
Compares first UTF-8 characters from a and b, ignoring the case.
- Function: int utf8_strcasecmp (unsigned char *a, unsigned char *b)
Compares the two UTF-8 strings a and b, ignoring the case of the characters.
- Function: int utf8_strncasecmp (unsigned char *a, unsigned char *b, size_t maxlen)
Compares at most maxlen first characters from the two UTF-8 strings a and b, ignoring the case of the characters.
- Function: int utf8_wc_strcmp (const unsigned *a, const unsigned *b)
Compare the two wide character strings a and b.
- Function: int utf8_wc_strncmp (const unsigned *a, const unsigned *b, size_t n)
Compares at most n first characters from the wide character strings a and b.
- Function: int utf8_wc_strcasecmp (const unsigned *a, const unsigned *b)
Compares the two wide character strings a and b, ignoring the case of the characters.
- Function: int utf8_wc_strncasecmp (const unsigned *a, const unsigned *b, size_t n)
Compares at most first n characters of the two wide character strings a and b, ignoring the case.
D.10.5 Character lookups
- Function: unsigned * utf8_wc_strchr (const unsigned *str, unsigned chr)
Returns a pointer to the first occurrence of wide character chr in string str, or ‘NULL’, if no such character is encountered.
- Function: unsigned * utf8_wc_strchr_ci (const unsigned *str, unsigned chr)
Returns a pointer to the first occurrence of wide character chr (case-insensitive) in string str, or ‘NULL’, if no such character is encountered.
- Function: const unsigned * utf8_wc_strstr (const unsigned *vartext, const unsigned *pattern)
Finds the first occurrence of pattern in text. Returns a pointer to the beginning of pattern in text. Returns
NULL
if no occurrence was found.
D.10.6 Functions for converting UTF-8 characters
- Function: unsigned utf8_wc_toupper (unsigned wc)
Converts wide character wc to upper case, if possible. Returns wc, if it cannot be converted.
- Function: int utf8_toupper (char *s, size_t len)
Converts first len bytes of the UTF-8 string s to upper case, if possible.
- Function: unsigned utf8_wc_tolower (unsigned wc)
Converts wide character wc to lower case, if possible. Returns wc, if it cannot be converted.
- Function: int utf8_tolower (char *s, size_t len)
Converts first len bytes of the UTF-8 string s to lower case, if possible.
- Function: void utf8_wc_strupper (unsigned *str)
Converts each character from the wide character string str to uppercase, if applicable.
- Function: void utf8_wc_strlower (unsigned *str)
Converts each character from the wide character string str to lowercase, if applicable.
D.10.7 Additional functions
- Function: unsigned * utf8_wc_strdup (const unsigned *s)
Returns a pointer to a new wide character string which is a duplicate of the string s. Memory for the new string is obtained with
malloc
(3), and can be freed withfree
(3).
- Function: unsigned * utf8_wc_quote (const unsigned *s)
Quotes occurrences of backslash and double-quote in s by prefixing each of them with a backslash. The return value is allocated using
malloc
(3).
- Function: int utf8_quote (const char *str, char **sptr)
Quotes occurrences of backslash and double-quote in s by prefixing each of them with a backslash. On success stores the result (allocated with
malloc
(3)) in sptr, and returns 0. On error, returns -1 and setserrno
to the one of the following:- ENOMEM
Not enough memory to allocate the return buffer.
- EILSEQ
An invalid wide character is encountered.
- Function: size_t utf8_wc_hash_string (const unsigned *ws, size_t n)
Compute a hash code of ws for a symbol table of n buckets.
- Function: int dico_levenshtein_distance (const char *a, const char *b, int flags)
Computes Levenshtein distance between UTF-8 strings a and b. The flags argument is a bitwise or of one or more flags:
0
Default - compute Levenstein distance, treating both arguments literally.
DICO_LEV_NORM
Treat runs of one or more whitespace characters as a single space character (ASCII 32).
DICO_LEV_DAMERAU
Compute Damerau-Levenshtein distance. This distance takes into account transpositions.
- Function: int dico_soundex (const char *word, char code[DICO_SOUNDEX_SIZE])
Computes the Soundex code for the given word. The code is stored in code. Returns 0 on success, -1 if word is not a valid UTF-8 string.
- Define: DICO_SOUNDEX_SIZE
This macro definition expands to the size of Soundex code buffer, including the terminal zero.
Note that this function silently ignores all characters, except Latin letters.
This document was generated on September 4, 2020 using makeinfo.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.