什么茶降血脂最好| 一个人自言自语的说话是什么病| nk是什么意思| 掌眼什么意思| 憧憬未来是什么意思| 618是什么| 拆台是什么意思| hpv亚临床感染是什么意思| 痧是什么| 这个是什么表情| 心率低于60说明什么| 梦见生女孩是什么征兆| 小翅膀车标是什么车| 慢阻肺吃什么药最有效最好| 什么手| 什么水果可以泡酒| 指甲长的快是什么原因| 潜血是什么意思| 五一年属什么生肖| 心律不齐吃什么药最快| 身份证借给别人有什么危害性| 男人性功能不好吃什么药| 细菌性阴道炎用什么药好| 古代的天花是现代的什么病| ace是什么意思| 平日是什么意思| 夜排是什么意思| 霸王硬上弓是什么意思| 戊土是什么意思| 卵巢早衰吃什么药调理最好| 拉黑色的屎是什么原因| 井井有条是什么意思| 蝎子长什么样| 海为什么是蓝色的| 风油精有什么作用| 3月份是什么季节| 为什么不要看电焊火花| 苏州有什么特产可以带回家| 强心剂是什么意思| 内膜厚是什么原因| 立秋那天吃什么| 00年属龙的是什么命| 蓝莓不能和什么一起吃| 备孕前准备吃什么叶酸| 狗肚子有虫子吃什么药| 膝盖疼挂什么科| 有什么好用的vpn| 酸麻胀痛痒各代表什么| 日照香炉生紫烟是什么意思| 袁崇焕为什么杀毛文龙| 举足轻重是什么生肖| 草头是什么菜| 喉咙嘶哑是什么原因| 空气缸是什么意思| 属马跟什么属相犯冲| 24属什么| 打乒乓球有什么好处| 荨麻疹吃什么药好得快| 沉住气是什么意思| 伏天是什么时候| 金玉其外败絮其中是什么意思| 阿僧只劫是什么意思| 什么是磁共振检查| 虫草是什么| 冰丝和天丝有什么区别| 腋下有异味是什么原因导致的| 贫嘴是什么意思| 吃糖醋蒜有什么好处和坏处| 梦见死了人是什么征兆| 看甲状腺去医院挂什么科| 王八吃什么食物| 石见读什么| 茯苓的作用是什么| 支气管扩张什么症状| 肺部纤维灶什么意思| 淋巴细胞百分比高是什么原因| 1111是什么意思| leonardo是什么牌子| 大步向前走永远不回头是什么歌| 钙果是什么水果| 黄精长什么样| 尿常规3个加号什么意思| 什么是冰种翡翠| 膛目结舌是什么意思| 去医院看头发挂什么科| 六月初一是什么日子| 变异是什么意思| 脾胃不好吃什么调理| 2022年属虎的是什么命| 眼睛浮肿是什么原因引起的| 药流后吃什么药| 瓦特发明了什么| 24小时动态脑电图能查出什么| fisherman是什么意思| 77年属什么生肖| 头陀是什么意思| 蓝脸的窦尔敦盗御马是什么歌| 血糖偏高能吃什么水果和食物最好| 生蚝补什么| 什么是阳历| ab型血可以接受什么血型| pr是什么职位| 脚掌发红是什么原因| 河南专升本考什么| 女生什么时候容易怀孕| 欧豪资源为什么这么好| 怀孕失眠是什么原因| 是什么意思| dunhill是什么品牌| 儿女双全什么意思| 孩子结膜炎用什么眼药水| 一个斤一个页念什么| 赤是什么意思| 扁桃体发炎用什么药| 脱发是什么原因| 面部痉挛吃什么药| 11月24是什么星座| 湿疹是什么意思| 涂防晒霜之前要涂什么| 燊念什么| 两个方一个土是什么字| 眼睛发红是什么原因| 脚肿了是什么原因引起的| 西游记是什么生肖| 血脂高吃什么| 为什么会得尿毒症| 酒石酸是什么| 卷饼里面配什么菜好吃| 桃符指的是什么| 喉咙有异物感挂什么科| 什么死法不痛苦| ebay是什么| 冲任失调是什么意思| 吃什么帮助消化通便| 为什么插几下就射了| 怀孕建档是什么意思| 田野里有什么| 宫内膜回声不均匀是什么意思| mic什么意思| 抑郁吃什么药可以缓解情绪| 老觉得饿是什么原因| 打酱油是什么意思| 元神是什么意思| 采字五行属什么| 席梦思床垫什么牌子好| 尾货是什么意思| 幽冥是什么意思| 眉中间有竖纹代表什么| 非典型鳞状细胞意义不明确是什么意思| 宫颈光滑说明什么| 2024是什么年生肖| 吴优为什么叫大胸姐| 忌入宅是什么意思| 肌酐激酶高是什么原因| 农历七月是什么星座| lynn是什么意思| 大家闺秀是什么生肖| 身份证号最后一位代表什么| 丝瓜不能和什么食物一起吃| 隔岸观火是什么意思| 枭神夺食会发生什么| 养胃是什么意思| 胰腺炎吃什么药见效快| 三月是什么生肖| 为什么不能叫醒梦游的人| 吃高血压药有什么副作用| 肺结节影是什么意思啊| 鼻炎不能吃什么食物| 眼睛周围长斑是什么原因引起的| 纷至沓来什么意思| 梦见龙是什么预兆| 强高是什么意思| neighborhood什么意思| 铅是什么东西| 首鼠两端什么意思| ifa是什么意思| 过期的啤酒能干什么| 鬼冢虎为什么很少人穿| hpv用什么药| 牙龈为什么会肿| 有什么好看的动漫| 焯水什么意思| psa是什么| 小孩掉头发是什么原因| 拖什么东西最轻松| na医学上是什么意思| 风团是什么| 宫颈管少量积液是什么意思| 可小刀是什么意思| 蜈蚣最怕什么| 尿道口有灼热感是什么原因| 乐不思蜀什么意思| 经常尿路感染是什么原因| 七月22号是什么星座| 囊肿里面是什么东西| 前列腺炎要吃什么药| 1967年出生属什么| 站姐是什么职业| 商朝之后是什么朝代| 疳积是什么| 来月经为什么会肚子痛| 暗送秋波什么意思| 月什么意思| 疱疹用什么药可以根治| 上不来气吃什么药好使| 湿疹抹什么药| 81年属什么生肖| 紫色是什么颜色调出来的| 容易长口腔溃疡是什么原因| 素女经是什么| 什么是骨癌| 扑尔敏是什么药| 剖腹产坐月子可以吃什么水果| 卵巢增大是什么原因引起的| 皮肤瘙痒是什么病的前兆| 姜字五行属什么| 番茄和蕃茄有什么区别| 正局级是什么级别| 高材生是什么意思| 一个月一个泉是什么字| 血淋是什么意思| 心里发慌什么原因| 大便不正常是什么原因造成的| 梦见拉麦子是什么预兆| 房产证改名字需要什么手续| 血热是什么原因引起的| 秦二世叫什么名字| 男人断眉有什么说法| 女性乳房痒是什么原因| 脾大对身体有什么影响| ve是什么意思| 1.5是什么星座| 阳字属于五行属什么| 夏至吃什么好| 人吸了甲醛有什么症状| 尼麦角林片治什么病| 胃寒吃什么药好| 冬阴功是什么意思| 股癣是什么原因引起的| 什么生花| 黄疸挂什么科| 心肌病吃什么药| 梦见小老虎是什么预兆| 男孩取什么名字好听又有贵气| 酸角是什么| 为什么用英语怎么说| 产前筛查是检查什么| 饿了么什么时候成立的| 离婚的女人叫什么| 臼是什么意思| 半夜脚抽筋是什么原因| 柠檬苦是什么原因| 相貌是什么意思| 你什么我什么成语| 代谢不好是什么原因| 老人流口水是什么原因| 后脑勺疼什么原因| 出火是什么意思| 什么原因会导致尿路感染| 催丹香是什么意思| afar是什么意思| 多巴胺什么意思| 休眠是什么意思| 半永久是什么意思| 百度


国台办:继续推动和深化两岸语言文字交流合作

by Jeff Bezanson

If you write an email in Russian and send it to somebody in Russia, it is depressingly unlikely that he or she will be able to read it. If you write software, the burden of this sad state of affairs rests on your shoulders. 百度 与此同时,苏宁易购APP月活跃用户数较年初增长%,2017年12月苏宁易购APP订单数量占线上整体比例提升至89%。

Given modern hardware resources, it is unacceptable that we can't yet routinely communicate text in different scripts or containing technical symbols. Fortunately, we are getting there.

After reading a lot on the subject and incorporating Unicode compatibility into some of my software, I decided to prepare this quick and highly pragmatic guide to digital text in the 21st century (for C programmers, of course). I don't mind adding my voice to the numerous articles that already exist on this subject, since the world needs as many programmers as possible to pick up these skills as soon as possible.

I. Encoding text

Given the variety of human languages on this planet, text is a complex subject. Many are scared away from dealing with world scripts, because they think of the numerous related software problems in the area instead of focusing on what they can actually do with their code to help.

The first thing to know is that you do not have to worry about most problems with digital text. The most difficult work is handled below the application layer, in OSes, UI libraries, and the C library. To give you an idea of what goes on though, here is a summary of software problems surrounding text:

  • Encoding
    Mapping characters to numbers. Many such mappings exist; once you know the encoding of a piece of text, you know what character is meant by a particular number. Unicode is one such mapping, and a popular one since it incorporates more characters than any other at this time.
  • Display
    Once you know what character is meant, you have to find a font that has the character and render it. This task is much complicated by the need to display both left-to-right and right-to-left text, the existence of combining characters that modify previous characters and have zero width, the fact that some languages require wider character cells than others, and context-sensitive letterforms.
  • Input
    An input method is a way to map keystrokes (most likely several keystrokes on a typical keyboard) to characters. Input is also complicated by bidirectional text.
  • Internationalization (i18n)
    This refers to the practice of translating a program into multiple languages, effectively by translating all of the program's strings.
  • Lexicography
    Code that processes text as more than just binary data might have to become a lot smarter. The problems of searching, sorting, and modifying letter case (upper/lower) vary per-language. If your application doesn't need to perform such tasks, consider yourself lucky. If you do need these operations, you can probably find a UI toolkit or i18n library that already implements them.
If you are savvy with just the first issue (encoding), then OS-vendor-supplied input methods and display routines should magically work with your program. Whether you want to or are able to translate your software is another matter, and compared to proper handling of character encodings it is almost optional (corrupting data is worse than having an unintelligible UI).

The encoding I'll talk about is called Unicode. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. For convenience, the first 128 Unicode characters are the same as those in the familiar ASCII encoding.

The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character.

UTF-8 has the property that all existing 7-bit ASCII strings are still valid. UTF-8 only affects the meaning of bytes greater than 127, which it uses to represent higher Unicode characters. A character might require 1, 2, 3, or 4 bytes of storage depending on its value; more bytes are needed as values get larger. To store the full range of possible 32-bit characters, UTF-8 would require a whopping 6 bytes. But again, Unicode only defines characters up to 0x10FFFF, so this should never happen in practice.

UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF:

00000000 -- 0000007F: 	0xxxxxxx
00000080 -- 000007FF: 	110xxxxx 10xxxxxx
00000800 -- 0000FFFF: 	1110xxxx 10xxxxxx 10xxxxxx
00010000 -- 001FFFFF: 	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The x's are bits to be extracted from the sequence and glued together to form the final number.

It is fair to say that UTF-8 is taking over the world. It is already used for filenames in Linux and is supported by all mainstream web browsers. This is not surprising considering its many nice properties:

  1. It can represent all 1,114,112 Unicode characters.
  2. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.
  3. Characters usually require fewer than four bytes.
  4. String sort order is preserved. In other words, sorting UTF-8 strings per-byte yields the same order as sorting them per-character by logical Unicode value.
  5. A missing or corrupt byte in transmission can only affect a single character—you can always find the start of the sequence for the next character just by scanning a couple bytes.
  6. There are no byte-order/endianness issues, since UTF-8 data is a byte stream.
The only price to pay for all this is that there is no longer a one-to-one correspondence between bytes and characters in a string. Finding the nth character of a string requires iterating over the string from the beginning.

See What is UTF-8? for more information about UTF-8.

Side note: Some consider UTF-8 to be discriminatory, since it allows English text to be stored efficiently at one byte per character while other world scripts require two bytes or more. This is a troublesome point, but it should not get in the way of Unicode adoption. First of all, UTF-8 was not really designed to preferentially encode English text. It was designed to preserve compatibility with the large body of existing code that scans for special characters such as line breaks, spaces, NUL terminators, and so on. Furthermore, the encoding used internally by a program has little impact on the user as long as it is able to represent their data without loss. UTF-8 is a great boon, especially for C programming. Think of it this way: if it allows you to internationalize an application that would have been difficult to convert otherwise, it is much less discriminatory than the alternative.

II. The C library

All recent implementations of the standard C library have lots of functions for manipulating international strings. Before reading up on them, it helps to know some vocabulary:

"Multibyte character" or "multibyte string" refers to text in one of the many (possibly language-specific) encodings that exist throughout the world. A multibyte character does not necessarily require more than one byte to store; the term is merely intended to be broad enough to encompass encodings where this is the case. UTF-8 is in fact only one such encoding; the actual encoding of user input is determined by the user's current locale setting (selected as an option in a system dialog or stored as an environment variable in UNIX). Strings you get from the user will be in this encoding, and strings you pass to printf() are supposed to be as well. Strings within your program can of course be in any encoding you want, but you might have to convert them for proper display.

"Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out).

The C library also provides functions like towupper() that can convert a wide character from any language to uppercase (if applicable). strftime() can format a date and time string appropriately for the current locale, and strcoll() can do international sorting. These and other functions that depend on locale must be initialized at the beginning of your program using

#include <locale.h>

int main()
{
    char *locale;

    locale = setlocale(LC_ALL, "");
    ...
}
You don't have to do anything with the locale string returned by setlocale(), but you can use it to query your user's locale settings (more on this later).

The C library pretty much assumes you will be using multibyte strings throughout your program (since that's what you get as input). Since multibyte strings are opaque, a lot of functions beginning with "mb" are provided to deal with them. Personally, I don't like not knowing what encoding my strings use. One concrete problem with the multibyte thing is file I/O— a given file could be in any encoding, independent of locale. When you write a file or send data over a network, keeping the multibyte encoding might be a bad idea. (Even if all software uses only the proper locale-independent C library functions, and all platforms support all encodings internally, there is still no single standard for communicating the encoding of a piece of text; email messages and HTML tags do it in various ways.) You also might be able to do more efficient processing, or avoid rewriting code, if you knew the encoding your strings used.

Your encoding options

You are free to choose a string encoding for internal use in your program. The choice pretty much boils down to either UTF-8, wide (4-byte) characters, or multibyte. Each has its advantages and disadvantages:

    UTF-8
    • Pro: compatible with all existing strings and most existing code
    • Pro: takes less space
    • Pro: widely used as an interchange format (e.g. in XML)
    • Con: more complex processing, O(n) string indexing
    Wide characters
    • Pro: easy to process
    • Con: wastes space
    • Pro/Con: although you can use the syntax
      L"Hello, world."
      to easily include wide-character strings in C programs, the size of wide characters is not consistent across platforms (some incorrectly use 2-byte wide characters)
    • Con: should not be used for output, since spurious zero bytes and other low-ASCII characters with common meanings (such as '/' and '\n') will likely be sprinkled throughout the data.
    Multibyte
    • Pro: no conversions ever needed on input and output
    • Pro: built-in C library support
    • Pro: provides the widest possible internationalization, since in rare cases conversion between local encodings and Unicode does not work well
    • Con: strings are opaque
    • Con: perpetuates incompatibilities. For example, there are three major encodings for Russian. If one Russian sends data to another through your program, the recipient will not be able to read the message if his or her computer is configured for a different Russian encoding. But if your program always converts to UTF-8, the text is effectively normalized so that it will be widely legible (especially in the future) no matter what encoding it started in.

In this article I will advocate and give explicit instruction on using UTF-8 as an internal string encoding. Many Linux users already set their environment to a UTF-8 locale, in which case you won't even have to do any conversions. Otherwise you will have to convert multibyte to wide to UTF-8 on input, and back to multibyte on output. Nevertheless, UTF-8 has its advantages.

III. What to do right now

Below I'll outline concrete steps any C programmer could take to bring his or her code up to date with respect to text encoding. I'll also be presenting a simple C library that provides the routines you need to manipulate UTF-8.

Here's your to-do list:

  1. "char" no longer means character
    I hereby recommend referring to character codes in C programs using a 32-bit unsigned integer type. Many platforms provide a "wchar_t" (wide character) type, but unfortunately it is to be avoided since some compilers allot it only 16 bits—not enough to represent Unicode. Wherever you need to pass around an individual character, change "char" to "unsigned int" or similar. The only remaining use for the "char" type is to mean "byte".

  2. Get UTF-8-clean
    To take advantage of UTF-8, you'll have to treat bytes higher than 127 as perfectly ordinary characters. For example, say you have a routine that recognizes valid identifier names for a programming language. Your existing standard might be that identifiers begin with a letter:
    int valid_identifier_start(char ch)
    {
        return ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z'));
    }
    
    If you use UTF-8, you can extend this to allow letters from other languages as follows:
    int valid_identifier_start(char ch)
    {
        return ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') ||
                ((unsigned char)ch >= 0xC0));
    }
    
    A UTF-8 sequence can only start with values 0xC0 or greater, so that's what I used for checking the start of an identifier. Within an identifier, you would also want to allow characters >= 0x80, which is the range of UTF-8 continuation bytes.

    Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters. A notable exception is strchr(), which in this context is more aptly named "strbyte()". Since you will be passing character codes around as 32-bit integers, you need to replace this with a routine such as my u8_strchr() that can scan UTF-8 for a given character. The traditional strchr() returns a pointer to the location of the found character, and u8_strchr() follows suit. However, you might want to know the index of the found character, and since u8_strchr() has to scan through the string anyway, it keeps a count and returns a character index as well.

    With the old strchr(), you could use pointer arithmetic to determine the character index. Now, any use of pointer arithmetic on strings is likely to be broken since characters are no longer bytes. You'll have to find and fix any code that assumes "(char*)b - (char*)a" is the number of characters between a and b (though it is still of course the number of bytes between a and b).

  3. Interface with your environment
    Using UTF-8 as an internal encoding is now widespread among C programmers. However, the environment your program runs in will not necessarily be nice enough to feed you UTF-8, or expect UTF-8 output.

    The functions mbstowcs() and wcstombs() convert from and to locale-specific encodings, respectively. "mbs" means multibyte string (i.e. the locale-specific string), and "wcs" means wide character string (universal 4-byte characters). Clearly, if you use wide characters internally, you are in luck here. If you use UTF-8, there is a chance that the user's locale will be set to UTF-8 and you won't have to do any conversion at all. To take advantage of that situation, you will have to specifically detect it (I'll provide a function for it). Otherwise, you will have to convert from multibyte to wide to UTF-8.

    Version 1.6 (1.5.x while in development) of the FOX toolkit uses UTF-8 internally, giving your program a nice all-UTF-8-all-the-time environment. GTK2 and Qt also support UTF-8.

  4. Modify APIs to discourage O(n^2) string processing
    The idea of non-constant-time string indexing may worry you. But when you think about it, you rarely need to specifically access the nth character of a string. Algorithms almost never need to make requests like "Quick! Get me the 6th character of this piece of text!" Typically, if you're accessing characters you're iterating over the whole string or most of it. UTF-8 is simple enough to process that iterating over characters takes essentially the same time as iterating over bytes.

    In your own code, you can use my u8_inc() and u8_dec() to move through strings. If you develop libraries or languages, be sure to expose some kind of inc() and dec() API so nobody has to move through a string by repeatedly requesting the nth character.

IV. Some UTF-8 routines

Various libraries are available for internationalization and converting between different text encodings. However, I couldn't find a straightforward set of C routines providing the minimal support needed for using UTF-8 as an internal encoding (although this functionality is often embedded in large UI toolkits and such). I decided to create a small library that could be used to bring UTF-8 to arbitrary C programs.

This library is quite incomplete; you might want to look at related FSF offerings and libutf8. libutf8 provides the multibyte and wide character C library routines mentioned above, in case your C library doesn't have them.

Since performance is sometimes a concern with UTF-8, I made my routines as fast and lightweight as possible. They perform minimal error checking— in particular, they do not bother to determine whether a sequence is valid UTF-8, which can actually be a security problem. I justify this decision by reiterating that the intention of the library is to manipulate an internal encoding; you can enforce that all strings you store in memory be valid UTF-8, enabling the library to make that assumption. Routines for validating and converting from/to UTF-8 are available free from Unicode, Inc.

Note that my routines do not need to support the many encodings of the world—the C library can handle that. If the current locale is not UTF-8, you call mbstowcs() on user input to convert any encoding (whatever it is) to a wide character string, then use my u8_toutf8() to convert it to the UTF-8 your program is comfortable with. Here's an example input routine wrapping readline():

char *get_utf8_input()
{
    char *line, *u8s;
    unsigned int *wcs;
    int len;

    line = readline("");
    if (locale_is_utf8) {
        return line;
    }
    else {
        len = mbstowcs(NULL, line, 0)+1;
        wcs = malloc(len * sizeof(int));
        mbstowcs(wcs, line, len);
        u8s = malloc(len * sizeof(int));
        u8_toutf8(u8s, len*sizeof(int), wcs, len);
        free(line);
        free(wcs);
        return u8s;
    }
The first call to mbstowcs() uses the special parameter value NULL to find the number of characters in the opaque multibyte string.

Anyway, on with the routines. They are divided into four groups:

Group 1: conversions

/* is c the start of a utf8 sequence? */
#define isutf(c) (((c)&0xC0)!=0x80)

/* convert UTF-8 to UCS-4 (4-byte wide characters)
   srcsz = source size in bytes, or -1 if 0-terminated
   sz = dest size in # of wide characters
   returns # characters converted */
int u8_toucs(unsigned int *dest, int sz, char *src, int srcsz);

/* convert UCS-4 to UTF-8
   srcsz = number of source characters, or -1 if 0-terminated
   sz = size of dest buffer in bytes
   returns # characters converted */
int u8_toutf8(char *dest, int sz, unsigned int *src, int srcsz);

/* single character to UTF-8 */
int u8_wc_toutf8(char *dest, wchar_t ch);
Note that the library uses "unsigned int" as its wide character type.
You can convert a known number of bytes, or a NUL-terminated string. The length of a UTF-8 string is often communicated as a byte count, since that's what really matters. Recall that you can usually treat a UTF-8 string like a normal C-string with N characters (where N is the number of bytes in the UTF-8 sequence), with the possibility that some characters are >127.

Group 2: moving through UTF-8 strings

/* character number to byte offset */
int u8_offset(char *str, int charnum);

/* byte offset to character number */
int u8_charnum(char *s, int offset);

/* return next character, updating a byte-index variable */
unsigned int u8_nextchar(char *s, int *i);

/* move to next character */
void u8_inc(char *s, int *i);

/* move to previous character */
void u8_dec(char *s, int *i);

Group 3: Unicode escape sequences
In the absence of Unicode input methods, Unicode characters are often notated using special escape sequences beginning with \u or \U. \u expects up to four hexadecimal digits, and \U expects up to eight. With these routines your program can accept input and give output using such sequences if necessary.

/* assuming src points to the character after a backslash, read an
   escape sequence, storing the result in dest and returning the number of
   input characters processed */
int u8_read_escape_sequence(char *src, unsigned int *dest);

/* given a wide character, convert it to an ASCII escape sequence stored in
   buf, where buf is "sz" bytes. returns the number of characters output. */
int u8_escape_wchar(char *buf, int sz, unsigned int ch);

/* convert a string "src" containing escape sequences to UTF-8 */
int u8_unescape(char *buf, int sz, char *src);

/* convert UTF-8 "src" to ASCII with escape sequences.
   if escape_quotes is nonzero, quote characters will be preceded by
   backslashes as well. */
int u8_escape(char *buf, int sz, char *src, int escape_quotes);

/* utility predicates used by the above */
int octal_digit(char c);
int hex_digit(char c);

Group 4: replacements for standard functions

/* return a pointer to the first occurrence of ch in s, or NULL if not
   found. character index of found character returned in *charn. */
char *u8_strchr(char *s, unsigned int ch, int *charn);

/* same as the above, but searches a buffer of a given size instead of
   a NUL-terminated string. */
char *u8_memchr(char *s, unsigned int ch, size_t sz, int *charn);

/* count the number of characters in a UTF-8 string */
int u8_strlen(char *s);

/* given the string returned by setlocale(), determine whether the current
   locale speaks UTF-8 */
int u8_is_locale_utf8(char *locale);

/* these functions can print from UTF-8 strings. they make no assumptions
   about locale; you can circumvent them if is_locale_utf8 */
int u8_vprintf(char *fmt, va_list ap);
int u8_printf(char *fmt, ...);
utf8.c
utf8.h


甲醛是什么气味 易举易泄是什么原因 菊花和什么一起泡最好 黑天鹅是什么意思 南宁晚上有什么好玩的地方
金的部首是什么 肤如凝脂是什么意思 查生育能力挂什么科 早上六点是什么时辰 骶椎腰化什么意思
招商是什么工作 5月23日是什么星座 彻夜难眠什么意思 尿泡沫多吃什么药 为什么说有钱难买孕妇B
欺凌是什么意思 乙肝病毒是什么 乙肝通过什么途径传染 夏天适合种植什么蔬菜 白细胞高是什么原因
属鼠的幸运色是什么颜色hcv9jop3ns6r.cn 高丽参适合什么人吃hcv7jop6ns3r.cn 儿童办理身份证需要什么材料hcv8jop9ns0r.cn 指甲黑是什么原因hcv9jop0ns3r.cn 舌质是什么hcv9jop1ns5r.cn
眩晕症是什么症状hcv9jop4ns2r.cn 上火了吃什么水果降火最快ff14chat.com 葵花宝典是什么意思hlguo.com 阴虚火旺有什么表现症状hcv8jop5ns6r.cn 6月20日是什么节日hcv9jop1ns7r.cn
为什么一到晚上就咳嗽hcv7jop9ns9r.cn 肾阴虚的表现是什么hcv9jop4ns1r.cn hivab是什么检测onlinewuye.com fat是什么意思hcv8jop7ns2r.cn 维生素b6主治什么jingluanji.com
qid是什么意思hcv8jop1ns8r.cn 三拜九叩是什么意思hcv8jop1ns6r.cn 蜗牛是什么动物hcv8jop2ns3r.cn 言音读什么hcv8jop8ns9r.cn 白醋泡脚有什么好处hcv9jop5ns6r.cn
百度