字符编码,简单的说,就是建立一个由我们平时使用的字符到数字的映射。读书时,学习数字电路时,学习有限状态机,我们知道要通过数字电路解决一个问题时,最先要做的就是先为问题域中的几个状态都分配一个二进制数字,然后再借助于状态机、状态图之类的帮助写出逻辑表达式。字符编码完成的工作,大致上就是给人类所用的字符,分配一个二进制的表示。相对于ASCII、GB2312之类的编码标准,Unicode的强大之处在于,它几乎囊括了全人类所使用的语言中的字符。Unicode官方网站上提供有一组文档,来更加具体的说明哪个字符对应于那个二进制表示,或者叫Unicode码:
http://www.unicode.org/charts/PDF/
Unicode的码空间从U+0000到U+10FFFF,共有1,112,064个码位(code point)可用来映射字符. Unicode的码空间可以划分为17个平面(plane),每个平面包含216(65,536)个码位。每个平面的码位可表示为从U+xx0000到U+xxFFFF, 其中xx表示十六进制值从0016 到1016,共计17个平面。第一个平面成为基本多文种平面(Basic Multilingual Plane, BMP),或称第零平面(Plane 0)。其他平面称为辅助平面(Supplementary Planes)。基本多语言平面内,从U+D800到U+DFFF之间的码位区段是永久保留不映射到字符,因此UTF-16利用保留下来的0xD800-0xDFFF区段的码位来对辅助平面的字符的码位进行编码。
第一个Unicode平面(码位从U+0000至U+FFFF)包含了最常用的字符。该平面被称为基本多语言平面,缩写为BMP. UTF-16与UCS-2编码这个范围内的码位为单个16比特长的码元,数值等价于对应的码位. BMP中的这些码位是仅有的码位可以在UCS-2被表示.
辅助平面(Supplementary Planes)中的码位,在UTF-16中被编码为一对16比特长的码元(即32bit,4Bytes),称作 code units called a 代理对(surrogate pair), 具体方法是:
由于高位代理、低位代理、BMP中的有效字符的码位,三者互不重叠,搜索是简单的: 一个字符编码的一部分不可能与另一个字符编码的不同部分相重叠。这意味着UTF-16是自同步(self-synchronizing): 可以通过仅检查一个码元就可以判定给定字符的下一个字符的起始码元. UTF-8 也有类似优点,但许多早期的编码模式就不是这样,必须从头开始分析文本才能确定不同字符的码元的边界.
自android 4.0 ICS起,android系统中文本渲染的大致流程为(相关的code主要在frameworks/base/core/jni/android/graphics/TextLayoutCache.cpp这个文件中):
在android系统中,将BiDi 子串切分成Script 子串的code主要如下面这样:
628 while ((isRTL) ? 629 hb_utf16_script_run_prev(&numCodePoints, &mShaperItem.item, mShaperItem.string, 630 mShaperItem.stringLength, &indexFontRun): 631 hb_utf16_script_run_next(&numCodePoints, &mShaperItem.item, mShaperItem.string, 632 mShaperItem.stringLength, &indexFontRun)) {
在这个地方,主要调用harfbuzz的函数来实际完成切分script 子串的动作。mShaperItem.string和mShaperItem.stringLength为要切分的字串的相关信息,一个是字串本身,一个是这个字串的长度。indexFontRun既是传入参数,又是传出参数,函数开始执行时,它的值是当前script在字串中的起始位置,函数执行结束时,它的值是下一个script在字串中的起始位置。numCodePoints和 mShaperItem.item为传出参数,在函数执行结束时,前者表示这个script中Unicode码的个数,后者则会包含script在字串中的起始位置,script的长度及script。
接着可以来看hb_utf16_script_run_next()和hb_utf16_script_run_prev()的实现,及Harfbuzz中对于UTF-16编码的子串的解码的过程:
91char 92hb_utf16_script_run_next(unsigned *num_code_points, HB_ScriptItem *output, 93 const uint16_t *chars, size_t len, ssize_t *iter) { 94 if (*iter == len) 95 return 0; 96 97 output->pos = *iter; 98 const uint32_t init_cp = utf16_to_code_point(chars, len, iter); 99 unsigned cps = 1; 100 if (init_cp == HB_InvalidCodePoint) 101 return 0; 102 const HB_Script init_script = code_point_to_script(init_cp); 103 HB_Script current_script = init_script; 104 output->script = init_script; 105 106 for (;;) { 107 if (*iter == len) 108 break; 109 const ssize_t prev_iter = *iter; 110 const uint32_t cp = utf16_to_code_point(chars, len, iter); 111 if (cp == HB_InvalidCodePoint) 112 return 0; 113 cps++; 114 const HB_Script script = code_point_to_script(cp); 115 116 if (script != current_script) { 117 /* BEGIN android-changed 118 The condition was not correct by doing "a == b == constant" 119 END android-changed */ 120 if (current_script == HB_Script_Inherited && init_script == HB_Script_Inherited) { 121 // If we started off as inherited, we take whatever we can find. 122 output->script = script; 123 current_script = script; 124 continue; 125 } else if (script == HB_Script_Inherited) { 126 continue; 127 } else { 128 *iter = prev_iter; 129 cps--; 130 break; 131 } 132 } 133 } 134 135 if (output->script == HB_Script_Inherited) 136 output->script = HB_Script_Common; 137 138 output->length = *iter - output->pos; 139 if (num_code_points) 140 *num_code_points = cps; 141 return 1; 142} 143 144char 145hb_utf16_script_run_prev(unsigned *num_code_points, HB_ScriptItem *output, 146 const uint16_t *chars, size_t len, ssize_t *iter) { 147 if (*iter == (size_t) -1) 148 return 0; 149 150 const size_t ending_index = *iter; 151 const uint32_t init_cp = utf16_to_code_point_prev(chars, len, iter); 152 unsigned cps = 1; 153 if (init_cp == HB_InvalidCodePoint) 154 return 0; 155 const HB_Script init_script = code_point_to_script(init_cp); 156 HB_Script current_script = init_script; 157 output->script = init_script; 158 159 for (;;) { 160 if (*iter < 0) 161 break; 162 const ssize_t prev_iter = *iter; 163 const uint32_t cp = utf16_to_code_point_prev(chars, len, iter); 164 if (cp == HB_InvalidCodePoint) 165 return 0; 166 cps++; 167 const HB_Script script = code_point_to_script(cp); 168 169 if (script != current_script) { 170 if (current_script == HB_Script_Inherited && init_script == HB_Script_Inherited) { 171 // If we started off as inherited, we take whatever we can find. 172 output->script = script; 173 current_script = script; 174 continue; 175 } else if (script == HB_Script_Inherited) { 176 /* BEGIN android-changed 177 We apply the same fix for Chrome to Android. 178 Chrome team will talk with upsteam about it. 179 Just assume that whatever follows this combining character is within 180 the same script. This is incorrect if you had language1 + combining 181 char + language 2, but that is rare and this code is suspicious 182 anyway. 183 END android-changed */ 184 continue; 185 } else { 186 *iter = prev_iter; 187 cps--; 188 break; 189 } 190 } 191 } 192 193 if (output->script == HB_Script_Inherited) 194 output->script = HB_Script_Common; 195 196 output->pos = *iter + 1; 197 output->length = ending_index - *iter; 198 if (num_code_points) 199 *num_code_points = cps; 200 return 1; 201}注意98行及151行调用的那两个名为utf16_to_code_point() 及utf16_to_code_point_prev()的函数,他们在Harfbuzz中完成真正的由UTF-16子串中解码出一个Unicode码并存放到一个32为整型值中的工作。
看这两个函数的实现:
13uint32_t 14utf16_to_code_point(const uint16_t *chars, size_t len, ssize_t *iter) { 15 const uint16_t v = chars[(*iter)++]; 16 if (HB_IsHighSurrogate(v)) { 17 // surrogate pair 18 if (*iter >= len) { 19 // the surrogate is incomplete. 20 return HB_InvalidCodePoint; 21 } 22 const uint16_t v2 = chars[(*iter)++]; 23 if (!HB_IsLowSurrogate(v2)) { 24 // invalidate surrogate pair. 25 return HB_InvalidCodePoint; 26 } 27 28 return HB_SurrogateToUcs4(v, v2); 29 } 30 31 if (HB_IsLowSurrogate(v)) { 32 // this isn't a valid code point 33 return HB_InvalidCodePoint; 34 } 35 36 return v; 37} 38 39uint32_t 40utf16_to_code_point_prev(const uint16_t *chars, size_t len, ssize_t *iter) { 41 const uint16_t v = chars[(*iter)--]; 42 if (HB_IsLowSurrogate(v)) { 43 // surrogate pair 44 if (*iter < 0) { 45 // the surrogate is incomplete. 46 return HB_InvalidCodePoint; 47 } 48 const uint16_t v2 = chars[(*iter)--]; 49 if (!HB_IsHighSurrogate(v2)) { 50 // invalidate surrogate pair. 51 return HB_InvalidCodePoint; 52 } 53 54 return HB_SurrogateToUcs4(v2, v); 55 } 56 57 if (HB_IsHighSurrogate(v)) { 58 // this isn't a valid code point 59 return HB_InvalidCodePoint; 60 } 61 62 return v; 63}对比前面提到的UTF-16的编码方法,这段Code看起来也还是比较清晰。上面那段code中所用到的那些辅助的一些宏的定义如下:
72#define HB_IsHighSurrogate(ucs) \ 73 (((ucs) & 0xfc00) == 0xd800) 74 75#define HB_IsLowSurrogate(ucs) \ 76 (((ucs) & 0xfc00) == 0xdc00) 77 78#define HB_SurrogateToUcs4(high, low) \ 79 (((HB_UChar32)(high))<<10) + (low) - 0x35fdc00;至此,android系统中,文本渲染部分,解码UTF-16到Unicode的过程基本厘清。