最近看到有关于code point和code unit的书,看得很模糊就Google了一下,看到一篇很不错的文章,写的非常详细(除了UTF-16的部分内容我没有看懂,但是意思领会了)。现在就我的理解做一些简单的说明,具体内容请查阅原文。
代码点 代码单元:citation:http://www.cnblogs.com/zhangzl419/archive/2013/05/21/3090601.html
以前看《Java核心技术 卷I》,看到了3.6.5节 代码点和代码单元,看了几遍但是没有彻底明白。直达最近发现一篇网络文章:https://github.com/acmerfight/insight_python/blob/master/Unicode_and_Character_Sets.md, 其中对编码历史进行了回顾,指出了代码点和代码单元出现的原因,让我对代码点和代码单元有了清晰的理解。
代码点(Code Point):Unicode是属于编码字符集(CCS)的范围。Unicode所做的事情就是将我们需要表示的字符表中的每个字符映射成一个数字,这个数字被称为相应字符的码点(code point)。例如“严”字在Unicode中对应的码点是U+0x4E25。
代码点是字符集被编码后出现的概念。字符集(Code Set)是一个集合,集合中的元素就是字符,比如ASCII字符集,其中的字符就是'A'、'B'等字符。为了在计算机中处理字符集,必须把字符集数字化,就是给字符集中的每一个字符一个编号,计算机程序中要用字符,直接用这个编号就可以了。于是就出现了编码后的字符集,叫做编码字符集(Coded Code Set)。编码字符集中每一个字符都和一个编号对应。那么这个编号就是代码点(Code Point)。
代码单元(Code Unit):是指一个已编码的文本中具有最短的比特组合的单元。对于UTF-8来说,码元是8比特长;对于UTF-16来说,码元是16比特长。换一种说法就是UTF-8的是以一个字节为最小单位的,UTF-16是以两个字节为最小单位的。
Unicode glossary:http://www.unicode.org/glossary/
code unit : The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
code point : (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.Not all code points are assigned to encoded characters.(2) A value, or position, for a character, in any coded character set.
citation:Oracle doc
The Java platform uses the UTF-16 representation in char
arrays and in the String
and StringBuffer
citation 1:http://msdn.microsoft.com/en-us/library/ms225454(v=vs.80).aspx
Code points and code units
In each encoding, the code points are mapped to one or more code units.
A "code unit" is a single unit within each encoding form. The code unit size is equivalent to the bit measurement for the particular encoding:
1、A code unit in UTF-8 consists of 8 bits.
2、A code unit in UTF-16 consists of 16 bits.
3、A code unit in UTF-32 consists of 32 bits.
4、In GB18030, a code unit consists of 8 bits.
Number of code units in each code point
The number of code units required to be mapped to a code point varies across encoding forms:
Multiple code units per code point are common in UTF-8 because of the smaller code units. The code points will be mapped to one, two, three, or four code units.
UTF-16 code units are twice as large as 8-bit code units. Therefore, any code points with a scalar value less than U+10000 is encoded with a single code unit.
For code points with a scalar value of U+10000 or higher, two code units are required per code point. These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".
The 32-bit code unit used in UTF-32 is large enough that every code point is encoded as a single code unit.
Multiple code units per code point are common in GB18030 because of the smaller code units. The code points will be mapped to one, two, or four code units.
citation 2:What is a Unicode code unit and a Unicode code point?
citation 3:Unicode 4.0 support in J2SE 1.5
citation 4:counting Characters
Character:a character can usefully be defined as the smallest atomic unit of text with semantic value.
citation 5: on the goodness ofUnicode
citation 6:comparinga char to a code-point?
citation 7:The Absolute MinimumEvery Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
It does not make sense to have a string without knowing what encoding it uses.If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
BOM,byte order mark
citation 8: 维基百科
citation 9: UTF-8文件的Unicode签名BOM(Byte Order Mark)问题
citation 10: Regex Tutorial - Unicode Characters and Properties