这是一篇翻译自Joel Spolsky的文章“The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets”,比较经典。
[翻译时为增加可读性,有少许改动]
原文:http://www.joelonsoftware.com/articles/Unicode.html
你曾经因为html文件里的Content-Type标签感到迷惑吗?比如:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
经常见到吧,却不知道它是用来做什么的?
你有没有收到过从外国发来的邮件(比如保加利亚),标题不能正常显示,变成了 "???? ?????? ??? ????"?
而当我再看到另一个商用的函数库时,没错,关于字符编码相关的实现也很糟糕。我联系到了它的开发者,他说没有什么解决办法。就像很多程序员一样,他也寄希望于那些问题会“自己消失”,真是莫名其妙。
让问题“自己消失”,这是不可能的。When I discovered that the popular web development tool PHP has almostcomplete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought,enough is enough.
所以我有一个消息要告诉你:如果你现在是一个程序员而且对计算机字符,字符集,编码还有 Unicode的基本内容似乎并不了解。那我就正好逮到你了。我发誓要让你蹲在潜艇里,罚你连续削6个月的圆葱。
我还要告诉你:这些看似高深的东西,其实没有那么难懂。
本文会告诉你每个程序员都应该了解的最基本内容。之前的一切关于“纯文本==ASCII码编码的字符串==每个字符为一个字节的字符串”( "plain text = ascii = characters are 8 bits")不仅是错误的,而且是无可救药地错了。如果你现在还按照这样的方式编写程序,那就像一个医生不相信细菌存在一样。所以在读完本文之前千万不要去写程序了。
在我们开始之前,我应该先提醒你,如果你是那些少数几个了解国际化的人,你可能会觉得以下讨论的内容有一点太简单了。我试着将门槛降到最低,这样所有人都能理解大致原理并且可以写出能操纵任何语言的代码,而不仅仅是英文字符串了。而且我也应该告诉你字符相关的技术只是创造国际化通用软件的一小部分,因为我每次只能写一样东西,所以今天就写了关于字符集、字符编码的内容。
从历史讲起
理解它们最简单的方式就是按照时间顺序叙述它的发展。
但是我们现在不谈EBCDIC,因为它实在太老了,那段历史已经没必要谈了。
Back in the semi-olden days,Unix刚出现的时候,当K&R还在写The C Programming Language的时候,一切都是那么简单。EBCDIC也即将消失。那个时候计算机仅能表示那些不带变音符号(注:欧洲部分语言在字母的顶端有变音符号)的英文字符,我们把它叫做ASCII,在ASCII码表中,从编号32到127承载着所有的字符。空格是32号,字母“A”是65号,等等。因此这些字符能方便地存储在7个2进制位里。那个时候大部分电脑都以8个2进制位为一个字节,所以一个字节里不仅能存储所有的ASCII字符,还有1个2进制位的空位。如果你有点坏,可以把它用在其他邪恶的目的:曾经有个文字处理软件WordStar填补了那个空位,以标定单词的尾部。 编号32以下的称为不可打印字符。它们用来让你发牢骚。哈哈,只是个玩笑而已,其实它们被用作控制字符,比如7可以让电脑蜂鸣,而12可以退出正在打印的页面再装入新的页面。
假设你使用的语言是美国英语,一切看起来似乎都挺好。
由于一个字节里还剩下很多空间,很多人会想到“天哪,我们可以把128到255留给自己用”。问题是,很多人的想法类似,他们按照自己的意愿在128到255之间填充自己的内容。
The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages anda bunch of line drawing characters... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners'. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (), so when Americans would send their résumés to Israel they would arrive as r sum s. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.
Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.
Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.
But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.