每个程序员都必须知道的Unicode以及字节码最基础的知识

原文出处:the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses 

Ever wonder about that mysterious Content-Type tag?You know, the one you’re supposed to put in HTML and you never quite know what it should be?

你有没有对那个神奇的Content-Type标签感到神奇?你知道的,你应该在HTML添加它,而且你永远不是很清楚它的值应该是什么?

Did you ever get an email from your friends in Bulgaria with the subject line “???? ?????? ??? ????”?

你有没有从你保加利亚的朋友那里收到主题为“???? ?????? ??? ????”的邮件?

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.

我一直很惊讶的发现有很多程序员并没有真正完全掌握字符集、编码、Unicode等相关内容。几年前,FogBUGZ的beta测试人员想知道它是否可以处理用日语写的邮件。日语?他们有用日语写的邮件吗?我不知道。当我仔细查看我们用来解析MIME电子邮件的商业ActiveX控件时,我们发现它完全用错了字符集,因此我们不得不写 heroic code来撤销它做的错误转换并且正确的重做它。当我查看另一个商业的代码库,它也完全破坏来字节码的实现。我与该软件包的开发人员交谈,他们有点认为“他们无能为力”。跟许多的程序员一样,他只希望它会以某种方式平息。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

但是它不会。当我发现最流行的web开发语言php几乎完全忽视编码问题,简单的使用8位字符。使他几乎不可能开发出好的网络页面应用。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

因此我要宣布:如果你是一个程序员在2003年工作并且你不知道字符、字符集、编码和Unicode的基础。当我抓住你的时候, 我要惩罚你, 让你在潜艇上剥洋葱6月。我发誓我会的。

IT’S NOT THAT HARD.

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

在这篇文章中我将告诉你究竟什么才是每个工作中的程序员应该知道的。所有关于“纯文本=ascii=字符是8位的”的东西,不仅是错误的,而且是不可救药的错误。如果你还在以这种方式编程,你不会比一个不相信细菌的医生好到哪里去。请在读完这篇文章之前不要去写下一行代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.

在开始之前,我应该提醒你的是如果你是那些知道国际化的稀有人士中的一员,你将会发现我的整个讨论有点简单了。这里我只是想设置一个最低的门槛以便每个人可以理解正在发生什么,并且能够编写有希望在任何语言以及英语中不包括口音的子集的文本中正常运行的代码 。并且我应该提醒你字符处理只是创建国际化软件应该采取的措施中很小的一部分,但是我一次只能写一件事情,因此今天是字符集。

A Historical Perspective

The easiest way to understand this stuff is to go chronologically.

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.

理解一件事情最容易的方式是按照时间顺序排列。

也许你会认为我要从EBCDIC这样很古老的字符集开始说起。我不会的。EBCDIC不适用于你们的生活。我们没有必要回到那么远的时间。

Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

早在半古老的时代,Unix刚刚被发明出来,K&R还在写C编程语言,所有的事情都非常的简单。EBCDIC正在退出历史舞台。唯一需要掌握的字符是优良的没有重音的英文字母,并且我们有一个称为ASCII的编码(能够使用32跟127之前的数字来表示任何字母)来表示它们。例如,用32表示空格,65表示“A”。ASCII可以方便的以7位字节来存储。那个时代大部分电脑使用8位字节,因此你不仅可以存储每个可能的ASCII字符,而且你还有一个备用的字节。如果你是邪恶的,你可以用于你自己的狡猾目的:WordStar的昏暗灯泡实际上打开了高位来指示一个单词中的最后一个字母,仅将WordStar谴责为英文文本。32位以下的代码是不可打印并且被用于诅咒。开个玩笑而已。它们被当作控制字符,比如7使你的电脑发出嘟嘟声,12这使得打印机切换纸张。

每个程序员都必须知道的Unicode以及字节码最基础的知识_第1张图片

 

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact  as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.

假设你是一位说英语的人,一切都很好。

因为字节最多有8位的空间,所以很多人就会想,“我们可以将代码128-255用于我们自己的目的”。问题是,很多人同时有这个想法,并且对于128-255这段应该放什么,他们都有自己的想法。IBM-PC有一些后来被称为OEM字符集的东西,它为欧洲语言提供了一些重音字符以及一堆线条画人物......水平杆,垂直杆,水平杆,右侧悬挂小铃铛,并且您可以使用这些线条绘制字符在屏幕上制作漂亮的盒子和线条,您仍然可以看到在干洗店的8088计算机上运行。事实上,一旦人们开始在美国以外地区购买PC,就会设想各种不同的OEM字符集,这些字符集都是为了自己的目的而使用前128个字符。例如一些电脑上的130显示为é,但在以色列出售的电脑上,它表示为ג。因此当美国人发送résumés到以色列将显示为 rגsumגs。在许多情况下,例如俄语,对于如何处理128个字符的内容有很多不同的想法,所以你甚至无法可靠地交换俄语文档。

每个程序员都必须知道的Unicode以及字节码最基础的知识_第2张图片 OEM字符集

 

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual” code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

最终这个免费的OEM已经编入ANSI标准。在ANSI标准中,每个人都同意128之下做什么,这几乎跟ASCII一样,但是有很多不同的方式处理128及以上的字符,这取决于你生活的区域。这些不同的系统称为代码页。例如在以色列,DOS使用了一个名为862的代码页,而希腊用户则使用了737。他们128一下都一样,从128开始不同,所有的有趣的字母都在128之后的这一侧。MD-DOS的national版本有很多这样的代码页,处理从英语到冰岛语的所有内容,他们甚至还有一些“多语言”代码页,可以在同一计算机上执行世界语和加利西亚语。但是,在同一台机器上使用希伯来语和希腊语是完全不可能的,除非你编写来自己的自定义程序,使用位图图形显示所有内容。因为希伯来语跟希腊语需要不同的代码页,对高位数字有不同的解释。

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set” in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.

同时,在亚洲,考虑到亚洲字母有数千封字母这一事实更加疯狂的事情正发生着,他们永远不适用8位字节。这通常由称为DBCS(双字节字符集)的混乱系统解决,在这个字符集中一些字母存储为单字节,另外一些存储为存储为双字节。它很容易转换为一个字符串,但几乎不可能还原回去。鼓励程序员不要使用s ++和s-来前后移动,而是调用Windows的AnsiNext和AnsiPrev之类的函数,它们知道如何处理整个混乱。

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

但是,大多数人只是假装一个字节是一个字符,一个字符是8位的,只要你不把字符串从一个电脑移动另一个电脑或者不是一个多语言的人,它总会正常工作。但是,一旦互联网发生,将字符串从一个电脑移动到另一个电脑变得非常普通,随之而来的便是非常混乱的局面。幸好,Unicode被发明出来。

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

Unicode创建来一个单字符集,该单字符集包括了地球上每一个合理的书写系统以及一些像克林贡一样的虚构的书写系统。有些人误以为Unicode只是一个16位代码,其中的每一个字符都占16位,它最多有65,536个字符。实际上,这不是正确的。这是关于Unicode最常见的假想,如果你也这么认为,请不要难过。

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

事实上,Unicode以一种不同的方式理解字符集,你必须理解这个思路,否则将毫无意义。

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

到目前为止,我们假设一个字母对应你能够存储在磁盘或者内存中的一些位:

A -> 0100 0001

在Unicode中,一个字母映射到一个被称位代码点的东西上,代码点只是一个理论概念。代码点在内存以及磁盘中如何展示是一个完整故事。

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:

                                                                                                             A

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

在Unicode中,字母A是一个理想的状态,它处于最顶层

                                                                                                             A

字母A跟字母B不同,跟字母a也不同,但是跟字母A 、A、A是一样的。字母A在字体Times New Roman和字体Helvetica中是相同的,但是

却跟小写的字母“a”不同。这看起来不是很有争议,但是在某些语言中弄清楚一个字母到底是什么却导致争议,比如,在德语中ß是一个真实的字母还是字母ss的某种写法?在比如,如果字母的形状在单词的末尾发生变化,那是一个不同的字母吗?这种情况,在希伯来语说是的,阿拉伯语中却不是。无论如何,Unicode联盟的聪明人在过去十年左右的时间里一直在考虑这个问题,尽管这伴随着大量高度政治化的争论,但是你不必担心它。他们已经完全想通了这一切。

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmaputility on Windows 2000/XP or visiting the Unicode web site.

每个字母表中的每个字母都由Unicode联盟分配一个固定的数值(magic number),这可以写成类似U+0639这样。这个固定的数值(magic number)被称位代码点。其中“U+”代表“Unicode”,数值是16进制。U+0639在阿拉伯语中是Ain。在英语中字母A的Unicode码为U+0041。这些都可以在Unicode的官网(the Unicode web site.)上看到。

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

Unicode可以定义的字母没有实际的限制,实际上它们已经超过65,536,因此不是每个Unicode字母都可以压缩为两个字节,即使这样Unicode还是成为一个标准。

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.

在Unicode中字符串Hello表示为五个代码点:U+0048 U+0065 U+006C U+006C U+006F,仅仅一堆代码点。实际上就是一堆数字。到目前为止我们都没有讨论任何关于如何存储这些代码点到内存中或者在电子邮件中表示它的东西。

Encodings

That’s where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

Unicode编码最早的想法,导致了关于两个字节的猜想。所以这些代码点被存储为两个字节。所以Hello变成了00 48 00 65 00 6C 00 6C 00 6F。这样就是对的吗?不一定!它是否可以存储为48 00 65 00 6C 00 6C 00 6F 00?从技术上来看,我相信这是可以的。事实上,早期的实现者想要以high-endian或者low-endian两种模式来存储它们的字节码,这样无论哪个特定的CPU都是最快的。因此,人们不得不想出在每个Unicode字符串的开头存储FE FF的奇怪惯例,这被称为字节顺序标记。如果你交换你的高和低字节,使它看起来像一个FF FE,读取你的字符串的人将知道他们必须交换每隔一个字节。但是,并非每个字节码字符串都会以字节顺序标记开头。

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

在一段时间内它看起来似乎已经足够好来,但是程序员却开始抱怨。它们说,“看看那些0”。因为它们是美国人,他们正在查看很少使用U + 00FF以上代码点的英文文本。他们也是加州的自由派嬉皮士,他们想要保护(冷笑)。如果他们是德克萨斯人,他们就不会吝啬两倍的字节数。但那些加利福尼亚的人无法忍受将字符串所需的存储量增加一倍的想法。(这里的意思应该是那些0会造成存储的浪费)。无论如何,目前这些文档都使用各种ANSI和DBCS字符集,谁将要将它们全部转换?仅仅因为这个原因,很多人决定忽略Unicode,同时事情页面的更加糟糕。

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

因此UTF-8就被发明了出来。UTF-8是另一个系统,在内存用用8位字节存储Unicode代码点。在UTF-8中,每个0-127的代码点被存储在一个单字节中,超过128的代码点使用2或3个字节,最多到6个。

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

这样就产生了一个比较好的结果,英文文本在UTF-8以及ASCII中几乎是一样的,所以美国人甚至没有发现任何错误。只是世界上的其他人不得不jump through hoops.因此,Hello,在Unicode中表示为U+0048 U+0065 U+006C U+006C U+006F,在UTF-8中存储为48 65 6C 6C 6F。这跟它在ASCII、ANSI以及地球上所有的OEM字符集中存储的方式都一样。现在,如果你想地使用重音字母或希腊字母或克林贡字母,你将不得不使用几个字节来存储单个代码点,但美国人永远不会注意到。UTF-8有一个很好的属性,想要使用单个0字节作为空终止符的无知旧字符串处理代码不会截断字符串。

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

到目前为止,我已经告诉你三种Unicode编码方式。传统的两字节存储方式被称为UCS-2(因为它有两个字节)后者UTF-16(因为它有16位),而且你还要弄清楚它是high-endian UCS-2或者low-endian UCS-2。还有流行的新UTF-8标准,它有很好的属性。而且碰巧你使用的是英文文本并且你的应用程序除了ASCII不能识别其他任何编码,UTF-8可以很好的工作。

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.

事实上,还有很多编码Unicode的方式。其中一种方式被称位UTF-7,这与UTF-8很像,但保证高位始终为0。如果你不得不通过一些认为7位已经足够的比较严谨的电子邮件传递Unicode,UTF-7也是可以完全胜任的。另外一个就是UCS-4,它以4个字节存储每个代码点,它具有每个代码点可以以相同的字节数存储的良好属性,但是它的缺点是太浪费内存。

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re reallygood, a box. Which did you get? -> �

事实上,现在你正在考虑用Unicode代码点表示的字母,那些unicode代码点也可以在任何旧式编码方案中编码。例如,你可以使用ASCII编码Hello(U + 0048 U + 0065 U + 006C U + 006C U + 006F)的Unicode字符串,也可以在旧的OEM希腊语编码,或希伯来语ANSI编码,或几百种编码中的任何一种到目前为止已发明的编码。with one catch:有可能某些字母不会出现。如果在你尝试编码的编码中没有Unicode代码点的对应代码,那将看到一个?或者�

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

有数百种传统的编码只能正确的存储少量的代码点,并将其他所有的代码点转换为?。一些流行的英文文本编码是Windows-1252(西欧语言的Windows 9x标准)和ISO-8859-1,也称为Latin-1(对任何西欧语言都有用)。但是当你尝试用这些编码存储俄语字母或者希伯来语字母的时候,你将得到一堆的问号。UTF 7,8,16和32都具有能够正确存储任何代码点的良好属性。

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

如果你完全忘记了我刚刚解释的,请记住一个非常重要的事实。在不知道它使用什么编码的情况下拥有一个字符串是没有意义的。你不能再把头埋在沙子里,假装“普通”文字是ASCII。

没有像纯文本那样的东西

如果你在内存中,文件中或者在电子邮件中有一个字符串,你必须要知道它是什么编码,否则它不能正确的展示给用户。

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

几乎每一个类似“我的网站看起来像是在胡言乱语”或者“当我使用重音时她看不懂我的邮件”的愚蠢的问题都归结于一个天真的程序员,他不明白这个简单的事实,如果你不告诉我是否使用UTF-8或ASCII或ISO 8859-1(拉丁文1)或Windows 1252(西欧)编码特定字符串,你根本无法正确显示它甚至找不到它的结束位置。

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

那我们该如何保留有关字符串编码使用情况的信息?这里有一些标准的方式来做到这件事情。对于一个电子邮件,你应该在表单头部包含一个如下字符串:

Content-Type: text/plain; charset="UTF-8"

对于网页,最初的想法是web服务器应该返回一个类似Content-Type的http 头以及网页本身--不包含在HTML内部,而是作为在HTML页面之前发送的响应标头之一。

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.

但是这样会导致问题。假设你有一个拥有很多站点的大型web服务器,有很多不同语言的人像这个服务器贡献数以百计的页面,这些人都所有使用他们的Microsoft FrontPage副本认为适合的编码。在这个时候,web服务器不能真正的知道这些文件通过什么编码完成的,因此它也不能发送Content-Type头。

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:



But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

如果你能够通过一些特殊的标签将HTML页面的Content-Type直接放在HTML里面,那将会很方便。但是在知道HTML的编码方式之前你该如何阅读它呢?幸运的是,几乎所有常用的编码都使用32到127之间的代码做同样的事情(都是英文字母),所以只要在Content-Type标签之前不使用特殊的字符就可以获取Content-Type信息。



但是,这个标签实际上应该放在head标签的首位,因为一旦浏览器看到这个标签,它将停止渲染网页,并且在使用你指定的编码之后重新渲染网页。

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.

如果浏览器无论是从HTTP头里面或者meta标签里面都没有发现Content-Type,它怎么做什么呢?Internet Explorer实际上做了一些非常有趣的事情:它尝试着基于各种语言的典型编码中典型文本中各种字节出现的频率来猜测使用了什么语言以及编码。因为各种旧的8位代码页倾向于将它们的国家字母放在128到255之间的不同范围内,并且因为每种人类语言都有不同的字母用法特征直方图,实际上这是有机会起作用的。这真的很奇怪,但它似乎经常起作用,以至于那些从来不知道他们需要Content-Type的天真的网页编写者在网络浏览器中查看他们的页面并且还看起来不错。直到有一天,他们写的东西并不完全符合他们母语的字母频率分布,并且Internet Explorer决定它是韩语并因此显示它。我认为,证明Postel的规则关于“你所放弃的是保守的,你所接受的是自由的”这一点,坦率地说并不是一个好的工程原则。无论如何,这个用保加利亚语编写却显示为韩语(甚至不是衔接的韩语)的网站的可怜的读者该怎么办呢?他使用View |编码菜单并尝试了一堆不同的编码(东欧语言至少有十几种),直到图片更加清晰。如果他知道这样做,大多数人都不知道。

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char”) instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

对于最新版本的CityDesk,网站管理软件由我公司发布,我们决定在UCS-2(双字节)Unicode内部执行所有操作,这是Visual Basic,COM和Windows NT / 2000 / XP用作其本机字符串类型的内容。在C++代码中,我们只将字符串声明为wchar_t(“wide char”),而不是char,我们用wcs函数替代str函数(比如wcscat和wcslen替换strcat和strlen)。在C语言中创建一个文字,你只需要在它前方写上L,即L“Hello”

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

当CityDesk发布网页时,它会将其转换为UTF-8编码,这种编码多年来一直受到Web浏览器的良好支持。这就是所有29种语言版本的Joel on Software编码的方式,我还没有听过一个人在查看它们时遇到任何问题。

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.

这篇文章变得相当长,我不可能涵盖有关字符编码和Unicode的所有知识,但我希望如果你读到这一点,你已经知道的足够多,并且可以返回去编程,using antibiotics instead of leeches and spells, 这是在你离开时交给你的任务。.

 

 

 

你可能感兴趣的:(计算机基础,Unicode,字节码)