以码为梯

每个程序员都必须知道的Unicode以及字节码最基础的知识

原文出处：the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses

Ever wonder about that mysterious Content-Type tag?You know, the one you’re supposed to put in HTML and you never quite know what it should be?

你有没有对那个神奇的Content-Type标签感到神奇？你知道的，你应该在HTML添加它，而且你永远不是很清楚它的值应该是什么？

Did you ever get an email from your friends in Bulgaria with the subject line “???? ?????? ??? ????”?

你有没有从你保加利亚的朋友那里收到主题为“???? ?????? ??? ????”的邮件？

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.

我一直很惊讶的发现有很多程序员并没有真正完全掌握字符集、编码、Unicode等相关内容。几年前，FogBUGZ的beta测试人员想知道它是否可以处理用日语写的邮件。日语？他们有用日语写的邮件吗？我不知道。当我仔细查看我们用来解析MIME电子邮件的商业ActiveX控件时，我们发现它完全用错了字符集，因此我们不得不写 heroic code来撤销它做的错误转换并且正确的重做它。当我查看另一个商业的代码库，它也完全破坏来字节码的实现。我与该软件包的开发人员交谈，他们有点认为“他们无能为力”。跟许多的程序员一样，他只希望它会以某种方式平息。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

但是它不会。当我发现最流行的web开发语言php几乎完全忽视编码问题，简单的使用8位字符。使他几乎不可能开发出好的网络页面应用。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

因此我要宣布：如果你是一个程序员在2003年工作并且你不知道字符、字符集、编码和Unicode的基础。当我抓住你的时候, 我要惩罚你, 让你在潜艇上剥洋葱6月。我发誓我会的。

IT’S NOT THAT HARD.

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

在这篇文章中我将告诉你究竟什么才是每个工作中的程序员应该知道的。所有关于“纯文本=ascii=字符是8位的”的东西，不仅是错误的，而且是不可救药的错误。如果你还在以这种方式编程，你不会比一个不相信细菌的医生好到哪里去。请在读完这篇文章之前不要去写下一行代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.

在开始之前，我应该提醒你的是如果你是那些知道国际化的稀有人士中的一员，你将会发现我的整个讨论有点简单了。这里我只是想设置一个最低的门槛以便每个人可以理解正在发生什么，并且能够编写有希望在任何语言以及英语中不包括口音的子集的文本中正常运行的代码。并且我应该提醒你字符处理只是创建国际化软件应该采取的措施中很小的一部分，但是我一次只能写一件事情，因此今天是字符集。

A Historical Perspective

The easiest way to understand this stuff is to go chronologically.

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.

理解一件事情最容易的方式是按照时间顺序排列。

也许你会认为我要从EBCDIC这样很古老的字符集开始说起。我不会的。EBCDIC不适用于你们的生活。我们没有必要回到那么远的时间。

Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

早在半古老的时代，Unix刚刚被发明出来，K&R还在写C编程语言，所有的事情都非常的简单。EBCDIC正在退出历史舞台。唯一需要掌握的字符是优良的没有重音的英文字母，并且我们有一个称为ASCII的编码（能够使用32跟127之前的数字来表示任何字母）来表示它们。例如，用32表示空格，65表示“A”。ASCII可以方便的以7位字节来存储。那个时代大部分电脑使用8位字节，因此你不仅可以存储每个可能的ASCII字符，而且你还有一个备用的字节。如果你是邪恶的，你可以用于你自己的狡猾目的：WordStar的昏暗灯泡实际上打开了高位来指示一个单词中的最后一个字母，仅将WordStar谴责为英文文本。32位以下的代码是不可打印并且被用于诅咒。开个玩笑而已。它们被当作控制字符，比如7使你的电脑发出嘟嘟声，12这使得打印机切换纸张。

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (), so when Americans would send their résumés to Israel they would arrive as rsums. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.

假设你是一位说英语的人，一切都很好。

因为字节最多有8位的空间，所以很多人就会想，“我们可以将代码128-255用于我们自己的目的”。问题是，很多人同时有这个想法，并且对于128-255这段应该放什么，他们都有自己的想法。IBM-PC有一些后来被称为OEM字符集的东西，它为欧洲语言提供了一些重音字符以及一堆线条画人物......水平杆，垂直杆，水平杆，右侧悬挂小铃铛，并且您可以使用这些线条绘制字符在屏幕上制作漂亮的盒子和线条，您仍然可以看到在干洗店的8088计算机上运行。事实上，一旦人们开始在美国以外地区购买PC，就会设想各种不同的OEM字符集，这些字符集都是为了自己的目的而使用前128个字符。例如一些电脑上的130显示为é，但在以色列出售的电脑上，它表示为。因此当美国人发送résumés到以色列将显示为 rsums。在许多情况下，例如俄语，对于如何处理128个字符的内容有很多不同的想法，所以你甚至无法可靠地交换俄语文档。

OEM字符集

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual” code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

最终这个免费的OEM已经编入ANSI标准。在ANSI标准中，每个人都同意128之下做什么，这几乎跟ASCII一样，但是有很多不同的方式处理128及以上的字符，这取决于你生活的区域。这些不同的系统称为代码页。例如在以色列，DOS使用了一个名为862的代码页，而希腊用户则使用了737。他们128一下都一样，从128开始不同，所有的有趣的字母都在128之后的这一侧。MD-DOS的national版本有很多这样的代码页，处理从英语到冰岛语的所有内容，他们甚至还有一些“多语言”代码页，可以在同一计算机上执行世界语和加利西亚语。但是，在同一台机器上使用希伯来语和希腊语是完全不可能的，除非你编写来自己的自定义程序，使用位图图形显示所有内容。因为希伯来语跟希腊语需要不同的代码页，对高位数字有不同的解释。

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set” in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.

同时，在亚洲，考虑到亚洲字母有数千封字母这一事实更加疯狂的事情正发生着，他们永远不适用8位字节。这通常由称为DBCS（双字节字符集）的混乱系统解决，在这个字符集中一些字母存储为单字节，另外一些存储为存储为双字节。它很容易转换为一个字符串，但几乎不可能还原回去。鼓励程序员不要使用s ++和s-来前后移动，而是调用Windows的AnsiNext和AnsiPrev之类的函数，它们知道如何处理整个混乱。

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

但是，大多数人只是假装一个字节是一个字符，一个字符是8位的，只要你不把字符串从一个电脑移动另一个电脑或者不是一个多语言的人，它总会正常工作。但是，一旦互联网发生，将字符串从一个电脑移动到另一个电脑变得非常普通，随之而来的便是非常混乱的局面。幸好，Unicode被发明出来。

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

Unicode创建来一个单字符集，该单字符集包括了地球上每一个合理的书写系统以及一些像克林贡一样的虚构的书写系统。有些人误以为Unicode只是一个16位代码，其中的每一个字符都占16位，它最多有65,536个字符。实际上，这不是正确的。这是关于Unicode最常见的假想，如果你也这么认为，请不要难过。

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

事实上，Unicode以一种不同的方式理解字符集，你必须理解这个思路，否则将毫无意义。

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

到目前为止，我们假设一个字母对应你能够存储在磁盘或者内存中的一些位：

A -> 0100 0001

在Unicode中，一个字母映射到一个被称位代码点的东西上，代码点只是一个理论概念。代码点在内存以及磁盘中如何展示是一个完整故事。

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:

A

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

在Unicode中，字母A是一个理想的状态，它处于最顶层

字母A跟字母B不同，跟字母a也不同，但是跟字母A 、A、A是一样的。字母A在字体Times New Roman和字体Helvetica中是相同的，但是

却跟小写的字母“a”不同。这看起来不是很有争议，但是在某些语言中弄清楚一个字母到底是什么却导致争议，比如，在德语中ß是一个真实的字母还是字母ss的某种写法？在比如，如果字母的形状在单词的末尾发生变化，那是一个不同的字母吗？这种情况，在希伯来语说是的，阿拉伯语中却不是。无论如何，Unicode联盟的聪明人在过去十年左右的时间里一直在考虑这个问题，尽管这伴随着大量高度政治化的争论，但是你不必担心它。他们已经完全想通了这一切。

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmaputility on Windows 2000/XP or visiting the Unicode web site.

每个字母表中的每个字母都由Unicode联盟分配一个固定的数值（magic number），这可以写成类似U+0639这样。这个固定的数值（magic number）被称位代码点。其中“U+”代表“Unicode”，数值是16进制。U+0639在阿拉伯语中是Ain。在英语中字母A的Unicode码为U+0041。这些都可以在Unicode的官网（the Unicode web site.）上看到。

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

Unicode可以定义的字母没有实际的限制，实际上它们已经超过65,536，因此不是每个Unicode字母都可以压缩为两个字节，即使这样Unicode还是成为一个标准。

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.

在Unicode中字符串Hello表示为五个代码点：U+0048 U+0065 U+006C U+006C U+006F，仅仅一堆代码点。实际上就是一堆数字。到目前为止我们都没有讨论任何关于如何存储这些代码点到内存中或者在电子邮件中表示它的东西。

Encodings

That’s where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

Unicode编码最早的想法，导致了关于两个字节的猜想。所以这些代码点被存储为两个字节。所以Hello变成了00 48 00 65 00 6C 00 6C 00 6F。这样就是对的吗？不一定！它是否可以存储为48 00 65 00 6C 00 6C 00 6F 00？从技术上来看，我相信这是可以的。事实上，早期的实现者想要以high-endian或者low-endian两种模式来存储它们的字节码，这样无论哪个特定的CPU都是最快的。因此，人们不得不想出在每个Unicode字符串的开头存储FE FF的奇怪惯例，这被称为字节顺序标记。如果你交换你的高和低字节，使它看起来像一个FF FE，读取你的字符串的人将知道他们必须交换每隔一个字节。但是，并非每个字节码字符串都会以字节顺序标记开头。

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

在一段时间内它看起来似乎已经足够好来，但是程序员却开始抱怨。它们说，“看看那些0”。因为它们是美国人，他们正在查看很少使用U + 00FF以上代码点的英文文本。他们也是加州的自由派嬉皮士，他们想要保护（冷笑）。如果他们是德克萨斯人，他们就不会吝啬两倍的字节数。但那些加利福尼亚的人无法忍受将字符串所需的存储量增加一倍的想法。（这里的意思应该是那些0会造成存储的浪费）。无论如何，目前这些文档都使用各种ANSI和DBCS字符集，谁将要将它们全部转换？仅仅因为这个原因，很多人决定忽略Unicode，同时事情页面的更加糟糕。

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

因此UTF-8就被发明了出来。UTF-8是另一个系统，在内存用用8位字节存储Unicode代码点。在UTF-8中，每个0-127的代码点被存储在一个单字节中，超过128的代码点使用2或3个字节，最多到6个。

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

这样就产生了一个比较好的结果，英文文本在UTF-8以及ASCII中几乎是一样的，所以美国人甚至没有发现任何错误。只是世界上的其他人不得不jump through hoops.因此，Hello,在Unicode中表示为U+0048 U+0065 U+006C U+006C U+006F，在UTF-8中存储为48 65 6C 6C 6F。这跟它在ASCII、ANSI以及地球上所有的OEM字符集中存储的方式都一样。现在，如果你想地使用重音字母或希腊字母或克林贡字母，你将不得不使用几个字节来存储单个代码点，但美国人永远不会注意到。UTF-8有一个很好的属性，想要使用单个0字节作为空终止符的无知旧字符串处理代码不会截断字符串。

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

到目前为止，我已经告诉你三种Unicode编码方式。传统的两字节存储方式被称为UCS-2（因为它有两个字节）后者UTF-16（因为它有16位），而且你还要弄清楚它是high-endian UCS-2或者low-endian UCS-2。还有流行的新UTF-8标准，它有很好的属性。而且碰巧你使用的是英文文本并且你的应用程序除了ASCII不能识别其他任何编码，UTF-8可以很好的工作。

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.

事实上，还有很多编码Unicode的方式。其中一种方式被称位UTF-7，这与UTF-8很像，但保证高位始终为0。如果你不得不通过一些认为7位已经足够的比较严谨的电子邮件传递Unicode，UTF-7也是可以完全胜任的。另外一个就是UCS-4，它以4个字节存储每个代码点，它具有每个代码点可以以相同的字节数存储的良好属性，但是它的缺点是太浪费内存。

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re reallygood, a box. Which did you get? -> �

事实上，现在你正在考虑用Unicode代码点表示的字母，那些unicode代码点也可以在任何旧式编码方案中编码。例如，你可以使用ASCII编码Hello（U + 0048 U + 0065 U + 006C U + 006C U + 006F）的Unicode字符串，也可以在旧的OEM希腊语编码，或希伯来语ANSI编码，或几百种编码中的任何一种到目前为止已发明的编码。with one catch：有可能某些字母不会出现。如果在你尝试编码的编码中没有Unicode代码点的对应代码，那将看到一个？或者�

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

有数百种传统的编码只能正确的存储少量的代码点，并将其他所有的代码点转换为？。一些流行的英文文本编码是Windows-1252（西欧语言的Windows 9x标准）和ISO-8859-1，也称为Latin-1（对任何西欧语言都有用）。但是当你尝试用这些编码存储俄语字母或者希伯来语字母的时候，你将得到一堆的问号。UTF 7,8,16和32都具有能够正确存储任何代码点的良好属性。

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

如果你完全忘记了我刚刚解释的，请记住一个非常重要的事实。在不知道它使用什么编码的情况下拥有一个字符串是没有意义的。你不能再把头埋在沙子里，假装“普通”文字是ASCII。

没有像纯文本那样的东西

如果你在内存中，文件中或者在电子邮件中有一个字符串，你必须要知道它是什么编码，否则它不能正确的展示给用户。

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

几乎每一个类似“我的网站看起来像是在胡言乱语”或者“当我使用重音时她看不懂我的邮件”的愚蠢的问题都归结于一个天真的程序员，他不明白这个简单的事实，如果你不告诉我是否使用UTF-8或ASCII或ISO 8859-1（拉丁文1）或Windows 1252（西欧）编码特定字符串，你根本无法正确显示它甚至找不到它的结束位置。

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

那我们该如何保留有关字符串编码使用情况的信息？这里有一些标准的方式来做到这件事情。对于一个电子邮件，你应该在表单头部包含一个如下字符串：

Content-Type: text/plain; charset="UTF-8"

对于网页，最初的想法是web服务器应该返回一个类似Content-Type的http 头以及网页本身--不包含在HTML内部，而是作为在HTML页面之前发送的响应标头之一。

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.

但是这样会导致问题。假设你有一个拥有很多站点的大型web服务器，有很多不同语言的人像这个服务器贡献数以百计的页面，这些人都所有使用他们的Microsoft FrontPage副本认为适合的编码。在这个时候，web服务器不能真正的知道这些文件通过什么编码完成的，因此它也不能发送Content-Type头。

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

如果你能够通过一些特殊的标签将HTML页面的Content-Type直接放在HTML里面，那将会很方便。但是在知道HTML的编码方式之前你该如何阅读它呢？幸运的是，几乎所有常用的编码都使用32到127之间的代码做同样的事情（都是英文字母），所以只要在Content-Type标签之前不使用特殊的字符就可以获取Content-Type信息。

但是，这个标签实际上应该放在head标签的首位，因为一旦浏览器看到这个标签，它将停止渲染网页，并且在使用你指定的编码之后重新渲染网页。

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.

如果浏览器无论是从HTTP头里面或者meta标签里面都没有发现Content-Type，它怎么做什么呢？Internet Explorer实际上做了一些非常有趣的事情：它尝试着基于各种语言的典型编码中典型文本中各种字节出现的频率来猜测使用了什么语言以及编码。因为各种旧的8位代码页倾向于将它们的国家字母放在128到255之间的不同范围内，并且因为每种人类语言都有不同的字母用法特征直方图，实际上这是有机会起作用的。这真的很奇怪，但它似乎经常起作用，以至于那些从来不知道他们需要Content-Type的天真的网页编写者在网络浏览器中查看他们的页面并且还看起来不错。直到有一天，他们写的东西并不完全符合他们母语的字母频率分布，并且Internet Explorer决定它是韩语并因此显示它。我认为，证明Postel的规则关于“你所放弃的是保守的，你所接受的是自由的”这一点，坦率地说并不是一个好的工程原则。无论如何，这个用保加利亚语编写却显示为韩语（甚至不是衔接的韩语）的网站的可怜的读者该怎么办呢？他使用View |编码菜单并尝试了一堆不同的编码（东欧语言至少有十几种），直到图片更加清晰。如果他知道这样做，大多数人都不知道。

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char”) instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

对于最新版本的CityDesk，网站管理软件由我公司发布，我们决定在UCS-2（双字节）Unicode内部执行所有操作，这是Visual Basic，COM和Windows NT / 2000 / XP用作其本机字符串类型的内容。在C++代码中，我们只将字符串声明为wchar_t（“wide char”），而不是char，我们用wcs函数替代str函数（比如wcscat和wcslen替换strcat和strlen）。在C语言中创建一个文字，你只需要在它前方写上L，即L“Hello”

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

当CityDesk发布网页时，它会将其转换为UTF-8编码，这种编码多年来一直受到Web浏览器的良好支持。这就是所有29种语言版本的Joel on Software编码的方式，我还没有听过一个人在查看它们时遇到任何问题。

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.

这篇文章变得相当长，我不可能涵盖有关字符编码和Unicode的所有知识，但我希望如果你读到这一点，你已经知道的足够多，并且可以返回去编程，using antibiotics instead of leeches and spells, 这是在你离开时交给你的任务。.

你可能感兴趣的:(计算机基础,Unicode,字节码)

JVM简介林小果呀 jvm jvm java 开发语言
JVM简介JVM本质上是一个运行在计算机上的程序，他的职责是运行Java字节码文件。JVM功能解释和运行：对字节码文件中的指令，实时的解释成机器码，让计算机执行内存管理：自动为对象、方法等分配内存自动的垃圾回收机制，回收不再使用的对象即时编译：对热点代码进行优化，提升执行效率常见的JVM
【python】爬取网站数据进击的C语言网络
编码问题因为涉及到中文，所以必然地涉及到了编码的问题，这一次借这个机会算是彻底搞清楚了。问题要从文字的编码讲起。原本的英文编码只有0~255，刚好是8位1个字节。为了表示各种不同的语言，自然要进行扩充。中文的话有GB系列。可能还听说过Unicode和UTF-8，那么，它们之间是什么关系呢？Unicode是一种编码方案，又称万国码，可见其包含之广。但是具体存储到计算机上，并不用这种编码，可以说它起着
CTF——web方向学习攻略一则孤庸 CTF 网络安全 CTF
1计算机基础操作系统：熟悉Linux命令，方便使用Kali。网络技术：HCNA、CCNA。编程能力：拔高项，有更好。2web应用HTTP协议：必须掌握web开发框架web安全测试3数据库数据库基本操作SQL语句数据库优化4刷题
JVM---内存管理 Wangwq. 八股文 JVM
JVM是一种用于计算设备的规范，他是一个虚构的计算机。是通过在实际的计算机上的仿真模拟各种计算机的功能来实现的。引入java虚拟机后，java语言在不同的平台上运行时不需要重新编译，运行字节码即可。五大内存区域1、方法区（1）所有线程共享的内存区域（2）用于存储已被虚拟机加载的类信息、常量、静态常量等。如：被static修饰的常量（3）方法区中的信息来源于类装载子系统，其加载class信息（4）这
java高级技术:反射不会编程的阿成 java 开发语言
反射认识反射，获取类获取类中的成分，并对其进行操作作用、应用场景。认识反射，获取类反射：加载类，并允许以编程的方式解剖类中的各种成分（成员变量，方法，构造器等）。反射学什么？学习获取类的信息，操作它们1、反射第一步：加载类，获取类的字节码：Class对象。2、获取类中的构造器：Constructor3、获取类的成员变量：Field对象4、获取类的成员方法:Method对象获取Class对象的三种方
重修设计模式-结构型-代理模式丶白泽重修设计模式设计模式代理模式系统安全
重修设计模式-结构型-代理模式在不改变原始类代码的情况下，通过引入代理类来给原始类附加功能。代理模式通过创建一个代理对象，使得客户端对目标对象的访问都通过代理对象间接进行，从而可以在不修改目标对象的前提下，增加额外的功能操作，如权限控制、日志记录、事务处理等。代理模式又分为静态代理和动态代理。静态代理（StaticProxy）：在程序运行前就已经存在代理类的字节码文件，代理类和委托类的关系在运行前
百度秋招测开面经情书学长面试百度笔记
1、自我介绍2、MySQL一、结合简历的项目说一下数据库设计如何优化二、说一下所知道的索引类型三、索引的优缺点四、索引的使用建议3、计算机基础一、TCP和UDP的区别二、TCP的三次握手的流程三、进程和线程的概念和区别四、深拷贝和浅拷贝的区别5、Linux一、文件查看前10行的命令二、文件编辑的命令三、vim和view的区别四、查看端口的命令五、查看进程的命令6、数据结构一、说一下知道的数据结构二
网络安全（黑客）——自学2024 白帽子黑客-宝哥 web安全安全嵌入式硬件网络单片机
一、什么是网络安全网络安全是一种综合性的概念，涵盖了保护计算机系统、网络基础设施和数据免受未经授权的访问、攻击、损害或盗窃的一系列措施和技术。经常听到的“红队”、“渗透测试”等就是研究攻击技术，而“蓝队”、“安全运营”、“安全运维”则研究防御技术。作为一个合格的网络安全工程师，应该做到攻守兼备，毕竟知己知彼，才能百战百胜。二、网络安全怎么入门安全并非孤立存在，而是建立在其计算机基础之上的应用技术。
源码到class字节码的编译流程 & 字节码到内存的Java类加载流程 Tinty0o0 java 开发语言
类加载：字节码—>内存Java类的加载流程是一个复杂但有序的过程，它确保了类文件能够被正确地加载到Java虚拟机（JVM）中，并被正确地初始化和使用。这个过程主要包括以下几个阶段：1.加载（Loading）加载阶段是类加载过程的第一个阶段。在这个阶段，JVM通过类加载器（ClassLoader）完成以下三件事情：通过一个类的全限定名（包括包名和类名）来获取定义此类的二进制字节流。将这个字节流所代表
java native方法深入理解 weixin_41253524 java python 开发语言 jvm c++
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录@[TOC](文章目录)javanative方法深入理解前言一、java字节码层面分析1.普通方法demo2.普通方法字节码3.native方法demo4.native方法字节码字节码层面总结1.调用static方法字节码：invokestatic2.调用实例方法字节码：invokevirtual3.调用类的构造方法：**in
Java 类型擦除机制详解项目笔记与工具库 java 开发语言
Java的类型擦除机制（TypeErasure）是Java泛型中一个非常重要的特性。通过类型擦除，Java既实现了泛型功能，又能保持与旧版本的向后兼容性。本文将详细介绍Java类型擦除机制的工作原理、步骤、影响以及相关示例。什么是类型擦除？类型擦除是指在编译时，Java编译器会将泛型类型转换为原始类型（rawtype），并删除或替换与类型参数相关的类型信息。换句话说，在编译后的字节码中，泛型的具体
JVM工作过程一只BI鱼每日面经整理笔记 jvm java
将JVM工作过程粗略分为5个阶段，包括加载阶段、链接阶段、初始化阶段、执行阶段、回收阶段其中，（1）加载阶段、链接阶段的解析部分主要由类加载器完成（2）初始化阶段是由JVM的类加载机制在类加载过程的最后阶段自动触发的。（3）执行阶段主要由执行引擎负责（4）回收阶段主要是垃圾收集器（GarbageCollector）负责。所以，在Java虚拟机（JVM）中，读取字节码文件、解析字节码文件为类信息，并
什么是JVM？它有什么作用？ xiangpingeasy Java面试题 jvm
Java虚拟机（JVM）是运行所有Java程序的软件平台，它独立于硬件和操作系统。JVM是Java技术的核心部分，使得Java能够实现“一次编写，到处运行”（WriteOnce,RunAnywhere，WORA）的特性。JVM主要有以下作用：1.程序的执行JVM负责读取Java字节码（.class文件），并将其转换为机器码执行。这个过程可以通过解释执行（逐行转换并运行）或JIT编译执行（将字节码编
JVM是什么？ .suki... JVM 1024程序员节
JVM是java虚拟机栈，用于运行java执行字节码文件的。是java实现跨平台的核心机制，因为它的目的是使用相同的字节码文件，在不同的操作系统运行的结果相同。一、java内存模型在JDK1.8之前，它是分为线程共享和线程私有的，在线程共享的部分分为堆区和方法区；在线程私有的部分分为jvm虚拟机栈、程序计数器、本地方法栈。在1.8之后，它是将方法区换为元空间。jvm虚拟机栈：是由一个个的栈帧组成，
【Java面试题-001】什么是JVM?为什么称Java为跨平台的编程语言？陈春满 JavaSE jvm 编程语言跨平台面试题虚拟机
1.什么是JVM?为什么称Java为跨平台的编程语言？答：①概述：JVM是Java虚拟机的英文简称。全称为JavaVirtualMachine。②用途：JVM是用来执行Java字节码的虚拟机，每个Java源文件将被编译成字节码文件(即.class文件)，然后在JVM中执行。③作用：-->Java之所以可以跨平台，关键因为JVM屏蔽了与具体操作系统平台相关的信息，只需要有经过编译的字节码(.clas
什么是JVM GG编程 jvm
1.基本概念：JVM（JavaVirtualMachine，Java虚拟机）是一个用于运行Java程序的虚拟机。它是Java编程语言的核心部分，负责将Java字节码（bytecode）解释或编译成计算机可以执行的机器码，从而实现Java程序的跨平台特性。2.主要功能：(1).加载字节码：JVM将编译后的Java字节码文件（.class文件）加载到内存中。(2).字节码验证：验证字节码的正确性和安全
深入理解Java虚拟机：Jvm总结-类文件结构以及类加载机制 Ty_1106 JVM java jvm 开发语言
第六章类文件结构6.1意义代码编译的结果从本地机器码转变为字节码，冲破了平台界限。6.2无关性的基石实现语言无关性的基础仍然是虚拟机和字节码存储格式。Java虚拟机不与包括Java语言在内的任何程序语言绑定，它只与“Class文件”这种特定的二进制文件格式所关联。6.3Class类文件的结构任何一个Class文件都对应着唯一的一个类或接口的定义信息，但是反过来说，类或接口并不一定都得定义在文件里（
C++ Builder 使用 SelectDirectory 打开选择文件夹的对话框玄坴
SelectDirectory可以打开3种不同的打开文件夹对话框。目前比较常用的选择文件夹对话框老式的选择文件夹对话框和选择文件一样的选择文件夹对话框一.目前比较常用的选择文件夹对话框bool__fastcallSelectDirectory(constUnicodeStringCaption,constWideStringRoot,UnicodeString&Directory,TSelectD
Unicode字符编码过好每一天的女胖子 Windows c++windows
1、简介Unicode是ASCII（美国信息交换标准码）字符编码的一个扩展。ASCII中每个字符用7位表示，计算机上每个字符8位。Unicode使用全16为字符编码，因此Unicode能表示世界上所有能用于计算机通讯的符号。Unicode最初是作为ASCII的补充。ASCII最终有26个小写/大写字母、10个数字、32个符号、33个控制代码和一个空格，共128个代码。1、优点大小写字符的代码是连续
Java经典面试题118问，还不会你就out了！（附赠答案）阿博的java技术栈
前言1.什么是Java虚拟机？为什么Java被称作是"平台无关的编程语言"？Java虚拟机是一个可以执行Java字节码的虚拟机进程。Java源文件被编译成能被Java虚拟机执行的字节码文件。Java被设计成允许应用程序可以运行在任意的平台，而不需要程序员为每一个平台单独重写或者是重新编译。Java虚拟机让这个变为可能，因为它知道底层硬件平台的指令长度和其特性。2.JDK和JRE的区别是什么？Jav
双十一亿级电商系统JVM性能调优实战 YonchanLew
（1）JDK体系结构这个是JDK的体系结构，JDK包含JRE，JRE包含JVM，所以JDK无非就是一些工具集和支持java运行的类库以及java虚拟机java跨平台就是靠JVM进行的（2）JVM组成部分那么JVM由什么部分组成？由类装载子系统、运行时数据区、字节码执行引擎组成。先由类装载子系统加载class字节码文件到数据区（内存区）中，再由字节码执行引擎执行内存区中的代码附上官方文档，这里教一下
maven工程使用sonar tommyhxh
maven项目配置maven的setting.xml文件Maven插件会自动把所需数据（如单元测试结果、静态检测结果等）上传到Sonar服务器上，Sonar的配置并不在每个工程的pom.xml文件里，而是在Maven的配置文件settings.xml里。在标签添加：sonartruejdbc:mysql://10.16.8.96:3306/sonar?useUnicode=true&cha
python字串节对象Bytes 局外人LZ python python
一、简介字节串（bytes）是二进制数据的一种表示形式。它由一系列的字节组成，每个字节都是一个范围为0-255的整数。字节串可以用来表示二进制数据，例如图像、音频、视频、网络数据等。字节串与字符串（str）类型不同，字符串是由Unicode字符组成的文本数据。而字节串是原始的二进制数据，它不具备字符编码的概念，而是将数据以字节的形式进行存储和处理。在处理二进制数据时，字节串是非常有用的数据类型。它
蒙特卡罗方法——布丰投针实验近似计算圆周率python代码实现潮汐退涨月冷风霜 python 开发语言蒙特卡罗
布丰实验数学原理python代码importrandomasrdimportnumpyasnpimportmathimportmatplotlib.pyplotaspltimportmatplotlibmatplotlib.rcParams['font.family']='SimHei'#或者'MicrosoftYaHei'matplotlib.rcParams['axes.unicode_min
python中的pyc, pyd文件及生成使用 whereismatrix python
简介python源码文件是py后缀，看到py扩展名的文件，那就可用判断其为python代码文件。在python系统里，还有pyc文件和pyd文件。注意：本操作使用的python为v3.11版本。pyc文件pyc是python编译后，生成的字节码文件。使用pyc可以加快程序的加载速度，但不能加快程序的实际执行速度，这就是解释为什么我们安装python目录很多第三方库下是pyc文件的原因，因为它可以使
浅谈gbase与oracle 字符集差异 gbase_lmax java 前端开发语言
字符集字符集（CharacterSet）：按照一定的字符编码方案，将特定的符号集编码为计算机能够处理的数值的集合。常见字符集名称：ASCII字符集、Unicode字符集、GB2312字符集、BIG5字符集、GB18030字符集等。字符编码字符编码（CharacterEncoding）：是一套规则，对字符集进行编码的方案。如，Unicode是字符集，UTF-8、UTF-16、UTF-32是三种字符编
Java虚拟机是如何执行线程同步的莫生人
在网上看到一篇老外的文章（原文地址：HowtheJavavirtualmachineperformsthreadsynchronization），介绍了线程同步相关的几个基础知识点。所以想把它翻译一下给大家看看。相信大家看过这些基础知识之后再看synchronized的原理就会好理解一点。了解Java语言的人都知道，Java代码要想被JVM执行，需要被转换成由字节码组成的class文件。本文主要来
java基础知识阿拉伯的劳伦斯292 java 开发语言
1.java简介跨平台性：Java语言的“一次编写，到处运行”特性是这一语言的重要优势,java程序编译成字节码，可以在安装了Java虚拟机（JVM，javavirtualmachine）的不同操作系统上运行面向对象：java是一门完全的面向对象编程语言，支持继承，抽象，封装和多态的特性内存管理：java通过垃圾回收器（GarbageCollector）自动管理内存，这就不用开发者手动释放内存了，
字符编码方案：Unicode flying jiang 快速开发 java 前端开发语言
摘要：Unicode（统一码、万国码、单一码）是一种在计算机上广泛使用的字符编码，旨在解决传统字符编码方案的局限，为每种语言中的每个字符设定了统一且唯一的二进制编码，以满足跨语言、跨平台进行文本转换和处理的需求。以下是关于Unicode编码的详细解析：一、历史背景起源与发展：Unicode编码的历史可以追溯到20世纪60年代，当时计算机科学家们意识到不同计算机系统使用不同的字符编码方式，导致文本和
出现UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0xe9 in position 0: ordinal not in range解决方法码农研究僧 BUG python 编码 unicode
目录1.问题所示2.问题分析3.解决方法1.问题所示在传输数据的时候出现这个问题，如下所示：File"./audioadmin/common.py",line331,insend_alarm.format(content,project_name,result))UnicodeDecodeError:'ascii'codeccan'tdecodebyte0xe9inposition0:ordina
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag