Character sets
Contributed by Ken Fowles, Personal Systems Division, Microsoft.
This page starts with a summary, then digs into ASCII, OEM, ANSI, DBCS, and Unicode character sets, and how character sets affect technology at Microsoft.
Summary
Character sets affect two fundamental parts of your code:
Character sets do not solve:
In the dark ages, developers generally ignored character sets. Since one ANSI character set can handle Western European languages like English, French, German, Italian and Spanish, other languages were considered special cases or not handled at all.
Many, but not all of the world's major writing systems can be represented within 256 characters, using individual 8-bit character sets. It's important to note there isn't an 8-bit character set which can represent all of these languages at once, or even just the languages required by the European Union.
Languages which require more than 256 characters include: Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul). It is a requirement, not an option, that any application which touches text in these languages needs to correctly handle DBCS or Unicode string processing and data. Unless you enjoy throwing away a lot of code and algorithms, it's best to implement this from day one in all your text handling code.
ASCII
ASCII is contained within 2 to the 7th power, or 128 characters. There's room in ASCII for upper and lowercase English, American English punctuation, base 10 numbers, a few control characters and not much else. Although very primitive, it's important to note ASCII is the one common denominator contained in all the other common character sets - so the only means of interchanging data across all major languages (without risk of character mapping loss) is to use ASCII (or have all sides understand Unicode). For example, the safest way to store filenames on a typical network today is using the ASCII subset of characters. If you manually log into CompuServe, they require a 7-bit instead of 8-bit modem protocol, since their servers were originally ASCII-based.
OEM 8-bit characters
Back in the DOS days, separate Original Equipment Manufacturer code pages were created so that text-mode PCs could display and print line-drawing characters. They're still used today for direct FAT access, and for accessing data files created by MS-DOS based applications. OEM code pages typically have a 3-digit label, such as CP 437 for American English.
The emphasis with OEM code pages was linedraw characters. It was a good idea at the time, since the standard video for the original IBM PC was a monochrome text card with 2k RAM, connected to an attractive green monitor. However the drawing characters took up a lot of space in the 256 character map, leaving very little room for international characters. Since each hardware OEM was free to set their own character standards, some situations continue today where characters can be scrambled or lost even within the same language, if two OEM code pages have different character code points. For example a few characters were mapped differently between Russian MS-DOS and Russian IBM PC-DOS, so data movement is unreliable, or software has to be written to map between each special case.
Users aren't going to suddently erase all their old data and reformat all their disks. The raw data and FAT filenames created with OEM code pages will be around for a long time.
Windows ANSI
Since Windows GDI overrides the need for text-based line draw characters, the old OEM line-draw characters could be freed up for something more useful, like international characters and publishing symbols. An assortment of 256-character Windows ANSI character sets cover all the 8-bit languages targeted by Windows.
You can think of Windows ANSI as a lower 128, and an upper 128. The lower 128 is identical to ASCII, and the upper 128 is different for each ANSI character set, and is where the various international characters are parked.
code page | 1250 | 1251 | 1252 | 1253 | 1254 | etc., |
upper 128 |
Eastern Europe | Cyrillic | West Euro ANSI |
Greek | Turkish | etc., |
lower 128 |
ASCII | ASCII | ASCII | ASCII | ASCII | etc., |
The European Union includes more languages than Code Page 1252 can cover - specifically Greek is missing, and there's no way to fit it all into 256 characters. Switching entirely to Unicode would allow coverage of all EU languages (and a lot more) in one character set, but that conversion is not automatic, and requires every algorithm which touches text is inspected or rewritten. So an interim solution available, which allows the spanning of multiple ANSI code pages within one document - Multilingual Content I/O. Remember this is for multilingual document content, not user interface - two separate issues.
DBCS
DBCS stands for Double Byte Character Sets but are actually multi-byte encodings, a mix of 8-bit and 16-bit characters. Modern writing systems used in the Far East region typically require a minimum of 3k-15k characters.
There are several DBCS character sets supported by Far East editions of Microsoft Windows. Leadbytes signal that the following byte is a trailbyte of the 16-bit character unit, instead of the start of the next character. Each DBCS code page has a different leadbyte and trailbyte range. No leadbytes fall within the lower 127 (ASCII) range, but some trailbytes do.
The main rules for DBCS-enabling are:
What is Unicode / ISO 10646 ?
Unicode is a 16-bit character set which contains all of the characters commonly used in information processing. Approximately 1/3 of the 64k possible code points are still unassigned, to allow room for adding additional characters in the future.
Unicode is not a technology in itself. Sometimes people misunderstand Unicode and expect it to 'solve' international engineering, which it doesn't. Unicode is an agreed upon way to store characters, a standard supported by members of the Unicode Consortium.
The fundamental idea behind Unicode is to be language-independent, which helps conserve space in the character map - no single character is assumed to identify a language in itself. Just like a character "a" can be a French, German or English "a" even if they have different meanings, a particular Han ideograph might map to a character used in Chinese, Japanese and Korean. Sometimes native speakers of these languages misunderstand Unicode as not "looking" correct in Japanese for example, but that's intentional - appearance should reside in the font as an artistic issue, not the code point as an engineering issue. Although it's technically possible to ship one font which covers all Unicode characters, it would have very limited commercial use, since end-users in Asia will expect fonts dedicated and designed to look correct in their language.
This language-independence also means Unicode does not imply any sort order. The older 8-bit and DBCS character sets usually contain a sort order, but this means they had to create a new character set to change the sort order, which makes a mess out of data interchange between languages. Instead, Unicode expects the host operating system to handle sorting, as the Win32 NLS APIs do.
Data interchange between languages
This is where Unicode has the clearest advantage compared to code pages. Unicode is essentially a superset of every Windows ANSI, Windows DBCS and DOS OEM character set. So for example an Unicode-based Internet browser could let its user simultaneously view Web pages which contained text in practically any language, as long as they have the appropriate fonts on their machine.
Unicode is even useful for products which don't rely on Unicode for string processing, since it makes a good common denominator for mapping characters between code pages. Instead of manually creating an almost infinite set of possible mapping tables between every code page, it's easier to map from one codepage to Unicode, and then back over to the other codepage. The Win32 SDK sample UCONVERT shows how to use the system's *.nls tables to accomplish part of this task.
Impact on your project
Unicode-enabling is not an automatic process - since it requires 16-bit characters, many of the same ANSI coding assumptions which will break on DBCS will also break on Unicode - for example your pointer math can't assume 8 bit characters, and you will need to test to verify correct string handling, in every place your code directly touches text. Fortunately there are some shortcuts.
Twelve steps to Unicode-enabling【熊猫注:这几条不错啊】
from Developing International Applications pages 109-111, Microsoft Press:
Specifier printf Expects wprintf expects
%s SBCS or MBCS Unicode
%S Unicode SBCS or MBCS
%hs SBCS or MBCS SBCS or MBCS
%ls Unicode Unicode
For samples on Unicode-enabled programming, UCONVERT shows character conversion using the system's *.nls tables, and GRIDFONT shows how font enumeration and display needs to keep track of character encoding.