New Chinese Encoding GB-18030
Over the last few months, I have received many many questions about the new Chinese character encoding schema GB 18030 and how Microsoft products will support this new encoding. With the help of many colleagues, I now have some answers for you, including a link to the new GB 18030 support package.
•
Will GB18030 have a code page identifier?
•
Will GB18030 have an associated System Locale?
•
Will GB18030 have Command Shell support?
•
Which Microsoft Platforms support GB18030?
•
Which Microsoft Platforms have been approved for use in China?
•
How can I obtain the Windows 2000 GB18030 support package?
•
How can I obtain the Windows XP support package?
•
What is in the GB18030 support package?
•
Will the support package work on Windows 2000?
•
Will the support package work on the US English version of Windows XP?
•
Will the support package work with MUI?
•
How do Windows Millennium, Windows 98 and Windows 95 support GB18030?
•
Will the GB18030 files be in Windows XP SP1 or Windows 2000 SP3?
•
Will existing applications run on a system that supports GB18030?
•
Are there differences between the Windows 2000 and Windows XP support?
•
Is GB18030 replacing the Windows Simplified Chinese code page (CP936)?
•
Can I input GB18030 characters on Windows 2000 and Windows XP?
•
Does Internet Explorer support GB18030?
•
Does the GB18030 support package support Tibetan, Mongolian, Yi and Uyghur?
•
Must application XXX conform to the new Chinese law?
•
What are the new APIs for GB18030?
•
How does Unicode handle the four-byte sequence conversion in GB18030?
•
Who should I contact if I have additional questions?
GB18030–2000 is a new Chinese character encoding standard. The standard contains many characters and has some tough new conformance requirements. GB18030-2000 encodes characters in sequences of one, two, or four bytes. These sequences are defined as follows:
1.
Single-byte: 00-0x7f
2.
Two-byte: 0x81-0xfe + 0x40-0x7e, 0x80-0xfe
3.
Four-byte: 0x81-0xfe + 0x30-0x39 + 0x81-0xfe + 0x30-0x39
The single-byte section applies the standard GB 11383 coding structure and principles by using the code points 0x00 through 0x7f. GB 11383 is identical to ISO 4873:1986
The two-byte section uses two eight-bit sequences -- much in the same manner as most DBCS (double-byte character sets) do -- to express a character. The leading byte code points range from 0x81 through 0xfe. The trailing byte code points ranges from 0x40 through 0x7e and 0x80 through 0xfe. This section has the same problem as most DBCS in as much as some code points can be either a leading or trailing byte, thus making character delimitation more complicated.
The four-byte section uses the code points 0x30 through 0x39 as a way to extend the two-byte encodings. Which means the four-byte code points range from 0x81308130 through 0xfe39fe39.
GB18030 is noteworthy in several ways:
1.
It is very large – with room for 1.6 million characters. This is larger than any other code page Windows supports, including Unicode.
2.
GB18030 characters can be one, two, or four bytes. Windows has never supported a four byte code page before and many system interfaces and data structures will not work with four byte characters. Moreover, it is impossible to tell from a particular byte whether it is the first, second, third or fourth byte of a character – this makes processing more difficult than for other code pages.
3.
Products sold in China must conform to this standard. The government defines and determines conformance.
4.
GB18030 conformance includes requirements to support some previously unsupported languages, for example Tibetan, Mongolian, Yi, and Uyghur.
It is now illegal to sell products in China that do not conform to the standard.
Yes, the Microsoft code page identifier is 54936.
No, it only has a code page identifier, as mentioned previously, to allow for conversions to and from Unicode.
No, it is not supported in text mode I/O. Only applications performing windowed mode I/O can use this encoding.
Product
Support
Windows XP
Yes (with add-on)
Windows 2000
Yes (with add-on)
Windows 95, 98 & Millennium and NT4
No
Internet Explorer 6.0
Yes
Internet Explore 5.5 and earlier
No
Product
Status
Windows XP
Approved by the Chinese testing agency.
Windows 2000
Approved by the Chinese testing agency.
Windows NT4
Exempt from the law because released before the standard was published.
Windows 98
Exempt from the law because released before the standard was published.
Windows 95
Exempt from the law because released before the standard was published.
Windows Millennium
Does not conform and may no longer be sold in China.
The GB18030 support package will be bundled on a CD with the Simplified Chinese version of the Windows 2000 sold in China. The SKU is being updated. It is also available for download from a website located at the Microsoft subsidiary in China at http://www.microsoft.com/china/windows2000/downloads/18030.asp .
If you don't read Chinese, click on the URL near the bottom of the page that says:
then on the next page, click on the URL that begins with GBSETUP.EXE. (If you would like to see the English translation of the first Chinese page, click here.)
The Windows XP support package is bundled with the Simplified Chinese version of the Windows XP sold in China.
The contents of the support package are identical to the Windows 2000 download listed above.
This GB18030 support package contains:
1.
SimSun18030.ttc–A font for GB18030
2.
c_g18030.dll–A system library that extends the MultiByteToWideChar and WideCharToMultiByte program interfaces to support GB18030.
3.
Gbunicnv.exe – an end–user applet that converts between GB18030 plain text and HTML files and Unicode.
4.
Ms4bsp.dll – a system library that provides a few program interfaces useful for applications that use GB18030 as their character type.
It also contains the EULA (End–User License Agreement). The entire package is approximately 12 MB to download.
Yes, the contents of the CD are identical to the contents of the Windows 2000 support package.
Yes, Windows XP is a single, world–wide binary–the GB18030 support package will work on any version of Windows XP, including English. However, the system must have the optional East Asian language support installed.
Yes, MUI is identical to the English version of Windows XP. The optional East Asian language support must be installed.
They do not, and there are no plans in the near future to add support.
No. some of the GB18030 files are large–the font is 12 MB. It has been decided this is too big to add to an SP (Support Package).
The Windows XP CHS version will bundle the GB18030 support package. Windows 2000 CHS SKU will also be updated with the support package.
Yes, all of the GB18030 support is in addition to the regular system support. Existing applications are completely unaffected.
Yes, there is one difference:
Windows XP supports programmatic and end–user conversion of GB18030 files even without the add–on. In other words, c_g18030.dll and gbunicnv.exe are present on every version of Windows XP. These files are installed on Windows 2000 only when the GB18030 support is added.
No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows.
Yes, currently you can use the Unicode mode of the CHS input method editor to enter the code point values directly.
Yes, IE6 recognizes GB18030 as a charset and will display such web pages, but only when it is run on Windows XP. Older versions of Internet Explorer do not.
The font, SimSun18030.ttc, contains the glyphs for these characters. However, normal typographic standards for these languages require additional processing that Windows does not currently support.
The Chinese standard body has said that the ability to display the glyphs is sufficient for now.
We cannot speculate on the government's decisions. The government has said that it will enforce the standard first for platforms, and later for tools and applications.
The Microsoft 4–byte Character Set Encoding Support Package (MS4BSP) provides six functions that support multiple byte encodings that can be up to four bytes long. The API set is drawn from the set of WCHAR (Unicode) functions provided by Windows 95/98/Me. In each case, the function name is identical to the ANSI and WCHAR function except with an 'L' suffix instead of 'A' or 'W'.
•
ExtTextOutL
•
GetTextExtentExPointL
•
GetTextExtentPoint32L
•
MessageBoxL
•
MessageBoxExL
•
TextOutL
The function parameters are identical to the 'A' version of the interface. The package is designed to expedite conversion of a code page 936 based application to GB18030 or other 4–byte encoding.
The Windows XP and Windows 2000 implementation of MS4BSP is delivered as a single dll, ms4bsp.dll. Each function invokes the MultiByteToWideChar function to convert any multibyte string inputs to UTF–16. It then invokes the 'W' version of the function and returns the output parameters that function. The dll assumes the fonts, IMEs and registry settings for the encoding are present on the system.
BOOL ExtTextOutL( HDC hdc, // handle to DC int X, // x-coordinate of reference point int Y, // y-coordinate of reference point UINT fuOptions, // text-output options CONST RECT* lprc, // optional dimensions LPCSTR lpString, // string UINT cbCount, // number of characters in string CONST INT* lpDx // array of spacing values ); BOOL GetTextExtentExPointL( HDC hdc, // handle to DC LPCSTR lpszStr, // character string int cchString, // number of characters int nMaxExtent, // maximum width of formatted string LPINT lpnFit, // maximum number of characters LPINT alpDx, // array of partial string widths LPSIZE lpSize // string dimensions ); BOOL GetTextExtentPoint32L( HDC hdc, // handle to DC LPCSTR lpString, // text string int cbString, // characters in string LPSIZE lpSize // string size ); Int MessageBoxL( HWND hWnd, // handle to owner window LPCSTR lpText, // text in message box LPCSTR lpCaption, // message box title UINT uType // message box style ); Int MessageBoxExL( HWND hWnd, // handle to owner window LPCSTR lpText, // text in message box LPCSTR lpCaption, // message box title UINT uType, // message box style WORD wLanguageId // language identifier ); BOOL TextOutL( HDC hdc, // handle to DC int nXStart, // x-coordinate of starting position int nYStart, // y-coordinate of starting position LPCSTR lpString, // character string int cbString // number of characters );
Unicode is capable of addressing more than 1.1 million code points, and the standard has provisions for 8-bit, 16-bit and 32-bit encoding forms. The 16-bit encoding is used as its default encoding and allows for its million plus code points to be distributed across 17 "planes" with each plane addressing over 65,000 characters each.
The characters in Plane 0 -- or as it is commonly called the "Basic Multilingual Plane" (BMP) -- are used to represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, basic dingbats (including all level-100 Zapf Dingbats), and punctuation marks.
But in addition to the support for characters in modern languages and for the symbols and shapes just mentioned, Unicode also offers coverage for other characters, such as less commonly used Chinese, Japanese, and Korean (CJK) ideographs, Arabic presentation forms, and musical symbols. Many of these additional characters are mapped beyond the original plane using an extension mechanism called "surrogate pairs." It is these surrogate pairs that allow the mapping of the the GB18030 4-byte sequences to Unicode.
You should already know the answer to this: Dr. International.
•
I am having difficulties entering and displaying characters represented in GB18030 encoding. Why?
•
How do I enter CJK Extension A and B characters?
•
What do I need to get a functional GB18030 platform?
Fonts and IMEs that support characters in Unicode CJK Extension A and B sets that are needed to represent characters in GB18030 encoding are still in development. Problems concerning display can be attributed mostly to lack of glyphs in fonts. Here's a summary:
•
The SimSun18030 font (simsun18030.ttc) in the GB18030 support package does not have the complete GB18030 repertoire. It is missing the entire CJK Extension B set, as well as other characters. It does have, however, CJK Extension A.
•
Character Map (charmap.exe) and other applications (e.g., Internet Explorer 6.0) can only show character ranges that were defined at the time the operating system was released. CJK Extension A was added after Windows 2000 was released, but before Windows XP. On Windows 2000, Charmap will not display CJK Extension A characters because they were added to the Unicode standard after Microsoft released Windows 2000. On Windows XP, the Character Map applet will display CJK Extension A characters but not CJK Extension B characters (because the Character Map applet does not yet handle UTF–16 surrogate pairs).
A font that contains Simplified Chinese glyphs from both CJK Extension A and B sets is "SimSun (Founder Extended)" (SurSong.ttf in the system), or 宋体–方正超大字符集 (in Chinese). It is currently available in the Simplified Chinese (CHS) version of Office XP, or the Microsoft Office Proofing Tools. Click the link for more information and how to buy.
As for IMEs that supports the input of GB18030 characters, the NeiMa IME that ships in Windows XP (with one of the above–mentioned fonts present, or you'll just see blanks or squares), and the enhanced Unicode IME that ships in the Microsoft Office Proofing Tools both support the input of CJK Extension A and B characters on Windows XP using surrogate code points. For more information about surrogate code points, see Dr. International #18.
Microsoft is working to obtain a font that contains both CJK Extension A and B for Windows.
Once the system fonts have been upgraded to include these glyphs, the automatic system font–linking will make the glyphs appear without explicitly selecting GB18030–capable font such as SimSun–18030 (SimSun18030.ttc), and the input would also be more intuitive.
The NeiMa IME that is included in Windows 2000 and XP supports the input of CJK Extension A and Extension B characters via surrogate pairs when set to Unicode mode. When inputting surrogate pairs, you may ignore what seems to be a strange behavior. Upon inputting the first half of a surrogate pair, a square box will appear (this is the system's way to let you know that you have entered something in, but that the current font does not have a glyph to represent it). After you input the second half, the box will go away and the character will display correctly –– provided that you have the corresponding font. This works on both Windows 2000 and Windows XP.
The Microsoft Office Proofing Tools includes an "Enhanced Unicode IME" that addresses the above–mentioned flaw. Click on the link to find out more. Here's how to set it up after purchase:
•
From the Proofing Tools setup, select Custom setup to install Simplified Chinese Proofing Tools only. This will load the necessary IME and the “SurSong.ttf” font.
•
Upon completion, load the desired IME (NeiMa or Enhanced Unicode) from Regional Options' Languages tab in Windows XP, or Input Locales tab on Windows 2000.
•
Open an Office application, say Word. Set the font to "SimSun (Founder Extended)" font (SurSong.ttf) from the font drop–down list in Word.
•
Switch to your selected IME. Once it is enabled, click on the center part of its toolbar until it displays Unicode. And off you go.
NOTE: You can search for help on the Simplified Chinese version of Office XP, or the Office XP with MUI Pack (with the user interface set to CHS) for how to get the Unicode value of a given character with the help of KangXi dictionary in order to input characters in GB18030. Use “Surrogate” or “大字符集” as your keyword.
To obtain a functional GB18030 platform, the Dr. recommends:
1.
Microsoft Windows XP, English or Simplified Chinese
2.
Microsoft GB18030 Support Package (provided on a separate CD–ROM if you purchase CHS Windows XP, or via a download on other systems. See the question “How can I obtain the Windows 2000 GB18030 support package?” above).
3.
Microsoft Office XP Simplified Chinese Version with Office Proofing Tools, OR Microsoft Office XP English version with Office Proofing Tools to get the Enhanced Unicode IME and SimSun (Founder Extended) font.
To enable the display of CJK Extension A characters in IE 6.0 on Windows 2000 or XP, go to Internet Explorer/Tools/Internet Options /Fonts, and select NSimSun (the GB18030 font) as the default for Simplified Chinese (or if you create the HTML content, set this font explicitly), then you'll see CJK Extension A characters if the web page contains them. As mentioned earlier, the font in the GB18030 package does not contain CJK Extension B glyphs and as a result cannot display them. IE 6.0 can also display CJK Extension B characters if SimSun (Founder Extended) font is present.
The capability of displaying characters in IE is tied to the OS and the availability of fonts and the availability of glyphs in a font. Windows XP supports GB18030, CJK Extension A, and UTF–16 Surrogate Pairs, but fonts containing all these characters are still hard to find.
I hope this helps.
Dr. International
Windows international Division