Delphi in a Unicode World Part I

---

Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?

By: Nick Hodges

原文链接：http://dn.codegear.com/article/38437

Abstract: This article discusses Unicode, how Delphi developers can benefit from using Unicode, and how Unicode will be implemented in Delphi 2009.

Introduction

The Internet has broken down geographical barriers that enable world-wide software distribution. As a result, applications can no longer live in a purely ANSI-based environment. The world has embraced Unicode as the standard means of transferring text and data. Since it provides support for virtually any writing system in the world, Unicode text is now the norm throughout the global technological ecosystem.

What is Unicode?

Unicode is a character encoding scheme that allows virtually all alphabets to be encoded into a single character set. Unicode allows computers to manage and represent text most of the world’s writing systems. Unicode is managed by The Unicode Consortium and codified in a standard. More simply put, Unicode is a system for enabling everyone to use each other’s alphabets. Heck, there is even a Unicode version of Klingon.

This series of articles isn’t meant to give you a full rundown of exactly what Unicode is and how it works; instead it is meant to get you going on using Unicode within Delphi 2009. If you want a good overview of Unicode, Joel Spolsky has a great article entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” which is highly recommended reading. As Joel clearly points out “IT’S NOT THAT HARD”. This article, Part I of III, will discuss why Unicode is important, and how Delphi will implement the new UnicodeString type.

Why Unicode?

Among the many new features found in Delphi 2009 is the imbuing of Unicode throughout the product. The default string in Delphi is now a Unicode-based string. Since Delphi is largely built with Delphi, the IDE, the compiler, the RTL, and the VCL all are fully Unicode-enabled.

The move to Unicode in Delphi is a natural one. Windows itself is fully Unicode-aware, so it is only natural that applications built for it, use a Unicode string as the default string. And for Delphi developers, the benefits don’t stop merely at being able to use the same string type as Windows.

The addition of Unicode support provides Delphi developers with a great opportunity. Delphi developers now can read, write, accept, produce, display, and deal with Unicode data – and it’s all built right into the product. With only few, or in some cases to zero code changes, your applications can be ready for any kind of data you, your customers or your users can throw at it. Applications that previously restricted to ANSI encoded data can be easily modified to handle almost any character set in the world.

Delphi developers will now be able to serve a global market with their applications -- even if they don’t do anything special to localize or internationalize their applications. Windows itself supports many different localized versions, and Delphi applications need to be able to adapt and work on machines running any of the large number of locales that Windows supports, including the Japanese, Chinese, Greek, or Russian versions of Windows. Users of your software may be entering non-ANSI text into your application or using non-ANSI based path names. ANSI-based applications won’t always work as desired in those scenarios. Windows applications built with a fully Unicode-enabled Delphi will be able to handle and work in those situations. Even if you don’t translate your application into any other spoken languages, your application still needs to be able to work properly -- no matter what the end user’s locale is.

For existing ANSI-based Delphi applications, then opportunity to localize applications and expand the reach of those applications into Unicode-based markets is potentially huge. And if you do want to localize your applications, Delphi makes that very easy, especially now at design-time. The Integrated Translation Environment (ITE) enables you to translate, compile, and deploy an application right in the IDE. If you require external translation services, the IDE can export your project in a form that translators can use in conjunction with the deployable External Translation Manager. These tools work together with the Delphi IDE for both Delphi and C++Builder to make localizing your applications a smooth and easy to manage process.

The world is Unicode-based, and now Delphi developers can be a part of that in a native, organic way. So if you want to be able to handle Unicode data, or if you want to sell your applications to emerging and global markets, you can do it with Delphi 2009.

A Word about Terminology

Unicode encourages the use of some new terms. For instance the idea of “character” is a bit less precise in the world of Unicode than you might be used to. In Unicode, the more precise term is “code point”. In Delphi 2009, the SizeOf(Char) is 2, but even that doesn’t tell the whole story. Depending on the encoding, it is possible for a given character to take up more than two bytes. These sequences are called “Surrogate Pairs”. So a code point is a unique code assigned an element defined by the Unicode.org. Most commonly that is a “character”, but not always.

Another term you will see in relation to Unicode is “BOM”, or Byte Order Mark, and that is a very short prefix used at the beginning of a text file to indicate the type of encoding used for that text file. MSDN has a nice article on what a BOM is. The new TEncoding Class (to be discussed in Part II) has a class method called GetPreamble which returns the BOM for a given encoding.

Now that all that has been explained, we’ll look at how Delphi 2009 implements a Unicode-based string.

The New UnicodeString Type

The default string in Delphi 2009 is the new UnicodeString type. By default, the UnicodeString type will have an affinity for UTF-16, the same encoding used by Windows. This is a change from previous versions which had AnsiString as the default type. The Delphi RTL has in the past included the WideString type to handle Unicode data, but this type is not reference-counted as the AnsiString type is, and thus isn’t as full-featured as Delphi developers expect the default string to be.

For Delphi 2009, a new UnicodeString type has been designed, that incorporates the capabilities of both the AnsiString and WideString types. A UnicodeString can contain either a Unicode-sized character, or an ANSI byte-sized character. (Note that both the AnsiString and WideString types will remain in place.) The Char and PChar types will map to WideChar and PWideChar, respectively. Note, as well, that no string types have disappeared. All the types that developers are used to still exist and work as before.

However, for Delphi 2009, the default string type will be equivalent to UnicodeString. In addition, the default Char type is WideChar, and the default PChar type is PWideChar.

That is, the following code is declared by the compiler:

type
  string = UnicodeString;
  Char = WideChar;
  PChar = PWideChar;

UnicodeString is assignment compatible with all other string types; however, assignments between AnsiStrings and UnicodeStrings will do type conversions as appropriate. Thus, an assignment of a UnicodeString type to an AnsiString type could result in data-loss. That is, if a UnicodeString contains high-order byte data, a conversion of that string to AnsiString will result in a loss of that high-order byte data.

The important thing to note here is that this new UnicodeString behaves pretty much like strings always have (with the notable exception of their ability to hold Unicode data, of course). You can still add any string data to them, you can index them, you can concatenate them with the ‘+’ sign, etc.

For example, instances of a UnicodeString will still be able to index characters. Consider the following code:

 var
   MyChar: Char;
   MyString: string;
 begin
   MyString := ‘This is a string’;
   MyChar := MyString[1];
 end;

The variable MyChar will still hold the character found at the first index position, i.e. ‘T’. This functionality of this code hasn’t changed at all. Similarly, if we are handling Unicode data:

 var
   MyChar: Char;
   MyString: string;
 begin
   MyString := ‘世界您好‘;
   MyChar := MyString[1];
 end;

The variable MyChar will still hold the character found at the first index position, i.e. ‘世’.

The RTL provides helper functions that enable users to do explicit conversions between codepages and element size conversions. If the user is using the Move function on the character array, they cannot make assumptions about the element size.

As you can imagine, this new string type has ramifications for existing code. With Unicode, it is no longer true that one Char represents one Byte. In fact, it isn’t even always true that one Char is equal to two bytes! As a result, you may have to make some adjustments to your code. However, we’ve worked very hard to make the transition a smooth one, and we are confident that you’ll be able to be up and running quite quickly. Parts II and III of this series will discuss further the new UnicodeString type, talk about some of the new features of the RTL that support Unicode enablement, and then discuss specific coding idioms that you’ll want to look for in your code. This series should help make your transition to Unicode a smooth and painless endeavor.

Conclusion

With the addition of Unicode as the default string, Delphi can accept, process, and display virtually any alphabet or code page in the world. Applications you build with Delphi 2009 will be able to accept, display, and handle Unicode text with ease, and they will work much better in almost any Windows locale. Delphi developers can now easily localize and translate their applications to enter markets that they have previously been more difficult to enter. It’s a Unicode world out there, and now your Delphi apps can live in it.

In Part II, we’ll discuss the changes and updates to the Delphi Runtime Library that will enable you to work easily with Unicode strings.

以下为有道自动翻译(Delphi园地站长注：翻译不是很准确，如有读者有兴趣翻译，请发给我们发布，谢谢）：

在本部分德尔世界是什么,为什么你需要制定,你如何工作的呢?
　　通过:尼克霍奇
　　
　　文摘:论述了如何制定,Delphi开发商受益于使用统一的字符编码标准,以及如何将实施Delphi2009年。
　　介绍
　　互联网地理障碍,打破世界软件分布。作为一个结果,应用再也不能住在一个纯ANSI-based环境。世界已接受的标准的制定和数据。转移文本它提供了支持,几乎所有的书写体系在全世界范围内,统一的字符编码标准文本现在的标准是全球科技的生态系统。
　　
　　什么是统一的字符编码标准吗?
　　本是字符编码系统使几乎所有字母被编码成一个单一的字符集。让计算机管理,制定表示文本,世界上大部分的书写系统。本协会是由统一的字符编码标准并编入在一个标准。更简单地说,就是一个系统,使本都使用对方的字母。见鬼,甚至有一个统一的字符编码标准版Klingon。
　　
　　这系列的文章并不意味着给你一份完全破旧的到底是什么Unicode工作;相反,它是为了让您将在2009年在Delphi使用本。如果你想好概要的,乔尔Spolsky本有很大的文章题为“最低限度每个软件开发者绝对地、肯定地必须了解Unicode字符(没有藉口,!)”,它是高度推荐阅读。作为乔明确指出“这不难。”这篇文章中,我将讨论的3本是重要的,以及如何将如何实施新的UnicodeString德尔菲的类型。
　　
　　为什么Unicode吗?
　　在许多新功能在2009年发现的imbuing德尔菲的整个产品。本默认的字符串在Delphi现在Unicode-based字符串。从很大程度上与德尔菲德尔菲的IDE,编译器、RTL、VCL都完全Unicode-enabled。
　　
　　在转会到Delphi是一个天然的一个。窗户本身就是完全Unicode-aware,所以它只是自然的应用程序,使用了建造的字符串作为默认Unicode字符串。和德尔菲的开发者、效益不停止仅仅在能够使用相同的字符串类型作为窗口。
　　
　　增加Delphi开发者以支持提供了巨大的机会。现在可以Delphi开发商的读、写、接受、生产、展览、处理Unicode——这些数据集成于产品。只有几个,或在某些情况下,你的代码变更为零应用可以准备任何类型的数据,你,你的客户或你的用户可以把它。应用程序,以前只限于美国国家标准化组织(ANSI)编码资料,可以很方便地进行修改来处理任何字符集,在世界上。
　　
　　Delphi开发者可以作为全球市场中的应用——即使他们不做任何特别的局部或国际化的应用。支持多种不同的局部窗口本身的版本,Delphi应用程序需要能够适应工作的任何机器运行大量的场景,包括了,窗户支持日本、中国、希腊、或俄罗斯版本的视窗。用户可以进入你的软件应用到你non-ANSI non-ANSI或使用基于路径名。ANSI-based应用不会一直工作所需的那些场景。视窗系统应用具有完全Unicode-enabledDelphi将能够处理和工作的情况。即使你不把你的应用程序在任何其他种语言,你的应用还需要能够正常工作——无论如何在最终用户的场所。
　　
　　对现有ANSI-basedDelphi申请书,并应用和扩大机遇来定位的应用是潜在的巨大市场进入Unicode-based。如果你确实想要让你的应用程序中,德尔斐定位,非常容易,尤其是现在在设计。尽管综合翻译环境(翻译)允许你编写,和部署,申请的权利。如果你需要外部的翻译服务,IDE可汇出您的项目可以使用一种译者在翻译会同部署的外部经理。这些工具与DelphiIDE对于德尔菲法和C + + Builder使本地化软件平滑而易于处理的过程。
　　
　　世界是Unicode-based,现在Delphi开发者可以成为你生活的一部分,在本地、有机方式。所以,如果你想要有能力处理数据,或者如果你本想卖掉你的应用程序和全球市场出现时,你也可以做到Delphi2009年。
　　
　　一个字有关术语
　　鼓励使用一些制定新条款。比如“品质”的理念是有点不太准确的世界里,你可能会比使用本。在统一的字符编码标准,更精确的说法是“密码”。2009年的长度(炭)是2,但也没有告诉我们全部的故事。根据编码,它是可能的,对于一个给定字符占用超过二个字节。这些序列被称为“代孕双”。所以一个代码是一个独特的编码指定一个元素被Unicode.org。最常用的是“角色”,但并不总是。
　　
　　你会看到另一个词是“关系”,本订单或字节的炸弹,那是一种非常短的前缀使用之初,一个文本文件来显示类型的编码用于文本文件。有一个好的文章MSDN炸弹是什么。新的TEncoding班(讨论)在第二部分类方法,称为GetPreamble返回BOM对于一个给定的编码。
　　
　　现在,一切都说明,我们要看看如何实现一个Unicode-basedDelphi2009年的字符串。
　　
　　新的UnicodeString类型
　　默认的字符串在2009年新UnicodeString德尔菲的类型。默认情况下,UnicodeString类型会有亲和力,同样的编码为UTF-16所用的窗口。这是一个从以前的版本,具有AnsiString设为默认的类型。德尔菲RTL已经在过去的数据类型来处理WideString制定的,但是这种不是reference-counted AnsiString型的,因此并不像预期的一样Delphi开发商将默认的字符串。
　　
　　对于一个新UnicodeString德尔菲2009年,设计,类型都包含了能力,WideString AnsiString类型。一个UnicodeString可以包含一个字,或一个Unicode-sized ANSI byte-sized字符。(注意:双方AnsiString WideString类型,将继续存在。)贾泽民、PChar的类型将地图,分别WideChar PWideChar。注意,没有字符串类型已经消失了。所有的类型,开发者习惯于依然存在的情况下工作。
　　
　　然而,Delphi2009年,默认字符串类型将相当于UnicodeString。此外,默认是WideChar炙、类型的默认PChar PWideChar类型。
　　
　　那就是,下面的代码被宣布由编译器。
　　
　 string = UnicodeString;
Char = WideChar;
PChar = PWideChar;

　　
　　兼容所有作业UnicodeString是其他字符串类型;然而,作业和UnicodeStrings AnsiStrings之间做适当的类型转换。因此,赋值类型的一个AnsiString UnicodeString data-loss类型可能导致。这就是说,如果一个UnicodeString含有高阶字节数据转换的那根绳子,将导致损失AnsiString高字节的数据。
　　
　　最重要的是要注意这是这个新的UnicodeString表现得很像串总是(有例外的能力,当然Unicode数据)。你还可以加入任何一个字符串数据,你可以指数,可以连结的“+”号签署,等等。
　　
　　例如,一个UnicodeString仍能够指标特征。考虑以下代码:
　　
　　var
   MyChar: Char;
   MyString: string;
   begin
     MyString := ‘This is a string’;
     MyChar := MyString[1];
   end;

　　
　　这个变量MyChar仍然会出现在第一个字符的索引位置。”,即“。这一功能的代码并没有改变。同样的,如果我们要处理Unicode数据:
　　
　　var
   MyChar: Char;
   MyString: string;
   begin
     MyString := ‘世界您好‘;
     MyChar := MyString[1];
   end;

　　
　　这个变量MyChar仍然会出现在第一个字符的索引,即“世位置。”。
　　
　　致力于提供帮助的功能,让用户做codepages元素之间的显式转换,大小转换。如果用户使用移动功能上的角色,他们不能数组元素的尺寸的假设。
　　
　　可以想象,这个新的字符串类型,对现有的代码。以Unicode,它已不再是真实的,代表了一种煤焦字节。事实上,它不是真实的,甚至常常炙等于两个字节!作为一种结果,你可能不得不做一些调整你的代码。然而,我们非常努力地过渡平稳,我们有信心,你就能够建立并运行相当快。第二和第三部分的系列将进一步探讨新UnicodeString类型,谈论一些新的特点,支持”,然后讨论特定伺服器、编码成语,你会想要找你的代码。该系列会让你的过渡平稳,无痛性努力制定。
　　
　　结论
　　再加上Unicode设为默认的字符串,Delphi可以接受,工艺,显示几乎任何字母或代码页,在世界上。你建立与应用德尔菲2009年将能够接受,显示和处理Unicode文本,以方便的话,他们会发挥更大的作用在几乎任何窗口区域。Delphi发展商现在可以方便地定位和翻译的申请进入市场,他们曾被更难进入。这是一个统一的字符编码标准的环境里,现在你可以住在德尔菲程序。
　　
　　在第二部分,我们将讨论的变化和更新的Delphi中运行的图书馆,而且会让你更容易工作与万国码字串。

（出处：http://dn.codegear.com/article/38437）

本文章使用“国华软件”出品的博客内容管理软件MultiBlogWriter撰写并发布

Delphi in a Unicode World Part I