Delphi in a Unicode World Part II
By: Nick Hodges
原文链接:http://dn.codegear.com/article/38498
Abstract: This article will cover the new features of the Tiburon Runtime Library that will help handle Unicode strings.
Introduction
In Part I, we saw how Unicode support is a huge benefit for Delphi developers by enabling communication with all characters set in the Unicode universe. We saw the basics of the UnicodeString type and how it will be used in Delphi
In Part II, we’ll look at some of the new features of the Delphi Runtime Library that support Unicode and general string handling.
TCharacter Class
The Tiburon RTL includes a new class called TCharacter, which is found in the Character unit. It is a sealed class that consists entirely of static class functions. Developers should not create instances of TCharacter, but rather merely call its static class methods directly. Those class functions do a number of things, including:
- Convert characters to upper or lower case
- Determine whether a given character is of a certain type, i.e. is the character a letter, a number, a punctuation mark, etc.
TCharacter uses the standards set forth by the Unicode consortium.
Developers can use the TCharacter class to do many things previously done with sets of chars. For instance, this code:
uses
Character;begin
if
MyCharin
[‘a’...’z’, ‘A’...’Z’]then
begin
...end
;end
;
can be easily replaced with
uses
Character;begin
if
TCharacter.IsLetter(MyChar)then
begin
...end
;end
;
The Character unit also contains a number of standalone functions that wrap up the functionality of each class function from TCharacter, so if you prefer a simple function call, the above can be written as:
uses
Character;begin
if
IsLetter(MyChar)then
begin
...end
;end
;
Thus the TCharacter class can be used to do most any manipulation or checking of characters that you might care to do.
In addition, TCharacter contains class methods to determine if a given character is a high or low surrogate of a surrogate pair.
TEncoding Class
The Tiburon RTL also includes a new class called TEncoding. Its purpose is to define a specific type of character encoding so that you can tell the VCL what type of encoding you want used in specific situations.
For instance, you may have a TStringList instance that contains text that you want to write out to a file. Previously, you would have written:
begin
... MyStringList.SaveToFile(‘SomeFilename.txt’); ...end
;
and the file would have been written out using the default ANSI encoding. That code will still work fine – it will write out the file using ANSI string encoding as it always has, but now that Delphi supports Unicode string data, developers may want to write out string data using a specific encoding. Thus, SaveToFile (as well as LoadFromFile) now take an optional second parameter that defines the encoding to be used:
begin
... MyStringList.SaveToFile(‘SomeFilename.txt’, TEncoding.Unicode); ...end
;
Execute the above code and the file will be written out as a Unicode (UTF-16) encoded text file.
TEncoding will also convert a given set of bytes from one encoding to another, retrieve information about the bytes and/or characters in a given string or array of characters, convert any string into an array of byte (TBytes), and other functionality that you may need with regard to the specific encoding of a given string or array of chars.
The TEncoding class includes the following class properties that give you singleton access to a TEncoding instance of the given encoding:
class
property
ASCII: TEncodingread
GetASCII;class
property
BigEndianUnicode: TEncodingread
GetBigEndianUnicode;class
property
Default
: TEncodingread
GetDefault;class
property
Unicode: TEncodingread
GetUnicode;class
property
UTF7: TEncodingread
GetUTF7;class
property
UTF8: TEncodingread
GetUTF8;
The Default property refers to the ANSI active codepage. The Unicode property refers to UTF-16.
TEncoding also includes the
class
function
TEncoding.GetEncoding(CodePage: Integer): TEncoding;
that will return an instance of TEncoding that has the affinity for the code page passed in the parameter.
In addition, it includes following function:
function
GetPreamble: TBytes;
which will return the correct BOM for the given encoding.
TEncoding is also interface compatible with the .Net class called Encoding.
TStringBuilder
The RTL now includes a class called TStringBuilder. Its purpose is revealed in its name – it is a class designed to “build up” strings. TStringBuilder contains any number of overloaded functions for adding, replacing, and inserting content into a given string. The string builder class makes it easy to create single strings out of a variety of different data types. All of the Append, Insert, and Replace functions return an instance of TStringBuilder, so they can easily be chained together to create a single string.
For example, you might choose to use a TStringBuilder in place of a complicated Format statement. For instance, you might write the following code:
procedure
TForm86.Button2Click(Sender: TObject);var
MyStringBuilder: TStringBuilder; Price: double;begin
MyStringBuilder := TStringBuilder.Create(''
);try
Price := 1.49; Label1.Caption := MyStringBuilder.Append('The apples are $'
).Append(Price). ?Append(' a pound.'
).ToString;finally
MyStringBuilder.Free;end
;end
;
TStringBuilder is also interface compatible with the .Net class called StringBuilder.
Declaring New String Types
Tiburon’s compiler enables you to declare your own string type with an affinity for a given codepage. There is any number of code pages available. (MSDN has a nice rundown of available codepages.) For instance, if you require a string type with an affinity for ANSI-Cyrillic, you can declare:
type
// The code page for ANSI-Cyrillic is 1251
CyrillicString =type
Ansistring(1251);
And the new String type will be a string with an affinity for the Cyrillic code page.
Additional RTL Support for Unicode
The RTL adds a number of routines that support the use of Unicode strings.
StringElementSize
StringElementSize returns the typical size for an element (code point) in a given string. Consider the following code:
procedure
TForm88.Button3Click(Sender: TObject);var
A: AnsiString; U: UnicodeString;begin
A :='This is an AnsiString'
; Memo1.Lines.Add('The ElementSize for an AnsiString is: '
+ IntToStr(StringElementSize(A))); U :='This is a UnicodeString'
; Memo1.Lines.Add('The ElementSize for an UnicodeString is: '
+ IntToStr(StringElementSize(U)));end
;
The result of the code above will be:
The ElementSizefor
an AnsiStringis
: 1 The ElementSizefor
an UnicodeStringis
: 2
StringCodePage
StringCodePage will return the Word value that corresponds to the codepage for a given string.
Consider the following code:
procedure
TForm88.Button2Click(Sender: TObject);type
// The code page for ANSI-Cyrillic is 1251
CyrillicString =type
AnsiString(1251);var
A: AnsiString; U: UnicodeString; U8: UTF8String; C: CyrillicString;begin
A :='This is an AnsiString'
; Memo1.Lines.Add('AnsiString Codepage: '
+ IntToStr(StringCodePage(A))); U :='This is a UnicodeString'
; Memo1.Lines.Add('UnicodeString Codepage: '
+ IntToStr(StringCodePage(U))); U8 :='This is a UTF8string'
; Memo1.Lines.Add('UTF8string Codepage: '
+ IntToStr(StringCodePage(U8))); C :='This is a CyrillicString'
; Memo1.Lines.Add('CyrillicString Codepage: '
+ IntToStr(StringCodePage(C)));end
;
The above code will result in the following output:
The Codepagefor
an AnsiStringis
: 1252 The Codepagefor
an UnicodeStringis
: 1200 The Codepagefor
an UTF8stringis
: 65001 The Codepagefor
an CyrillicStringis
: 1251
Other RTL Features for Unicode
There are a number of other routines for converting strings of one codepage to another. Including:
UnicodeStringToUCS4String UCS4StringToUnicodeString UnicodeToUtf8 Utf8ToUnicode
In addition the RTL also declares a type called RawByteString which is a string type with no encoding affiliated with it:
RawByteString = type
AnsiString($FFFF);
The purpose of the RawByteString type is to enable the passing of string data of any code page without doing any codepage conversions. This is most useful for routines that do not care about specific encoding, such as byte-oriented string searches.Normally, this would mean that parameters of routines that process strings without regard for the strings code page should be of type RawByteString. Declaring variables of type RawByteString should rarely, if ever, be done as this can lead to undefined behavior and potential data loss.
In general, string types are assignment compatible with each other.
For instance:
MyUnicodeString := MyAnsiString;
will perform as expected – it will take the contents of the AnsiString and place them into a UnicodeString. You should in general be able to assign one string type to another, and the compiler will do the work needed to make the conversions, if possible.
Some conversions, however, can result in data loss, and one must watch out this when moving from one string type that includes Unicode data to another that does not. For instance, you can assign UnicodeString to an AnsiString, but if the UnicodeString contains characters that have no mapping in the active ANSI code page at runtime, those characters will be lost in the conversion. Consider the following code:
procedure
TForm88.Button4Click(Sender: TObject);var
U: UnicodeString; A: AnsiString;begin
U :='This is a UnicodeString'
; A := U; Memo1.Lines.Add(A); U :='Добро пожаловать в мир Юникода с использованием Дельфи 2009!!'
; A := U; Memo1.Lines.Add(A);end
;
The output of the above when the current OS code page is 1252is:
This is
a UnicodeString
????? ?????????? ? ??? ??????? ? ?????????????? ?????? 2009!!
As you can see, because Cyrillic characters have no mapping in Windows-1252, information was lost when assigning this UnicodeString to an AnsiString. The result was gibberish because the UnicodeString contained characters not representable in the code page of the AnsiString, those characters were lost and replaced by the question mark when assigning the UnicodeString to the AnsiString.
SetCodePage
SetCodePage, declared in the System.pas unit as
procedure
SetCodePage(var
S: AnsiString; CodePage: Word; Convert: Boolean);
is a new RTL function that sets a new code page for a given AnsiString. The optional Convert parameter determines if the payload itself of the string should be converted to the given code page. If the Convert parameter is False, then the code page for the string is merely altered. If the Convert parameter is True, then the payload of the passed string will be converted to the given code page.
SetCodePage should be used sparingly and with great care. Note that if the codepage doesn’t actually match the existing payload (i.e. Convert is set to False), then unpredictable results can occur. Also if the existing data in the string is converted and the new codepage doesn’t have a representation for a given original character, data loss can occur.
Getting TBytes from Strings
The RTL also includes a set of overloaded routines for extracting an array of bytes from a string. As we’ll see in Part III, it is recommended that instead of using string as a data buffer, you use TBytes instead. The RTL makes it easy by providing overloaded versions of BytesOf() that takes as a parameter the different string types.
Conclusion
Tiburon’s Runtime Library is now completely capable of supporting the new UnicodeString. It includes new classes and routines for handling, processing, and converting Unicode strings, for managing codepages, and for ensuring an easy migration from earlier versions.
In Part III, we’ll cover the specific code constructs that you’ll need to look out for in ensuring that your code is Unicode ready.
(出处:http://dn.codegear.com/article/38498)
---
本文章使用“国华软件”出品的博客内容管理软件MultiBlogWriter撰写并发布