Chapter2 – An introduce to Unicode
·Unicode is an extension of ASCII character encoding set.
·ASCII is now using a byte of 8-bit per character, and Unicode use full of 16 bits for character encoding.
·In this case, it allows Unicode to represent all the letters and all ideographs, and other symbol written in other language of the world are used to computer communication.
·Unicode is intended initially to supplement ASCII and, with any luck, eventually replace it.
·The C programming language as formalized by ANSI inherently supports Unicode through its support of wide characters.
A brief of character sets
Character sets |
Introduce period |
Feature |
Telegraph encoding set |
between 1838 and 1854 |
·Each letter in the alphabet corresponded to a series of short and long pulses (dots and dashes) ·No distinction uppercase and lowercase letters but numbers and punctuation marks had their own codes |
Morse code |
Between 1821 and 1824 |
·essentially a 6-bit code that encodes letters ·common letter combinations, common words, and punctuation |
Telex codes |
standardized in 1931 |
·5-bit codes that included letter shifts and figure shifts |
BCDIC |
|
·Binary-Coded Decimal Interchange Code" |
8-bit EBCDIC |
1960s |
|
ASCII |
origins in the late 1950s and was finalized in 1967 |
·a total of 128 codes ·The 26 letter codes are contiguous ·The codes for the 10 digits are easily derived from the value of the digits |
ANSI character set |
1985 |
|
Double-Byte Character Sets(DBCS) |
|
·maintain all kinds of language character sets ·introduce Code-Page concept ·not compatible to ASCII which is 1 byte ·insufficient and awkward. |
Unicode |
|
·allowing the representation of 65,536 characters · sufficient for all the characters and ideographs · compatible with ASCII · simply no ambiguity with only one character set |
Wide character set in C
ANSI C also supports multibyte character set, and wide characters aren't necessarily Unicode.
The char Date Type
char data type is encoded by one byte. The definition likes so.
char c = ‘A’; 1byte
char* p = “Hello, World!”; 12bytes
char a[] = “Hello, World!”; sizeof(a) is 13byte; with ‘/ 0’ as its end
char a[10]; sizeof(a) is 13byte;
Wide characters
wide char type in C is based on wchar_t data type which is defined in <wchar.h>. The definition likes so.
typedef unsigned short wchar_t
we can use following statement to define some wide characters.
wchar_t c = ‘A’; 2bytes equivalent to wcha_t c = L‘A’;
wchar_t* p = L“Hello, World!”; 26bytes
wchar_t a[] = L“Hello, World!”; sizeof(a) is 28bytes; with ‘/ 0’ as its end
Wide character functions library
original char data type character functions is showed below
char *pc = “Hello!”;
wchar_t *pw = “Hello!”;
int iLength = strlen(pc);
iLength = strlen(pw) is syntax error as strlen() is defined to process strlen( const char*) while pw is wchar* ( as defined unsigned short* ). This statement will be considered by complier as error or warning.
The form of string stored in memory:
The 6 characters of the character string "Hello!" have the 16-bit values:
0x0048 0x0065 0x 006C 0x 006C 0x 006F 0x0021
and stored in intel processor as this form:
48 00 65 00 6C 00 6C 00 6F 00 21 00
If iLength = strlen(pw) could be complied by complier the iLength will be assigned 1;
wide character function in C
There are alternations of 1byte character functions while us wchar_t data type, and hese functions are declared both in < wchar.h> and in the header file where the normal function is declared
1byte char data type functions |
wide char data type functions |
strlen( const char*) |
wcslen( const wchar_t*) |
printf( const char*, …) |
wprintf( const wchar_t*, …) |
Maintain a single source code
·It is obvious to provide two version of the source code. One is complied for ASCII char encoding and the other is complied for wide encoding system.
·Use <TCHAR.H> head file to maintain one version source code which is defined in VC++ by Microsoft and it is not the ANSI C Standard.
How to use TCHAR.H?
There are some very useful definitions in TCHAR.H :
#ifdef _UNICODE
typedef wchar_t TCHAR
#define __T(x) L##x
#define _tcslen wcslen
#else
#define __T(x) x
typedef char TCHAR
#define _tcslen strlen
#endif /* _UNICODE*/
#define _T(x) __T(x)
#define _TEXT(x) __T(x)
So we can use _tcslen to declare characters whatever there are char or wide char. The translate between wcslen and strlen is automatic by complier. we can only transfer option “ –D _UNICODE ” to complier if we want to use wide char functions in our program.
we can make declarations like so:
TCHAR *pstr = _TEXT(“Hello, World!”);
Wide Characters and Windows
WINNT supports not only ASCII character set but UNICODE set. So it can accept both 8-bit and 16-bit character strings.
WIN98 has much less supports of UNICODE than WINNT. Only a few Windows 98 function calls support wide-character strings
Windows Header File Types
Windows program includes the header file WINDOWS.H. This file includes a number of other header files, including WINDEF.H, which has many of the basic type definitions used in Windows and which itself includes WINNT.H. WINNT.H handles the basic Unicode support.
There are some new data types and useful Macros in WINNT.H:
These definitions let you mix ASCII and Unicode characters strings in the same program or write a single program that can be compiled for either ASCII or Unicode
typedef char CHAR ; typedef wchar_t WCHAR ; // wc |
typedef CHAR * PCHAR, * LPCH, * PCH, * NPSTR, * LPSTR, * PSTR ; typedef CONST CHAR * LPCCH, * PCCH, * LPCSTR, * PCSTR ;
|
typedef WCHAR * PWCHAR, * LPWCH, * PWCH, * NWPSTR, * LPWSTR, * PWSTR ; typedef CONST WCHAR * LPCWCH, * PCWCH, * LPCWSTR, * PCWSTR ; |
#ifdef UNICODE typedef WCHAR TCHAR, * PTCHAR ; typedef LPWSTR LPTCH, PTCH, PTSTR, LPTSTR ; typedef LPCWSTR LPCTSTR ;
#define __TEXT(quote) L##quote
#else typedef char TCHAR, * PTCHAR ; typedef LPSTR LPTCH, PTCH, PTSTR, LPTSTR ; typedef LPCSTR LPCTSTR ;
#define __TEXT(quote) quote #endif
#define TEXT(quote) __TEXT(quote)
|
8-bit character variables and strings, |
use CHAR, PCHAR (or one of the others), |
explicit 16-bit character variables and strings |
use WCHAR, PWCHAR, and append an L before quotation marks |
8 bit or 16 bit depending on the definition of the UNICODE identifier |
use TCHAR, PTCHAR, and the TEXT macro |
Windows' String Functions
Microsoft C includes wide-character and generic versions of all C run-time library functions that require character string arguments.
ILength = lstrlen (pString) ;
pString = lstrcpy (pString1, pString2) ;
pString = lstrcpyn (pString1, pString2, iCount) ;
pString = lstrcat (pString1, pString2) ;
iComp = lstrcmp (pString1, pString2) ;
iComp = lstrcmpi (pString1, pString2) ;
These work much the same as their C library equivalents. They accept wide-character strings if the UNICODE identifier is defined and regular strings if not.
Using printf in Windows
The printf() function in C could not be used in Window programming.
use fprintf() function to output to files.
use sprintf() function to format strings, and then we can pass it to MessageBox().
char szBuffer [100] ; sprintf (szBuffer, "The sum of %i and %i is %i", 5, 3, 5+3) ; puts (szBuffer) ;
int sprintf (char * szBuffer, const char * szFormat, ...) { int iReturn ; va_list pArgs ; va_start (pArgs, szFormat) ; iReturn = vsprintf (szBuffer, szFormat, pArgs) ; va_end (pArgs) ; return iReturn ; } The va_start macro sets pArg to point to the variable on the stack right above the szFormat argument on the stack. |
|
ASCII |
Wide-Character |
Generic |
Variable Number |
|
|
|
Standard Version |
sprintf |
swprintf |
_stprintf |
Max-Length Version |
_snprintf |
_snwprintf |
_sntprintf |
Windows Version |
wsprintfA |
wsprintfW |
wsprintf |
Pointer to Array |
|
|
|
Standard Version |
vsprintf |
vswprintf |
_vstprintf |
Max-Length Version |
_vsnprintf |
_vsnwprintf |
_vsntprintf |
Windows Version |
wvsprintfA |
wvsprintfW |
wvsprintf |
A Formatting Message Box
SCRNSIZE.C#include <windows.h> #include <tchar.h> #include <stdio.h> int CDECL MessageBoxPrintf (TCHAR * szCaption, TCHAR * szFormat, ...) { TCHAR szBuffer [1024] ; va_list pArgList ; // The va_start macro (defined in STDARG.H) is usually equivalent to: // pArgList = (char *) &szFormat + sizeof (szFormat) ; va_start (pArgList, szFormat) ; // The last argument to wvsprintf points to the arguments _vsntprintf (szBuffer, sizeof (szBuffer) / sizeof (TCHAR), szFormat, pArgList) ; // The va_end macro just zeroes out pArgList for no good reason va_end (pArgList) ; return MessageBox (NULL, szBuffer, szCaption, 0) ; } int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, PSTR szCmdLine, int iCmdShow) { int cxScreen, cyScreen ; cxScreen = GetSystemMetrics (SM_CXSCREEN) ; cyScreen = GetSystemMetrics (SM_CYSCREEN) ; MessageBoxPrintf (TEXT ("ScrnSize"), TEXT ("The screen is %i pixels wide by %i pixels high."), cxScreen, cyScreen) ; return 0 ; }
|