Oracle Character Set

)Character Set Encoding

Code point/code value即字符对应的字符编码

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as a character set. An encoded character set assigns a unique numeric code to each character in the character set. The numeric codes are called code points or encoded values. 

一个字符集可支持多种语言,字符集受限于它的字符库

Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can support multiple languages. When character sets were first developed, they had a limited character repertoire. Even now there can be problems using certain characters across platforms.

无论Oracle是什么字符集均可转化以下字符,但其它字符使用时就注意数据库是否支持了

The following CHAR and VARCHAR characters are represented in all Oracle Database character sets and can be transported to any platform:

  1. Uppercase and lowercase English characters A through Z and a through z
  2. Arabic digits 0 through 9
  3. The following punctuation marks: % ' ' ( ) * + - , . / \ : ; < > = ! _ & ~ { } | ^ ? $ # @ " [ ]
  4. The following control characters: space, horizontal tab, vertical tab, form feed

If you are using characters outside this set, then take care that your data is supported in the database character set that you have chosen.

  1. How are Characters Encoded?
  1. Single-Byte Encoding Schemes

每个字符均使用1byte存储

Single-byte encoding schemes are efficient. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.

Single-byte encoding schemes are classified as one of the following types:

  1. 7-bit encoding schemes

Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).

  1. 8-bit encoding schemes

Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. One example is ISO 8859-1, which supports many Western European languages. The following figure shows the ISO 8859-1 8-bit encoding scheme.

  1. Multibyte Encoding Schemes

Multibyte encoding schemes are used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These encoding schemes use either a fixed number or a variable number of bytes to represent each character.

  1. Fixed-width multibyte encoding schemes

In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of bytes. The number of bytes is at least two in a multibyte encoding scheme.

  1. Variable-width multibyte encoding schemes

A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that represents a character. For example, if two bytes is the maximum number of bytes used to represent a character, then the most significant bit can be used to indicate whether that byte is a single-byte character or the first byte of a double-byte character.

  1. Shift-sensitive variable-width multibyte encoding schemes

Some variable-width encoding schemes use control codes to differentiate between single-byte and multibyte characters with the same code values. A shift-out code indicates that the following character is multibyte. A shift-in code indicates that the following character is single-byte. Shift-sensitive encoding schemes are used primarily on IBM platforms. Note that ISO-2022 character sets cannot be used as database character sets, but they can be used for applications such as a mail server.

  1. Naming Convention for Oracle Database Character Sets

Oracle Database uses the following naming convention for its character set names:

[S|C]

可选的S或C用于区分只能在服务器(S)或仅在客户端(C)上使用的字符集。

Keep in mind that:

  1. You should use the server character set (S) on the Macintosh platform. The Macintosh client character sets are obsolete. On EBCDIC platforms, use the server character set (S) on the server and the client character set (C) on the client.
  2. UTF8 and UTFE are exceptions to the naming convention.

The following table shows examples of Oracle Database character set names.

  1. Subsets and Supersets

The terms subset and superset, without the adjective "binary":

character set A is a superset of character set B if A supports all characters that B supports. Character set B is a subset of character set A if A is a superset of B.

The terms binary subset and binary superset:

character set A is a binary superset of character set B if A supports all characters that B supports and all these characters have the same binary representation in A and B. Character set B is a binary subset of character set A if A is a binary superset of B.

When character set A is a binary superset of character set B, any text value encoded in B is at the same time valid in A without need for character set conversion. When A is a non-binary superset of B, a text value encoded in B can be represented in A without loss of data but may require character set conversion to transform the binary representation.

Oracle Database does not maintain a list of all subset-superset pairs, but it does maintain a list of binary subset-superset pairs that it recognizes in various situations, such as checking compatibility of a transportable tablespace or a pluggable database.

)Length Semantics

In single-byte character sets, the number of bytes and the number of characters in a string are the same. In multibyte character sets, a character or code point consists of one or more bytes. Calculating the number of characters based on byte lengths can be difficult in a variable-width character set. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called character semantics.

Character semantics is useful for defining the storage requirements for multibyte strings of varying widths.

The following expressions use byte semantics

VARCHAR2(20 BYTE)

SUBSTRB(string, 1, 20)

Note the BYTE qualifier in the VARCHAR2 expression and the B suffix in the SQL function name.

The following expressions use character semantics:

VARCHAR2(10 CHAR)

SUBSTR(string, 1, 10)

Note the CHAR qualifier in the VARCHAR2 expression.

The length semantics of character data type columns, user-defined type attributes, and PL/SQL variables can be specified explicitly in their definitions with the BYTE or CHAR qualifier. 

如果未设置则使用系统参数NLS_LENGTH_SEMANTICS值,默认值为BTYE

If you create database objects with legacy scripts that are too large and complex to be updated to include explicit BYTE and/or CHAR qualifiers, execute an explicit ALTER SESSION SET NLS_LENGTH_SEMANTICS statement before running each of the scripts to assure the scripts create objects in the expected semantics.

注:对于NCHAR不以指定CHAR/BYTE,它只能为CHAR

Byte semantics is the default for the database character set. Character length semantics is the default and the only allowable kind of length semantics for NCHAR data types.

The user cannot specify the CHAR or BYTE qualifier for NCHAR definitions.

Consider the following example:

CREATE TABLE employees

( employee_id NUMBER(4), last_name NVARCHAR2(10)

, job_id NVARCHAR2(9), manager_id NUMBER(4), hire_date DATE

, salary NUMBER(7,2), department_id NUMBER(2)

) ;

Oracle Database Character Set

Oracle有两个字符集database character set与national character set,它们均是在创建数据库时指定

database character set包含以下数据使用的字符集:

  1. Data stored in SQL CHAR data types (CHAR, VARCHAR2, CLOB, and LONG)
  2. Identifiers such as table names, column names, and PL/SQL variables
  3. Entering and storing SQL and PL/SQL source code

All SQL CHAR data type columns (CHAR, CLOB, VARCHAR2, and LONG), including columns in the data dictionary, have their data stored in the database character set. 

national character set用于以下数据

SQL NCHAR data type columns (NCHAR, NCLOB, and NVARCHAR2) 

  1. Database Character Sets

database character set需要标识SQL与PL/SQL的源码,所以它必须能把EBCDIC或ASCII作为子集,即固定多字节字符集不能作为数据库字符集,这是AL16UTF16不能作database character set的原因

The database character set is used to identify SQL and PL/SQL source code. In order to do this, it must have either EBCDIC or 7-bit ASCII as a subset, whichever is native to the platform. Therefore, it is not possible to use a fixed-width, multibyte character set as the database character set. Currently, only the AL16UTF16 character set cannot be used as a database character set.

The following table lists the restrictions on the character sets that can be used to express names.

从11gR1开始,通过OUI及dbca安装数据库只会列出建议字符集,但通过自定义安装选项可使用非建议字符集,Oracle不建议使用,后续版本它们可能不再支持

Oracle strongly recommends for usage as the database character set. Other Oracle-supported character sets that do not appear on this list can continue to be used in Oracle Database 12c, but may be desupported in a future release.

Starting with Oracle Database 11g Release 1, the choice for the database character set is limited to this list of recommended character sets in common installation paths of Oracle Universal Installer and Oracle Database Configuration Assistant.

Customers are still able to create new databases using custom installation paths and migrate their existing databases even if the character set is not on the recommended list. However, Oracle suggests that customers migrate to a recommended character set as soon as possible.

从12cR2开始默认数据库字符集为AL32UTF8,Oracle建议不要更改

Starting from Oracle Database 12c Release 2, if you use Oracle Universal Installer or Oracle Database Configuration Assistant (DBCA) to create a database, then the default database character set used is AL32UTF8. 

The AL32UTF8 character set is Oracle's implementation of the industry standard UTF-8 encoding, which supports most of the written languages of the world. Making the AL32UTF8 character set the default character set for new database deployments enables the database to support multilingual globalized applications out-of-the-box.

Oracle recommends using Unicode for all new system deployments. Migrating legacy systems to Unicode is also recommended. Deploying your systems today in Unicode offers many advantages in usability, compatibility, and extensibility. Oracle Database enables you to deploy high-performing systems faster and more easily while utilizing the advantages of Unicode. Even if you do not need to support multilingual data today, nor have any requirement for Unicode, it is still likely to be the best choice for a new system in the long run and will ultimately save you time and money as well as give you competitive advantages in the long term.

  1. National Character Set

National Character Set主要用于非Unicode Database Character Set的数据库存储Unicode data (NCHAR, NVARCHAR2, and NCLOB data types support Unicode data only).

起初National Character Set作为Oracle的附加字符集,用于增强对亚洲使用定长多字节编码的支持,现在数据库字符集都是建议UAL32UTF8了,所以一般也不会用了,它的默认值为UAL16UTF16

The term national character set refers to an alternative character set that enables you to store Unicode character data in a database that does not have a Unicode database character set. Another reason for choosing a national character set is that the properties of a different character encoding scheme may be more desirable for extensive character processing operations.

SQL NCHAR, NVARCHAR2, and NCLOB data types support Unicode data only. You can use either the UTF8 or the AL16UTF16 character set. The default is AL16UTF16.

Oracle不建议使用NCHAR, NVARCHAR2, NCLOB数据类型了,因为好多数据库并支持,而且Oracle Text与XML DB同样不支持这些类型

Oracle recommends using SQL CHAR, VARCHAR2, and CLOB data types in AL32UTF8 database to store Unicode character data. SQL NCHAR, NVARCHAR2, and NCLOB data types are not supported by some database features. Most notably, Oracle Text and XML DB do not support these data types.

Oracle recommends using SQL CHAR, VARCHAR2, and CLOB data types in AL32UTF8 database to store Unicode character data. Use of SQL NCHAR, NVARCHAR2, and NCLOB should be considered only if you must use a database whose database character set is not AL32UTF8.

使用NCHAR, NVARCHAR2, NCLOB时它的长度单位永久是character而不是BYTE

When you use NCHAR and NVARCHAR2 data types for storing multilingual data, the column size specified for a column is defined in number of characters. (This number of characters means the number of encoded Unicode code points, except that supplementary Unicode characters represented through surrogate pairs count as two characters.)

Table 6-2 Maximum Data Type Size for the AL16UTF16 and UTF8 National Character Sets

National Character Set

Maximum Column Size of NCHAR Data Type

Maximum Column Size of NVARCHAR2 Data Type (When MAX_STRING_SIZE = STANDARD)

Maximum Column Size of NVARCHAR2 Data Type (When MAX_STRING_SIZE = EXTENDED)

AL16UTF16

1000 characters

2000 characters

16383 characters

UTF8

2000 characters

4000 characters

32767 characters

上面最大字符数只是一个约束,这些类型实际容量限制是按下面bytes计算的

This maximum size in characters is a constraint, not guaranteed capacity of the data type. The maximum capacity is expressed in bytes.

For the NCHAR data type, the maximum capacity is 2000 bytes. 

For NVARCHAR2, it is 4000 bytes, if the initialization parameter MAX_STRING_SIZE is set to STANDARD, and 32767 bytes, if the initialization parameter MAX_STRING_SIZE is set to EXTENDED

When the national character set is AL16UTF16, the maximum number of characters never occupies more bytes than the maximum capacity, as each character (in an Oracle sense) occupies exactly 2 bytes. However, if the national character set is UTF8, the maximum number of characters can be stored only if all these characters are from the Unicode Basic Latin range, which corresponds to the ASCII standard.

Other Unicode characters occupy more than one byte each in UTF8 and presence of such characters in a 4000 character string makes the string longer than the maximum 4000 bytes. If you want national character set columns to be able to hold the declared number of characters in any national character set, do not declare NCHAR columns longer than 2000/3=666 characters and NVARCHAR2 columns longer than 4000/3=1333 or 32767/3=10922 characters, depending on the MAX_STRING_SIZE initialization parameter.

Scenario 1: Unicode Solution with a Unicode Standard-Enabled Database

An American company running a Java application would like to add German and French support in the next release of the application. They would like to add Japanese support at a later time. The company currently has the following system configuration:

  1. The existing database has a database character set of US7ASCII.
  2. All character data in the existing database is composed of ASCII characters.
  3. PL/SQL stored procedures are used in the database.
  4. The database is about 300 GB, with very little data stored in CLOB columns.
  5. There is a nightly downtime of 4 hours.

In this case, a typical solution is to choose AL32UTF8 for the database character set because of the following reasons:

  1. The database is very large and the scheduled downtime is short. Fast migration of the database to a Unicode character set is vital. Because the database is in US7ASCII, the easiest and fastest way of enabling the database to support the Unicode Standard is to switch the database character set to AL32UTF8 by using the Database Migration Assistant for Unicode (DMU). No data conversion is required for columns other than CLOB because US7ASCII is a subset of AL32UTF8.
  2. Because most of the code is written in Java and PL/SQL, changing the database character set to AL32UTF8 is unlikely to break existing code. Unicode support is automatically enabled in the application.

Scenario 2: Unicode Solution with Unicode Data Types

A European company that runs its legacy applications mainly on Windows platforms wants to add a new small Windows application written in Visual C/C++. The new application will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:

  1. The existing database has a database character set of WE8MSWIN1252.
  2. All character data in the existing database is composed of Western European characters.
  3. The database is around 500 GB with a lot of CLOB columns.
  4. Support for full-text search and XML storage is not required in the new application

A typical solution is to take the following actions:

  1. Use NCHAR and NVARCHAR2 data types to store Unicode characters
  2. Keep WE8MSWIN1252 as the database character set
  3. Use AL16UTF16 as the national character set

The reasons for this solution are:

  1. Migrating the existing database to a Unicode database requires data conversion because the database character set is WE8MSWIN1252 (a Windows Latin-1 character set), which is not a subset of AL32UTF8. Also, a lot of data is stored in CLOB columns. All CLOB values in a database, even if they contain only ASCII characters, must be converted when migrating from a single-byte database character set, such as US7ASCII or WE8MSWIN1252 to AL32UTF8. As a result, there will be a lot of overhead in converting the data to AL32UTF8.
  2. The additional languages are supported in the new application only. It does not depend on the existing applications or schemas. It is simpler to use the Unicode data type in the new schema and keep the existing schemas unchanged.
  3. Only customer name columns require Unicode character set support. Using a single NCHAR column meets the customer's requirements without migrating the entire database.
  4. The new application does not need database features that do not support SQL NCHAR data types.
  5. The lengths of the SQL NCHAR data types are defined as number of characters. This is the same as how they are treated when using wchar_t strings in Windows C/C++ programs. This reduces programming complexity.
  6. Existing applications using the existing schemas are unaffected.

  1. Summary of Supported Data Types

The following table lists the data types that are supported for different encoding schemes.

BLOBs process characters as a series of byte sequences. The data is not subject to any NLS-sensitive operations.

You can create an abstract data type with the NCHAR attribute as follows:

SQL> CREATE TYPE tp1 AS OBJECT (a NCHAR(10));

Type created.

SQL> CREATE TABLE t1 (a tp1);

Table created.

)Character Set Conversion Between Clients and the Server

If you choose a database character set that is different from the character set on the client operating system, then the Oracle Database can convert the operating system character set to the database character set. Character set conversion has the following disadvantages:

  1. Potential data loss
  2. Increased overhead

Character set conversions can sometimes cause data loss. For example, if you are converting from character set A to character set B, then the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B are converted to a replacement character. The replacement character is often specified as a question mark or as a linguistically related character. For example, ä (a with an umlaut) may be converted to a. If you have distributed environments, then consider using character sets with similar character repertoires to avoid loss of data.

Character set conversion may require copying strings between buffers several times before the data reaches the client. The database character set should always be a superset or equivalent of the native character set of the client's operating system. The character sets used by client applications that access the database usually determine which superset is the best choice.

If all client applications use the same character set, then that character set is usually the best choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.

The character set that is specified by the NLS_LANG parameter should reflect the setting for the client operating system. Setting NLS_LANG correctly enables proper conversion from the client operating system character encoding to the database character set.

  1. Monolingual Database Scenario 单语言数据库场景

The simplest example of a database configuration is a client and a server that run in the same language environment and use the same character set.


Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently through Oracle Net.

The following figure shows a server and one client with the JA16EUC Japanese character set. The other client uses the JA16SJIS Japanese character set.


When a target character set does not contain all of the characters in the source data, replacement characters are used. If, for example, a server uses US7ASCII and a German client uses WE8ISO8859P1, then the German character ß is replaced with ? and ä is replaced with a.

Replacement characters may be defined for specific characters as part of a character set definition. When a specific replacement character is not defined, a default replacement character is used. To avoid the use of replacement characters when converting from a client character set to a database character set, the server character set should be a superset of all the client character sets.

The following figure shows that data loss occurs when the database character set does not include all of the characters in the client character set. The database character set is US7ASCII. The client's character set is WE8MSWIN1252, and the language used by the client is German. When the client inserts a string that contains ß, the database replaces ß with ?, resulting in lost data.

  1. Multilingual Database Scenario多语言数据库场景

If you need multilingual support, then use Unicode AL32UTF8 for the server database character set.

Unicode has two major encoding schemes:

UTF-16: Each character is either 2 or 4 bytes long.

UTF-8: Each character takes 1 to 4 bytes to store.

Oracle Database provides support for UTF-8 as a database character set and both UTF-8 and UTF-16 as national character sets.

Character set conversion between a UTF-8 database and any single-byte character set introduces very little overhead. Conversion between UTF-8 and any multibyte character set has some overhead. There is no data loss from conversion, with the following exceptions:

  1. Some multibyte character sets do not support user-defined characters during character set conversion to and from UTF-8.
  2. Some Unicode characters are mapped to more than one character in another character set. For example, one Unicode character is mapped to three characters in the JA16SJIS character set. This means that a round-trip conversion may not result in the original JA16SJIS character.

The following figure shows a server that uses the AL32UTF8 Oracle Database character set that is based on the Unicode UTF-8 character set.


Character conversion takes place between each client and the server except for the AL32UTF8 client, but there is no data loss because AL32UTF8 is a universal character set. If the German client tries to retrieve data from one of the Japanese clients, then all of the Japanese characters in the data are lost during the character set conversion.

Note:The database, the application server, and each client use the AL32UTF8 character set. This eliminates the need for character conversion even though the clients are French, German, and Japanese.

你可能感兴趣的:(Oracle,数据库)