Unicode Bidirectional Algorithm

http://www.unicode.org/reports/tr9/tr9-27.html


When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected


The directional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The directional types associated with numbers are calledweak types, and characters of those types are called weak directional characters.


Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.



2.1 Explicit Directional Embedding

Abbr. Code Chart Name Description
LRE U+202A http://www.unicode.org/cgi-bin/refglyph?24-202A LEFT-TO-RIGHT EMBEDDING Treat the following text as embedded left-to-right.
RLE U+202B http://www.unicode.org/cgi-bin/refglyph?24-202B RIGHT-TO-LEFT EMBEDDING Treat the following text as embedded right-to-left.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.


2.2 Explicit Directional Overrides

Abbr. Code Chart Name Description
LRO U+202D LEFT-TO-RIGHT OVERRIDE Force following characters to be treated as strong left-to-right characters.
RLO U+202E http://www.unicode.org/cgi-bin/refglyph?24-202E RIGHT-TO-LEFT OVERRIDE Force following characters to be treated as strong right-to-left characters.


The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.



2.3 Terminating Explicit Directional Code

Abbr. Code Chart Name Description
PDF U+202C

http://www.unicode.org/cgi-bin/refglyph?24-202C

POP DIRECTIONAL FORMATTING Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO.



2.4 Implicit Directional Marks



Abbr. Code Chart Name Description
LRM U+200E http://www.unicode.org/cgi-bin/refglyph?24-200E LEFT-TO-RIGHT MARK Left-to-right zero-width character
RLM U+200F http://www.unicode.org/cgi-bin/refglyph?24-200F RIGHT-TO-LEFT MARK Right-to-left zero-width character


There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display



3 Basic Display Algorithm

The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds infour main phases:

  • Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
  • Initialization. A list of directionalcharacter types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
  • Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
  • Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

 

3.1 Definitions

BD1. The bidirectional characters types are values assigned to each   Unicode character, including unassigned characters. The formal property name in the    Unicode Character Database [UCD] is Bidi_Class.

BD2. Embedding levels are numbers that indicate how deeply the text is   nested, and the default direction of text on that level. The minimum embedding level of text is   zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format     codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is     to provide a precise stack limit for implementations to guarantee the same results. Sixty-one     levels is far more than sufficient for ordering, even with mechanically generated formatting;     the display becomes rather muddied with more than a small number of embeddings.

BD3. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even,   and R if the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain     Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly     embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English     text and numbers will always be an even level; Arabic text (excluding numbers) will always be an     odd level. The exact meaning of the embedding level will become clear when the reordering     algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. The paragraph embedding level is the embedding level that   determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the   paragraph direction.

  • In some contexts the paragraph direction is also known as the base direction.

BD6. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status   has three states, as shown in Table 2.

Table 2. Directional Override Status

Status Interpretation
Neutral No override is currently active
Right-to-left Characters are to be reset to R
Left-to-right Characters are to be reset to L

BD7. A level run is a maximal substring of characters that have the   same embedding level. It is maximal in that no character immediately before or after the substring   has the same level (a level run is also known as a directional run).

 

 

 

5.1 Reference Code

There are two versions of BIDI reference code available. Both have been tested to   produce identical results. One version is written in Java, and the other is written in C++. The   Java version is designed to closely follow the steps of the algorithm as described below. The C++   code is designed to show one of the optimization methods that can be applied to the algorithm,   using a state table for one phase.

One of the most effective optimizations is to first test for right-to-left     characters and not invoke the Bidirectional Algorithm unless they are present.

There are two directories containing source code for reference implementations at [Code9]. Implementers are encouraged to use this   resource to test their implementations. There is an online demo of bidi code at http://unicode.org/cldr/utility/bidi.jsp, which shows the results, plus the levels and the rules invoked for each character.


 

你可能感兴趣的:(Unicode Bidirectional Algorithm)