http://www.unicode.org/reports/tr9/tr9-27.html
When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected。
The directional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The directional types associated with numbers are calledweak types, and characters of those types are called weak directional characters.
Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.
Abbr. | Code | Chart | Name | Description |
---|---|---|---|---|
LRE | U+202A | LEFT-TO-RIGHT EMBEDDING | Treat the following text as embedded left-to-right. | |
RLE | U+202B | RIGHT-TO-LEFT EMBEDDING | Treat the following text as embedded right-to-left. |
The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.
Abbr. | Code | Chart | Name | Description |
---|---|---|---|---|
LRO | U+202D | LEFT-TO-RIGHT OVERRIDE | Force following characters to be treated as strong left-to-right characters. | |
RLO | U+202E | RIGHT-TO-LEFT OVERRIDE | Force following characters to be treated as strong right-to-left characters. |
The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.
Abbr. | Code | Chart | Name | Description |
---|---|---|---|---|
U+202C | POP DIRECTIONAL FORMATTING | Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO. |
Abbr. | Code | Chart | Name | Description |
---|---|---|---|---|
LRM | U+200E | LEFT-TO-RIGHT MARK | Left-to-right zero-width character | |
RLM | U+200F | RIGHT-TO-LEFT MARK | Right-to-left zero-width character |
There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display
The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds infour main phases:
BD1. The bidirectional characters types are values assigned to each Unicode character, including unassigned characters. The formal property name in the Unicode Character Database [UCD] is Bidi_Class.
BD2. Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61.
Embedding levels are explicitly set by both override format codes and by embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.
BD3. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd.
For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embedding level will become clear when the reordering algorithm is discussed, but the following provides an example of how the algorithm works.
BD4. The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph.
BD5. The direction of the paragraph embedding level is called the paragraph direction.
BD6. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status has three states, as shown in Table 2.
Status | Interpretation |
---|---|
Neutral | No override is currently active |
Right-to-left | Characters are to be reset to R |
Left-to-right | Characters are to be reset to L |
BD7. A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level (a level run is also known as a directional run).
There are two versions of BIDI reference code available. Both have been tested to produce identical results. One version is written in Java, and the other is written in C++. The Java version is designed to closely follow the steps of the algorithm as described below. The C++ code is designed to show one of the optimization methods that can be applied to the algorithm, using a state table for one phase.
One of the most effective optimizations is to first test for right-to-left characters and not invoke the Bidirectional Algorithm unless they are present.
There are two directories containing source code for reference implementations at [Code9]. Implementers are encouraged to use this resource to test their implementations. There is an online demo of bidi code at http://unicode.org/cldr/utility/bidi.jsp, which shows the results, plus the levels and the rules invoked for each character.