Condividi tramite


BiDi Hyperlinks

More precisely, this post is about BiDi Internationalized Resource Identifiers (IRIs). These objects are a generalization of Universal Resource Identifiers (URIs) that can contain a large variety of nonASCII characters, such as most alphabetic characters and Chinese characters. Complications occur when BiDi characters such as Arabic and Hebrew are used in IRIs especially when displayed in a right-to-left (RTL) context. As the IRI reference discusses, use of the Unicode BiDi Algorithm (UBA) is consistent in the way such IRIs are displayed in plain text, but unfortunately some RTL IRI displays are nearly unreadable. This post illustrates the display problems and offers several ways to remedy them. The same approaches can be used with file paths.

The remedies place a higher-level BiDi protocol on top of the UBA. A number of such protocols are discussed in an earlier post. One set of scenarios there deals with matched parentheses which were displayed in confusing ways according to the UBA up through Unicode 6.2. Solutions for those problems were incorporated into the UBA of the Unicode 6.3. That version also adds the “BiDi isolate” format characters, which allow text to be inserted into a document without reordering the display. These improvements have become integral parts of the UBA, which assumes the text is plain text. In contrast, the BiDi IRI protocols have a rich-text property in that they assume the IRI text is, in fact, an IRI. Exactly how the text becomes identified as an IRI is not part of the protocol, although that identification is an interesting subject in its own right. Suffice it to say that many programs identify IRIs automagically.

First, consider a couple of examples of the BiDi IRI display problems. According to the UBA, https://شس.يب.ثق displays in a right-to-left paragraph as

https://شس.يب.ثق

Or more confusing yet, https://exchange.شس.ثق displays in an RTL paragraph as

https://exchange.شس.ثق

The potential for spoofed IRIs is large. The IRI reference does recommend that IRIs always be displayed in the UBA LTR display order. But that requires being able to recognize an IRI in the first place and it also has some anomalous displays.

Assuming an IRI can be recognized or is used in an unambiguous URL setting such as a browser address bar, we can enforce a more readable display order. Namely we force the URL delimiters '#', '.', '/', ':', '?', '@', '[', ']' to be treated either as strong LTR or as strong RTL characters. Treated as strong RTL characters, https://exchange.شس.ثق displays in an RTL context as

ثق.شس.exchange//:http

i.e., the alphanumeric spans in between the URL delimiters appear in the reverse order from the way they appear in an LTR paragraph and the UBA is not used to resolve the neutrality of the slash, colon and period. The alphanumeric spans themselves (in effect the “leaves” of the structure) are still displayed in the order given by the UBA.

The question arises as to which directionality to assign to the URL delimiters. As one might expect, different people have different preferences. Accordingly, the directionality of the URL neutrals could be made a user setting. In most locales, the default value would be LTR. But in some Arabic locales the default could be given by the embedding level or RTL. It might be worthwhile to allow users to change the setting as desired.

The setting could have four values:

LTR

RTL

Paragraph direction

First strong character following scheme identifier

The paragraph direction option is the easiest to implement and is probably the most natural from a BiDi perspective. Furthermore it agrees with the UBA layout in the limiting cases of fully LTR text in an LTR paragraph and fully RTL text in an RTL paragraph. Accordingly I think the best solution is to have BiDi IRIs layout with the directionality of the paragraph containing them and not bother to offer a user setting.

The question also arises as to how the IRI affects its surroundings. It seems that the most reasonable choice is to make it act like an "other neutral" character as defined in the UBA. That way it goes along with the flow of the text surrounding it.