Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Update bidi design document to show proposed design #871

Merged
merged 2 commits into from
Sep 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 49 additions & 22 deletions exploration/bidi-usability.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,39 @@ Not allowing these to mix could produce annoying parse errors.

_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._

I propose adopting a hybrid approach in which we permit "super-loose isolation".
This allows user to include isolates and strongly directional characters into the whitespace
portions of the syntax in order to make messages appear correctly.

The second part of the hybrid approach would be to recommend ("SHOULD") the "strict isolation"
design for serializers.
(Note that "strict" and "super-loose" use non-identical productions with the name `bidi`.
These serve different purposes and are consistent with strict being narrower with super-loose.)
This syntax is a subset of the super-loose syntax and can be applied selectively to messages that
have RTL sequences or which have problematic display.


## Alternatives Considered

_What other solutions are available?_
_How do they compare against the requirements?_
_What other properties they have?_

### Nothing
We could do nothing.

A likely outcome of doing nothing is that RTL users would insert bidi controls into
_messages_ in an attempt to make the _pattern_ and/or _placeholders_ display correctly.
These controls would become part of the output of the _message_,
showing up inappropriately at runtime.
Because these characters are invisible, users might be very frustrated trying to manage
the results or debug what is wrong with their messages.

By contrast, if users insert too many or the wrong controls using the recommended design,
the _message_ would still be functional and would emit no undesired characters.

### LTR Messages with isolating sequences

The syntax of a _message_ assumes a left-to-right base direction
both for the complete text of the _message_ as well as for each line (paragraph)
contained therein.
Expand Down Expand Up @@ -383,7 +416,7 @@ ns-separator = [bidi] ":"
bidi = [ %x200E-200F / %x061C ]
```

### Open Issues with Proposed Design
**Open Issues**

The ABNF changes found above put isolates and strongly directional marks into specific locations,
such as directly next to `{`/`}`/`{{`/`}}` markers
Expand All @@ -393,32 +426,24 @@ A more permissive design would add the isolates and strongly directional marks t
whitespace in the syntax and depend on users/editors to appropriately pair or position the marks
to get optimal display.

## Alternatives Considered

_What other solutions are available?_
_How do they compare against the requirements?_
_What other properties they have?_

### Nothing
We could do nothing.

A likely outcome of doing nothing is that RTL users would insert bidi controls into
_messages_ in an attempt to make the _pattern_ and/or _placeholders_ display correctly.
These controls would become part of the output of the _message_,
showing up inappropriately at runtime.
Because these characters are invisible, users might be very frustrated trying to manage
the results or debug what is wrong with their messages.

By contrast, if users insert too many or the wrong controls using the recommended design,
the _message_ would still be functional and would emit no undesired characters.

### Super-loose isolation

Add isolates and strongly directional marks to required and optional whitespace in the syntax.
This would permit users to get the effects described by the above design,
as long as they use isolates/marks in a "responsible" way.

(Omitting other changes found in #673)
The exception to this is the namespace separator, used in `identifier`.
This requires the ability to insert isolates or strongly directional marks
between the namespace and name portions, where whitespace is not permitted.
This is the only location in the syntax where such characters might be needed
but whitespace is not at least optional.
This could be defined as:
```abnf
ns-separator = [bidi] ":" [bidi]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow for the bidi after : if we're not allowing it to show up elsewhere after a starting sigil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). The [bidi] production allows starting or ending strongly directional marks to prevent spillover in the middle of a namespaced identifier, e.g. $_م1صر:‎_م2صر‎⁩

I'll admit that it's generally better practice to put the strongly directional character at the end (before the : separator), but I didn't want to make the syntax ultra-fussy: whatever stew of strongly directional characters and a colon are not part of either the namespace or the name.

The separator is different from sigils because the sigils are all at token-start, whereas the namespace separator is embedded into the word-token.

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it
code points for the second example: \u2066$_\u06451\u0635\u0631:\u200e_\u06452\u0635\u0631\u200e\u2069 try it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). [...]

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it

That's rendering for me with the _ next to the $, which is wrong? The _ here has neutral directionality, and it's the first character of a name which as a whole is RTL. So we ought to have some way to render it as the right-most character, but that's not possible without some bidi control after the $.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, although a display like $⁧_مصر⁩⁩ is in itself super-weird (and it's namespaced friend $⁧_م1صر⁩:⁧_م2صر⁩⁩ is even weirder) from the point of view that it's an RTL isolated run inside an LTR isolated run (or, in the latter case, two RTL runs inside an LTR run). This makes the name tokens display correctly RTL while still forcing the overall token (including the sigil) to display LTR. That is, this sequence:

\u2066$\u2067_\u06451\u0635\u0631\u2069:\u2067_\u06452\u0635\u0631\u2069\u2069

... with 6 of the 18 code points being bidi isolates.

There's a similar problem with options, where we're trying to force them to be LTR-ordered (option = value/م1صر‎=م2صر⁩) when RTL really really wants that to be displayed as ⁦م1صر=م2صر⁩.

So you're right: my proposal does compromise some aspects of RTL display as part of our insistence that messages are intended to work in an LTR editing environment with LTR syntax. Either way, logical order is all that matters and users should be careful about using mixed direction for non-literal values. We give them the tools to make their bidi literals and patterns work RTL-normally as well as the tools to let non-RTL speakers read placeholders LTR (like most developers debugging messages). It is not perfect. Do we need that last bit of cruft to allow the sigil and identifier to be separate? (Not asking rhetorically, btw. What do people think? Where can we get developer/translator feedback?)

Note: It took a minute of fiddling to get the namespaced example to not get entangled with the markdown in this comment--all in service of getting the underscore on the right side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposed solution solves the 99% case, and trying to solve the remaining 1% will be tricky. Moreover, the people who really have to deal with messages are the translators, and they will need tooling to effectively do their jobs at scale — and that tooling has far more freedom to deal with bidi issues than we have in plain text. So I don't think we need to go any further down this path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbf, _ has neutral direction and it is relatively common to use it as a prefix for identifiers.

Yes, I know. So do all of the sigils (with some of them being enclosing punctuation).

I guess my argument would be that the sigil is, from a user perspective, "part of the name", e.g. $foo, not $ with foo being separate. In an RTL display context, the sigil, being a neutral, will naturally reverse sides to be at the start, (e.g. ⁧$_مصر⁩). For us to maintain an LTR bias while making only the name token isolated is a lot more work--for tools and users.

The counterargument would be that identifier and name are their own things (and many are generated from the user's environment). The various sigils and such in MF2 are thus "not part of the name" and should be treated separately. Allowing users to insert these characters is not the same as obligating them to use them and I could see us making the affordance.

In the screenshot below, the display is set to RTL. The placeholders have an LRI/PDI inside the {/}.

  • The first line, lacking any other bidi controls, is illegible.
  • The second line uses RLI/PDI around each token. This presents sigils on the right, which an RTL speaker would understand, but the tokens are in LTR order. I don't specifically object to this presentation.
  • The third line uses LRI/PDI around each token and approximates what I'm suggesting above.
  • The fourth line uses the markup you're suggesting, which is RLI/PDI around each name and LRI/PDI around the sigil+identifier.

image

Here's the data used to produce the screenshot. Note that I do use an U+200E next to the : separator.

Example: {\u2066$_\u06351\u0636\u0637 :_\u06352\u0636\u0637:_\u06353\u0636\u0637 _\u06354\u0636\u0637=_\u06355\u0636\u0637\u2069}<br>
Example RLI: {\u2066\u2067$_\u06351\u0636\u0637\u2069 \u2067:_\u06352\u0636\u0637:_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>
Example LRI: {\u2066\u2066$_\u06351\u0636\u0637\u2069 \u2066:_\u06352\u0636\u0637\u200e:_\u06353\u0636\u0637\u2069 \u2066_\u06354\u0636\u0637\u2069=\u2066_\u06355\u0636\u0637\u2069\u2069}<br>
Example EAO: {\u2066\u2066$\u2067_\u06351\u0636\u0637\u2069\u2069 \u2066:\u2067_\u06352\u0636\u0637\u2069\u200e:\u2067_\u06353\u0636\u0637\u2069\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various sigils and such in MF2 are thus "not part of the name" and should be treated separately.

That would be my position. In particular for variables, AFAIK all current implementations would require parameters like { foo: 42 } in order to resolve a $foo, and not e.g. { $foo: 42 }.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Example EAO: {\u2066$\u2067_\u06351\u0636\u0637\u2069 :\u2067_\u06352\u0636\u0637\u2069:\u2067_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Your formulation is correct. Note that the LRM is necessary if we don't permit isolates inside the identifier production (which is not what you are proposing), because the : is a neutral and wants to extend whatever run is to either side of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something got damaged somewhere in the several copy/paste operations.

To make things readable I replaced:

\u0635 A
\u0636 B
\u0637 C

\u2066 [LRI]
\u2069 [PDI]
\u200E [LRM]
\u2067 [RLI]

And the resulting strings above are:

Example: {[LRI]$_A1BC :_A2BC:_A3BC _A4BC=_A5BC[PDI]}<br>
Example RLI: {[LRI][RLI]$_A1BC[PDI] [RLI]:_A2BC:_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>
Example LRI: {[LRI][LRI]$_A1BC[PDI] [LRI]:_A2BC[LRM]:_A3BC[PDI] [LRI]_A4BC[PDI]=[LRI]_A5BC[PDI][PDI]}<br>
Example EAO: {[LRI][LRI]$[RLI]_A1BC[PDI][PDI] [LRI]:[RLI]_A2BC[PDI][LRM]:[RLI]_A3BC[PDI][PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>

Example EAO: {[LRI]$[RLI]_A1BC[PDI] :[RLI]_A2BC[PDI]:[RLI]_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}

Taking out the bidi control characters we are left with

{$_A1BC :_A2BC:_A3BC _A4BC=_A5BC}

That is not valid MF2 syntax.
So I don't know what this is trying to fix.


But yesterday I played a bit and put together a web page that one can use to interactively play with this.

https://mihai-nita.net/tmp/mf2bidi.html

And I argue that:

  • isolates are enough (FSI, LRI, RLI, PDI)
  • there is no need for any control character between $ and the name proper, even when there is an _ there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not valid MF2 syntax.

Um... how is it not valid?

$_A1BC is a valid operand.
:_A2BC is a valid namespace. :_A3BC is a valid function name.
_A4BC is a valid option name.
_A5BC is a valid unquoted literal.

It's even a valid message, in that a simple message might consist solely of a placeholder.

What am I missing?

```

Here are the other ABNF changes:

```abnf
; strongly directional marks and bidi isolates
Expand Down Expand Up @@ -447,7 +472,7 @@ s = ( SP / HTAB / CR / LF / %x3000 )
### Strict isolation all the time

Apply bidi isolates in a strict way.
The main differences to the proposed solution is:
In this design:
1. The open/close isolate characters are syntactically required to be paired.
This introduces parse errors for unpaired invisible characters,
which could lead to bad user experiences.
Expand All @@ -467,7 +492,7 @@ markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["
/ "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close
/ "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close
identifier = [(namespace ns-separator)] name
ns-separator = [bidi] ":"
ns-separator = [bidi] ":" [bidi]
bidi = [ %x200E-200F / %x061C ]
```

Expand Down Expand Up @@ -610,6 +635,8 @@ adherence to the stricter grammar.
syntax errors
- Provides a foundation for tools to claim strict conformance and message normalization
as well as guidance to implementers to make them want to adopt it
- Messages are valid while being edited (such as when the open or close isolate has been
inserted but the corresponding opposite isolate hasn't been entered yet)

**Cons**
- Requires additional effort to maintain the grammar
Expand Down