Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Update bidi design document to show proposed design #871

Merged
merged 2 commits into from
Sep 2, 2024

Conversation

aphillips
Copy link
Member

The design I actually think we should adopt is the "hybrid approaches" one. This is a necessary first step on the highway to UAX31 compliance and I think is responsibly contained/managed. It is a hybrid approach, in that it permits testable strict implementations to be created (particularly for message serialization).

The design I actually think we should adopt is the "hybrid approaches" one. This is a necessary first step on the highway to UAX31 compliance and I think is responsibly contained/managed. It is a hybrid approach, in that it permits testable strict implementations to be created (particularly for message serialization).

This PR consists of moving text around. I added one "pro" to one option also.
@aphillips aphillips added syntax Issues related with MF Syntax design Design principles, decisions labels Aug 26, 2024
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservations about taking this direction, but I don't oppose it provided that we express clearly the recommended isolation/marking style.

I also continue to advocate for bidi controls to be available around name, so that it can be isolated from prefix sigils like :, $, @.

But all of that can be addressed separately; this request for changes is about the inline comment below.

Comment on lines 280 to 283
The second part of the hybrid approach would be to recommend ("SHOULD") the "strict isolation"
design for serializers.
This syntax is a subset of the super-loose syntax and can be applied selectively to messages that
have RTL sequences or which have problematic display.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presuming that this is referring to the "Strict isolation all the time" alternative, its proposed syntax is not a subset of the "Super-loose isolation" syntax, as the former includes at least this rule, which proposes bidi control characters in a non-whitespace position:

identifier     = [(namespace ns-separator)] name
ns-separator   = [bidi] ":"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll add appropriate notes about this and the larger issue of names.

@aphillips aphillips requested a review from eemeli August 26, 2024 23:59
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably works for most real-world cases.

It might be good to explicitly note that with the proposed solution RTL names that don't start with a strongly RTL-directional character are not well supported in positions that use a leading sigil.

This means that when using a name like ⁧_مصر⁩ for a variable (where the _ is the first character of the name), this can only be rendered incorrectly as either $_مصر or ⁧$_مصر⁩.

but whitespace is not at least optional.
This could be defined as:
```abnf
ns-separator = [bidi] ":" [bidi]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow for the bidi after : if we're not allowing it to show up elsewhere after a starting sigil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). The [bidi] production allows starting or ending strongly directional marks to prevent spillover in the middle of a namespaced identifier, e.g. $_م1صر:‎_م2صر‎⁩

I'll admit that it's generally better practice to put the strongly directional character at the end (before the : separator), but I didn't want to make the syntax ultra-fussy: whatever stew of strongly directional characters and a colon are not part of either the namespace or the name.

The separator is different from sigils because the sigils are all at token-start, whereas the namespace separator is embedded into the word-token.

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it
code points for the second example: \u2066$_\u06451\u0635\u0631:\u200e_\u06452\u0635\u0631\u200e\u2069 try it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). [...]

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it

That's rendering for me with the _ next to the $, which is wrong? The _ here has neutral directionality, and it's the first character of a name which as a whole is RTL. So we ought to have some way to render it as the right-most character, but that's not possible without some bidi control after the $.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, although a display like $⁧_مصر⁩⁩ is in itself super-weird (and it's namespaced friend $⁧_م1صر⁩:⁧_م2صر⁩⁩ is even weirder) from the point of view that it's an RTL isolated run inside an LTR isolated run (or, in the latter case, two RTL runs inside an LTR run). This makes the name tokens display correctly RTL while still forcing the overall token (including the sigil) to display LTR. That is, this sequence:

\u2066$\u2067_\u06451\u0635\u0631\u2069:\u2067_\u06452\u0635\u0631\u2069\u2069

... with 6 of the 18 code points being bidi isolates.

There's a similar problem with options, where we're trying to force them to be LTR-ordered (option = value/م1صر‎=م2صر⁩) when RTL really really wants that to be displayed as ⁦م1صر=م2صر⁩.

So you're right: my proposal does compromise some aspects of RTL display as part of our insistence that messages are intended to work in an LTR editing environment with LTR syntax. Either way, logical order is all that matters and users should be careful about using mixed direction for non-literal values. We give them the tools to make their bidi literals and patterns work RTL-normally as well as the tools to let non-RTL speakers read placeholders LTR (like most developers debugging messages). It is not perfect. Do we need that last bit of cruft to allow the sigil and identifier to be separate? (Not asking rhetorically, btw. What do people think? Where can we get developer/translator feedback?)

Note: It took a minute of fiddling to get the namespaced example to not get entangled with the markdown in this comment--all in service of getting the underscore on the right side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposed solution solves the 99% case, and trying to solve the remaining 1% will be tricky. Moreover, the people who really have to deal with messages are the translators, and they will need tooling to effectively do their jobs at scale — and that tooling has far more freedom to deal with bidi issues than we have in plain text. So I don't think we need to go any further down this path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbf, _ has neutral direction and it is relatively common to use it as a prefix for identifiers.

Yes, I know. So do all of the sigils (with some of them being enclosing punctuation).

I guess my argument would be that the sigil is, from a user perspective, "part of the name", e.g. $foo, not $ with foo being separate. In an RTL display context, the sigil, being a neutral, will naturally reverse sides to be at the start, (e.g. ⁧$_مصر⁩). For us to maintain an LTR bias while making only the name token isolated is a lot more work--for tools and users.

The counterargument would be that identifier and name are their own things (and many are generated from the user's environment). The various sigils and such in MF2 are thus "not part of the name" and should be treated separately. Allowing users to insert these characters is not the same as obligating them to use them and I could see us making the affordance.

In the screenshot below, the display is set to RTL. The placeholders have an LRI/PDI inside the {/}.

  • The first line, lacking any other bidi controls, is illegible.
  • The second line uses RLI/PDI around each token. This presents sigils on the right, which an RTL speaker would understand, but the tokens are in LTR order. I don't specifically object to this presentation.
  • The third line uses LRI/PDI around each token and approximates what I'm suggesting above.
  • The fourth line uses the markup you're suggesting, which is RLI/PDI around each name and LRI/PDI around the sigil+identifier.

image

Here's the data used to produce the screenshot. Note that I do use an U+200E next to the : separator.

Example: {\u2066$_\u06351\u0636\u0637 :_\u06352\u0636\u0637:_\u06353\u0636\u0637 _\u06354\u0636\u0637=_\u06355\u0636\u0637\u2069}<br>
Example RLI: {\u2066\u2067$_\u06351\u0636\u0637\u2069 \u2067:_\u06352\u0636\u0637:_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>
Example LRI: {\u2066\u2066$_\u06351\u0636\u0637\u2069 \u2066:_\u06352\u0636\u0637\u200e:_\u06353\u0636\u0637\u2069 \u2066_\u06354\u0636\u0637\u2069=\u2066_\u06355\u0636\u0637\u2069\u2069}<br>
Example EAO: {\u2066\u2066$\u2067_\u06351\u0636\u0637\u2069\u2069 \u2066:\u2067_\u06352\u0636\u0637\u2069\u200e:\u2067_\u06353\u0636\u0637\u2069\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various sigils and such in MF2 are thus "not part of the name" and should be treated separately.

That would be my position. In particular for variables, AFAIK all current implementations would require parameters like { foo: 42 } in order to resolve a $foo, and not e.g. { $foo: 42 }.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Example EAO: {\u2066$\u2067_\u06351\u0636\u0637\u2069 :\u2067_\u06352\u0636\u0637\u2069:\u2067_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Your formulation is correct. Note that the LRM is necessary if we don't permit isolates inside the identifier production (which is not what you are proposing), because the : is a neutral and wants to extend whatever run is to either side of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something got damaged somewhere in the several copy/paste operations.

To make things readable I replaced:

\u0635 A
\u0636 B
\u0637 C

\u2066 [LRI]
\u2069 [PDI]
\u200E [LRM]
\u2067 [RLI]

And the resulting strings above are:

Example: {[LRI]$_A1BC :_A2BC:_A3BC _A4BC=_A5BC[PDI]}<br>
Example RLI: {[LRI][RLI]$_A1BC[PDI] [RLI]:_A2BC:_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>
Example LRI: {[LRI][LRI]$_A1BC[PDI] [LRI]:_A2BC[LRM]:_A3BC[PDI] [LRI]_A4BC[PDI]=[LRI]_A5BC[PDI][PDI]}<br>
Example EAO: {[LRI][LRI]$[RLI]_A1BC[PDI][PDI] [LRI]:[RLI]_A2BC[PDI][LRM]:[RLI]_A3BC[PDI][PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>

Example EAO: {[LRI]$[RLI]_A1BC[PDI] :[RLI]_A2BC[PDI]:[RLI]_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}

Taking out the bidi control characters we are left with

{$_A1BC :_A2BC:_A3BC _A4BC=_A5BC}

That is not valid MF2 syntax.
So I don't know what this is trying to fix.


But yesterday I played a bit and put together a web page that one can use to interactively play with this.

https://mihai-nita.net/tmp/mf2bidi.html

And I argue that:

  • isolates are enough (FSI, LRI, RLI, PDI)
  • there is no need for any control character between $ and the name proper, even when there is an _ there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not valid MF2 syntax.

Um... how is it not valid?

$_A1BC is a valid operand.
:_A2BC is a valid namespace. :_A3BC is a valid function name.
_A4BC is a valid option name.
_A5BC is a valid unquoted literal.

It's even a valid message, in that a simple message might consist solely of a placeholder.

What am I missing?

@mihnita mihnita added the blocker-candidate The submitter thinks this might be a block for the Technology Preview label Aug 28, 2024
@aphillips
Copy link
Member Author

Notes from 2024-09-02 call: need to add text about alm. look up alm end-effects

@aphillips aphillips merged commit 6657b80 into main Sep 2, 2024
1 check passed
@aphillips aphillips deleted the aphillips-bidi-design-proposal branch September 2, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker-candidate The submitter thinks this might be a block for the Technology Preview design Design principles, decisions syntax Issues related with MF Syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants