Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bidi support and address UAX31/UTS55 requirements #884

Merged
merged 28 commits into from
Sep 16, 2024

Conversation

aphillips
Copy link
Member

@aphillips aphillips commented Sep 11, 2024

Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax.

Formally defines optional vs. non-optional whitespace.

Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible).

This replaces PR#673. This fixes #661, #847.

TODO: Add guidance on "strict" bidi.

Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax.

Formally defines optional vs. non-optional whitespace.

Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible)
Include ALM and better specify how to use the marks.
Add optional whitespace at the start of `variant`

Add optional whitespace around `quoted-pattern`

These changes result in allowing bidi around keys and quoted patterns as intended.
@aphillips aphillips marked this pull request as ready for review September 11, 2024 16:39
@aphillips aphillips added syntax Issues related with MF Syntax normative LDML46 LDML46 Release (Tech Preview - October 2024) labels Sep 11, 2024
@aphillips
Copy link
Member Author

@eggrobin Pinging you to look at this, which reimplements the text you previously reviewed in our #673

Copy link
Member

@eggrobin eggrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I see you are referring to UTS #‌55 in the PR title; in case you want to mention that you are following its recommendation in some informative part of the text, the relevant bit is Section 3.2, Whitespace and Syntax.

- Add a note about the difference between formatting and message syntax.
- Clarify the sentence about message directionality.
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed previously e.g. in #871 (comment), I believe that we need to also allow for a bidi character before name and before the namespace delimiter.

Theses control characters would not be considered as a part of the parsed value of the name, much like the | delimiters are not a part of the parsed value of a quoted-literal.

-identifier = [namespace ":"] name
+identifier = [namespace [bidi] ":"] name
 namespace  = name
-name = name-start *name-char
+name = [bidi] name-start *name-char

spec/message.abnf Outdated Show resolved Hide resolved
spec/syntax.md Outdated Show resolved Hide resolved
spec/message.abnf Show resolved Hide resolved
spec/syntax.md Outdated
the character `U+3000 IDEOGRAPHIC SPACE`
_is_ interpreted as whitespace,
and the directional isolates U+2066..U+2069
are treated as ignorable format controls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is U+061C, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and LRM/RLM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but LRM and RLM are part of R3a-1, see the first note under https://www.unicode.org/reports/tr31/#R3a.

The profile includes adding U+061C to the ignorable format controls, not just the isolates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second note talks about ALM, including:

If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.

In any case, I now have a list of ignorable format controls, which might be overkill, but saves reading rule R3a 😉.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.

Indeed, but you still need to say that the profile adds it!

I agree that listing the set is probably better than the diff at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also not allow ALM as a bidi character, as there should be no place in the syntax where an RLM couldn't be used just as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uncomfortable removing bidi marks. The ALM was added many years after RLM/LRM and its differences with RLM are minor. But we want bidi language users to have the tools they need to make things look right (and still be functional). I would hate to remove it because we need to add a couple of words to the spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that treating it as an ignorable format control requires us to deviate further from the XML name production.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is the right thing to do.

ALM is an invisible, default-ignorable, non-spacing code point. As noted elsewhere, it was added to Unicode after XML/XMLName were defined. According to XML's rules, an ALM all-by-itself is a valid identifier. That seems like a bug, not a feature. Maybe we should call out the deviation more clearly and maybe (wearing my other chair hat) W3C should be called on to do an erratum.

@macchiati Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree; it should be added.

spec/syntax.md Outdated Show resolved Hide resolved
spec/syntax.md Outdated Show resolved Hide resolved
spec/syntax.md Outdated
Comment on lines 840 to 841
_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the
following mechanisms to make messages display intelligibly in plain-text editors:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before getting much deeper into discussing the mechanisms, could you clarify if this is intended to be the canonical/recommended way of isolating or marking messages with RTL contents that we've discussed, or is that something that'll be provided separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's separate.

spec/syntax.md Outdated Show resolved Hide resolved
spec/message.abnf Outdated Show resolved Hide resolved
spec/message.abnf Outdated Show resolved Hide resolved
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline for a few final nitpicks, but I think this is good to go.

spec/message.abnf Outdated Show resolved Hide resolved
spec/syntax.md Outdated Show resolved Hide resolved
spec/syntax.md Outdated Show resolved Hide resolved
aphillips and others added 3 commits September 12, 2024 15:58
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
@aphillips aphillips merged commit c3d06cf into main Sep 16, 2024
1 check passed
@aphillips aphillips deleted the aphillips-implement-bidi branch September 16, 2024 17:20
eemeli added a commit to messageformat/messageformat that referenced this pull request Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDML46 LDML46 Release (Tech Preview - October 2024) normative specification syntax Issues related with MF Syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEEDBACK] Improving bidi layout
5 participants