-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bidi support and address UAX31/UTS55 requirements #884
Conversation
Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax. Formally defines optional vs. non-optional whitespace. Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible)
Include ALM and better specify how to use the marks.
Add optional whitespace at the start of `variant` Add optional whitespace around `quoted-pattern` These changes result in allowing bidi around keys and quoted patterns as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I see you are referring to UTS #55 in the PR title; in case you want to mention that you are following its recommendation in some informative part of the text, the relevant bit is Section 3.2, Whitespace and Syntax.
- Add a note about the difference between formatting and message syntax. - Clarify the sentence about message directionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed previously e.g. in #871 (comment), I believe that we need to also allow for a bidi character before name
and before the namespace delimiter.
Theses control characters would not be considered as a part of the parsed value of the name
, much like the |
delimiters are not a part of the parsed value of a quoted-literal
.
-identifier = [namespace ":"] name
+identifier = [namespace [bidi] ":"] name
namespace = name
-name = name-start *name-char
+name = [bidi] name-start *name-char
spec/syntax.md
Outdated
the character `U+3000 IDEOGRAPHIC SPACE` | ||
_is_ interpreted as whitespace, | ||
and the directional isolates U+2066..U+2069 | ||
are treated as ignorable format controls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is U+061C, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and LRM/RLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but LRM and RLM are part of R3a-1, see the first note under https://www.unicode.org/reports/tr31/#R3a.
The profile includes adding U+061C to the ignorable format controls, not just the isolates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second note talks about ALM, including:
If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.
In any case, I now have a list of ignorable format controls, which might be overkill, but saves reading rule R3a 😉.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.
Indeed, but you still need to say that the profile adds it!
I agree that listing the set is probably better than the diff at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also not allow ALM as a bidi
character, as there should be no place in the syntax where an RLM couldn't be used just as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel uncomfortable removing bidi marks. The ALM was added many years after RLM/LRM and its differences with RLM are minor. But we want bidi language users to have the tools they need to make things look right (and still be functional). I would hate to remove it because we need to add a couple of words to the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that treating it as an ignorable format control requires us to deviate further from the XML name production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is the right thing to do.
ALM is an invisible, default-ignorable, non-spacing code point. As noted elsewhere, it was added to Unicode after XML/XMLName were defined. According to XML's rules, an ALM all-by-itself is a valid identifier. That seems like a bug, not a feature. Maybe we should call out the deviation more clearly and maybe (wearing my other chair hat) W3C should be called on to do an erratum.
@macchiati Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly agree; it should be added.
spec/syntax.md
Outdated
_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the | ||
following mechanisms to make messages display intelligibly in plain-text editors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before getting much deeper into discussing the mechanisms, could you clarify if this is intended to be the canonical/recommended way of isolating or marking messages with RTL contents that we've discussed, or is that something that'll be provided separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's separate.
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline for a few final nitpicks, but I think this is good to go.
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax.
Formally defines optional vs. non-optional whitespace.
Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible).
This replaces PR#673. This fixes #661, #847.
TODO: Add guidance on "strict" bidi.