Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a sanctioned way to reference a code point with its official name? #91

Closed
gibson042 opened this issue Jun 15, 2023 · 7 comments · Fixed by #92
Closed

Is there a sanctioned way to reference a code point with its official name? #91

gibson042 opened this issue Jun 15, 2023 · 7 comments · Fixed by #92

Comments

@gibson042
Copy link
Contributor

The README mentions two ways to reference a Unicode code point, but fails to adequately specify them:

  • An abbreviation for a Unicode Code point, of the form <NBSP>
  • A Unicode code point, of the form U+00A0

grammarkdown.grammar doesn't mention the latter at all, and implicitly defines the former as one or more non-< non-> non-|LineTerminator| code points in between < and >. As for the implementation, scanner.ts uses scanString(CharacterCodes.GreaterThan, …), which pays special attention only to line terminators and >—and in particular allows < when represented as a character reference like &lt; in e.g.

Nonterminal :::
  &lt;foo&lt;bar&gt;

scanner.ts also handles the second form upon encountering "U+" or "u+" followed by four hexadecimal digits (and notably not working for supplementary-plane characters such as U+1D306 TETRAGRAM FOR CENTRE "𝌆").


This is relevant because I want to express a nonterminal like <U+2212 MINUS SIGN>, which is not clearly valid or invalid according to documentation here and accepted by ecma262 build:spec while being rejected by esmeta (cf. tc39/ecma262@cc5e203 and https://github.com/tc39/ecma262/actions/runs/5270397258/jobs/9529840136?pr=3098 ).

Ideally, we'd end up with alignment between documentation and implementation on a form that represents a single code point in any Unicode plane by its hexadecimal value plus descriptive explanatory text (generally its name in the Unicode Character Database), e.g.

A single Unicode code point may be specified using one of the following forms:

  • U+ followed by four to six non-lowercase hexadecimal digits with no leading zeroes other than those necessary for padding to a minimum of four digits, in accordance with The Unicode Standard, Version 15.0.0, Appendix A, Notational Conventions (i.e., matching Unicode extended BNF pattern "U+" ( [1-9 A-F] | "10" )? H H H H or regular expression pattern ^U[+]([1-9A-F]|10)?[0-9A-F]{4}$ as in U+00A0 or U+1D306)
  • The preceding representation followed by a space and a printable ASCII prose explanation (such as a character name) free of < and > and line terminators, all wrapped in < and > (i.e., matching Unicode extended BNF pattern "<" "U+" ( [1-9 A-F] | "10" )? H H H H " " [\u0020-\u007E -- [<>]]+ ">" or regular expression pattern ^<U[+]([1-9A-F]|10)?[0-9A-F]{4} [\x20-\x3b\x3d\x3f-\x7e]+>$ as in <U+2212 MINUS SIGN>)
  • An abbreviation defined somewhere outside the grammar as an ASCII identifier name (i.e., matching Unicode extended BNF pattern [A-Z a-z _] [A-Z a-z _ 0-9]* or regular expression pattern ^[A-Za-z_][A-Za-z_0-9]*$ as in <NBSP>)
@rbuckton
Copy link
Owner

I have a fix inbound for 5 and 6 digit hexadecimal codes, as well as a fix to correctly disallow < (including HTML entity encoded <) inside of a <>-wrapped character literal.

Currently the <>-wrapped format just indicates some prose to dictate the character and there are no other restrictions to content aside from disallowing < and > without escaping using \. I left the <> format open-ended to be flexible with any other future consumer of grammarkdown that may chose to define characters in some other way.

@rbuckton
Copy link
Owner

The grammarkdown.grammar file is fairly out of date, unfortunately, and is only used as part of tests at the moment. I'll look into updating the README.md to be a bit clearer on what is supported for unicode characters.

@gibson042
Copy link
Contributor Author

I have a fix inbound for 5 and 6 digit hexadecimal codes, as well as a fix to correctly disallow < (including HTML entity encoded <) inside of a <>-wrapped character literal.

👍

Currently the <>-wrapped format just indicates some prose to dictate the character and there are no other restrictions to content aside from disallowing < and > without escaping using \. I left the <> format open-ended to be flexible with any other future consumer of grammarkdown that may chose to define characters in some other way.

Thanks for the clarification (which probably belongs in the README as well). That kind of generality can certainly be convenient, but it comes with costs (such as there being no reliable way for machines to consume <…> nonterminals). It also doesn't address what I'm looking for, which is a way to express a specific code point while also providing an explanation for those poor souls who have not memorized the Unicode Character Database.

@rbuckton
Copy link
Owner

I can probably make <> a little stricter since a > on its own line can be used for any prose. That will probably end up as a separate fix than the one I'm working on now, though.

@rbuckton
Copy link
Owner

rbuckton commented Jun 15, 2023

I have a fix for this in #92, although it does not include the more stringent parsing for <> just yet. I've called out the expected format that you suggested above in the README.md, however.

@rbuckton
Copy link
Owner

The fix in #92 now includes some validation for <U+HHHH description>

@rbuckton
Copy link
Owner

#92 now includes all of the rules you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants