Is there a sanctioned way to reference a code point with its official name? #91

gibson042 · 2023-06-15T02:05:40Z

The README mentions two ways to reference a Unicode code point, but fails to adequately specify them:

An abbreviation for a Unicode Code point, of the form <NBSP>

A Unicode code point, of the form U+00A0

grammarkdown.grammar doesn't mention the latter at all, and implicitly defines the former as one or more non-< non-> non-|LineTerminator| code points in between < and >. As for the implementation, scanner.ts uses scanString(CharacterCodes.GreaterThan, …), which pays special attention only to line terminators and >—and in particular allows < when represented as a character reference like < in e.g.

Nonterminal :::
  &lt;foo&lt;bar&gt;

scanner.ts also handles the second form upon encountering "U+" or "u+" followed by four hexadecimal digits (and notably not working for supplementary-plane characters such as U+1D306 TETRAGRAM FOR CENTRE "𝌆").

This is relevant because I want to express a nonterminal like <U+2212 MINUS SIGN>, which is not clearly valid or invalid according to documentation here and accepted by ecma262 build:spec while being rejected by esmeta (cf. tc39/ecma262@cc5e203 and https://github.com/tc39/ecma262/actions/runs/5270397258/jobs/9529840136?pr=3098 ).

Ideally, we'd end up with alignment between documentation and implementation on a form that represents a single code point in any Unicode plane by its hexadecimal value plus descriptive explanatory text (generally its name in the Unicode Character Database), e.g.

A single Unicode code point may be specified using one of the following forms:

U+ followed by four to six non-lowercase hexadecimal digits with no leading zeroes other than those necessary for padding to a minimum of four digits, in accordance with The Unicode Standard, Version 15.0.0, Appendix A, Notational Conventions (i.e., matching Unicode extended BNF pattern "U+" ( [1-9 A-F] | "10" )? H H H H or regular expression pattern ^U[+]([1-9A-F]|10)?[0-9A-F]{4}$ as in U+00A0 or U+1D306)

The preceding representation followed by a space and a printable ASCII prose explanation (such as a character name) free of < and > and line terminators, all wrapped in < and > (i.e., matching Unicode extended BNF pattern "<" "U+" ( [1-9 A-F] | "10" )? H H H H " " [\u0020-\u007E -- [<>]]+ ">" or regular expression pattern ^<U[+]([1-9A-F]|10)?[0-9A-F]{4} [\x20-\x3b\x3d\x3f-\x7e]+>$ as in <U+2212 MINUS SIGN>)

An abbreviation defined somewhere outside the grammar as an ASCII identifier name (i.e., matching Unicode extended BNF pattern [A-Z a-z _] [A-Z a-z _ 0-9]* or regular expression pattern ^[A-Za-z_][A-Za-z_0-9]*$ as in <NBSP>)

The text was updated successfully, but these errors were encountered:

rbuckton · 2023-06-15T22:03:12Z

I have a fix inbound for 5 and 6 digit hexadecimal codes, as well as a fix to correctly disallow < (including HTML entity encoded <) inside of a <>-wrapped character literal.

Currently the <>-wrapped format just indicates some prose to dictate the character and there are no other restrictions to content aside from disallowing < and > without escaping using \. I left the <> format open-ended to be flexible with any other future consumer of grammarkdown that may chose to define characters in some other way.

rbuckton · 2023-06-15T22:10:36Z

The grammarkdown.grammar file is fairly out of date, unfortunately, and is only used as part of tests at the moment. I'll look into updating the README.md to be a bit clearer on what is supported for unicode characters.

gibson042 · 2023-06-15T22:17:46Z

I have a fix inbound for 5 and 6 digit hexadecimal codes, as well as a fix to correctly disallow < (including HTML entity encoded <) inside of a <>-wrapped character literal.

👍

Currently the <>-wrapped format just indicates some prose to dictate the character and there are no other restrictions to content aside from disallowing < and > without escaping using \. I left the <> format open-ended to be flexible with any other future consumer of grammarkdown that may chose to define characters in some other way.

Thanks for the clarification (which probably belongs in the README as well). That kind of generality can certainly be convenient, but it comes with costs (such as there being no reliable way for machines to consume <…> nonterminals). It also doesn't address what I'm looking for, which is a way to express a specific code point while also providing an explanation for those poor souls who have not memorized the Unicode Character Database.

rbuckton · 2023-06-15T22:23:28Z

I can probably make <> a little stricter since a > on its own line can be used for any prose. That will probably end up as a separate fix than the one I'm working on now, though.

rbuckton · 2023-06-15T22:38:21Z

I have a fix for this in #92, although it does not include the more stringent parsing for <> just yet. I've called out the expected format that you suggested above in the README.md, however.

rbuckton · 2023-06-16T23:10:10Z

The fix in #92 now includes some validation for <U+HHHH description>

rbuckton · 2023-06-17T01:34:01Z

#92 now includes all of the rules you suggested.

gibson042 mentioned this issue Jun 15, 2023

Unicode terminal parsing does not match grammarkdown es-meta/esmeta#147

Closed

This was referenced Jun 15, 2023

Extend Unicode terminal parsing es-meta/esmeta#148

Merged

Meta: Upgrade ESMeta to v0.3.2 tc39/ecma262#3100

Merged

rbuckton mentioned this issue Jun 15, 2023

Fix unicode literal parsing to allow 5-6 digit U+ sequences and disallow < in abbreviations #92

Merged

rbuckton closed this as completed in #92 Jun 17, 2023

gibson042 mentioned this issue Jun 17, 2023

Fix README explanation of <…> literals #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a sanctioned way to reference a code point with its official name? #91

Is there a sanctioned way to reference a code point with its official name? #91

gibson042 commented Jun 15, 2023

rbuckton commented Jun 15, 2023

rbuckton commented Jun 15, 2023

gibson042 commented Jun 15, 2023

rbuckton commented Jun 15, 2023

rbuckton commented Jun 15, 2023 •

edited

Loading

rbuckton commented Jun 16, 2023

rbuckton commented Jun 17, 2023

Is there a sanctioned way to reference a code point with its official name? #91

Is there a sanctioned way to reference a code point with its official name? #91

Comments

gibson042 commented Jun 15, 2023

rbuckton commented Jun 15, 2023

rbuckton commented Jun 15, 2023

gibson042 commented Jun 15, 2023

rbuckton commented Jun 15, 2023

rbuckton commented Jun 15, 2023 • edited Loading

rbuckton commented Jun 16, 2023

rbuckton commented Jun 17, 2023

rbuckton commented Jun 15, 2023 •

edited

Loading