Allow integer suffixes starting with `e`. #111628

nnethercote · 2023-05-16T03:12:28Z

Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in #103872. For example:

Suffixes representing units, such as 8bits, 100px, 20ns, 30GB
CSS hex colours such as #7CFC00 (LawnGreen)
UUIDs, e.g. 785ada2c-f2d0-11fd-3839-b3104db0cb68

The hex cases may be surprising.

#7CFC00 is tokenized as a # followed by a 7 integer with a CFC00 suffix.
785ada2c is tokenized as a 785 integer with an ada2c suffix.
f2d0 is tokenized as an identifier.
3839 is tokenized as an integer literal.

A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like u8, i32, and f64.

Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with e is allowed, because that it interpreted as a float literal with an exponent. For example:

Units: 1eV and 1em
CSS colours: #90EE90 (LightGreen)
UUIDs: 785ada2c-f2d0-11ed-3839-b3104db0cb68

In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users have noticed; it's causing problems when embedding DSLs into proc macros, as seen in #111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad).

We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. 1eV) or multiple tokens (e.g. 1e+a).

Consequences of this:

All the proc macro problem cases mentioned above are fixed.
The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed.
The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish 1e+_______3 (a float literal with exponent) from 1e+_______a (previously invalid, but now the tokenised as 1e, +, _______a).

This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting.

Fixes #111615.

r? @ghost

nnethercote · 2023-05-16T03:18:24Z

This will need buy-in from the lang team. I have started a Zulip thread for discussion.

nnethercote · 2023-05-16T07:54:57Z

cc @rust-lang/lang, for obvious reasons.

cc @matklad, in case there are any rust-analyzer considerations.

matklad · 2023-05-16T11:04:44Z

~~No IDE concerns here. Unbounded look ahead in the lexer looks suspicious, but I think it’s actually fine.~~

Actually, no, I think the lookahead would require a small adjustment in the code for incremental relexing:

https://github.com/rust-lang/rust-analyzer/blob/2f8cd66fb4c98026d2bdbdf17270e3472e1ca42a/crates/syntax/src/parsing/reparsing.rs#L35

This is not super-precisely formulated (and probably buggy as-is), but their idea here is that a lot of edits modify just a single token (user appending a letter to identifier), so we should take advantage of that and modify the syntax tree without incremental reparsing, by just replacing a single token.

We do have access to previous token there, so running this lookahead logic there should be possible, just more code.

It is perhaps worth it to move this incremental re-lexing logic over to rustc code base (with suitable unit tests), to encode the core constraint an IDE needs: “re-lexing can be done incrementally”.

ogoffart · 2023-05-16T12:26:47Z

proc_macro2 will probably need to be adjusted as well.

nnethercote · 2023-05-16T12:29:57Z

proc_macro2 will probably need to be adjusted as well.

How so? I'm no proc_macro2 expert, but won't the newly accepted tokens just be more tokens, not really any different to existing tokens? E.g. 1eV doesn't seem particularly different to 1mm, once the lexer accepts it.

matklad · 2023-05-16T12:45:15Z

proc macro 2 has another copy of the lexer:

https://github.com/dtolnay/proc-macro2/blob/2c1b1021cff64aa6c29dd2c82bcb87b369013d00/src/parse.rs#L325

Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in rust-lang#103872. For example: - Suffixes representing units, such as `8bits`, `100px`, `20ns`, `30GB` - CSS hex colours such as `#7CFC00` (LawnGreen) - UUIDs, e.g. `785ada2c-f2d0-11fd-3839-b3104db0cb68` The hex cases may be surprising. - `#7CFC00` is tokenized as a `#` followed by a `7` integer with a `CFC00` suffix. - `785ada2c` is tokenized as a `785` integer with an `ada2c` suffix. - `f2d0` is tokenized as an identifier. - `3839` is tokenized as an integer literal. A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like `u8`, `i32`, and `f64`. Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with `e` is allowed, because that it interpreted as a float literal with an exponent. For example: - Units: `1eV` and `1em` - CSS colours: `#90EE90` (LightGreen) - UUIDs: `785ada2c-f2d0-11ed-3839-b3104db0cb68` In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users *have* noticed; it's causing problems when embedding DSLs into proc macros, as seen in rust-lang#111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad). We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. `1eV`) or multiple tokens (e.g. `1e+a`). Consequences of this: - All the proc macro problem cases mentioned above are fixed. - The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed. - The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish `1e+_______3` (a float literal with exponent) from `1e+_______a` (previously invalid, but now the tokenised as `1e`, `+`, `_______a`). This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting. Fixes rust-lang#111615.

nnethercote · 2023-05-16T23:06:25Z

proc macro 2 has another copy of the lexer:

Looks like its own implementation of the lexer, right?

cc @dtolnay, in that case, for the proc_macro2 perspective.

bors · 2023-05-26T07:08:35Z

☔ The latest upstream changes (presumably #111858) made this pull request unmergeable. Please resolve the merge conflicts.

petrochenkov · 2023-05-29T18:23:51Z

The main requirement from me here is for this change to be compatible with lexer producing finer-grained tokens for floats (possibly suffixed integers, idents, and punctuation instead of whole-floats) as I described on the Zulip thread and in #71322.

Step 1

So I suggest to actually implement that new behavior in the lexer first.

1e2 -> Int(1e2)
1. -> Int(1) Punct(.)
1.2 -> Int(1) Punct(.) Int(2)
1.2e3 -> Int(1) Punct(.) Int(2e3)
1e+2 -> Int(1e) Punct(+) Int(2)
1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+3 -> Int(1) Punct(.) Int(2e) Punct(+) Int(3)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

That would be of great help for any future work, and we could publicly expose this lexing mode from rustc_lexer even if rustc_parser is not using it right now.

For compatibility we'd also provide a mode that would immediately glue everything we've just lexed back into a Float token.

Step 2

Then we'd just choose in some cases to not glue everything back, thus fixing #111615.

petrochenkov · 2023-05-30T23:14:11Z

1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

I strongly suspect that can unsupport the underscores after +/- thus removing Idents from the equation, and leaving only punctuation and (possibly suffixed) integers.
It would be interesting to run this change through crater.

JohnCSimon · 2023-10-01T03:14:47Z

@nnethercote
ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks!

nnethercote · 2023-10-01T06:35:17Z

waiting-on-author is still appropriate. Vadim's suggestion above is for a completely different approach, one that requires much larger changes, and I haven't gotten around to trying it.

Dylan-DPC · 2024-07-28T06:50:28Z

@nnethercote any updates on this? thanks

nnethercote · 2024-08-01T00:20:51Z

I'd still like to fix this, but it's fair to say progress is stalled enough that closing this is reasonable.

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 16, 2023

nnethercote marked this pull request as draft May 16, 2023 03:12

This comment has been minimized.

Sign in to view

nnethercote force-pushed the allow-e-suffixes branch from 6635303 to 6f5d2f6 Compare May 16, 2023 04:25

nnethercote mentioned this pull request May 16, 2023

Allow numeric tokens containing 'e' that aren't exponents be passed to proc macros #111615

Open

nnethercote force-pushed the allow-e-suffixes branch from 6f5d2f6 to 9fa6652 Compare May 16, 2023 07:06

petrochenkov self-assigned this May 16, 2023

nnethercote force-pushed the allow-e-suffixes branch from 9fa6652 to e51cfe6 Compare May 16, 2023 23:01

petrochenkov marked this pull request as ready for review May 29, 2023 18:07

petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 29, 2023

petrochenkov mentioned this pull request Aug 11, 2023

Disallow leading underscores in float exponents. #114567

Closed

guofoo mentioned this pull request Jan 5, 2024

Error defining colors containing hex digit 'e' in live_design! macro makepad/makepad#343

Open

nnethercote closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow integer suffixes starting with `e`. #111628

Allow integer suffixes starting with `e`. #111628

nnethercote commented May 16, 2023

nnethercote commented May 16, 2023

This comment has been minimized.

nnethercote commented May 16, 2023

matklad commented May 16, 2023

ogoffart commented May 16, 2023

nnethercote commented May 16, 2023

matklad commented May 16, 2023

nnethercote commented May 16, 2023

bors commented May 26, 2023

petrochenkov commented May 29, 2023 •

edited

Loading

petrochenkov commented May 30, 2023

JohnCSimon commented Oct 1, 2023

nnethercote commented Oct 1, 2023

Dylan-DPC commented Jul 28, 2024

nnethercote commented Aug 1, 2024

Allow integer suffixes starting with e. #111628

Allow integer suffixes starting with e. #111628

Conversation

nnethercote commented May 16, 2023

nnethercote commented May 16, 2023

This comment has been minimized.

nnethercote commented May 16, 2023

matklad commented May 16, 2023

ogoffart commented May 16, 2023

nnethercote commented May 16, 2023

matklad commented May 16, 2023

nnethercote commented May 16, 2023

bors commented May 26, 2023

petrochenkov commented May 29, 2023 • edited Loading

Step 1

Step 2

petrochenkov commented May 30, 2023

JohnCSimon commented Oct 1, 2023

nnethercote commented Oct 1, 2023

Dylan-DPC commented Jul 28, 2024

nnethercote commented Aug 1, 2024

Allow integer suffixes starting with `e`. #111628

Allow integer suffixes starting with `e`. #111628

petrochenkov commented May 29, 2023 •

edited

Loading