Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

richard-uk1 · 2020-12-10T18:55:39Z

This fixes #67544. There will be some regression in diagnostics that need fixing before merge. I want to get feedback before I sink time into this though that the patch might be accepted.

It will fail CI because some compile-fail messages have changed, but I would still like feedback.

I expect if it does get merged, it will be after considerable discussion.

Also requires tests before merge.

Change lexer to treat 'e' after number as part of a suffix unless it is followed by a valid exponent.

rust-highfive · 2020-12-10T18:55:42Z

r? @lcnr

(rust-highfive has picked a reviewer for you, use r? to override)

lcnr · 2020-12-10T19:03:03Z

r? @petrochenkov maybe, they are more knowledgeable about this.

this also requires t-lang signoff

richard-uk1 · 2020-12-11T13:14:20Z

It would be nice to move the exponent parsing into the parser, so all tokens in proc macros get a number followed by a suffix, but this would be a conflagration of concerns.

This isn't really ready for review so I'll close and re-open once it's further along.

TODOs for me

the token 1e+ (1, e, +, <non-digit>) should be the integer 1with the suffixe, a +` symbol, and then the lexer should start parsing another token.

petrochenkov · 2020-12-11T13:28:59Z

@derekdreery

It would be nice to move the exponent parsing into the parser

That's pretty much the plan, see the discussion in #71322.
It's also a part of a larger plan on lexer & parser librarification and code sharing with rust-analyzer on which @matklad started working sometime around #76170, but stopped quickly after that. It would be great if someone could pick up that work.

matklad · 2020-12-11T13:45:05Z

Yeah, sorry about over promising on that front -- life happened. I am intending to pick that work eventually, no promises as to when this'll actually happen this time :)

richard-uk1 · 2020-12-11T18:02:29Z

@petrochenkov @matklad cool I will have a look. It would probably be better to work from sone existing code, so I can get a feel for how to write stuff in line with the rest of the code.

richard-uk1 · 2020-12-12T11:22:26Z

I'm going to write a little plan here for people to comment on:

The lexer will be unaware of exponential notation. 1e6 will be lexed as {integer} 1 with suffix e6. 1e-2 will be lexed as {integer} 1 with suffix e followed by - followed by int 2. 1.2e3 will lex as {float} 1.2 with suffix e3 and so on.
An early stage in the parser will match and combine valid exponentials (e.g. convert 3e2 into {float} 300, covert {float} 3.2 with suffix e, +, {int} 1 into {float} 32), but leave any invalid exponentials untouched (e.g. leave 3e as {int} 3 with suffix e).
This is what proc macros get - rather than failing at the lexer stage if an e does not lead to a valid exponential, proc macros will get the unparsed tokens. If an e does match a valid exponential, then the proc macro gets a parsed float as is currently the case, to maintain backwards compatibility and to strike a balance between convenience and flexibility (parsing exponents is more convenient but less flexible, in any case we have to do this to be backwards compat).
I believe there is also an issue with a token like 1.2ff - rather than getting the token the proc_macro will get an error about expected f32 or f64. I need to check if this is actually the case, as it's based on a vague memory ATM. I should also see if this kind of thing affects u{8, 16, ...} for integers.
Ensure that diagnostics are the same or improved. Specifically, if a token like 1e, or tokens like [1e, +, something] are found, emit a "did you mean to use an exponent....".

EDIT

TODO: think about pathological cases like 1e+2ee (which should be parsed as {float} 100 with suffix ee), 1e+1e+1 would be {float} 10 with suffix e, +, {integer} 1.

Concern: How to handle 1e + 2e, since this would end up parsed as {float} 100 with suffix e, which is probably not intended. How to handle this correctly? Because whitespace matters, this is a problem that comes from moving the task out of the lexer.

petrochenkov · 2020-12-12T19:31:22Z

With "fine-grained" tokens lexer will not produce float tokens at all, only integer tokens (possibly suffixed) and punctuation.
So 1.2e3 is [int(1), punct(.), int(2e3)] (with spacing information kept).

petrochenkov · 2020-12-12T20:07:31Z

I think we right now we can change behavior for cases returning rustc_lexer::LiteralKind::Float { empty_exponent: true } from lexer, because they are unconditionally reported as errors currently, so it will be strictly a language extension rather than a change.

For this fn is_next_exponent should behave identical to existing fn eat_float_exponent.
So instead of introducing is_next_exponent, eat_float_exponent could return a three-variant enum instead of bool, something like

enum {
    GoodFloatExponentIndeed, // e123, e+123, e-123
    BadFloatExponent, // e+, e-
    JustASuffixStartingWithE, // e, e___
}

(Note that eat_decimal_digits eats underscores in addition to digits, so we need to be careful with cases with no digits, for example.)

richard-uk1 · 2020-12-13T14:51:25Z

@petrochenkov @matklad is there any documentation on how you want the lexer/parser to look after the update? If there is, I could help work towards it.

petrochenkov · 2020-12-27T12:32:01Z

@derekdreery
I think @matklad made a tracking issue for this (or described the plan in on of the PRs/issues), but I can't find it right now.
The minimal changes to e suffix (#79912 (comment)) are not blocked on any big plans though.

richard-uk1 · 2020-12-27T16:18:28Z

@petrochenkov Ok that's what I'll make this PR 😄

petrochenkov · 2021-02-23T10:16:13Z

Closing due to inactivity.

richard-uk1 · 2021-02-23T16:50:51Z

If I pick this up again I'll make a new PR.

Change lexer to treat 'e' after number as suffix

3ee510a

Change lexer to treat 'e' after number as part of a suffix unless it is followed by a valid exponent.

rust-highfive assigned lcnr Dec 10, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 10, 2020

jonas-schievink added needs-fcp This change is insta-stable, so needs a completed FCP to proceed. T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Dec 10, 2020

rust-highfive assigned petrochenkov and unassigned lcnr Dec 10, 2020

petrochenkov mentioned this pull request Dec 12, 2020

Properly capture trailing 'unglued' token #79978

Merged

richard-uk1 marked this pull request as draft December 27, 2020 16:18

petrochenkov closed this Feb 23, 2021

richard-uk1 mentioned this pull request Oct 13, 2024

move some invalid exponent detection into rustc_session #131656

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

richard-uk1 commented Dec 10, 2020 •

edited

Loading

rust-highfive commented Dec 10, 2020

lcnr commented Dec 10, 2020

richard-uk1 commented Dec 11, 2020 •

edited

Loading

petrochenkov commented Dec 11, 2020

matklad commented Dec 11, 2020

richard-uk1 commented Dec 11, 2020

richard-uk1 commented Dec 12, 2020 •

edited

Loading

petrochenkov commented Dec 12, 2020

petrochenkov commented Dec 12, 2020

richard-uk1 commented Dec 13, 2020

petrochenkov commented Dec 27, 2020

richard-uk1 commented Dec 27, 2020

petrochenkov commented Feb 23, 2021

richard-uk1 commented Feb 23, 2021

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Conversation

richard-uk1 commented Dec 10, 2020 • edited Loading

rust-highfive commented Dec 10, 2020

lcnr commented Dec 10, 2020

richard-uk1 commented Dec 11, 2020 • edited Loading

TODOs for me

petrochenkov commented Dec 11, 2020

matklad commented Dec 11, 2020

richard-uk1 commented Dec 11, 2020

richard-uk1 commented Dec 12, 2020 • edited Loading

petrochenkov commented Dec 12, 2020

petrochenkov commented Dec 12, 2020

richard-uk1 commented Dec 13, 2020

petrochenkov commented Dec 27, 2020

richard-uk1 commented Dec 27, 2020

petrochenkov commented Feb 23, 2021

richard-uk1 commented Feb 23, 2021

richard-uk1 commented Dec 10, 2020 •

edited

Loading

richard-uk1 commented Dec 11, 2020 •

edited

Loading

richard-uk1 commented Dec 12, 2020 •

edited

Loading