Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change lexer to treat 'e' after number as suffix unless it is followed by a valid exponent. #79912

Conversation

richard-uk1
Copy link
Contributor

@richard-uk1 richard-uk1 commented Dec 10, 2020

This fixes #67544. There will be some regression in diagnostics that need fixing before merge. I want to get feedback before I sink time into this though that the patch might be accepted.

It will fail CI because some compile-fail messages have changed, but I would still like feedback.

I expect if it does get merged, it will be after considerable discussion.

Also requires tests before merge.

Change lexer to treat 'e' after number as part of a suffix unless it is
followed by a valid exponent.
@rust-highfive
Copy link
Collaborator

r? @lcnr

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 10, 2020
@jonas-schievink jonas-schievink added needs-fcp This change is insta-stable, so needs a completed FCP to proceed. T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Dec 10, 2020
@lcnr
Copy link
Contributor

lcnr commented Dec 10, 2020

r? @petrochenkov maybe, they are more knowledgeable about this.

this also requires t-lang signoff

@rust-highfive rust-highfive assigned petrochenkov and unassigned lcnr Dec 10, 2020
@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Dec 11, 2020

It would be nice to move the exponent parsing into the parser, so all tokens in proc macros get a number followed by a suffix, but this would be a conflagration of concerns.

This isn't really ready for review so I'll close and re-open once it's further along.

TODOs for me

  • the token 1e+ (1, e, +, <non-digit>) should be the integer 1with the suffixe, a +` symbol, and then the lexer should start parsing another token.

@petrochenkov
Copy link
Contributor

@derekdreery

It would be nice to move the exponent parsing into the parser

That's pretty much the plan, see the discussion in #71322.
It's also a part of a larger plan on lexer & parser librarification and code sharing with rust-analyzer on which @matklad started working sometime around #76170, but stopped quickly after that. It would be great if someone could pick up that work.

@matklad
Copy link
Member

matklad commented Dec 11, 2020

Yeah, sorry about over promising on that front -- life happened. I am intending to pick that work eventually, no promises as to when this'll actually happen this time :)

@richard-uk1
Copy link
Contributor Author

@petrochenkov @matklad cool I will have a look. It would probably be better to work from sone existing code, so I can get a feel for how to write stuff in line with the rest of the code.

@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Dec 12, 2020

I'm going to write a little plan here for people to comment on:

  • The lexer will be unaware of exponential notation. 1e6 will be lexed as {integer} 1 with suffix e6. 1e-2 will be lexed as {integer} 1 with suffix e followed by - followed by int 2. 1.2e3 will lex as {float} 1.2 with suffix e3 and so on.
  • An early stage in the parser will match and combine valid exponentials (e.g. convert 3e2 into {float} 300, covert {float} 3.2 with suffix e, +, {int} 1 into {float} 32), but leave any invalid exponentials untouched (e.g. leave 3e as {int} 3 with suffix e).
  • This is what proc macros get - rather than failing at the lexer stage if an e does not lead to a valid exponential, proc macros will get the unparsed tokens. If an e does match a valid exponential, then the proc macro gets a parsed float as is currently the case, to maintain backwards compatibility and to strike a balance between convenience and flexibility (parsing exponents is more convenient but less flexible, in any case we have to do this to be backwards compat).
  • I believe there is also an issue with a token like 1.2ff - rather than getting the token the proc_macro will get an error about expected f32 or f64. I need to check if this is actually the case, as it's based on a vague memory ATM. I should also see if this kind of thing affects u{8, 16, ...} for integers.
  • Ensure that diagnostics are the same or improved. Specifically, if a token like 1e, or tokens like [1e, +, something] are found, emit a "did you mean to use an exponent....".

EDIT

TODO: think about pathological cases like 1e+2ee (which should be parsed as {float} 100 with suffix ee), 1e+1e+1 would be {float} 10 with suffix e, +, {integer} 1.

Concern: How to handle 1e + 2e, since this would end up parsed as {float} 100 with suffix e, which is probably not intended. How to handle this correctly? Because whitespace matters, this is a problem that comes from moving the task out of the lexer.

@petrochenkov
Copy link
Contributor

With "fine-grained" tokens lexer will not produce float tokens at all, only integer tokens (possibly suffixed) and punctuation.
So 1.2e3 is [int(1), punct(.), int(2e3)] (with spacing information kept).

@petrochenkov
Copy link
Contributor

I think we right now we can change behavior for cases returning rustc_lexer::LiteralKind::Float { empty_exponent: true } from lexer, because they are unconditionally reported as errors currently, so it will be strictly a language extension rather than a change.

For this fn is_next_exponent should behave identical to existing fn eat_float_exponent.
So instead of introducing is_next_exponent, eat_float_exponent could return a three-variant enum instead of bool, something like

enum {
    GoodFloatExponentIndeed, // e123, e+123, e-123
    BadFloatExponent, // e+, e-
    JustASuffixStartingWithE, // e, e___
}

(Note that eat_decimal_digits eats underscores in addition to digits, so we need to be careful with cases with no digits, for example.)

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. needs-fcp This change is insta-stable, so needs a completed FCP to proceed. labels Dec 12, 2020
@richard-uk1
Copy link
Contributor Author

@petrochenkov @matklad is there any documentation on how you want the lexer/parser to look after the update? If there is, I could help work towards it.

@petrochenkov
Copy link
Contributor

@derekdreery
I think @matklad made a tracking issue for this (or described the plan in on of the PRs/issues), but I can't find it right now.
The minimal changes to e suffix (#79912 (comment)) are not blocked on any big plans though.

@richard-uk1
Copy link
Contributor Author

@petrochenkov Ok that's what I'll make this PR 😄

@richard-uk1 richard-uk1 marked this pull request as draft December 27, 2020 16:18
@petrochenkov
Copy link
Contributor

Closing due to inactivity.

@richard-uk1
Copy link
Contributor Author

If I pick this up again I'll make a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In the lexer, accept number suffixes that start with e.
6 participants