Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse expression a as usize < b #22644

Closed
defuz opened this issue Feb 21, 2015 · 23 comments
Closed

Unable to parse expression a as usize < b #22644

defuz opened this issue Feb 21, 2015 · 23 comments
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-parser Area: The parsing of Rust source code to an AST

Comments

@defuz
Copy link
Contributor

defuz commented Feb 21, 2015

It seems like the parser tries to read the type where it isn't:

fn main() {
    let a : u32 = 0;
    let b : usize = 0;

    a as usize > b; // ok
    a as usize < b; // error: expected one of `(`, `+`, `,`, `::`, `<`, or `>`, found `;`
}

I think it's rather strange that the operator > works, but the operator < doesn't at the same place.

rustc 1.0.0-nightly (522d09dfe 2015-02-19) (built 2015-02-20)

@kmcallister
Copy link
Contributor

usize<T> is a syntactically valid (though nonsensical) type-expression. This ambiguity on < has plagued C++ for decades :(

@kmcallister kmcallister added A-parser Area: The parsing of Rust source code to an AST A-diagnostics Area: Messages for errors, warnings, and lints labels Feb 21, 2015
@kmcallister
Copy link
Contributor

Maybe we can at least detect the ambiguous cases and provide a hint.

@defuz
Copy link
Contributor Author

defuz commented Feb 21, 2015

At least Scala, Nemerle, Nim and Boo began to use square brackets for type parameters. Is there any reason why Rust can not do the same? I think it would significant simplify parsing and get rid of ambiguities.

@kmcallister
Copy link
Contributor

That is my preference as well, but the discussion already happened and a decision was made (before my time, I believe).

@huonw
Copy link
Member

huonw commented Feb 21, 2015

We use [] for array indexing so there are still ambiguities.

@defuz
Copy link
Contributor Author

defuz commented Feb 22, 2015

@huonw, actually no, because unlike < and > characters, square brackets (and parens) always respect the pairing. The ambiguity lies in using < and > like brackets, although they also can be used separately.

@huonw
Copy link
Member

huonw commented Feb 22, 2015

I don't thinks that's totally the reason.

The token sequence x as T[y] has two options for parses with [] as the generic syntax: an index expression (x as T)[y] or a cast to a generic type x as (T[y]). This is essentially the same as what is occurring here, the token sequence x as T < y represents a prefix of two options for parses: a comparison (x as T) < y and a cast to a generic type x as (T<y ... >). You're wanting the former but the compiler is giving the latter.

Of course, it is probably rarer that one wishes to index a cast expression, so using [] for generics may reduce how often this ambiguity occurs in real code, but it definitely does not get rid of it.

In any case, the generic syntax is here to stay. There's been more than enough discussion about it (e.g. rust-lang/rfcs#148).

@bombless
Copy link
Contributor

Is x as (T<y ...>) actually available here?
I guess we only accept x as (&T<y ...>) for now? (T is trait here)

If we don't accept x as (T<y ...>) as legal expression (yet), I think we can parse it as OP wishes, and if we accept something like x as (T<y ...>) as valid expression one day, we just write that parenthesis out.

I'd love to write a PoC for this.
Correct me if I miss anything.

And if my idea's totally wrong, I guess I'll find out soon. (like in make check or even earlier stage)

@defuz
Copy link
Contributor Author

defuz commented Feb 22, 2015

The question of how to properly parsing sequence x as T[y ... is a question about the operator precedence. Indexing is weaker than the as operator, so there is only one way of the correct parsing it: x as (T[y ...). In the other hand, the question of parsing x as T<y ... is a question about what is < at all. It could be a start of generic type parameters or just comparison operator. In this case parser can't determine what is < and it is the reason of ambiguity.

@bombless
Copy link
Contributor

At least I don't see the possibility we change the syntax on 1.x

@dgrunwald
Copy link
Contributor

The ambiguity on < also exists in C++ and C#, and doesn't cause such problems in those languages. (although the way this syntax is disambiguated is dramatically different between C++ and C#)

I believe fixing this issue means we can also drop the :: when calling generic functions: f::<T>() becomes f<T>().
Note that with sufficient lookahead, most programs are not actually ambiguous, because RFC 558 prevents the use of both < and > in an expression.
Real ambiguities arise only with multiple or nested type arguments (and possibly also with UFCS syntax, I haven't yet looked at the details of that).
f(a<b,c>(d)) could be f((a<b), (c>d)) or f(a::<b,c>(d)).
a as b < c < d >> - e could be (a as b<c<d>>) - e or (a as b) < c < (d >> (-e)) (with python-style chained comparison)

I think introducing an C#-style disambiguation rule could possibly solve this issue with only very minor breaking changes. (these were also breaking changes in C# when generics were introduced in version 2.0, and affected very little code)
I have an idea on how to do this, but I didn't write a concrete proposal because people didn't seem to like parser rules that require arbitrary lookahead. And because it would make it difficult to extend Rust's type syntax later without introducing additional minor breaking changes due to the disambiguation.

@kmcallister
Copy link
Contributor

The ambiguity on < also exists in C++ and C#, and doesn't cause such problems in those languages.

I don't know about C#, but it's responsible for some of the ugliest special cases in C++ syntax, like

void f() {
    T::template g<int>();
}

@bombless
Copy link
Contributor

C++ needs a symbol table for parsing.
And people feel okay about it just because C already needs symbol table for parsing.

@dgrunwald
Copy link
Contributor

I agree that we don't want to copy C++'s mistakes here.
The C# rules are much more sane: no weird "typename"/"template" keywords for disambiguating, no need for a symbol table during parsing.

The idea is: when the parser encounters a < that is ambiguous, it looks ahead in the token stream and attempts to parse the type argument list. If the type argument list is not valid, backtrack and parse the < as less-than operator. If the type argument list is valid, look at the next token after the matching > to determine whether to parse the < as type argument list or less-than operator:

  • Next token signals end of expression, e.g. one of ; , ) ] }: use interpretation as type argument list.
  • Next token is identifier or integer literal: use interpretation as less-than operator.
  • Next token is { or (: use interpretation as type argument list (struct initializer / function call). (this is a breaking change for the current interpretation of f(a<b,c>(d)))
  • For other tokens, we'd have to think about which interpretation makes more sense. In general, tokens in FIRST(Expression) should use the interpretation as less-than operator and tokens in FOLLOW(Expression) should use the interpretation as type argument list, but quite a few tokens are in both of those sets.

Note that if this heuristic is wrong, the user can always choose force one or the other interpretation: Putting parentheses around the less-than operator (e.g. f((a<b), (c>d))) makes the interpretation as type argument list invalid and thus enforces the parse as less-than operator.
If a type argument list is desired, enforcing this depends on which context the type argument list is used in:

  • for a struct initializer or function call, the { or ( heuristic always pick correctly.
  • where type arguments are used at the end of an expression (in as expression, nullary struct, ...), the expression can be enclosed in parentheses so that > is followed by ) and the heuristic thus picks correctly.
  • We'd need to review all other possible places where a type argument list might appear. In general, it seems like a good idea to have the heuristic err on the side of picking the type argument list.

We can also exploit the fact that RFC 558 makes a < b > c invalid in expression context to always pick the type argument list interpretation (without looking at the next token after >) if the type argument list does not contain any comma and the > is not part of a >> token.

I think this heuristic will work well enough that stuff will "just work" in most cases.
In C# it's basically perfect and most C# programmers don't even realize there is an ambiguity -- but Rust syntax a bit more flexible and may cause more problems in this regard.

Upsides:

  • Fixes this issue
  • Allows us to remove :: for generic function calls.

Downsides:

  • Introduces arbitrary lookahead into the language
  • Complicated rule, users may get confused where the heuristic fails.
    • Counterpoint: users are already confused where the current "heuristic" fails (see: this issue)
  • Any future extension to type syntax would expand the set of valid type argument lists, and thus might change how existing expressions are interpreted. This may make it difficult to add new type-level syntax post-1.0 without breaking existing code.
  • In particular, I have no idea how this would work with type-level integers.
    • Counterpoint: I don't know how those would work syntactically without this idea, either -- putting a constant expression where currently a type is expected seems troublesome in general.

@SimonSapin
Copy link
Contributor

I think #20078 is the same underlying issue.

@benaryorg
Copy link
Contributor

I wonder why in this example the comparison is prioritized but in a line such as

thread_local!(static LOCAL:Cell<u32>=Cell::new(0));

the comparison operator is resolved first.

The above code yields:

src/main.rs:3:35: 3:37 error: no rules expected the token `>=`
src/main.rs:3 thread_local!(static TEST:Cell<u32>=Cell::new(0));
                                                ^~

@arielb1
Copy link
Contributor

arielb1 commented Sep 1, 2015

@benaryorg

Because >= is eagerly tokenized

@arielb1
Copy link
Contributor

arielb1 commented Sep 1, 2015

anyway, foo as Rc < fmt::Debug > is perfectly valid (and useful!) Rust. If we don't want to play games with infinite lookahead, this is how we go.

@benaryorg
Copy link
Contributor

@arielb1 Thanks.

I guess I now have a real reason to change my coding style.

Regarding the other problem, wish you good luck fixing that….

@bstrie
Copy link
Contributor

bstrie commented Sep 10, 2015

@dgrunwald We had this discussion back when RFC 558 was accepted and determined that such a scheme wouldn't allow us to get rid of ::<> entirely; it would only allow us to omit the :: for things that take a single type parameter (i.e. arity one) and the full ::<> form would be required for typaram lists of higher arity. The upside is that this probably constitutes the majority of ::<> usage, and so lots of code would still be cleaned up in practice. The downsides are that this would prevent us from realizing the original intent of RFC 558 (Python-style chained comparisons) while also introducing the notion of infinite lookahead into the parser (currently our lookahead is always bounded). I think it's still probably worth doing, but it was sufficiently controversial to postpone the discussion to post-1.0.

@dgrunwald
Copy link
Contributor

@bstrie: The C#-style lookahead heuristic in my comment works for arity > 1; it just requires adding parentheses to disambiguate some rare cases (it never requires ::<>). It would be a breaking change, though I would be surprised if there's much (any?) real-world code that would be broken.

It doesn't require exploiting RFC 558 for the disambiguation. RFC 558 could theoretically be used to improve the heuristic (no need to look at the token following >), but only in the comma-free cases (arity 1 isn't sufficient, it could also be ambiguous if the single type argument is another generic type instantiation containing commas). But I don't think we should use it; adding an additional type argument to an existing generic call should always be valid.

The ugliness of ::<> was why I started looking into that part of the syntax in the first place (I kinda forgot about that when I wrote RFC 558 a few days later). If we add support for Python-style chained comparisons in the future, we should probably only add monotonic comparisons (allow a < b < c and a > b > c, but not a < b > c) to avoid conflicts with generic syntax.

The main downside of the C#-style heuristic is that future additions to the type-level syntax would be breaking changes. So I think this should be postponed until type-level integer syntax is worked out.

@birkenfeld
Copy link
Contributor

Duplicate of #11962?

@SimonSapin
Copy link
Contributor

Yes, though there is more discussion here.

bors added a commit that referenced this issue Jun 16, 2017
Learn to parse `a as usize < b`

Parsing `a as usize > b` always works, but `a as usize < b` was a
parsing error because the parser would think the `<` started a generic
type argument for `usize`. The parser now attempts to parse as before,
and if a DiagnosticError is returned, try to parse again as a type with
no generic arguments. If this fails, return the original
`DiagnosticError`.

Fix #22644.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-parser Area: The parsing of Rust source code to an AST
Projects
None yet
Development

No branches or pull requests

10 participants