Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator tokens #601

Merged
merged 10 commits into from
Jul 8, 2021
1 change: 1 addition & 0 deletions proposals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,6 @@ request:
- [0447 - Generics terminology](p0447.md)
- [0538 - `return` with no argument](p0538.md)
- [0555 - Operator precedence](p0555.md)
- [0601 - Operator tokens](p0601.md)

<!-- endproposals -->
351 changes: 351 additions & 0 deletions proposals/p0601.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
# Operator tokens

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/601)

<!-- toc -->

## Table of contents

- [Problem](#problem)
- [Background](#background)
- [Proposal](#proposal)
- [Details](#details)
- [Two kinds of operator tokens](#two-kinds-of-operator-tokens)
- [Symbolic token list](#symbolic-token-list)
- [Whitespace](#whitespace)
- [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
- [Alternatives considered](#alternatives-considered)

<!-- tocstop -->

## Problem

Carbon needs a set of tokens to represent operators.

## Background

Some languages have a fixed set of operator tokens. For example:

- [C++ operators](https://eel.is/c++draft/lex.operators)
- The keyword operators `and`, `or`, etc. are lexical synonyms for
corresponding symbolic operators `&&`, `||`, etc.
- [Rust operators](https://doc.rust-lang.org/book/appendix-02-operators.html)

Other languages have extensible rules for defining operators, including the
facility for a developer to define operators that aren't part of the base
language. For example:

- [Swift operator rules](https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#ID418)
- [Haskell operator rules](https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#dx7-18008)

Operators tokens can be formed by various rules, for example:

- At each lexing step, form the longest known operator token possible from the
remaining character sequence. For example, in C++, `a += b` is 3 tokens and
`a =+ b` is four tokens, because there are `+`, `=`, and `+=` operators, but
there is no `=+` operator. This approach is sometimes known as "max munch".
- At each lexing step, treat the longest sequence of operator-like characters
possible as an operator. The program is invalid if there is no such
operator. For example, in a C++-like language using this approach, `a =+ b`
would be invalid instead of meaning `a = (+b)`.
- Use semantic information to determine how to split a sequence of operator
characters into one or more operators, for example based on the types of the
operands.

## Proposal

Carbon has a fixed set of tokens that represent operators, defined by the
language specification. Developers cannot define new tokens to represent new
operators; there may be facilities to overload operators, but that is outside
the scope of this proposal. There are two kinds of tokens that represent
operators:

- _Symbolic tokens_ consist of one or more symbol characters. In particular,
such a token contains no characters that are valid in identifiers, no quote
characters, and no whitespace.
- _Keywords_ follow the lexical rules for words.

Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the
longest symbolic token defined by the language specification that appears
starting at the current input position is lexed, if any.

Not all uses of symbolic tokens within the Carbon grammar will be as operators.
For example, we will have `(` and `)` tokens that serve to delimit various
grammar productions, and we may not want to consider `.` to be an operator,
because its right "operand" is not an expression.

When a symbolic token is used as an operator, we use the presence or absence of
whitespace around the symbolic token to determine its fixity, in the same way we
expect a human reader to recognize them. For example, we want `a* - 4` to treat
the `*` as a unary operator and the `-` as a binary operator, while `a * -4`
results in the reverse. This largely requires whitespace on only one side of a
unary operator and on both sides of a binary operator. However, we'd also like
to support binary operators where a lack of whitespace reflects precedence such
as`2*x*x + 3*x + 1` where doing so is straightforward. The rules we use to
achieve this are:

- There can be no whitespace between a unary operator and its operand.
- The whitespace around a binary operator must be consistent: either there is
whitespace on both sides or on neither side.
- If there is whitespace on neither side of a binary operator, the token
zygoloid marked this conversation as resolved.
Show resolved Hide resolved
before the operator must be an identifier, a literal, or any kind of closing
bracket (for example, `)`, `]`, or `}`), and the token after the operator
must be an identifier, a literal, or any kind of opening bracket (for
example, `(`, `[`, or `{`).

This proposal includes an initial set of symbolic tokens covering only the
grammar productions that have been approved so far. This list should be extended
by proposals that use additional symbolic tokens.

## Details

### Two kinds of operator tokens
jonmeow marked this conversation as resolved.
Show resolved Hide resolved

Two kinds of operator tokens are proposed. These two kinds are intended for
different uses, not as alternate spellings of the same functionality:

- Symbolic tokens are intended to be used for widely-recognized operators,
jonmeow marked this conversation as resolved.
Show resolved Hide resolved
such as the mathematical operators `+`, `*`, `<`, and so on.
- Symbolic tokens used as operators would generally be expected to also be
meaningful for some user-defined types, and should be candidates for
being made overloadable once we support operator overloading.
- Keywords are intended to be used for cases such as the following:
jonmeow marked this conversation as resolved.
Show resolved Hide resolved
- Operators that perform flow control, such as `and`, `or`, `throw`,
`yield`, and operators closely connected to these, such as `not`. It is
important that these stand out from other operators as they have action
that goes beyond evaluating their operands and computing a value.
- Operators that are rare and that we do not want to spend our finite
symbolic token budget on, such as perhaps xor or bit rotate.
- Operators with very low precedence, and perhaps certain operators with
very high precedence.
- Special-purpose operators for which there is no conventional established
symbol and for which we do not want to invent one, such as `as`.

The example operators in this section are included only to motivate the two
kinds of operator token; those specific operators are not proposed as part of
this proposal.

### Symbolic token list

The following is the initial list of symbolic tokens recognized in a Carbon
source file:

| | | | | | |
| --- | ---- | ---- | --- | --- | --- |
| `(` | `)` | `{` | `}` | `[` | `]` |
| `,` | `.` | `;` | `:` | `*` | `&` |
| `=` | `->` | `=>` | | | |
chandlerc marked this conversation as resolved.
Show resolved Hide resolved

This list is expected to grow over time as more symbolic tokens are required by
language proposals.

### Whitespace

We wish to support the use of the same symbolic token as a prefix operator, an
infix operator, and a postfix operator, in some cases. In particular, we have
[decided in #523](https://github.com/carbon-language/carbon-lang/issues/523)
that the `*` operator should support all three uses; this operator will be
introduced in a future proposal. In order to support such usage, we want a rule
that allows us to simply and unambiguously parse operators that might have all
three fixities.

For example, given the expression `a * - b`, there are two possible parses:

- As `a * (- b)`, multiplying `a` by the negation of `b`.
- As `(a *) - b`, subtracting `b` from the pointer type `a *`.

Our chosen rule to distinguish such cases is to consider the presence or absence
of whitespace, as we think this strikes a good balance between simplicity and
expressiveness for the programmer and simplicity and good support for error
recovery in the implementation. `a * -b` uses the first interpretation, `a* - b`
uses the second interpretation, and other combinations (`a*-b`, `a *- b`,
`a* -b`, `a * - b`, `a*- b`, `a *-b`) are rejected as errors.

In general, we require whitespace to be present or absent around the operator to
indicate its fixity, as this is a cue that a human reader would use to
understand the code: binary operators have whitespace on both sides, and unary
operators lack whitespace between the operator and its operand. We also make
allowance for omitting the whitespace around a binary operator in cases where it
aids readability to do so, such as in expressions like `2*x*x + 3*x + 1`: for an
operator with whitespace on neither side, if the token immediately before the
operator indicates it is the end of an operand, and the token immediately after
the operator indicates it is the beginning of an operand, the operator is
treated as binary.

We define the set of tokens that constitutes the beginning or end of an operand
as:

- Identifiers, as in `x*x + y*y`.
- Literals, as in `3*x + 4*y` or `"foo"+s`.
- Brackets of any kind, facing away from the operator, as in `f()*(n + 3)` or
`args[3]*{.real=4, .imag=1}`.

For error recovery purposes, this rule functions best if no expression context
can be preceded by a token that looks like the end of an operand and no
expression context can be followed by a token that looks like the start of an
operand. One known exception to this is in function definitions:

```
fn F(p: Int *) -> Int * { return p; }
```

Both occurrences of `Int *` here are erroneous. The first is easy to detect and
diagnose, but the second is more challenging, if `{...}` is a valid expression
form. We expect to be able to easily distinguish between code blocks starting
with `{` and expressions starting with `{` for all cases other than `{}`.
However, the code block `{}` is not a reasonable body for a function with a
return type, so we expect errors involving a combination of misplaced whitespace
and `{}` to be rare, and we should be able to recover well from the remaining
cases.

From the perspective of token formation, the whitespace rule means that there
are four _variants_ of each symbolic token:

- A symbolic token with whitespace on both sides is a _binary_ variant of the
token.
- A symbolic token with whitespace on neither side, where the preceding token
is an identifier, literal, or closing bracket, and the following token is an
identifier, literal, or `(`, is also a _binary_ variant of the token.
- A symbolic token with whitespace on neither side that does not satisfy the
preceding rule is a _unary_ variant of the token.
- A symbolic token with whitespace on the left side only is a _prefix_ variant
of the token.
- A symbolic token with whitespace on the right side only is a _postfix_
variant of the token.

When used in non-operator contexts, any variant of a symbolic token is
acceptable. When used in operator contexts, only a binary variant of a token can
be used as a binary operator, only a prefix or unary variant of a token can be
used as a prefix operator, and only a postfix or unary variant of a token can be
used as a postfix operator.

This whitespace rule has been
[implemented in the Carbon toolchain](https://github.com/carbon-language/carbon-lang/pull/576)
for all operators by tracking the presence or absence of trailing whitespace as
part of a token, and
[in executable semantics](https://github.com/carbon-language/carbon-lang/commit/04d3a885ae01a779aadb19f51ec7a5a12ffe295c)
for the `*` operator by forming four different token variants as described
above.

The choice to disallow whitespace between a unary operator and its operand is
_experimental_.

## Rationale based on Carbon's goals

- Software and language evolution

- By not allowing user-defined operators, we reduce the possibility that
operators added to the language later will conflict with existing uses
in programs. Due to the use of a max munch rule, we might add an
operator that causes existing code to be interpreted differently, but
such problems will be easy to detect and resolve, because we know the
operator set in advance.

- Code that is easy to read, understand, and write

- The fixed operator set means that developers don't need to understand an
unbounded and extensible number of operators and precedence rules. The
fixed operator set encourages functionality that does not correspond to
a well-known operator symbol to be exposed by way of a named operation
instead of a symbol, improving readability among developers not familiar
with a codebase.
- Requiring whitespace to be used consistently around operators reduces
the possibility for confusing formatting.
- Permitting whitespace on either both sides of a binary operator or on
neither side allows expressions such as `2*x*x + 3*x + 1` to use the
absence of whitespace to improve readability. Because the language
officially sanctions both choices, the formatting tool can be expected
to preserve the user's choice.
- The choice to lex the longest known symbolic token rather than the
longest sequence of symbolic characters makes it easier to write
expressions involving a series of prefix or postfix operators, such as
`x = -*p;`.

- Interoperability with and migration from existing C++ code

- The fixed operator set makes a mapping between Carbon operators and C++
operators easier, by avoiding any desire to map arbitrary user-defined
Carbon operators into a C++ form.
- The choice of a fixed operator set and a "max munch" rule will be
familiar to C++ developers, as it is the same approach taken by C++.
- The whitespace rule permits the `*` operator to be used for all of
multiplication, dereference, and pointer type formation, as in C++,
while still permitting Carbon to treat type expressions as expressions.

## Alternatives considered

We could lex the longest sequence of symbolic characters rather than lexing only
the longest known operator.

Advantages:

- Adding new operators could be done without any change to the lexing rules.
- If unknown operators are rejected, adding new operators would carry no risk
of changing the meaning of existing valid code.

Disadvantages:

- Sequences of prefix or postfix operators would require parentheses or
whitespace. For example, `Int**` would lex as `Int` followed by a single
`**` token, and `**p` would lex as a single `**` token followed by `p`, if
there is no `**` operator. While we could define `**`, `***`, and so on as
operators, doing so would add complexity and inconsistency to the language
rules.

We could support an extensible operator set, giving the developer the option to
add new operators.

Advantages:

- This would increase expressivity, especially for embedded domain-specific
languages.

Disadvantages:

- This would harm readability, at least for those unfamiliar with the code
using the operators.
- This could harm our ability to evolve the language, by admitting the
possibility of a custom operator colliding with a newly-introduced standard
operator, although this risk could be reduced by providing a separate
lexical syntax for custom operators.
- We would need to either lex the longest sequence of symbolic characters we
can, which has the same disadvantage discussed for that approach above, or
use a more sophisticated rule to determine how to split operators -- perhaps
based on what operator overloads are in scope -- increasing complexity.

We could apply different whitespace restrictions or no whitespace restrictions.
See [#520](https://github.com/carbon-language/carbon-lang/issues/520) for
discussion of the alternatives and the leads decision.

We could require whitespace around a binary operator followed by `[` or `{`. In
particular, for examples such as:

```
fn F() -> Int*{ return Null; }
var n: Int = pointer_to_array^[i];
```

... this would allow us to form a unary operator instead of a binary operator,
which is likely to be more in line with the developer's expectations.

Advantages:

- Room to add a postfix `^` dereference operator, or similarly any other
postfix operator producing an array, without creating surprises for pointers
to arrays.
- Allows the whitespace before the `{` of a function body to be consistently
omitted if desired.

Disadvantages:

- The rule would be more complex, and would be asymmetric: we must allow
closing square brackets before unspaced binary operators to permit things
like `arr[i]*3`.
- Would interact badly with expression forms that begin with a `[` or `{`, for
example `Time.Now()+{.seconds = 3}` or `names+["Lrrr"]`.