Operator tokens (#601)

Proposal: lexical rules for operator tokens. Co-authored-by: Chandler Carruth <[email protected]>
carbon-language · Jul 8, 2021 · db2efea · db2efea
1 parent f499c3c
commit db2efea
Show file tree

Hide file tree

Showing 2 changed files with 352 additions and 0 deletions.
diff --git a/proposals/README.md b/proposals/README.md
@@ -59,5 +59,6 @@ request:
 -   [0538 - `return` with no argument](p0538.md)
 -   [0540 - Remove `Void`](p0540.md)
 -   [0555 - Operator precedence](p0555.md)
+-   [0601 - Operator tokens](p0601.md)
 
 <!-- endproposals -->
diff --git a/proposals/p0601.md b/proposals/p0601.md
@@ -0,0 +1,351 @@
+# Operator tokens
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/601)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Problem](#problem)
+-   [Background](#background)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Two kinds of operator tokens](#two-kinds-of-operator-tokens)
+    -   [Symbolic token list](#symbolic-token-list)
+    -   [Whitespace](#whitespace)
+-   [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
+-   [Alternatives considered](#alternatives-considered)
+
+<!-- tocstop -->
+
+## Problem
+
+Carbon needs a set of tokens to represent operators.
+
+## Background
+
+Some languages have a fixed set of operator tokens. For example:
+
+-   [C++ operators](https://eel.is/c++draft/lex.operators)
+    -   The keyword operators `and`, `or`, etc. are lexical synonyms for
+        corresponding symbolic operators `&&`, `||`, etc.
+-   [Rust operators](https://doc.rust-lang.org/book/appendix-02-operators.html)
+
+Other languages have extensible rules for defining operators, including the
+facility for a developer to define operators that aren't part of the base
+language. For example:
+
+-   [Swift operator rules](https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#ID418)
+-   [Haskell operator rules](https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#dx7-18008)
+
+Operators tokens can be formed by various rules, for example:
+
+-   At each lexing step, form the longest known operator token possible from the
+    remaining character sequence. For example, in C++, `a += b` is 3 tokens and
+    `a =+ b` is four tokens, because there are `+`, `=`, and `+=` operators, but
+    there is no `=+` operator. This approach is sometimes known as "max munch".
+-   At each lexing step, treat the longest sequence of operator-like characters
+    possible as an operator. The program is invalid if there is no such
+    operator. For example, in a C++-like language using this approach, `a =+ b`
+    would be invalid instead of meaning `a = (+b)`.
+-   Use semantic information to determine how to split a sequence of operator
+    characters into one or more operators, for example based on the types of the
+    operands.
+
+## Proposal
+
+Carbon has a fixed set of tokens that represent operators, defined by the
+language specification. Developers cannot define new tokens to represent new
+operators; there may be facilities to overload operators, but that is outside
+the scope of this proposal. There are two kinds of tokens that represent
+operators:
+
+-   _Symbolic tokens_ consist of one or more symbol characters. In particular,
+    such a token contains no characters that are valid in identifiers, no quote
+    characters, and no whitespace.
+-   _Keywords_ follow the lexical rules for words.
+
+Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the
+longest symbolic token defined by the language specification that appears
+starting at the current input position is lexed, if any.
+
+Not all uses of symbolic tokens within the Carbon grammar will be as operators.
+For example, we will have `(` and `)` tokens that serve to delimit various
+grammar productions, and we may not want to consider `.` to be an operator,
+because its right "operand" is not an expression.
+
+When a symbolic token is used as an operator, we use the presence or absence of
+whitespace around the symbolic token to determine its fixity, in the same way we
+expect a human reader to recognize them. For example, we want `a* - 4` to treat
+the `*` as a unary operator and the `-` as a binary operator, while `a * -4`
+results in the reverse. This largely requires whitespace on only one side of a
+unary operator and on both sides of a binary operator. However, we'd also like
+to support binary operators where a lack of whitespace reflects precedence such
+as`2*x*x + 3*x + 1` where doing so is straightforward. The rules we use to
+achieve this are:
+
+-   There can be no whitespace between a unary operator and its operand.
+-   The whitespace around a binary operator must be consistent: either there is
+    whitespace on both sides or on neither side.
+-   If there is whitespace on neither side of a binary operator, the token
+    before the operator must be an identifier, a literal, or any kind of closing
+    bracket (for example, `)`, `]`, or `}`), and the token after the operator
+    must be an identifier, a literal, or any kind of opening bracket (for
+    example, `(`, `[`, or `{`).
+
+This proposal includes an initial set of symbolic tokens covering only the
+grammar productions that have been approved so far. This list should be extended
+by proposals that use additional symbolic tokens.
+
+## Details
+
+### Two kinds of operator tokens
+
+Two kinds of operator tokens are proposed. These two kinds are intended for
+different uses, not as alternate spellings of the same functionality:
+
+-   Symbolic tokens are intended to be used for widely-recognized operators,
+    such as the mathematical operators `+`, `*`, `<`, and so on.
+    -   Symbolic tokens used as operators would generally be expected to also be
+        meaningful for some user-defined types, and should be candidates for
+        being made overloadable once we support operator overloading.
+-   Keywords are intended to be used for cases such as the following:
+    -   Operators that perform flow control, such as `and`, `or`, `throw`,
+        `yield`, and operators closely connected to these, such as `not`. It is
+        important that these stand out from other operators as they have action
+        that goes beyond evaluating their operands and computing a value.
+    -   Operators that are rare and that we do not want to spend our finite
+        symbolic token budget on, such as perhaps xor or bit rotate.
+    -   Operators with very low precedence, and perhaps certain operators with
+        very high precedence.
+    -   Special-purpose operators for which there is no conventional established
+        symbol and for which we do not want to invent one, such as `as`.
+
+The example operators in this section are included only to motivate the two
+kinds of operator token; those specific operators are not proposed as part of
+this proposal.
+
+### Symbolic token list
+
+The following is the initial list of symbolic tokens recognized in a Carbon
+source file:
+
+|     |      |      |     |     |     |
+| --- | ---- | ---- | --- | --- | --- |
+| `(` | `)`  | `{`  | `}` | `[` | `]` |
+| `,` | `.`  | `;`  | `:` | `*` | `&` |
+| `=` | `->` | `=>` |     |     |     |
+
+This list is expected to grow over time as more symbolic tokens are required by
+language proposals.
+
+### Whitespace
+
+We wish to support the use of the same symbolic token as a prefix operator, an
+infix operator, and a postfix operator, in some cases. In particular, we have
+[decided in #523](https://github.com/carbon-language/carbon-lang/issues/523)
+that the `*` operator should support all three uses; this operator will be
+introduced in a future proposal. In order to support such usage, we want a rule
+that allows us to simply and unambiguously parse operators that might have all
+three fixities.
+
+For example, given the expression `a * - b`, there are two possible parses:
+
+-   As `a * (- b)`, multiplying `a` by the negation of `b`.
+-   As `(a *) - b`, subtracting `b` from the pointer type `a *`.
+
+Our chosen rule to distinguish such cases is to consider the presence or absence
+of whitespace, as we think this strikes a good balance between simplicity and
+expressiveness for the programmer and simplicity and good support for error
+recovery in the implementation. `a * -b` uses the first interpretation, `a* - b`
+uses the second interpretation, and other combinations (`a*-b`, `a *- b`,
+`a* -b`, `a * - b`, `a*- b`, `a *-b`) are rejected as errors.
+
+In general, we require whitespace to be present or absent around the operator to
+indicate its fixity, as this is a cue that a human reader would use to
+understand the code: binary operators have whitespace on both sides, and unary
+operators lack whitespace between the operator and its operand. We also make
+allowance for omitting the whitespace around a binary operator in cases where it
+aids readability to do so, such as in expressions like `2*x*x + 3*x + 1`: for an
+operator with whitespace on neither side, if the token immediately before the
+operator indicates it is the end of an operand, and the token immediately after
+the operator indicates it is the beginning of an operand, the operator is
+treated as binary.
+
+We define the set of tokens that constitutes the beginning or end of an operand
+as:
+
+-   Identifiers, as in `x*x + y*y`.
+-   Literals, as in `3*x + 4*y` or `"foo"+s`.
+-   Brackets of any kind, facing away from the operator, as in `f()*(n + 3)` or
+    `args[3]*{.real=4, .imag=1}`.
+
+For error recovery purposes, this rule functions best if no expression context
+can be preceded by a token that looks like the end of an operand and no
+expression context can be followed by a token that looks like the start of an
+operand. One known exception to this is in function definitions:
+
+```
+fn F(p: Int *) -> Int * { return p; }
+```
+
+Both occurrences of `Int *` here are erroneous. The first is easy to detect and
+diagnose, but the second is more challenging, if `{...}` is a valid expression
+form. We expect to be able to easily distinguish between code blocks starting
+with `{` and expressions starting with `{` for all cases other than `{}`.
+However, the code block `{}` is not a reasonable body for a function with a
+return type, so we expect errors involving a combination of misplaced whitespace
+and `{}` to be rare, and we should be able to recover well from the remaining
+cases.
+
+From the perspective of token formation, the whitespace rule means that there
+are four _variants_ of each symbolic token:
+
+-   A symbolic token with whitespace on both sides is a _binary_ variant of the
+    token.
+-   A symbolic token with whitespace on neither side, where the preceding token
+    is an identifier, literal, or closing bracket, and the following token is an
+    identifier, literal, or `(`, is also a _binary_ variant of the token.
+-   A symbolic token with whitespace on neither side that does not satisfy the
+    preceding rule is a _unary_ variant of the token.
+-   A symbolic token with whitespace on the left side only is a _prefix_ variant
+    of the token.
+-   A symbolic token with whitespace on the right side only is a _postfix_
+    variant of the token.
+
+When used in non-operator contexts, any variant of a symbolic token is
+acceptable. When used in operator contexts, only a binary variant of a token can
+be used as a binary operator, only a prefix or unary variant of a token can be
+used as a prefix operator, and only a postfix or unary variant of a token can be
+used as a postfix operator.
+
+This whitespace rule has been
+[implemented in the Carbon toolchain](https://github.com/carbon-language/carbon-lang/pull/576)
+for all operators by tracking the presence or absence of trailing whitespace as
+part of a token, and
+[in executable semantics](https://github.com/carbon-language/carbon-lang/commit/04d3a885ae01a779aadb19f51ec7a5a12ffe295c)
+for the `*` operator by forming four different token variants as described
+above.
+
+The choice to disallow whitespace between a unary operator and its operand is
+_experimental_.
+
+## Rationale based on Carbon's goals
+
+-   Software and language evolution
+
+    -   By not allowing user-defined operators, we reduce the possibility that
+        operators added to the language later will conflict with existing uses
+        in programs. Due to the use of a max munch rule, we might add an
+        operator that causes existing code to be interpreted differently, but
+        such problems will be easy to detect and resolve, because we know the
+        operator set in advance.
+
+-   Code that is easy to read, understand, and write
+
+    -   The fixed operator set means that developers don't need to understand an
+        unbounded and extensible number of operators and precedence rules. The
+        fixed operator set encourages functionality that does not correspond to
+        a well-known operator symbol to be exposed by way of a named operation
+        instead of a symbol, improving readability among developers not familiar
+        with a codebase.
+    -   Requiring whitespace to be used consistently around operators reduces
+        the possibility for confusing formatting.
+    -   Permitting whitespace on either both sides of a binary operator or on
+        neither side allows expressions such as `2*x*x + 3*x + 1` to use the
+        absence of whitespace to improve readability. Because the language
+        officially sanctions both choices, the formatting tool can be expected
+        to preserve the user's choice.
+    -   The choice to lex the longest known symbolic token rather than the
+        longest sequence of symbolic characters makes it easier to write
+        expressions involving a series of prefix or postfix operators, such as
+        `x = -*p;`.
+
+-   Interoperability with and migration from existing C++ code
+
+    -   The fixed operator set makes a mapping between Carbon operators and C++
+        operators easier, by avoiding any desire to map arbitrary user-defined
+        Carbon operators into a C++ form.
+    -   The choice of a fixed operator set and a "max munch" rule will be
+        familiar to C++ developers, as it is the same approach taken by C++.
+    -   The whitespace rule permits the `*` operator to be used for all of
+        multiplication, dereference, and pointer type formation, as in C++,
+        while still permitting Carbon to treat type expressions as expressions.
+
+## Alternatives considered
+
+We could lex the longest sequence of symbolic characters rather than lexing only
+the longest known operator.
+
+Advantages:
+
+-   Adding new operators could be done without any change to the lexing rules.
+-   If unknown operators are rejected, adding new operators would carry no risk
+    of changing the meaning of existing valid code.
+
+Disadvantages:
+
+-   Sequences of prefix or postfix operators would require parentheses or
+    whitespace. For example, `Int**` would lex as `Int` followed by a single
+    `**` token, and `**p` would lex as a single `**` token followed by `p`, if
+    there is no `**` operator. While we could define `**`, `***`, and so on as
+    operators, doing so would add complexity and inconsistency to the language
+    rules.
+
+We could support an extensible operator set, giving the developer the option to
+add new operators.
+
+Advantages:
+
+-   This would increase expressivity, especially for embedded domain-specific
+    languages.
+
+Disadvantages:
+
+-   This would harm readability, at least for those unfamiliar with the code
+    using the operators.
+-   This could harm our ability to evolve the language, by admitting the
+    possibility of a custom operator colliding with a newly-introduced standard
+    operator, although this risk could be reduced by providing a separate
+    lexical syntax for custom operators.
+-   We would need to either lex the longest sequence of symbolic characters we
+    can, which has the same disadvantage discussed for that approach above, or
+    use a more sophisticated rule to determine how to split operators -- perhaps
+    based on what operator overloads are in scope -- increasing complexity.
+
+We could apply different whitespace restrictions or no whitespace restrictions.
+See [#520](https://github.com/carbon-language/carbon-lang/issues/520) for
+discussion of the alternatives and the leads decision.
+
+We could require whitespace around a binary operator followed by `[` or `{`. In
+particular, for examples such as:
+
+```
+fn F() -> Int*{ return Null; }
+var n: Int = pointer_to_array^[i];
+```
+
+... this would allow us to form a unary operator instead of a binary operator,
+which is likely to be more in line with the developer's expectations.
+
+Advantages:
+
+-   Room to add a postfix `^` dereference operator, or similarly any other
+    postfix operator producing an array, without creating surprises for pointers
+    to arrays.
+-   Allows the whitespace before the `{` of a function body to be consistently
+    omitted if desired.
+
+Disadvantages:
+
+-   The rule would be more complex, and would be asymmetric: we must allow
+    closing square brackets before unspaced binary operators to permit things
+    like `arr[i]*3`.
+-   Would interact badly with expression forms that begin with a `[` or `{`, for
+    example `Time.Now()+{.seconds = 3}` or `names+["Lrrr"]`.