Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

☂️ Rome Parser Architecture Changes #1718

Closed
3 tasks done
jamiebuilds opened this issue Oct 27, 2021 · 9 comments
Closed
3 tasks done

☂️ Rome Parser Architecture Changes #1718

jamiebuilds opened this issue Oct 27, 2021 · 9 comments
Labels
umbrella Issue to track a collection of other issues

Comments

@jamiebuilds
Copy link
Contributor

jamiebuilds commented Oct 27, 2021

Description

Last week, we spent our time discussing a number of different architectural changes that we would like to make to the parser and CST/AST. We also talked to some of the maintainers about RSLint about how we would like to work together in the future.

Immediately, we've already started working on these changes in our locally checked in fork of the RSLint parser, but in the coming weeks/months we're going to collaborate to ensure that the needs of both projects are being met, and in the future Rome will be exporting its own crate to be used in RSLint with collaborators from both projects.

cc @RDambrosio016 @Stupremee

Note: I'll create a separate issue at some point to talk about adding RSLint contributors to the rome/tools repo and what we want the complete public API of the parser crate to be. I think there might be some open questions if we should be providing the AST to RSLint or if using Ungrammar we can allow both projects to generate their own AST optimized for their use cases and only really export a tree-sink. Either way, a separate issue would be better probably.

Store leading/trailing trivia for each token

We explored two options on how to store trivia like spaces, tabs, newlines, single and multiline comments. Let's use the following example to illustrate how RSLint works today and how we want to store trivia moving onwards.

[
	1,
	2, // test
	3
]

RSLint stores the trivia as a token of the kind Whitespace or Comment and attaches it to its enclosing node. The CST for the array example would look like this (whitespace and comments are highlighted):

script
  - array
    - l-bracket
    - whitespace: \n\t       # <<<
    - element
      - literal
        - number_token: 1 
      - comma-token
      - comment: // test     # <<<
    - whitespace: \n\t       # <<<
    - comment: // test       # <<<
    - element
      - literal
        - number_token: 2
      - comma-token
    - whitespace: " "        # <<<
    - comment: // test       # <<<
    - whitespace: \n\t       # <<<
    - element
      - literal
        - number_token: 3
    - whitespace: \n         # <<<
    - r-bracket
  - whitespace: \n           # <<<

There are a number of concerns that we have about this design that we'd like to resolve:

  • It prevents using fixed offsets for accessing a specific child of a node, and, therefore, O(1) lookups for child nodes. The problem is that an unspecified number of trivia may exist between any two non-trivia elements. Normally, the condition of an if statement is the 3rd element: 1st: if keyword, 2nd: whitespace, 3rd: condition. However, this isn't guaranteed as if \n // some comment\ncondition\n... illustrates, where the condition is the 5th element because of the added comment that is followed by a \n.
  • The leading or trailing comments for a node or token can't be accessed using the AST facade. Resolving the comments would require additional helpers that lookup the comments in the parent's children.

That's why we believe that storing leading/trailing trivia on the token is better for our use case. We would use the same rule as Roslyn and Swift to decide to which token a trivia belongs:

  1. A token owns all of its trailing trivia up to, but not including, the next newline character.
  2. Looking backward in the text, a token owns all of the leading trivia up to and including the first newline character.

Applying these rules to our initial array example creates the following CST:

script:
  - array:
    - l-bracket:
      - leading_trivia: ""            # <<<
      - trailing_trivia: ""           # <<<
    - element: # (that's not how RSlint represents array elements today but probably is as how it should)
      - literal:
        - number_token(1):
          - leading_trivia: "\n\t"    # <<<
          - trailing_trivia: ""       # <<<
      - comma-token:
        - leading_trivia: ""          # <<<
        - trailing_trivia: ""         # <<<
    - element:
      - literal:
        - number_token(2):
          - leading_trivia: "\n\t"    # <<<
          - trailing_trivia: ""       # <<<
      - comma-token: 
        - leading_trivia: " "
        - trailing_trivia: "// test"  # <<<
    - element:
      - literal:
        - number_token(3):
          - leading_trivia: "\n\t"    # <<<
          - trailing_trivia: ""       # <<<
    - r-bracket:
      - leading_trivia: "\n"          # <<<
      - trailing_trivia:              # <<<
  - leading_trivia:                   # <<<
  - trailing_trivia: "\n"             # <<<

We should be mindful of:

  • The CST's mutation API to ensure they're explicit about dropping or retaining comments (or other trivia).
  • The CST's structure to ensure that every token can be accessed using the AstNode API. For example, the array expression in RSLint contains the expressions for the elements as well as the commas separating them. The problem is, that it isn't possible to access a specific comma. The CST structure should, therefore, be changed so that elements are wrapped by an ArrayElement node that stores the expression together with an optional trailing comma (array[element[literal, comma], element[literal, comma]] instead of array[literal, comma, literal, comma]).

Allow missing children

RSLint already supports missing children but the decision is listed here to be explicit that this is the behaviour we intended. Let's use the following example where the body of the else branch is missing.

if (test) {
  test
} else 

There are two options to parse the if statement

  • The first option is to parse the whole IfStmt into an UnknownStmt (or Error) because it misses some mandatory children. The UnkownStmt is a simple wrapper node that gives access to its children but otherwise gives no more guarantees about its structure. Parsing this as an UnknownStmt has the benefit that it's known that all children are present if you get any other Stmt but comes at the cost that the parser discards the information that this almost certainly is an if statement. Handling the UnknownStmt in a meaningful way inside of a CST pass would require re-inferring that the UnknownStmt is very likely an IfStmt.
  • Parse the if-statement but leave the else branch missing. This enables CST passes to work correctly on the existing nodes and allows them to make use of the additional knowledge that this is an if statement.

We believe that parsing the example as an IfStmt and leaving the alt branch missing is better for our use case because it conveys the most information of the source code, allowing us to provide better guidance. Analysis that depend on the presence of the alt branch can manually test its existence.

Use fixed slots to store the children

RSLint and rust-analyzer only store nodes or tokens if they've been present in the source file. Let's take the following example of a for statement that only has a condition but misses the init and update nodes.

for (;i < 10;) {
}

RSLint generates the following CST (ignoring trivia):

script:
  - for-stmt:
    - for-keyword
    - l-paren    #// <-- No ForInit after the l-paren
    - semicolon  
    - for-test:
      - binary-expression:
        - name-ref:
          - ident(test)
        - l-angle 
        - literal
          - number(10)
    - semicolon 
    - r-paren
    - block

The downside of not storing missing (even optional) nodes is that accessing a particular field requires iterating over the children of a node to find its position. For example, the for-test is the 4th element if the for-loop has no for-init but is the 5th element otherwise.

The CST can guarantee fixed positions of its children if missing optional or mandatory children are marked as missing inside its children collection. There are multiple ways to do so, some options are:

  • Storing None for missing children
  • Using a fake missing node/token
  • Creating the missing nodes and tokens and flag them as "missing"
  • Use an additional bitmask to encode which nodes are present or missing
script:
  - for-stmt:
    - for-keyword
    - l-paren
    - None # <<<
    - semicolon
    - for-test:
      - binary-expression:
        - name-ref:
          - ident(test)
        - l-angle 
        - literal:
          - number(10)
    - semicolon 
    - None # <<<
    - r-paren
    - block

Storing missing elements increases memory consumption because the data structure must encode the information whatever an element is present or missing. We explored different solutions, including bitflags, to mark present and missing elements to keep the overhead to a minimum.

AST Façade

RSLint's and Rust-analyzer's child accessors defined on the AST nodes return Option<TNode>. For example, the IfStmt child accessors are defined as follow (implementations omitted for brevity):

impl IfStmt {
	pub fn if_token(&self) -> Option<SyntaxToken>;
	pub fn condition(&self) -> Option<Condition>;
	pub fn cons(&self) -> Option<Stmt>;
	pub fn else_token(&self) -> Option<SyntaxToken>;
	pub fn alt(&self) -> Option<Stmt>;
}

We identified that such an API has two major shortcomings for our use case. Let's explore these using the following example:

if (test)) else {}

RSLint creates the following CST for this script:

  1. The condition of the IfStmt has one right parenthesis too many. RSLint excellently recovers from this error by inserting an Error in the place of the if's consequence. The problem with the API is that accessing if_stmt.cons() returns None because Error isn't a Stmt. This has the advantage that a linter will not bother and try to dive into the body but a formatter needs the Error node to format its content (even if that just means printing it as it was in the original source).
  2. It's not clear from the API what are the mandatory and optional children. Correctly handling the case where a mandatory child is missing is important where, on the other hand, it's safe to skip missing optional children. That's why the API should guide users to handle missing mandatory children.

Because of this, we plan to make the following two changes to the facade:

  • Change the return type for mandatory children to Result<TNode, MissinElementError> (the name and shape of the error are still up for discussion). Changing the return type to Result encodes that this is a mandatory child and guides the user to handle a missing child appropriately.
  • Change our grammar to be explicit about where the parser inserts Error elements to recover from syntax errors. For example, the parser may create an Error element for an unknown statement. That's why the Stmt union should be extended to include a Stmt::Unknown case. Adding the Unknown case allows the facade to return an Error element in places of Stmts because it's now part of the Stmt union.
    Designing the grammar requires finding the right balance between recover-granularity and API ergonomics. For instance, the parser must generate an Error statement for --; if the grammar only allows errors on the Stmt level but it can parse expression_statement(error) if errors are also allowed in places of expressions. However, allowing errors in too many places increases the complexity of writing CST passes because they must be handled by the authors.
    The places where a parser wants to insert errors to recover from syntax errors are language-specific.
  • We may want to split the Error syntax kind into different Error kinds, for example, UnkownStmt, UnknownExpr, and so on. Splitting the Errors into different kinds allows to query for a specific Unknown kind and implement custom behaviour per kind using methods or implementing the same trait differently for each kind.

The definition of the IfStmt and Stmt changes as follows:

impl IfStmt {
	pub fn if_token(&self) -> **MandatoryElementResult<SyntaxToken>**;
	pub fn condition(&self) -> **MandatoryElementResult<Condition>**;
	pub fn cons(&self) -> **MandatoryElementResult<Stmt>**; // union includes the `UnknownStmt`. 
	pub fn else_clause(&self) -> Option<ElseClause>;
}

enum Stmt {
	If(IfStmt),
	Empty(EmptyStmt),
	...,
	**Unknown(UnknownStmt)**,
}

// other cases...
// 
// import { %%%% } from "./foo"
// Import { namedImports: [ImportSpecifier { binding: Err }] }
//
// let { %%% } = init
// ObjectBinding { properties: [BindingPropertyIdentifier { binding: Err }] }
//
// class { %%% } 
//
// let id = %%%
// VariableDeclaration { declarators: [VariableDeclarator { id: Id, init: Some(Expr::UnknownExpression) }

Considered Alternatives

Own try type

We decided against running our own Try type because we don't want to depend on nightly language features for such a central piece of our architecture.

Using Result<Option<T>> for mandatory children

enum MandatoryElementError {
  MissingElement,
	Error(ErrorNode),
}

type MandatoryElementResult<T> = Result<T, MandatoryElementError>;
type OptionalElementResult<T> = Result<Option<T>, ErrorNode>;

impl IfStmt {
	pub fn if_token(&self) -> Result<SyntaxToken, MandatoryElementError>;
	pub fn condition(&self) -> Result<Condition, MandatoryElementError>;
	pub fn cons(&self) -> Result<Stmt, MandatoryElementError>; 
	pub fn else_clause(&self) -> Result<Option<ElseClause>, ErrorNode>;
}

We decided to not use this approach because:

  • Many use cases, for example, formatting or even linting, must handle Error elements and only want to break out of the operation if a mandatory node is missing. These passes can't use the try operator on Result<T, MandatoryElementError> because it would break if the element is missing or if it's an error element. They instead must match on the Result and handle all three cases, leading to more verbose code.
    You could argue that the return type for mandatory fields (and optional fields?) should be Option<Result<T, ErrorNode>> instead. Doing so may mitigate this problem but the Option<Result> nesting makes it harder to reason about what the try operator usages. Is it testing for missing or if the node is correct?
    That's why we want to use different Unknown* elements. It avoids the nesting and has the additional benefit that the parent node is "correct", as far as it's concerned, even if, for example, the cons contains an UnknownStmt. The code working with the cons statement can handle the UnknownStmt case.
  • This API requires that the consuming code handles Errors even in places where the parser won't ever insert an error. For example, the parser will never insert an Error in the ElseClause because it's either absent (in which case the Error is attached to the next statement) or present if there's an else token (in which case the Error is attached to the alt. It would be possible to change our syntax generator to take this into consideration but that would increase the complexity of the source code generator and adding UnknownStmt to the Stmt union encodes the same information.
  • It's not immediately clear what the try operator tests for when working on optional nodes because they use a nested Result<Option>. For example, if_stmt.else_clause()? means that the else_clause contains no error node, but another ? is required to get the actual node.
  • It requires extra care that Errors are not accidentally filtered out or implicitly skipped. For example, would Errors be returned when iterating over all Stmts in a program if that specific Error appears in a position of a Stmt ?

The main advantage of this approach is that the parser has more flexibility when it comes to error recovering because the parser is free to insert Error elements in any place. This can somewhat be mitigated by introducing new Unknown nodes if it's important to get better error recovery.

A second advantage is that there are two different types:

  • Stmt represents a valid statement
  • Result<Stmt, ErrorNode> represents a potentially invalid statement

Having two different types can help to do the error handling in a central place and then assume valid statements in the rest of the code.

Insert missing mandatory children

This is the approach taken by Roslyn. It inserts any missing child into the AST but marks it as missing and uses explicit Unkonw* nodes.

/// <summary>
/// Determines whether the node represents a language construct that was actually parsed
/// from the source code. Missing nodes are generated by the parser in error scenarios to
/// represent constructs that should have been present in the source code in order to
/// compile successfully but were actually missing.
/// </summary>
public bool IsMissing
{
    get
    {
        return this.Green.IsMissing;
    }
}

The main advantage of this approach is that the accessors for mandatory children don't have to return a Result nor an Option.

impl IfStmt {
	pub fn if_token(&self) -> SyntaxToken;
	pub fn condition(&self) -> Condition;
	pub fn cons(&self) -> Stmt; 
	pub fn else_clause(&self) -> Option<ElseClause>;
}

We decided against this approach because users must know that they may have to explicitly check if the node was missing. The API doesn't guide them to handle the missing case and decide what the appropriate behaviour is.

Resources

Action Items

  • @xunilrj Update with details and links to trivia-in-tokens work
  • @ematipico Update with details and links to Ungrammar work
  • @MichaReiser Update with details and links to AST Facade work

PS -- @Stupremee and @RDambrosio016 we're happy to talk more about these changes, these are just where we landed in our own discussions, but we'd really value any feedback you could provide.

@jamiebuilds jamiebuilds added the umbrella Issue to track a collection of other issues label Oct 27, 2021
@RDambrosio016
Copy link

I agree with most of these changes, i dont however think that formatters should even attempt to format invalid syntax, it is probably unhelpful in most cases, it complicates the formatter, and is likely to yield horrible results for somewhat broken code, unless the error recovery is basically perfect, which is hard to achieve, especially in js/ts.

@MichaReiser
Copy link
Contributor

@RDambrosio016 I share your concerns. Formatting invalid code is challenging and there's a high risk that we mess the code up rather than improve the formatting. Having something to play around with will show us how well full formatting, partial formatting (before any error, only format whitespace, ...), etc. work. Going from full to not formatting files with syntax errors shouldn't be a significant change (it mainly means opting out earlier).

Regarding Use fixed slots to store the children. Something we may have overlooked is how to handle lists inside of nodes. The difficulty with lists is that they may contain an arbitrary number of elements, making it impossible to use fixed offsets if these children are stored inline.

One option I see is that we create artificial nodes for lists, e.g. ImportSpecificers, JsArguments, etc. However, that is probably rather limiting.

Another alternative that might be to build on top of GreenChild and extend it so that it stores one of Token, Node, List, and Missing (except if we model Missing with bitflags). I would probably rename Child to Slot because it gives us virtual slots over children. The downside is, that the list's children must be allocated on the heap, requiring one more de-referencing.

We may be able to model slots more cleverly to avoid that additional heap allocation but I haven't found a way how to do that yet (Ideally, we have two lists, a list with slots and a list with children: Downside, getting the static offset of a child requires calculating the offset from all slots before the slot you're looking for).

@ematipico
Copy link
Contributor

Regarding Use fixed slots to store the children. Something we may have overlooked is how to handle lists inside of nodes. The difficulty with lists is that they may contain an arbitrary number of elements, making it impossible to use fixed offsets if these children are stored inline.

That's what I thought too. My understanding was to store a fixed number of children only nodes that have a predetermined number of sub nodes. For example, a JsIfStatement has two children: JsIfBlock (mandatory) and JsElseBlock (optional). In this case it's valid to have access to the child nodes using offsets.

Although we can't do this with, for example, for JsSwitch, where it can have an infinite number of JsCaseClause.

@MichaReiser
Copy link
Contributor

MichaReiser commented Oct 27, 2021

@ematipico exactly.

What I started outlining in the previous comment with slots could be implemented with a node format as follow, so that we can store the children inline (downside, paying an additional 8 bytes per slot)

kind | Slice<arity, usize> | Slice<child_count, GreenChild> 

A slot is a number specifying how many children it contains. This count could also be used to model missing (count = 0). The children offset can then be calculated by adding the counts of all slots up to the requested slot. I don't think this is ideal because it's still a linear search.

That's why I took another slot at Roslyn and the way they implement it is that SyntaxList (list of children) implements GreenNode. So, a child can be a Node, Token, or a List (what I proposed above). I guess that makes sense and all it would require is that the parser signals the tree sink when a sub-list starts and ends, so that the sink can allocate a GreenList (a GreenList is what GreenNode currently stores, a length and a list of Arc ptrs). I think this is certainly worth trying out and doesn't seem to be too much effort.

@xunilrj
Copy link
Contributor

xunilrj commented Oct 27, 2021

about trivia: #1720

@MichaReiser
Copy link
Contributor

I believe the approach taken by Roslyn is beneficial besides just getting absolute offsets. Grouping fields with many children in a GreenList would allow us

  • Build specialised views on top of GreenList. For example, a SeparatedList. It's a list over nodes separated by tokens (e.g. arrays, object properties).
  • Expose additional mutation APIs for lists. For example, the SeparatedList can automatically insert the separator tokens for you.

A GreenList can either be its own type or we can use Roslyn's approach and re-use a green node and give it a specific syntax kind.

@MichaReiser
Copy link
Contributor

I think I finally got my head around how this could work:

  • Introduce a new SyntaxKind::List
  • The parser generates a list node whenever the grammar has a T* or (...)* rule (the latter is required for lists that are separated by a token). That's all the changes needed in the parser and sink
  • Create an AstList<N> which is a lightweight view over a SyntaxNode that guarantees that the children are nodes of type N. It mainly exposes a different API (also limited) compared to the SyntaxNode. Possible operations are iter() -> Iterable<N>, len() , as well as mutation APIs.
  • Create an AstSeparatedList<N> which is a special list representing nodes that are separated by a specific token. Array elements are one use case for a separated list. The separated list is also a thin wrapper around a syntax node and defines operations like iter() -> Iterable<Element<N>> nodes() -> Iterable<Option<N>>, separators() -> Iterable<SyntaxToken>, trailing_separator() -> Option<SyntaxToken>, etc.
  • The source gen uses AstList if the grammar defines a T* child, and AstSeparatedList when the grammar has a (T (',' T)* ','?) child

One thing left to verify is if SyntaxNodes are aware of their absolute position. That would be problematic because it requires to compute the full span of all preceding children... (which requires iterating over all children)

@MichaReiser
Copy link
Contributor

I created a "discussion" to discuss the changes around storing children in fixed slots. #1721

@MichaReiser
Copy link
Contributor

The parser architecture changes have been completed. We can track the formatter in its own task (#1726)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
umbrella Issue to track a collection of other issues
Projects
None yet
Development

No branches or pull requests

5 participants