Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

☂️ AST Façade Improvements #1725

Closed
43 of 47 tasks
MichaReiser opened this issue Oct 29, 2021 · 11 comments · Fixed by #1719 or #1767
Closed
43 of 47 tasks

☂️ AST Façade Improvements #1725

MichaReiser opened this issue Oct 29, 2021 · 11 comments · Fixed by #1719 or #1767
Assignees
Labels
A-Parser Area: parser umbrella Issue to track a collection of other issues

Comments

@MichaReiser
Copy link
Contributor

MichaReiser commented Oct 29, 2021

Description

RSLint's and Rust-analyzer's child accessors defined on the AST nodes return Option<TNode>. For example, the IfStmt child accessors are defined as follow (implementations omitted for brevity):

impl IfStmt {
	pub fn if_token(&self) -> Option<SyntaxToken>;
	pub fn condition(&self) -> Option<Condition>;
	pub fn cons(&self) -> Option<Stmt>;
	pub fn else_token(&self) -> Option<SyntaxToken>;
	pub fn alt(&self) -> Option<Stmt>;
}

We identified that such an API has two major shortcomings for our use case. Let's explore these using the following example:

if (test)) else {}

RSLint creates the following CST for this script:

  1. The condition of the IfStmt has one right parenthesis too many. RSLint excellently recovers from this error by inserting an Error in the place of the if's consequence. The problem with the API is that accessing if_stmt.cons() returns None because Error isn't a Stmt. This has the advantage that a linter will not bother and try to dive into the body but a formatter needs the Error node to format its content (even if that just means printing it as it was in the original source).
  2. It's not clear from the API what are the mandatory and optional children. Correctly handling the case where a mandatory child is missing is important where, on the other hand, it's safe to skip missing optional children. That's why the API should guide users to handle missing mandatory children.

Because of this, we plan to make the following two changes to the facade:

  • Change the return type for mandatory children to Result<TNode, MissinElementError> (the name and shape of the error are still up for discussion). Changing the return type to Result encodes that this is a mandatory child and guides the user to handle a missing child appropriately.
  • Change our grammar to be explicit about where the parser inserts Error elements to recover from syntax errors. For example, the parser may create an Error element for an unknown statement. That's why the Stmt union should be extended to include a Stmt::Unknown case. Adding the Unknown case allows the facade to return an Error element in places of Stmts because it's now part of the Stmt union.
    Designing the grammar requires finding the right balance between recover-granularity and API ergonomics. For instance, the parser must generate an Error statement for --; if the grammar only allows errors on the Stmt level but it can parse expression_statement(error) if errors are also allowed in places of expressions. However, allowing errors in too many places increases the complexity of writing CST passes because they must be handled by the authors.
    The places where a parser wants to insert errors to recover from syntax errors are language-specific.
  • We may want to split the Error syntax kind into different Error kinds, for example, UnkownStmt, UnknownExpr, and so on. Splitting the Errors into different kinds allows to query for a specific Unknown kind and implement custom behaviour per kind using methods or implementing the same trait differently for each kind.

The definition of the IfStmt and Stmt changes as follows:

impl IfStmt {
	pub fn if_token(&self) -> **MandatoryElementResult<SyntaxToken>**;
	pub fn condition(&self) -> **MandatoryElementResult<Condition>**;
	pub fn cons(&self) -> **MandatoryElementResult<Stmt>**; // union includes the `UnknownStmt`. 
	pub fn else_clause(&self) -> Option<ElseClause>;
}

enum Stmt {
	If(IfStmt),
	Empty(EmptyStmt),
	...,
	**Unknown(UnknownStmt)**,
}

// other cases...
// 
// import { %%%% } from "./foo"
// Import { namedImports: [ImportSpecifier { binding: Err }] }
//
// let { %%% } = init
// ObjectBinding { properties: [BindingPropertyIdentifier { binding: Err }] }
//
// class { %%% } 
//
// let id = %%%
// VariableDeclaration { declarators: [VariableDeclarator { id: Id, init: Some(Expr::UnknownExpression) }

Part of #1718

AST Facade for lists

Introduce the AstList<N> and AstSeparatedList<N> and change the codegen to reduce an AstList for T*and an AstSeparatedList for (T (',' T)* ','?) where ',' could be any token. See this discussion for more details.

Replace Error with Unknown* nodes

A possible way of defining the Unknown* nodes in Ungrammar could look as follow:

SyntaxNode = SyntaxNode
SyntaxToken = SyntaxToken
SyntaxElement = SyntaxNode | SyntaxToken

JsUnknownStatement = SyntaxElement*
JsUnknownExpression = SyntaxElement*
JsUnknownPattern = SyntaxElement*
JsUnknownMember = SyntaxElement*
JsUnknownBinding = SyntaxElement*
JsUnknownAssignmentTarget = SyntaxElement*

Change return type of mandatory nodes

The ungrammar encodes the optionality of the children and the codegen should take that into consideration when generating the field accessors.

Mandatory children should now return Results instead of Option because it's an error if they're missing. Following a draft of the API

type SyntaxResult<T> = Result<T, SyntaxError>;

// Do we need the enum?
enum SyntaxError {
  MissingElement; // Should we store the kind? 
}

impl IfStmt {
  pub fn cons(&self) -> SyntaxResult<Stmt> {
    match self.get_child(5) {
      None => Err(SyntaxError::MissingElement),
      Some(element) => Ok(Stmt::cast(element).expect("Expected a 'Stmt' in the 'cons' slot of an if stmt'")) // Create helper on AstNode that creates a nice error message?
    }
  }
}

Codegen extensions

  • Add support for labels so that we can influence how the child accessors are named. For example, we want to name the left and right expressions of BinaryExpression left and right and not expression and expression. Ungrammar supports the label: Type` syntax. All we need to do is to respect the naming when generating the fields.

  • Implode alternatives of tokens #1741
    Support inline token-union types. For example, the binary expression operator can be any of ==, +, -, .... That's why it's defined as operator: ( '==' | '+' | '-' ...). The source gen should only generate a single field for the operator that returns a SyntaxToken

  • Flatten unions of unions during the code generation #1743
    It's convenient to define unions in respect of other unions. For example, JsClassMemberName = JsObjectMemberName | JsPrivateClassMemberName where JsObjectMemberName is an union type as well. This is convenient to maintain the grammar but makes the Facade more awkward to use because it requires two matches: first on the outer union and then on the inner union. We can avoid this if we flatten unions inside of the source gen and automatically generate From<InnerUnion> implementations.

New AST Structure and naming

Requirements to AST

Documentation the AST

  • Documenting error recovery general rules
  • Documenting specific error recoveries in nodes
  • Listing differences to Rome Classic AST (no FunctionHead etc) @ematipico
@MichaReiser MichaReiser added the umbrella Issue to track a collection of other issues label Oct 29, 2021
@MichaReiser
Copy link
Contributor Author

@ematipico can you take a look at the codegen extensions and tick off any that you've already implemented as part of #1722

@MichaReiser MichaReiser linked a pull request Oct 29, 2021 that will close this issue
@ematipico
Copy link
Contributor

In #1715, I created a README.md where I wanted to write down all the conventions that we follow to write the grammar. This will be really helpful for future contributions.

I am going to prepare now a list of differences in the AST that I found.

@ematipico
Copy link
Contributor

ematipico commented Nov 1, 2021

Regarding the SyntaxResult bit, I am thinking what we could possible have other then something missing (MissingResult).

Maybe we could have SyntaxResult::Unknown in case we find some Unknown* syntax kind?

@MichaReiser
Copy link
Contributor Author

Regarding the SyntaxResult bit, I am thinking what we could possible have other then something missing (MissingResult).

Maybe we could have SyntaxResult::Unknown in case we find some Unknown* syntax kind?

The approach we decided on uses ‘Unknown *’ nodes that are part of the corresponding union type. But there might be non_unique, as you outlined for the default clause tough I’m not sure if we should just parse this as an unknown case

@ematipico
Copy link
Contributor

OK. I will start the implementation with an enum, and we can revisit it later if we think it's not needed.

@MichaReiser
Copy link
Contributor Author

Do we already have a solution for the union types? Would be nice to get the code gen as close to done as possible

@ematipico
Copy link
Contributor

I think the best approach is to explode the enum during the code gen.

Let's take this example:

Foo = 'foo' 
Bar = 'bar'
Lorem = 'lorem'
Ipsum = 'ipsum'
AltOne = Foo | Bar
AltTwo = Lorem | Ipsum


MyNode = AltOne | AltTwo

In this example MyNode will result into an enum that contains two enum. As we said before, enum are just a convenience inside the code, and the green tree will only contain the kind that matter. In this example we might have [email protected] "foo", [email protected] "bar", etc.

During the code gen we can detect enums of enum and make a substitution and explode the enum AltOne and AltTwo in respectively Foo | Bar and Lorem | Ipsum , resulting into something like this.

// before
impl AstNode for MyNode {
	fn cast(syntax: SyntaxNode) -> Option<Self> {
		// this here would fail because kind is "FOO"
		let res = match syntax.kind() {
			ALT_ONE => => MyNode::AltOne(AltOne { syntax }),
			ALT_TWO => => MyNode::AltTwo(AltTwo { syntax }),
			_ => None
		}
	}
}

// after
impl AstNode for MyNode {
	fn cast(syntax: SyntaxNode) -> Option<Self> {
		// this here would fail because kind is "FOO"
		let res = match syntax.kind() {
			FOO => => MyNode::Foo(Foo { syntax }),
			BAR => => MyNode::Bar(Bar { syntax }),
			// lorem and ipsum too
			_ => None
		}
	}
}

@MichaReiser
Copy link
Contributor Author

That makes sense. I was initially worried that it will make it hard to convert a statement to a declaration and then pass it to a function but that’s easy, just call ‘Decl::cast(stmt)’

@MichaReiser
Copy link
Contributor Author

MichaReiser commented Nov 2, 2021

Support inline token-union types. For example, the binary expression operator can be any of ==, +, -, .... That's why it's defined as operator: ( '==' | '+' | '-' ...). The source gen should only generate a single field for the operator that returns a SyntaxToken

This is not yet implemented. For example, the code gen generates

impl BinExpr {
	pub fn l_angle_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [<]) }
	pub fn r_angle_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [>]) }
	pub fn less_than_equal_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [<=])
	}
	pub fn greater_than_equal_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [>=])
	}
	pub fn equality_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [==]) }
	pub fn strict_equality_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [===])
	}
	pub fn inequality_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [!=]) }
	pub fn strict_inequality_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [!==])
	}
	pub fn plus_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [+]) }
	pub fn minus_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [-]) }
	pub fn star_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [*]) }
	pub fn divide_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [/]) }
	pub fn reminder_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [%]) }
	pub fn exponent_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [**]) }
	pub fn left_shift_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [<<]) }
	pub fn right_shift_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [>>])
	}
	pub fn unsigned_right_shift_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [>>>])
	}
	pub fn amp_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [&]) }
	pub fn bitwise_or_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [|]) }
	pub fn bitwise_xor_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [^]) }
	pub fn nullish_coalescing_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [??])
	}
	pub fn logical_or_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T ! [||]) }
	pub fn logical_and_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T ! [&&])
	}
	pub fn in_token(&self) -> Option<SyntaxToken> { support::token(&self.syntax, T![in]) }
	pub fn instanceof_token(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax, T![instanceof])
	}
}

instead of

impl BinExpr {
	pub fn operator(&self) -> Option<SyntaxToken> {
		support::token(&self.syntax)
	}
}

@ematipico
Copy link
Contributor

ematipico commented Nov 2, 2021

@MichaReiser I am tackling the issue in your last comment and I believe the implementation would need to be different.

The function support::token needs to know what token to retrieve. I think the final result should be something like this:

impl BinExpr {
	pub fn operator(&self) -> Option<SyntaxToken> {
		// new function find_token
		support::find_token(&self.syntax, [T![&&], T![==]]) // an array of all the possible expected tokens
	}
}

And the function find_token might be something like this:

	pub(super) fn find_token(
		parent: SyntaxNode,
		possible_kinds: &[SyntaxKind],
	) -> Option<SyntaxToken> {
		parent
			.children_with_tokens()
			.filter_map(|it| it.into_token())
			.find(|it| {
				possible_kinds
					.iter()
					.any(|possible_kind| *possible_kind == it.kind())
			})
	}

MichaReiser added a commit that referenced this issue Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this issue Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this issue Nov 12, 2021
Changes the js grammar of the variable declaration to match our new AST facade as defined in #1725 (and proposed in #1719)

* Split `VarDecl` into `JsVariableDeclarationStatement` and `JsVariableDeclaration` to correctly account for variable declarations inside for loops (that aren't statements).
* Change the parser to emit a `let` token instead of an ident if let is used inside a variable declaration (let is a valid identifier which is why the lexer returns an `ident` token)
* Rename `Declarator` to `JsVariableDeclarator`
* Split out the `init` into a `JsEqualValueClause` that holds the `=` token and the initializer expression.
MichaReiser added a commit that referenced this issue Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this issue Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this issue Nov 17, 2021
This restructures the AST structure of object expressions and related nodes (members, etc.) and is part of #1725.

* Renames `ObjectExpr` to `JsObjectExpression`
* Renames `props` to `members`
* Renames the `key` of all members to `name`
* Renames `IdentProp` to `JsShorthandPropertyObjectMember` that directly stores a `ident` token
* Renames `LiteralProp` to `JsPropertyMember` and changes the `key` from `Name` to `JsStaticObjectMemberName`
* Introduces `JsGetterObjectMember` (unclear if it should be shared with classes in the future)
* Introduces `JsSetterObjectMember` (unclear if it should be shared with classes in the future)
* Introduces `JsMethodObjectMember` (unclear if it should be shared with classes in the future)
* Renames `SpreadProp` to `JsSpread`

The PR fixes some issues when it comes to `setter/getter` parsing.

The main concern around sharing `Js*Member`s with classes is that objects are more restrictive when it comes to naming because classes support private properties which object members don't. I think the better approach is to introduce a `JsAnyGetter = JsGetterObjectMember | JsGetterClassMember` union that implements helpers to retrieve the shared properties.
MichaReiser added a commit that referenced this issue Nov 18, 2021
This PR refactors the AST structures for class declarations and class expressions as defined in #1725. 

* Class declaration & Expression
  * Rename from `ClassDecl`/`ClassExpr` to `JsClassDeclaration`/`JsClassExpression`
  * Renamed `name` to `id`
  * Introduced a `JsExtendsClause` that wraps the `extends` keyword and the parent class
  * Renamed `body` to `members` and removed the `ClassBody` node
* Constructor
  * Rename from `Constructor` to `JsConstructorClassMember` 
  * Rename `name` to `id`
  * Parse "constructor" and 'constructor' methods as `JsConstructorClassMember`s
* Method
  * Renamed `Method` to `JsMethodClassMember`
* Properties
  * Renamed `ClassProp` to `JsPropertyClassMember`
* Getter
  * Renamed `Getter` to `JsGetterClassMember`
  * Removed the parameter list (getters can't have parameters)
* Setter
  * Renamed `Setter` to `JsSetterClassMember`
  * Replaced the parameter list with a single `parameter` 
* Private properties
  * Deleted the `PrivateProp` and instead introduce a `JsPrivatePropertyClassMemberName` type since all members can be private (except constructors)
* Replaced `JsEmptyStatement` with `JsEmptyMember` (these aren't proper statements)
* TS Accessibility: Rename to access modifier to match typescripts class documentation
* Rename `JsComputedObjectMemberName` and `JsStaticObjectMemberName` to `JsComputedMemberName` and `JsLiteralMember` because they can be shared between classes and objects
* Rename `TsReturnType` to `TsTypeAnnotation` so that it can also be used for properties, parameters, etc.  


This PR also adds a set of new tests for classes to verify the member parsing. I did so because I had to refactor much of the parsing logic to make sense of it (and reduce some repetitive code and fix some parsing issues related to semicolons). 

I further extracted the class parsing from the `decl.rs` into its own `class.rs`. You can see my refactorings if you look at the individual commits.
MichaReiser added a commit that referenced this issue Nov 22, 2021
Refactors the AST tree structure for MemberExpressions as part of #1725 

* Renames `BracketExpr` to `JsComputedMemberExpression`
* Renames `DotExpr` to `JsStaticMemberExpression`
* Introduces the new `JsReferenceIdentifierMember` and `JsReferencePrivateMember` which are references to member names (analog to `JsReferenceIdentifierExpression` that references an identifier and `JsBindingIdentifier` that defines an identifier)
* Merge `PrivatePropAccess` into `JsStaticMemberExpression` (enabled by introducing `JsReferenceMember`
* Introduce `SuperExpression` so that `super.test()` works 
* Add new check that verifies that calling `super()` only works inside of constructor (I leave checking if we're inside of a subclass to another PR).
* Delete `SuperCall` as this can now be modelled using `CallExpr`
* Deleted some no longer used nodes/kinds
@ematipico ematipico pinned this issue Nov 30, 2021
MichaReiser added a commit that referenced this issue Dec 16, 2021
Implementing the shape validation described in #1867 requires that it's possible to tell the required shape of a node by just looking at its kind. This hasn't been possible so far for list nodes that all use the same ` LIST` kind. The kind doesn't expose any information of what the expected type of the child nodes is nor if it is a separated or a node list. 

This PR replaces the generic `LIST` kind with specific list types, for example, it introduces the `JsArrayElementList` for the list storing the elements of an array expression. 

A consequence of having explicit `LIST` nodes is that these should now also implement the `AstNode` interface, they are nodes after all. The consequences of doing so are:

* `AstSeparatedList` and `AstNodeList` must be traits that each specific list implement together with the `AstNode` interface
* It's no longer possible to use a `missing` placeholder to represent an empty list because a concrete node is required to implement `AstNode.syntax()`. 

Overall, the change seems to nicely align with how the rest of the AST facade works and makes list nodes less special, they're just regular nodes that on top implement additional methods that allow iterating over their elements. 


The main downside of this change is that it's now required to implement the `AstSeparatedList` and/or `AstNodeList` traits if someone wants to use any of the list methods. I guess, that's just how Rust works and shouldn't be too much of a surprise.

Part of #1725 #1867
@MichaReiser
Copy link
Contributor Author

Closing this because most AST changes have been implemented. There are some more proposed changes that were discussed in #1868 and are probably worth doing at some point but implementing these changes isn't of the utmost importance at the time being.

The unknown nodes work isn't finished yet. The main outstanding questions are:

  • Do we want to stick to unknown nodes or use an alternative approach, for example, setting a has_diagnostic flag
  • When is a node's kind changed to an unknown kind. Only for grammatical errors or also in cases of semantic errors?

The next steps are discussed in #1848 and I propose to first write a few analyzers to get some real-world experiences with potential issues before making any decision.

@MichaReiser MichaReiser unpinned this issue Jan 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Parser Area: parser umbrella Issue to track a collection of other issues
Projects
None yet
2 participants