You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! During the last couple of years, I’ve spent a lot of time writingparsers and parser generators, and I want to write down my thoughtsabout this topic. Specifically, I want to describe some properties ofa parser generator that I would enjoy using. Note that this is not an
“introduction to parsing” blog post, some prior knowledge is assumed.
Why do I care about this at all? The broad reason is that today a lotof tools and even most editors use regular expressions toapproximately parse programming languages, and I find this outrightb҉a͡rb̢ari͞c͘. I understandthat in practice parsing is not as easy as it is in theory:
Law: You can’t check code you can’t parse. Checking code deeply requiresunderstanding the code’s semantics. The most basic requirement is that you parseit. Parsing is considered a solved problem. Unfortunately, this view is naïve,rooted in the widely believed myth that programming languages exist.
I’ve used various parser generators, implemented one,fall, and still haven’t met a parser generatorthat I love.
The post is split into three major chapters:
UX— how to make using a parser generator easy, enjoyable andfun?
API— what API the generated parser should have.
Parsing Techniques— how exactly do we get from text to theparsed tree?
I’ll be using a rather direct and assertive language in the following,but the fact is I am totally not sure about anything written here, andwould love to know more about alternatives!
Although this text is written in Emacs, I strongly believe that asemantic-based, reliable, and fast support from tooling is a greatboon to learnability and productivity. A great IDE support is a mustfor a modern parser generator, and this chapter talks mostly aboutIDE-related features.
The most important productivity boost of a parser generator is theability to fiddle with grammar interactively. The UI for this mightlook as a three-pane view, where the grammar is on the first pane,example code to parse is in the second pane and the resulting parsetree is in the third one. Editing first two panes should reactivelyupdate the last one. This is difficult to implement with mostyacc-like parser generators, I’ll talk more about it in the nextsection.
The second most important feature is inline tests: for complexgrammars it could be really hard to map from a particular rulespecification to actual code that is parsed by the rule. Having a testwritten alongside the rule is invaluable! The test should be just asnippet of code in the target language. The “gold” value of the parsetree for the snippet should be saved in the file alongside the grammarand should be updated automatically when the grammar changes. Havinginline tests allows to fit the “three pane UI” from the previous intotwo panes because you can just use the test as your second pane.
Note that even if you write your parser by hand, you still should use such
“inline tests”. To do so, write them as comments with special markers, and writea small script which extracts such comments and turns them into tests proper.Here’s anexamplefrom one experimental hand-written parser of mine. Having such examples of “whatdoes this if parses?” greatly simplifies reading of parser’s code!
Here’s the list of important misc IDE features, from super important to veryimportant. They are not specific to parser generators, so, if you are using aparser generator to implement IDE support for your language, look into thesefirst!
Extend selection to the enclosing syntactic structure (and not justto a braced block). A super simple feature, but this combined withmultiple cursors is arguably more powerful than vim’s text objects,and most definitely easier to use.
Fuzzy search of symbols in the current file/in the project: superhandy for navigation, both more important and easier to implementthan goto definition.
Precise syntax highlighting. Highlighting is not a super-importantfeature and actually works ok even with regex approximations, butif you already have the syntax tree, then why not use it?
Go to definition/find references.
Errors and warnings inline, with fixes if available.
Extract rule refactoring, pairs well with extend selection.
Code formatting.
Smart typing: indenting code on Enter, adding/removing trailingcommas when joining/splitting lines, and in general auto magicallyfixing punctuation.
Code completion: although for parser generators dumb word-basedcompletion tends to work OK.
I want to emphasize that most of these features are ridiculously easy toimplement, if you have a parse tree for your language. Take, for example, “fuzzysearch of symbols in the project”. This is a super awesome feature fornavigation. Basically, it is CTAGS done right: first, you parse each file (inparallel) and build a list of symbols for it. Then, as user types, youincrementally update the changed files. Using fall, I’ve implemented thisfeature for Rust, and it took me three small files:
find_symbols.rsto extract symbols from a single file, 21(!) lines.
indxr.rs,a generic infra to watch files for changes and recompute the index incrementally, 155 lines.
symbol_index.rsglues the previous two together, and addsfst by ever-awesome BurntSushion top for fuzzy search, 122 lines.
This is actually practical: initial indexing of rust-lang/rust repotakes about 30 seconds using a single core and fall’s ridiculouslyslow parser, and after that everything just works:
A small note on how to pack all this IDE functionality: make a library. Thatway, anyone could use it anywhere. For example, as a web-assembly module in theonline version. On top of the library you could implement whatever protocol youlike, Microsoft’s LSP, or some custom one. If you go the protocol-first way,using your code outside of certain editors could be harder.
Traditionally, parser generators work by allowing the user to specifycustom code for each rule, which is then copy-pasted into thegenerated parser. This is typically used to construct an abstractsyntax tree, but could be used, for example, to evaluate arithmeticexpressions during parsing.
I don’t think this is the right API for the parser generator for threereasons though.
It feels like a layering violation because it allows to intermix parsing withbasically everything else. You can literally do code-generation during parsing.It makes things likethe lexer hack possible.
It would be very hard to implement reactive rendering of the parsetree if the result of parsing is some user-defined type.
Most importantly, I don’t think that producing abstract syntaxtree as a result of parsing is the right choice. The problem with ASTis that it, by definition, loses information. The most commonly lostthings are whitespace and comments. While they are not important for acommand-line batch compiler, they are crucial for IDEs, which workvery close to the original source code. Another important IDE-specificaspect is support for incomplete code. If a function is missing a bodyand a closing parenthesis on the parameter list, it’s still better berecognized as a function. It’s difficult to support such missingpieces in traditional AST.
I am pretty confident that a better API for the generated parser is toproduce a parse tree which losslessly represents both the input textand associated tree structure. Losslessness is a very importantproperty: it guarantees that we could implement anything in principle.
I’ve outlined one possible design of such lossless representation in thelibsyntax2 RFC, the simplifiedversion looks like this:
That is, the result of parsing is a homogeneous tree, with nodeshaving two bits of information besides the children:
Type of a node: is it a function definition, a parameter, acomment?
Region of the source text covered by the node.
A cool thing about such representation is that every language usesthe same type of the syntax tree. In fall features like extendselection are implemented once and work for all languages.
If you need it, you can do the conversion to AST in a separatepass. Alternatively, it’s possible to layer AST on top of thehomogeneous tree, using newtype wrappers like
Parser generator should automatically generate such AST wrappers. However, itshouldn’t directly infer them from the grammar: not every node kind needs an ASTwrapper, and method names are important. Better to let the user specify ASTstructure separately, and check that AST and parse tree agree. As an examplefrom fall, here is thegrammar rule for Rust paths, the correspondingast definition, and thegenerated code.
Another important feature for modern parser generator is support forincremental reparsing, which is obviously useful for IDEs.
One thing that greatly helps here is the split between parser andlexer phases.
It is much simpler (and more efficient) to make lexingincremental. When lexing, almost any change affects at most a coupleof tokens, so in theory incremental lexing could be prettyefficient. Beware though that worst-case relexing still has to belinear, because insertion of unclosed quote changes all the followingtokens.
In contrast, it is much easier to change tree structure significantlywith a small edit, which places upper-bound on incremental reparsingeffectiveness. Besides, making parsing incremental is more complicatedbecause you have to deal with trees instead of a linear structure.
An interesting middle ground here is an incremental lexer combinedwith a fast non-incremental parser.
Traditional lex-style lexers struggle with special cases like ml-styleproperly nested comments or Rust raw literals which are even notcontext-free.The problem is typically solved by injecting custom code into lexer,which maintains some sort of state, like a nesting level ofcomments. In my experience, making this work properly is veryfrustrating.
These two tricks may make writing lexer simpler.
Instead of supporting lexer states and injecting custom code, allow to pairregex, which defines a token, with a function which takes a string slice andoutputs usize. If lexer matches such external token, it then calls suppliedfunction to determine the other end of the token. Here’s an example from fall:externaltoken,customfunctions.
Often it is better to use layered languages instead of lexerstates. Parsing string literals is a great example of this. Stringliterals usually have some notion of a well-formed escapesequence. The traditional approach to parsing string literals is toswitch to a separate lexer state after ", which handlesescapes. This is bad for error recovery: if there’s a typo in anescape sequence, it should still be possible to recognize literalcorrectly. So alternative approach is to parse a string literal as,basically, “anything between two quotes”, and then use a separatelexer for escapes specifically later in the compiler pipeline.
Another interesting lexing problem which arises in practice iscontext-sensitivity: things like contextual keywords or >> canrepresent different token types, depending on the surrounding code. Todeal with this case nicely, the parser should support tokenremapping. While most of the tokens appear in the final parse tree asis, the parser should be able to, for example, substitute two >>tokens with a single >>, so that later stages of compilation neednot to handle this special case.
A nice trick to make parser more general and fast is not to constructparse tree directly, but emit a stream of events like “start internalnode”, “eat token”, “finish internal node”. That way, parsing does notitself allocate and, for example, you can use the stream of events topatch an existing tree, doing minimal allocations. This also divorcesthe parser from a particular tree structure, so it is easier toplug-in different tree backends.
Events also help with reshuffling the tree structure. For example,during event processing we can turn left-leaning trees toright-leaning ones or flatten them into lists. Another interestingform of tree reshuffling is attachment of comments. If a commentimmediately precedes some definition, it should be a part of thisdefinition. This is not specified by the language, but it is theresult that human would expect. With events, we can handle onlysignificant tokens to the parser and deal with attaching comments andwhitespace when reconstructing tree from a flat list of events.
To properly implement incremental reparsing, we should start with adata structure for text which is more efficient to update thanString. While we do have quite a few extremely high-qualityimplementations of ropes, the ecosystem is critically missing a way totalks about them generically. That is, there’s no something likeJava’s CharSequence in Rust (which needs a much more involved designin Rust to avoid unnecessary overhead).
Luckily, the parse tree needs to remember only the offsets, so we canavoid hard-coding a particular text representation, and we don’t evenneed a generic parameter for that.
Homogeneous trees make reactive testing of the grammar possible intheory because you can always produce a text representation of a treefrom them. But in practice reactivity requires that “read grammar,compile parser, run it on input” loop is fast. Literally generatingsource code of the parser and then compiling it would be too slow, sosome kind of interpreted mode is required. However, this conflictswith the need to be able to extend lexer with custom code. I don’tknow of a great solution here, but something like this would work:
require that all lexer extensions are specified in the verbatimblock of the grammar file and don’t have external dependencies,
for IDE support, compile the lexer, and only the lexer, in a tempdir and communicate with it via IPC.
A possible alternative is to use a different, approximate lexer forinteractive testing of the grammar. In my experience this makes suchtesting almost useless because you get different results ininteresting cases and interesting cases are what is important for thisfeature.
In IDEs, a surprisingly complicated problem is managing a list of openand modified files, synchronizing them with the file system, providingconsistent file-system snapshots and making sure that things likein-memory buffers are also possible. For parser generators, all thiscomplexity might be dodged by requiring that all of the grammar needsto be specified in a single file.
So we want to write a parser generator that produces lossless parsetrees and which has an awesome IDE support. How do we actually parsea text into a tree? Unfortunately, while there are many ways to parsetext, there’s no accepted best one. I’ll try to do a broad survey ofvarious options.
I’d love to discuss the challenges of the textbook approach of justusing a context-free grammar/BNF notation. However, let’s start with asimpler, “solved” case: regular expressions.
Languages which could be described by regular expressions are calledregular. They are exactly the same languages which could be recognizedby finite state machines. These two definition mechanisms have niceproperties which explain the usefulness of regular languages in reallife:
Regular expressions map closely to our thinking and are easy forhumans to understand. Note that there are equivalent in power, butmuch less “natural” meta-languages for describing regularlanguages: raw finite state machines or regular grammars.
Finite state machines are easy for computers to execute. FSM isjust a program which is guaranteed to use constant amount ofmemory.
Regular languages are rather inexpressive, but they work great forlexers. On the opposite side of expressivity spectrum are Turingmachines. For them, we also have a number of meta-languages (likeRust), which work great for humans. It’s interesting that a Turingmachine is equivalent to a finite state machine with a pair of stacks:to get two stacks from a tape, cut the tape in half where the headis. Moving the head then corresponds to popping from one stack andpushing to another.
And the context-free languages, which are described by CFGs, areexactly in between languages recognized by finite state machines andlanguages recognized by Turing machines. You need a push-downautomaton, or a state machine with one stack, to recognize acontext-free language.
CFGs are powerful enough to describe arbitrary nesting structures andseem to be a good fit for describing programming languages. However,there are a couple of problems with CFGs. Let’s write a grammar forarithmetic expressions with additions, multiplications, parenthesisand numbers. The obvious answer,
E -> E + E | E * E | (E) | number
has a problem. It is under specified and does not tell if 1 + 2 * 3is (1 + 2) * 3 or 1 + (2 * 3). We need to tweak the grammar to getrid of this ambiguity:
E -> F | E + FF -> T | F * TT -> number | (E)
I think the necessity of such transformations is a problem! Humans don’t thinklike this: it took me three or four courses in formal grammars to reallyinternalize this transformation. And if we look at language references, we’lltypically see aprecedencetable instead of BNF.
Another problem here is that we even can’t workaround ambiguity byplainly forbidding it: checking if CFG is unambiguous is undecidable.
So CFGs turn out to be much less practical and simple than regularexpressions. What options do we have then?
The first choice is to parse something, not necessary a context-freelanguage. A good way to do it is to write a parser by hand. Ahand-written parser is usually called a recursive descent parser, butin reality it includes two crucial techniques in addition to justrecursive descent. The pure recursive descent works by translatinggrammar rules like T -> A B into a set of recursive functions:
fnparse_t() {parse_a();parse_b();}
The theoretical problem here is that it can’t deal withleft-recursion. That is, rules like Statements -> Statements ';'
OneStatement make recursive descent parser to loop infinitely. Intheory, this problem is solved by rewriting the grammar andeliminating the left recursion. If you had a formal grammars class,you probably have done this! In practice, this is a completelynon-existent problem, because we have loops:
The next problem with recursive descent is that parsing expressions withprecedence requires that weird grammar rewriting. Luckily, there’s a simplertechnique to deal with expressions. Suppose you want to parse 1 + 2 * 3. Oneway to do that would be to parse it with a loop as a list of atoms separatedby operators and then reconstruct a tree separately. If you fuse these twostages together, you get a loop, which could recursively call itself and nest,aPratt parser. Understanding it for the first time is hard, but you only need todo it once :)
The most important feature of hand-written parsers is a great supportfor error recovery and partial parses. It boils down to two simpletricks.
If you are parsing a homogeneous sequence of things (i.e, you are inside theloop), and the current token does not look like it can begin a new element, youjust skip over it and start the next iteration of the loop. Here’s anexamplefrom Kotlin. Atthisline, we’ll get null if current token could not begin a class memberdeclaration.Herewe just skip over it.
If you are parsing a particular thing T, and you expect token foo,but see bar, then, roughly:
if bar is not in the FOLLOW(T), you skip over it and emit error,
if bar is in FOLLOW(T), you emit error, but don’t skip thetoken.
That way, parsing something like
fnfoo(structS { f: u32}
would correctly recognize incomplete function foo (again, its easier torepresent such incomplete function with homogeneous parse trees than with AST),and a complete struct S. Here’s anotherexamplefrom Kotlin.
Although hand-written parsers are good at producing high-quality errormessages as well, I don’t think that this is important. In the IDEcontext, for syntax errors it is much more important and beneficial toget a red squiggly under the error immediately after you’ve typedinvalid code. Instantaneous feedback and precise location are, in mypersonal experience, enough to fix syntax errors. The error messagecan be just “Syntax error”, and more elaborate messages are often makethings worse because mapping from an error message to what isactually wrong is harder than just typing and deleting stuff andchecking if it works.
It is possible to simplify authoring of this style of parsers bygenerating all recursive functions, loop and Pratt parsers fromdeclarative BNF/PEG style description. This is what Grammar Kit andfall do.
Another choice is to stay within CFG class but avoid dealing withambiguity by producing all possible parse trees for a giveninput. This is typically achieved using non-determinism andmemorization, using GLR and GLL style techniques.
Here I’d like to call outtree-sitter project, which actuallyticks quite a few boxes outlined in this blog post. In particular, it useshomogeneous trees, is fully incremental and has surprisingly good support forerror recovery (though not quite as good as hand-written style parsers, at leastwhen I’ve last checked it).
Yet another choice is to give up full generality and restrict theparser generator to a subset of unambiguous grammars, for which weactually could verify the absence of ambiguity. This is how traditionalparser generators like yacc, happy, menhir or LALRPOP work.
The very important advantage of these parsers is that you get a strongguarantee that the grammar works and does not have nastysurprises. The price you have to pay, though, is that sometimes it isnecessary to tweak an already unambiguous grammar to make the stupidtool understand that there’s no ambiguity.
I also haven’t seen deterministic LR parsers with great support forerror recovery, but looks like it should be possible in theory?Recursive descent parsers, which are more or less LL(1), recover fromerrors splendidly, and LR(1) has strictly more information than anLL(1) one.
So, what is the best choice for writing a parser/parser generator?
It seems to me that the two extremes are the most promising: handwritten parser gives you utmost control over everything, which isimportant when you need to parse some language, not designed by you,which is hostile to the usual parsing techniques. On the other hand,classical LR-style parsers give you a proof that the grammar isunambiguous, which is very useful if you are creating your ownlanguage. Ultimately, I think that being able to produce losslessparse trees supporting partial parses is more important than anyparticular parsing technique, so perhaps supporting both approacheswith a single API is the right choice?
This turned out to be a quite lengthy post, hope it was interesting!These are the main points:
IDE support is important, for the parser generator itself as well asfor the target language.
Lossless parse trees are more general than ASTs and custom actioncode, and are a better fit for IDEs.
Interactivity matters! Reactive grammar repl and inline tests rock!
Modern Parser Generator
https://ift.tt/iOP4jSh
Hi! During the last couple of years, I’ve spent a lot of time writing parsers and parser generators, and I want to write down my thoughts about this topic. Specifically, I want to describe some properties of a parser generator that I would enjoy using. Note that this is not an “introduction to parsing” blog post, some prior knowledge is assumed.
Why do I care about this at all? The broad reason is that today a lot of tools and even most editors use regular expressions to approximately parse programming languages, and I find this outright b҉a͡rb̢ari͞c͘. I understand that in practice parsing is not as easy as it is in theory:
A few billion lines of code laterHowever, I do believe we could do better if we use better tools!
The specific reason is that I care way too much about the Rust programming language and
I think today it is the best language for writing compiler-like stuff (yes, better than OCaml!),
I’d love to see an awesome parser generator written in and targeting Rust,
I want to write a Rust parser in a slightly better way. I’ve done it twice already :) (update: thrice)
I’ve used various parser generators, implemented one, fall, and still haven’t met a parser generator that I love.
The post is split into three major chapters:
UX — how to make using a parser generator easy, enjoyable and fun?
API — what API the generated parser should have.
Parsing Techniques — how exactly do we get from text to the parsed tree?
I’ll be using a rather direct and assertive language in the following, but the fact is I am totally not sure about anything written here, and would love to know more about alternatives!
Although this text is written in Emacs, I strongly believe that a semantic-based, reliable, and fast support from tooling is a great boon to learnability and productivity. A great IDE support is a must for a modern parser generator, and this chapter talks mostly about IDE-related features.
The most important productivity boost of a parser generator is the ability to fiddle with grammar interactively. The UI for this might look as a three-pane view, where the grammar is on the first pane, example code to parse is in the second pane and the resulting parse tree is in the third one. Editing first two panes should reactively update the last one. This is difficult to implement with most yacc-like parser generators, I’ll talk more about it in the next section.
The second most important feature is inline tests: for complex grammars it could be really hard to map from a particular rule specification to actual code that is parsed by the rule. Having a test written alongside the rule is invaluable! The test should be just a snippet of code in the target language. The “gold” value of the parse tree for the snippet should be saved in the file alongside the grammar and should be updated automatically when the grammar changes. Having inline tests allows to fit the “three pane UI” from the previous into two panes because you can just use the test as your second pane.
Here’s a video that shows how it works in fall: https://youtu.be/gb1MJnTcvds.
Note that even if you write your parser by hand, you still should use such “inline tests”. To do so, write them as comments with special markers, and write a small script which extracts such comments and turns them into tests proper. Here’s an example from one experimental hand-written parser of mine. Having such examples of “what does this
if
parses?” greatly simplifies reading of parser’s code!Here’s the list of important misc IDE features, from super important to very important. They are not specific to parser generators, so, if you are using a parser generator to implement IDE support for your language, look into these first!
Extend selection to the enclosing syntactic structure (and not just to a braced block). A super simple feature, but this combined with multiple cursors is arguably more powerful than vim’s text objects, and most definitely easier to use.
Fuzzy search of symbols in the current file/in the project: super handy for navigation, both more important and easier to implement than goto definition.
Precise syntax highlighting. Highlighting is not a super-important feature and actually works ok even with regex approximations, but if you already have the syntax tree, then why not use it?
Go to definition/find references.
Errors and warnings inline, with fixes if available.
Extract rule refactoring, pairs well with extend selection.
Code formatting.
Smart typing: indenting code on
Enter
, adding/removing trailing commas when joining/splitting lines, and in general auto magically fixing punctuation.Code completion: although for parser generators dumb word-based completion tends to work OK.
Here’s a short demo of some of these features in fall: https://youtu.be/WRWmwfBLf7o.
I want to emphasize that most of these features are ridiculously easy to implement, if you have a parse tree for your language. Take, for example, “fuzzy search of symbols in the project”. This is a super awesome feature for navigation. Basically, it is CTAGS done right: first, you parse each file (in parallel) and build a list of symbols for it. Then, as user types, you incrementally update the changed files. Using fall, I’ve implemented this feature for Rust, and it took me three small files:
find_symbols.rs to extract symbols from a single file, 21(!) lines.
indxr.rs, a generic infra to watch files for changes and recompute the index incrementally, 155 lines.
symbol_index.rs glues the previous two together, and adds fst by ever-awesome BurntSushi on top for fuzzy search, 122 lines.
This is actually practical: initial indexing of rust-lang/rust repo takes about 30 seconds using a single core and fall’s ridiculously slow parser, and after that everything just works:
https://youtu.be/KyUUDcnOvUw
A small note on how to pack all this IDE functionality: make a library. That way, anyone could use it anywhere. For example, as a web-assembly module in the online version. On top of the library you could implement whatever protocol you like, Microsoft’s LSP, or some custom one. If you go the protocol-first way, using your code outside of certain editors could be harder.
Traditionally, parser generators work by allowing the user to specify custom code for each rule, which is then copy-pasted into the generated parser. This is typically used to construct an abstract syntax tree, but could be used, for example, to evaluate arithmetic expressions during parsing.
I don’t think this is the right API for the parser generator for three reasons though.
It feels like a layering violation because it allows to intermix parsing with basically everything else. You can literally do code-generation during parsing. It makes things like the lexer hack possible.
It would be very hard to implement reactive rendering of the parse tree if the result of parsing is some user-defined type.
Most importantly, I don’t think that producing abstract syntax tree as a result of parsing is the right choice. The problem with AST is that it, by definition, loses information. The most commonly lost things are whitespace and comments. While they are not important for a command-line batch compiler, they are crucial for IDEs, which work very close to the original source code. Another important IDE-specific aspect is support for incomplete code. If a function is missing a body and a closing parenthesis on the parameter list, it’s still better be recognized as a function. It’s difficult to support such missing pieces in traditional AST.
I am pretty confident that a better API for the generated parser is to produce a parse tree which losslessly represents both the input text and associated tree structure. Losslessness is a very important property: it guarantees that we could implement anything in principle.
I’ve outlined one possible design of such lossless representation in the libsyntax2 RFC, the simplified version looks like this:
That is, the result of parsing is a homogeneous tree, with nodes having two bits of information besides the children:
Type of a node: is it a function definition, a parameter, a comment?
Region of the source text covered by the node.
A cool thing about such representation is that every language uses the same type of the syntax tree. In fall features like extend selection are implemented once and work for all languages.
If you need it, you can do the conversion to AST in a separate pass. Alternatively, it’s possible to layer AST on top of the homogeneous tree, using newtype wrappers like
Parser generator should automatically generate such AST wrappers. However, it shouldn’t directly infer them from the grammar: not every node kind needs an AST wrapper, and method names are important. Better to let the user specify AST structure separately, and check that AST and parse tree agree. As an example from fall, here is the grammar rule for Rust paths, the corresponding ast definition, and the generated code.
Another important feature for modern parser generator is support for incremental reparsing, which is obviously useful for IDEs.
One thing that greatly helps here is the split between parser and lexer phases.
It is much simpler (and more efficient) to make lexing incremental. When lexing, almost any change affects at most a couple of tokens, so in theory incremental lexing could be pretty efficient. Beware though that worst-case relexing still has to be linear, because insertion of unclosed quote changes all the following tokens.
In contrast, it is much easier to change tree structure significantly with a small edit, which places upper-bound on incremental reparsing effectiveness. Besides, making parsing incremental is more complicated because you have to deal with trees instead of a linear structure.
An interesting middle ground here is an incremental lexer combined with a fast non-incremental parser.
Traditional lex-style lexers struggle with special cases like ml-style properly nested comments or Rust raw literals which are even not context-free. The problem is typically solved by injecting custom code into lexer, which maintains some sort of state, like a nesting level of comments. In my experience, making this work properly is very frustrating.
These two tricks may make writing lexer simpler.
Instead of supporting lexer states and injecting custom code, allow to pair regex, which defines a token, with a function which takes a string slice and outputs
usize
. If lexer matches such external token, it then calls supplied function to determine the other end of the token. Here’s an example from fall: external token, custom functions.Often it is better to use layered languages instead of lexer states. Parsing string literals is a great example of this. String literals usually have some notion of a well-formed escape sequence. The traditional approach to parsing string literals is to switch to a separate lexer state after
"
, which handles escapes. This is bad for error recovery: if there’s a typo in an escape sequence, it should still be possible to recognize literal correctly. So alternative approach is to parse a string literal as, basically, “anything between two quotes”, and then use a separate lexer for escapes specifically later in the compiler pipeline.Another interesting lexing problem which arises in practice is context-sensitivity: things like contextual keywords or
>>
can represent different token types, depending on the surrounding code. To deal with this case nicely, the parser should support token remapping. While most of the tokens appear in the final parse tree as is, the parser should be able to, for example, substitute two>
>
tokens with a single>>
, so that later stages of compilation need not to handle this special case.A nice trick to make parser more general and fast is not to construct parse tree directly, but emit a stream of events like “start internal node”, “eat token”, “finish internal node”. That way, parsing does not itself allocate and, for example, you can use the stream of events to patch an existing tree, doing minimal allocations. This also divorces the parser from a particular tree structure, so it is easier to plug-in different tree backends.
Events also help with reshuffling the tree structure. For example, during event processing we can turn left-leaning trees to right-leaning ones or flatten them into lists. Another interesting form of tree reshuffling is attachment of comments. If a comment immediately precedes some definition, it should be a part of this definition. This is not specified by the language, but it is the result that human would expect. With events, we can handle only significant tokens to the parser and deal with attaching comments and whitespace when reconstructing tree from a flat list of events.
To properly implement incremental reparsing, we should start with a data structure for text which is more efficient to update than
String
. While we do have quite a few extremely high-quality implementations of ropes, the ecosystem is critically missing a way to talks about them generically. That is, there’s no something like Java’sCharSequence
in Rust (which needs a much more involved design in Rust to avoid unnecessary overhead).Luckily, the parse tree needs to remember only the offsets, so we can avoid hard-coding a particular text representation, and we don’t even need a generic parameter for that.
Homogeneous trees make reactive testing of the grammar possible in theory because you can always produce a text representation of a tree from them. But in practice reactivity requires that “read grammar, compile parser, run it on input” loop is fast. Literally generating source code of the parser and then compiling it would be too slow, so some kind of interpreted mode is required. However, this conflicts with the need to be able to extend lexer with custom code. I don’t know of a great solution here, but something like this would work:
require that all lexer extensions are specified in the verbatim block of the grammar file and don’t have external dependencies,
for IDE support, compile the lexer, and only the lexer, in a temp dir and communicate with it via IPC.
A possible alternative is to use a different, approximate lexer for interactive testing of the grammar. In my experience this makes such testing almost useless because you get different results in interesting cases and interesting cases are what is important for this feature.
In IDEs, a surprisingly complicated problem is managing a list of open and modified files, synchronizing them with the file system, providing consistent file-system snapshots and making sure that things like in-memory buffers are also possible. For parser generators, all this complexity might be dodged by requiring that all of the grammar needs to be specified in a single file.
So we want to write a parser generator that produces lossless parse trees and which has an awesome IDE support. How do we actually parse a text into a tree? Unfortunately, while there are many ways to parse text, there’s no accepted best one. I’ll try to do a broad survey of various options.
I’d love to discuss the challenges of the textbook approach of just using a context-free grammar/BNF notation. However, let’s start with a simpler, “solved” case: regular expressions.
Languages which could be described by regular expressions are called regular. They are exactly the same languages which could be recognized by finite state machines. These two definition mechanisms have nice properties which explain the usefulness of regular languages in real life:
Regular expressions map closely to our thinking and are easy for humans to understand. Note that there are equivalent in power, but much less “natural” meta-languages for describing regular languages: raw finite state machines or regular grammars.
Finite state machines are easy for computers to execute. FSM is just a program which is guaranteed to use constant amount of memory.
Regular languages are rather inexpressive, but they work great for lexers. On the opposite side of expressivity spectrum are Turing machines. For them, we also have a number of meta-languages (like Rust), which work great for humans. It’s interesting that a Turing machine is equivalent to a finite state machine with a pair of stacks: to get two stacks from a tape, cut the tape in half where the head is. Moving the head then corresponds to popping from one stack and pushing to another.
And the context-free languages, which are described by CFGs, are exactly in between languages recognized by finite state machines and languages recognized by Turing machines. You need a push-down automaton, or a state machine with one stack, to recognize a context-free language.
CFGs are powerful enough to describe arbitrary nesting structures and seem to be a good fit for describing programming languages. However, there are a couple of problems with CFGs. Let’s write a grammar for arithmetic expressions with additions, multiplications, parenthesis and numbers. The obvious answer,
has a problem. It is under specified and does not tell if
1 + 2 * 3
is(1 + 2) * 3
or1 + (2 * 3)
. We need to tweak the grammar to get rid of this ambiguity:I think the necessity of such transformations is a problem! Humans don’t think like this: it took me three or four courses in formal grammars to really internalize this transformation. And if we look at language references, we’ll typically see a precedence table instead of BNF.
Another problem here is that we even can’t workaround ambiguity by plainly forbidding it: checking if CFG is unambiguous is undecidable.
So CFGs turn out to be much less practical and simple than regular expressions. What options do we have then?
The first choice is to parse something, not necessary a context-free language. A good way to do it is to write a parser by hand. A hand-written parser is usually called a recursive descent parser, but in reality it includes two crucial techniques in addition to just recursive descent. The pure recursive descent works by translating grammar rules like
T -> A B
into a set of recursive functions:The theoretical problem here is that it can’t deal with left-recursion. That is, rules like
Statements -> Statements ';' OneStatement
make recursive descent parser to loop infinitely. In theory, this problem is solved by rewriting the grammar and eliminating the left recursion. If you had a formal grammars class, you probably have done this! In practice, this is a completely non-existent problem, because we have loops:The next problem with recursive descent is that parsing expressions with precedence requires that weird grammar rewriting. Luckily, there’s a simpler technique to deal with expressions. Suppose you want to parse
1 + 2 * 3
. One way to do that would be to parse it with aloop
as a list of atoms separated by operators and then reconstruct a tree separately. If you fuse these two stages together, you get a loop, which could recursively call itself and nest, a Pratt parser. Understanding it for the first time is hard, but you only need to do it once :)The most important feature of hand-written parsers is a great support for error recovery and partial parses. It boils down to two simple tricks.
If you are parsing a homogeneous sequence of things (i.e, you are inside the loop), and the current token does not look like it can begin a new element, you just skip over it and start the next iteration of the loop. Here’s an example from Kotlin. At this line, we’ll get
null
if current token could not begin a class member declaration. Here we just skip over it.If you are parsing a particular thing
T
, and you expect tokenfoo
, but seebar
, then, roughly:bar
is not in theFOLLOW(T)
, you skip over it and emit error,bar
is inFOLLOW(T)
, you emit error, but don’t skip the token.That way, parsing something like
would correctly recognize incomplete function
foo
(again, its easier to represent such incomplete function with homogeneous parse trees than with AST), and a complete structS
. Here’s another example from Kotlin.Although hand-written parsers are good at producing high-quality error messages as well, I don’t think that this is important. In the IDE context, for syntax errors it is much more important and beneficial to get a red squiggly under the error immediately after you’ve typed invalid code. Instantaneous feedback and precise location are, in my personal experience, enough to fix syntax errors. The error message can be just “Syntax error”, and more elaborate messages are often make things worse because mapping from an error message to what is actually wrong is harder than just typing and deleting stuff and checking if it works.
It is possible to simplify authoring of this style of parsers by generating all recursive functions, loop and Pratt parsers from declarative BNF/PEG style description. This is what Grammar Kit and fall do.
Another choice is to stay within CFG class but avoid dealing with ambiguity by producing all possible parse trees for a given input. This is typically achieved using non-determinism and memorization, using GLR and GLL style techniques.
Here I’d like to call out tree-sitter project, which actually ticks quite a few boxes outlined in this blog post. In particular, it uses homogeneous trees, is fully incremental and has surprisingly good support for error recovery (though not quite as good as hand-written style parsers, at least when I’ve last checked it).
Yet another choice is to give up full generality and restrict the parser generator to a subset of unambiguous grammars, for which we actually could verify the absence of ambiguity. This is how traditional parser generators like yacc, happy, menhir or LALRPOP work.
The very important advantage of these parsers is that you get a strong guarantee that the grammar works and does not have nasty surprises. The price you have to pay, though, is that sometimes it is necessary to tweak an already unambiguous grammar to make the stupid tool understand that there’s no ambiguity.
I also haven’t seen deterministic LR parsers with great support for error recovery, but looks like it should be possible in theory? Recursive descent parsers, which are more or less LL(1), recover from errors splendidly, and LR(1) has strictly more information than an LL(1) one.
So, what is the best choice for writing a parser/parser generator?
It seems to me that the two extremes are the most promising: hand written parser gives you utmost control over everything, which is important when you need to parse some language, not designed by you, which is hostile to the usual parsing techniques. On the other hand, classical LR-style parsers give you a proof that the grammar is unambiguous, which is very useful if you are creating your own language. Ultimately, I think that being able to produce lossless parse trees supporting partial parses is more important than any particular parsing technique, so perhaps supporting both approaches with a single API is the right choice?
This turned out to be a quite lengthy post, hope it was interesting! These are the main points:
IDE support is important, for the parser generator itself as well as for the target language.
Lossless parse trees are more general than ASTs and custom action code, and are a better fit for IDEs.
Interactivity matters! Reactive grammar repl and inline tests rock!
Parsing is an unsolved problem :)
Discussion on /r/rust.
via matklad
January 23, 2024 at 10:06AM
The text was updated successfully, but these errors were encountered: