feat: implement Pratt parsing #614

39555 · 2024-11-12T00:15:54Z

Closes: #131

This is the initial implementation of the operator precedence parser based on the Pratt algorithm https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing.

The interface:

precedence(
    digit1.try_map(|d: &str| d.parse::<i32>()),
    (
        "-".value(2).prefix(|x| -1 * x),
        "+".value(2).prefix(|x| x),
        "!".value(2).postfix(|x| factorial(x)),
        "+".value(0).infix(|a, b| a + b),
        "-".value(0).infix(|a, b| a + b),
        "*".value(1).infix(|a, b| a * b),
        "/".value(1).infix(|a, b| a / b),
    ),
)
.parse_next(i)

Any parser that returns a number can be considered as operator with its binding power. To convert the parser into prefix, infix or postfix, we use the extension trait on top of the impl Parser.

Implementation details

Binding power, associativity, Pratt and Dijkstra

The Pratt algorithm only uses the binding power. Left and right associativity are used in Dijkstra's Shunting Yard algorithm. (It seems likechumsky incorrectly names its implementation as pratt while using shunting yard internally). EDIT: I am wrong here. We need to use associativity for infix operators to determine the order of the operators with equal binding power.

I named the combinator precedence with aliases to pratt, shunting yard, precedence climbing, separated. If any of the names are confusing, we may consider better alternatives.

The implementation is partially inspired by https://gist.github.com/ilonachan/3d92577265846e5327a3011f7aa30770.
But it doesn't use any Rc<...> allocations for closures; instead, it uses uses & RefCell<& dyn ...> references.

Operators from the tuple are statically dispatched by the requested affix so prefix.as_postfix() will return fail while prefix.as_prefix() with return the actual parser. This may have a performance issue because it iterates through the whole tuple each time the parser is requested for any affix:

See the trace:

> postfix                                                                                      | "+4"∅
 > opt                                                                                         | "+4"∅
  > alt                                                                                        | "+4"∅
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > "!"                                                                                       | "+4"∅
   < "!"                                                                                       | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
  < alt                                                                                        | backtrack
 < opt                                                                                         | +0
< postfix

(I tried to make a some sort of a linked list where each parser holds an index to the next one with the same affix but, it quickly became a nightmare of lifetimes, complexity, and compilation errors so it was rejected).

Performance

A quick criterion benchmark with a simple input &str 1-2*4+12-561-5-6*6/9-3+1*-2*4-758*3 shows:

winnow   time: [191.25 ns 191.56 ns 191.94 ns]
chumsky time: [335.63 ns 335.93 ns 336.42 ns]

TODO

Add more tests from here https://gist.github.com/ilonachan/3d92577265846e5327a3011f7aa30770 and here https://github.com/rust-bakery/nom/pull/1362/files
Implement a full example with ternary and other fun operators
Performance benchmarks for the example
Documentation with examples
Implement the api for Vec<..> and &[...]
Streaming support
Hide behind a feature flagunstable-pratt

epage

Wow, this looks great. You put a lot of work into this and I see care is taken in cases like the handling of infinite loops and such

epage · 2024-11-12T02:11:47Z

src/combinator/precedence.rs

+#[doc(alias = "shunting yard")]
+#[doc(alias = "precedence climbing")]
+#[inline(always)]
+pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(


I like names to make their role in the grammar clear. I'm trying to decide how this does and if there is anything better. We use separated for this kind of thing but unsure how to tie it in without it being overly verbose

Maybe expression? Similar to how combine names this parser https://docs.rs/combine-language/latest/combine_language/fn.expression_parser.html

Oh interesting, I wasn't aware of that crate. We should at least add expression_parser to our aliases.

We likely should pick one and deal with it and rename it if a better name comes up. expression seems just as good as any other. It fits with how this would be used in a language grammar.

epage · 2024-11-12T02:13:16Z

src/combinator/precedence.rs

+};
+
+/// An adapter for the [`Parser`] trait to enable its use in the [`precedence`] parser.
+pub trait PrecedenceParserExt<I, E> {


Our rule of thumb is grammar-level concepts are standalone functions and value processing are trait functions. These seem grammar-level.

Thanks for reviewing! Do you suggest making prefix, infix.. standalone functions? I followed the concern in the attached issue to make <parser>.prefix(...) work:

So I assume we'd go with

let calc = pratt( digits1.map(Expr::Int), ( '-'.prefix(Right(1), |r| unary(r, Op::Neg)); '+'.infix(Left(0), |l, r| binary(l, r, Op::Add)); '!'.prefix(Right(3), |r| unary(r, Op::Fact)); ) );

These prefix, infix, postfix are strange beasts in this implementation. They are both grammar-level constructs and value processing functions because the data processing is required by the parser to advance further.

I though about the interface like prefix(0, "-").map(|(a, b)| a - b). It looks clean but what should be the default operator when we don't call map. For a unary it is easy -- just return the argument, but for a binary it is unclear how to merge operands.

Ha! I previously looked at this and viewed the associativity as not grammar level and now looked at it and viewed it as grammar level. Normally trait methods are reserved for purely operating on the data, like map. This is actually changing how we parse the results. The one weird middle ground we have is cut_err. Unsure if this should be treated like that or not.

I also feel like a bunch of free functions floating around might not be the easiest for discovery and use.

I also suspect the ultimate answer will be dependent on the answer to other questions as I alluded to the interplay of my comments in #614 (comment)

epage · 2024-11-12T14:58:06Z

src/combinator/precedence.rs

+#[doc(alias = "shunting yard")]
+#[doc(alias = "precedence climbing")]
+#[inline(always)]
+pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(


The list of parsers will need to be updated

epage · 2024-11-12T14:59:11Z

src/combinator/precedence.rs

+#[doc(alias = "shunting yard")]
+#[doc(alias = "precedence climbing")]


Don't know if aliases with spaces works

epage · 2024-11-12T15:00:25Z

src/combinator/precedence.rs

+}
+impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}


Suggested change

}

impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}

}

impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}

epage · 2024-11-12T18:48:20Z

src/combinator/precedence.rs

+/// Type-erased unary predicate that folds an expression into a new expression.
+/// Useful for supporting not only closures but also arbitrary types as operator predicates within the [`precedence`] parser.
+pub trait UnaryOp<O> {
+    /// Invokes the [`UnaryOp`] predicate.
+    fn fold_unary(&mut self, o: O) -> O;
+}
+/// Type-erased binary predicate that folds two expressions into a new expression similar to
+/// [`UnaryOp`] within the [`precedence`] parser.
+pub trait BinaryOp<O> {
+    /// Invokes the [`BinaryOp`] predicate.
+    fn fold_binary(&mut self, lhs: O, rhs: O) -> O;
+}
+
+impl<O, F> UnaryOp<O> for F
+where
+    F: Fn(O) -> O,
+{
+    #[inline(always)]
+    fn fold_unary(&mut self, o: O) -> O {
+        (self)(o)
+    }
+}
+impl<O, F> BinaryOp<O> for F
+where


I tend to prefer that impls be closely organized with their trait

epage · 2024-11-12T18:49:44Z

src/combinator/mod.rs

@@ -174,6 +175,7 @@ pub use self::core::*;
 pub use self::debug::*;
 pub use self::multi::*;
 pub use self::parser::*;
+pub use self::precedence::*;


We are dumping a lot of stray types into combinator. The single-line summaries should make it very easy to tell they are related to precedence (maybe be the first word) and somehow help the user know what, if any, they should care about

epage · 2024-11-12T18:55:56Z

src/combinator/precedence.rs

+    }
+}
+
+macro_rules! impl_parser_for_tuple {


So we find the right "mode" by falling through an alt of parsers for that mode and then get the weight and fold function from the first success.

Looking at the example

precedence( digit1.try_map(|d: &str| d.parse::<i32>()), ( "-".value(2).prefix(|x| -1 * x), "+".value(2).prefix(|x| x), "!".value(2).postfix(|x| factorial(x)), "+".value(0).infix(|a, b| a + b), "-".value(0).infix(|a, b| a + b), "*".value(1).infix(|a, b| a * b), "/".value(1).infix(|a, b| a / b), ), ) .parse_next(i)

the alt seems unfortunate for performance reasons. dispatch can be used in this case and I suspect a lot of other cases. It would be a big help to find a way to set this up so dispatch can be used as well, e.g.

precedence( digit1.try_map(|d: &str| d.parse::<i32>()), dispatch!{any; '-' => prefix(2, |x| -1 * x), '+' => prefix(2, |x| x), '!' => postfix(2, |x| factorial(x)), '+' => infix(0, |a, b| a + b), '-' => infix(0, |a, b| a + b), '*' => infix(1, |a, b| a * b), '/' => infix(1, |a, b| a / b), _ => fail, ), ) .parse_next(i)

(note: I might give other suggestions that run counter to this, we'll need to weigh out each one and figure out how we want to balance the different interests)

Another idea. The implementation in nom rust-bakery/nom#1362 uses 3 parsers one for each affix in the function signature. It has an advantage that you can use dispatch and there is no prefix, postfix.. functions. The downside is that from just the function signature it's hard to remember which parser is which. And also fail is used if the parser is not provided.

Maybe abuilder pattern would be ideal e.g:

precedence(digit1.parse_to::<i32>()) .prefix(dispatch!{any; "+" => empty.map(|_| move |a, b| a + b) "-" => empty.map(|_| move |a, b| a + b) }) .postfix(dispatch!{any; ...}) .infix(dispatch!{any; ...}) .build() .parse_next(i)

Methods are similar to repeat().fold. This approach would eliminate stray functions, reduce the performance impact of iterating over each parser for each affix, and allow users to use dispatch or any custom logic they want.

I’m not sure yet how this would interact with binding power. I'm going to research whether we can drop the explicit index argument and rely on array indexing, how binding power interact between affixes .etc.

I came across another idea. Instead of dispathing by the parser type maybe we could dispatch by the input type:

enum Input<'i, I>{ Prefix(&'i mut I), Postfix(&'i mut I), Infix(&'i mut I) }

then it would be possible to (I'm not sure yet if Deref would work or not)

dispatch!{any; '-' => prefix(2, |x| -1 * x), '+' => prefix(2, |x| x), '!' => postfix(2, |x| factorial(x)), '+' => infix(0, |a, b| a + b), '-' => infix(0, |a, b| a + b), '*' => infix(1, |a, b| a * b), '/' => infix(1, |a, b| a / b), _ => fail, )

then prefix parsers would do

if let Prefix() = input { ... } else { fail }

This is nice because we only run the needed parsers (less alt, no dispatch just to get an error).

I agree about the opacity of parameter order. Its also not great if a parameter is unused (e.g. no postfix).

If we go with the "builder" approach, we are again violating the API design guidelines as discussed in #614 (comment). However, it does make it nice for naming the parameters and only requiring what is needed.

Maybe one way of looking at this is the error reporting. What happens if I use an operator in the wrong location?

Not quite getting the input type and dealing with custom input types in the middle of a parser adds a lot of complexity.

src/combinator/precedence.rs

epage · 2024-11-12T18:58:17Z

src/combinator/precedence.rs

+#[doc(alias = "shunting yard")]
+#[doc(alias = "precedence climbing")]
+#[inline(always)]
+pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(


Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence is core.

epage · 2024-11-12T18:59:37Z

src/combinator/mod.rs

Streaming support

This looks to be agnostic of streaming support like separated is

epage · 2024-11-12T19:00:10Z

src/combinator/mod.rs

Hide behind a feature flagunstable-pratt

If this is going to start off unstable, then its fine noting most of my feedback in the "tracking" issue and not resolving all of it here

epage · 2024-11-12T22:08:17Z

src/combinator/precedence.rs

+// recursive function
+fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>(


Is there a way we can do this without recursion? Parser authors are generally cautious around recursion due to blowing up the stack.

Hmm, seems all of the other ones use recursion

src/combinator/precedence.rs

epage · 2024-11-13T21:33:51Z

src/combinator/precedence.rs

+}
+
+// recursive function
+fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>(


From #614 (comment)

While looking at it I found a potential error. My current implementation is unsound. Consider the 2 + 4 + 7. In the current parser it evaluates as (4 + 7) + 2 instead of (2 + 4) + 7 when operators have equal binding power . That is why the algorithm uses associativity. While + or * are commutative someone might want to parse function calls for example

feat: implement Pratt parser

fed8c90

epage reviewed Nov 12, 2024

View reviewed changes

src/combinator/mod.rs

Copy link

Collaborator

epage Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming support

This looks to be agnostic of streaming support like separated is

epage reviewed Nov 12, 2024

View reviewed changes

epage mentioned this pull request Nov 12, 2024

Pratt parsing support #131

Open

2 tasks

epage reviewed Nov 12, 2024

View reviewed changes

src/combinator/precedence.rs Show resolved Hide resolved

epage reviewed Nov 13, 2024

View reviewed changes

39555 mentioned this pull request Nov 14, 2024

PoC: Pratt parsing with shunting yard algorithm #618

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement Pratt parsing #614

feat: implement Pratt parsing #614

39555 commented Nov 12, 2024 •

edited

Loading

epage left a comment

epage Nov 12, 2024

39555 Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

39555 Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

39555 Nov 12, 2024

39555 Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 12, 2024

epage Nov 13, 2024

		#[doc(alias = "shunting yard")]
		#[doc(alias = "precedence climbing")]

		}
		impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}

		// recursive function
		fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>(

feat: implement Pratt parsing #614

Are you sure you want to change the base?

feat: implement Pratt parsing #614

Conversation

39555 commented Nov 12, 2024 • edited Loading

Implementation details

Binding power, associativity, Pratt and Dijkstra

Performance

TODO

epage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

39555 commented Nov 12, 2024 •

edited

Loading