Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement Pratt parsing #614

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

feat: implement Pratt parsing #614

wants to merge 1 commit into from

Conversation

39555
Copy link

@39555 39555 commented Nov 12, 2024

Closes: #131

This is the initial implementation of the operator precedence parser based on the Pratt algorithm https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing.

The interface:

precedence(
    digit1.try_map(|d: &str| d.parse::<i32>()),
    (
        "-".value(2).prefix(|x| -1 * x),
        "+".value(2).prefix(|x| x),
        "!".value(2).postfix(|x| factorial(x)),
        "+".value(0).infix(|a, b| a + b),
        "-".value(0).infix(|a, b| a + b),
        "*".value(1).infix(|a, b| a * b),
        "/".value(1).infix(|a, b| a / b),
    ),
)
.parse_next(i)

Any parser that returns a number can be considered as operator with its binding power. To convert the parser into prefix, infix or postfix, we use the extension trait on top of the impl Parser.

Implementation details

Binding power, associativity, Pratt and Dijkstra

The Pratt algorithm only uses the binding power. Left and right associativity are used in Dijkstra's Shunting Yard algorithm. (It seems likechumsky incorrectly names its implementation as pratt while using shunting yard internally). EDIT: I am wrong here. We need to use associativity for infix operators to determine the order of the operators with equal binding power.

I named the combinator precedence with aliases to pratt, shunting yard, precedence climbing, separated. If any of the names are confusing, we may consider better alternatives.

The implementation is partially inspired by https://gist.github.com/ilonachan/3d92577265846e5327a3011f7aa30770.
But it doesn't use any Rc<...> allocations for closures; instead, it uses uses & RefCell<& dyn ...> references.

Operators from the tuple are statically dispatched by the requested affix so prefix.as_postfix() will return fail while prefix.as_prefix() with return the actual parser. This may have a performance issue because it iterates through the whole tuple each time the parser is requested for any affix:

See the trace:

> postfix                                                                                      | "+4"∅
 > opt                                                                                         | "+4"∅
  > alt                                                                                        | "+4"∅
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > "!"                                                                                       | "+4"∅
   < "!"                                                                                       | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
   > fail                                                                                      | "+4"∅
   < fail                                                                                      | backtrack
  < alt                                                                                        | backtrack
 < opt                                                                                         | +0
< postfix

(I tried to make a some sort of a linked list where each parser holds an index to the next one with the same affix but, it quickly became a nightmare of lifetimes, complexity, and compilation errors so it was rejected).

Performance

A quick criterion benchmark with a simple input &str 1-2*4+12-561-5-6*6/9-3+1*-2*4-758*3 shows:

winnow   time: [191.25 ns 191.56 ns 191.94 ns]
chumsky time: [335.63 ns 335.93 ns 336.42 ns]

TODO

Copy link
Collaborator

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this looks great. You put a lot of work into this and I see care is taken in cases like the handling of infinite loops and such

#[doc(alias = "shunting yard")]
#[doc(alias = "precedence climbing")]
#[inline(always)]
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like names to make their role in the grammar clear. I'm trying to decide how this does and if there is anything better. We use separated for this kind of thing but unsure how to tie it in without it being overly verbose

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expression? Similar to how combine names this parser https://docs.rs/combine-language/latest/combine_language/fn.expression_parser.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, I wasn't aware of that crate. We should at least add expression_parser to our aliases.

We likely should pick one and deal with it and rename it if a better name comes up. expression seems just as good as any other. It fits with how this would be used in a language grammar.

};

/// An adapter for the [`Parser`] trait to enable its use in the [`precedence`] parser.
pub trait PrecedenceParserExt<I, E> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our rule of thumb is grammar-level concepts are standalone functions and value processing are trait functions. These seem grammar-level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! Do you suggest making prefix, infix.. standalone functions? I followed the concern in the attached issue to make <parser>.prefix(...) work:

So I assume we'd go with

let calc = pratt(
    digits1.map(Expr::Int),
    (
        '-'.prefix(Right(1), |r| unary(r, Op::Neg));
        '+'.infix(Left(0), |l, r| binary(l, r, Op::Add));
        '!'.prefix(Right(3), |r| unary(r, Op::Fact));
    )
);

These prefix, infix, postfix are strange beasts in this implementation. They are both grammar-level constructs and value processing functions because the data processing is required by the parser to advance further.

I though about the interface like prefix(0, "-").map(|(a, b)| a - b). It looks clean but what should be the default operator when we don't call map. For a unary it is easy -- just return the argument, but for a binary it is unclear how to merge operands.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I previously looked at this and viewed the associativity as not grammar level and now looked at it and viewed it as grammar level. Normally trait methods are reserved for purely operating on the data, like map. This is actually changing how we parse the results. The one weird middle ground we have is cut_err. Unsure if this should be treated like that or not.

I also feel like a bunch of free functions floating around might not be the easiest for discovery and use.

I also suspect the ultimate answer will be dependent on the answer to other questions as I alluded to the interplay of my comments in #614 (comment)

#[doc(alias = "shunting yard")]
#[doc(alias = "precedence climbing")]
#[inline(always)]
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of parsers will need to be updated

Comment on lines +344 to +345
#[doc(alias = "shunting yard")]
#[doc(alias = "precedence climbing")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if aliases with spaces works

Comment on lines +63 to +64
}
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}
}
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {}

Comment on lines +105 to +128
/// Type-erased unary predicate that folds an expression into a new expression.
/// Useful for supporting not only closures but also arbitrary types as operator predicates within the [`precedence`] parser.
pub trait UnaryOp<O> {
/// Invokes the [`UnaryOp`] predicate.
fn fold_unary(&mut self, o: O) -> O;
}
/// Type-erased binary predicate that folds two expressions into a new expression similar to
/// [`UnaryOp`] within the [`precedence`] parser.
pub trait BinaryOp<O> {
/// Invokes the [`BinaryOp`] predicate.
fn fold_binary(&mut self, lhs: O, rhs: O) -> O;
}

impl<O, F> UnaryOp<O> for F
where
F: Fn(O) -> O,
{
#[inline(always)]
fn fold_unary(&mut self, o: O) -> O {
(self)(o)
}
}
impl<O, F> BinaryOp<O> for F
where
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer that impls be closely organized with their trait

@@ -174,6 +175,7 @@ pub use self::core::*;
pub use self::debug::*;
pub use self::multi::*;
pub use self::parser::*;
pub use self::precedence::*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dumping a lot of stray types into combinator. The single-line summaries should make it very easy to tell they are related to precedence (maybe be the first word) and somehow help the user know what, if any, they should care about

}
}

macro_rules! impl_parser_for_tuple {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we find the right "mode" by falling through an alt of parsers for that mode and then get the weight and fold function from the first success.

Looking at the example

precedence(
    digit1.try_map(|d: &str| d.parse::<i32>()),
    (
        "-".value(2).prefix(|x| -1 * x),
        "+".value(2).prefix(|x| x),
        "!".value(2).postfix(|x| factorial(x)),
        "+".value(0).infix(|a, b| a + b),
        "-".value(0).infix(|a, b| a + b),
        "*".value(1).infix(|a, b| a * b),
        "/".value(1).infix(|a, b| a / b),
    ),
)
.parse_next(i)

the alt seems unfortunate for performance reasons. dispatch can be used in this case and I suspect a lot of other cases. It would be a big help to find a way to set this up so dispatch can be used as well, e.g.

precedence(
    digit1.try_map(|d: &str| d.parse::<i32>()),
    dispatch!{any;
        '-' => prefix(2, |x| -1 * x),
        '+' => prefix(2, |x| x),
        '!' => postfix(2, |x| factorial(x)),
        '+' => infix(0, |a, b| a + b),
        '-' => infix(0, |a, b| a + b),
        '*' => infix(1, |a, b| a * b),
        '/' => infix(1, |a, b| a / b),
        _ => fail,
    ),
)
.parse_next(i)

(note: I might give other suggestions that run counter to this, we'll need to weigh out each one and figure out how we want to balance the different interests)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea. The implementation in nom rust-bakery/nom#1362 uses 3 parsers one for each affix in the function signature. It has an advantage that you can use dispatch and there is no prefix, postfix.. functions. The downside is that from just the function signature it's hard to remember which parser is which. And also fail is used if the parser is not provided.

Maybe abuilder pattern would be ideal e.g:

precedence(digit1.parse_to::<i32>())
    .prefix(dispatch!{any; 
         "+" => empty.map(|_| move |a, b| a + b)
         "-" => empty.map(|_| move |a, b| a + b)
   })
   .postfix(dispatch!{any; ...})
   .infix(dispatch!{any; ...})
   .build()
.parse_next(i)

Methods are similar to repeat().fold. This approach would eliminate stray functions, reduce the performance impact of iterating over each parser for each affix, and allow users to use dispatch or any custom logic they want.

I’m not sure yet how this would interact with binding power. I'm going to research whether we can drop the explicit index argument and rely on array indexing, how binding power interact between affixes .etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across another idea. Instead of dispathing by the parser type maybe we could dispatch by the input type:

enum Input<'i, I>{
   Prefix(&'i mut I),
   Postfix(&'i mut I),
   Infix(&'i mut I)
}

then it would be possible to (I'm not sure yet if Deref would work or not)

dispatch!{any;
        '-' => prefix(2, |x| -1 * x),
        '+' => prefix(2, |x| x),
        '!' => postfix(2, |x| factorial(x)),
        '+' => infix(0, |a, b| a + b),
        '-' => infix(0, |a, b| a + b),
        '*' => infix(1, |a, b| a * b),
        '/' => infix(1, |a, b| a / b),
        _ => fail,
    )

then prefix parsers would do

if let Prefix() = input {
   ...
} else {
  fail
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice because we only run the needed parsers (less alt, no dispatch just to get an error).

I agree about the opacity of parameter order. Its also not great if a parameter is unused (e.g. no postfix).

If we go with the "builder" approach, we are again violating the API design guidelines as discussed in #614 (comment). However, it does make it nice for naming the parameters and only requiring what is needed.

Maybe one way of looking at this is the error reporting. What happens if I use an operator in the wrong location?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite getting the input type and dealing with custom input types in the middle of a parser adds a lot of complexity.

src/combinator/precedence.rs Show resolved Hide resolved
#[doc(alias = "shunting yard")]
#[doc(alias = "precedence climbing")]
#[inline(always)]
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence is core.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming support

This looks to be agnostic of streaming support like separated is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide behind a feature flagunstable-pratt

If this is going to start off unstable, then its fine noting most of my feedback in the "tracking" issue and not resolving all of it here

@epage epage mentioned this pull request Nov 12, 2024
2 tasks
Comment on lines +363 to +364
// recursive function
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can do this without recursion? Parser authors are generally cautious around recursion due to blowing up the stack.

Hmm, seems all of the other ones use recursion

}

// recursive function
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #614 (comment)

While looking at it I found a potential error. My current implementation is unsound. Consider the 2 + 4 + 7. In the current parser it evaluates as (4 + 7) + 2 instead of (2 + 4) + 7 when operators have equal binding power . That is why the algorithm uses associativity. While + or * are commutative someone might want to parse function calls for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pratt parsing support
2 participants