-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement Pratt parsing #614
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this looks great. You put a lot of work into this and I see care is taken in cases like the handling of infinite loops and such
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like names to make their role in the grammar clear. I'm trying to decide how this does and if there is anything better. We use separated
for this kind of thing but unsure how to tie it in without it being overly verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expression
? Similar to how combine
names this parser https://docs.rs/combine-language/latest/combine_language/fn.expression_parser.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, I wasn't aware of that crate. We should at least add expression_parser
to our aliases.
We likely should pick one and deal with it and rename it if a better name comes up. expression
seems just as good as any other. It fits with how this would be used in a language grammar.
}; | ||
|
||
/// An adapter for the [`Parser`] trait to enable its use in the [`precedence`] parser. | ||
pub trait PrecedenceParserExt<I, E> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our rule of thumb is grammar-level concepts are standalone functions and value processing are trait functions. These seem grammar-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing! Do you suggest making prefix
, infix
.. standalone functions? I followed the concern in the attached issue to make <parser>.prefix(...)
work:
So I assume we'd go with
let calc = pratt( digits1.map(Expr::Int), ( '-'.prefix(Right(1), |r| unary(r, Op::Neg)); '+'.infix(Left(0), |l, r| binary(l, r, Op::Add)); '!'.prefix(Right(3), |r| unary(r, Op::Fact)); ) );
These prefix
, infix
, postfix
are strange beasts in this implementation. They are both grammar-level constructs and value processing functions because the data processing is required by the parser to advance further.
I though about the interface like prefix(0, "-").map(|(a, b)| a - b)
. It looks clean but what should be the default operator
when we don't call map
. For a unary
it is easy -- just return the argument, but for a binary
it is unclear how to merge operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I previously looked at this and viewed the associativity as not grammar level and now looked at it and viewed it as grammar level. Normally trait methods are reserved for purely operating on the data, like map
. This is actually changing how we parse the results. The one weird middle ground we have is cut_err
. Unsure if this should be treated like that or not.
I also feel like a bunch of free functions floating around might not be the easiest for discovery and use.
I also suspect the ultimate answer will be dependent on the answer to other questions as I alluded to the interplay of my comments in #614 (comment)
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of parsers will need to be updated
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know if aliases with spaces works
} | ||
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {} | |
} | |
impl<I, E, T: Parser<I, usize, E>> PrecedenceParserExt<I, E> for T where I: Stream {} |
/// Type-erased unary predicate that folds an expression into a new expression. | ||
/// Useful for supporting not only closures but also arbitrary types as operator predicates within the [`precedence`] parser. | ||
pub trait UnaryOp<O> { | ||
/// Invokes the [`UnaryOp`] predicate. | ||
fn fold_unary(&mut self, o: O) -> O; | ||
} | ||
/// Type-erased binary predicate that folds two expressions into a new expression similar to | ||
/// [`UnaryOp`] within the [`precedence`] parser. | ||
pub trait BinaryOp<O> { | ||
/// Invokes the [`BinaryOp`] predicate. | ||
fn fold_binary(&mut self, lhs: O, rhs: O) -> O; | ||
} | ||
|
||
impl<O, F> UnaryOp<O> for F | ||
where | ||
F: Fn(O) -> O, | ||
{ | ||
#[inline(always)] | ||
fn fold_unary(&mut self, o: O) -> O { | ||
(self)(o) | ||
} | ||
} | ||
impl<O, F> BinaryOp<O> for F | ||
where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer that impls be closely organized with their trait
@@ -174,6 +175,7 @@ pub use self::core::*; | |||
pub use self::debug::*; | |||
pub use self::multi::*; | |||
pub use self::parser::*; | |||
pub use self::precedence::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are dumping a lot of stray types into combinator
. The single-line summaries should make it very easy to tell they are related to precedence
(maybe be the first word) and somehow help the user know what, if any, they should care about
} | ||
} | ||
|
||
macro_rules! impl_parser_for_tuple { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we find the right "mode" by falling through an alt
of parsers for that mode and then get the weight and fold function from the first success.
Looking at the example
precedence(
digit1.try_map(|d: &str| d.parse::<i32>()),
(
"-".value(2).prefix(|x| -1 * x),
"+".value(2).prefix(|x| x),
"!".value(2).postfix(|x| factorial(x)),
"+".value(0).infix(|a, b| a + b),
"-".value(0).infix(|a, b| a + b),
"*".value(1).infix(|a, b| a * b),
"/".value(1).infix(|a, b| a / b),
),
)
.parse_next(i)
the alt seems unfortunate for performance reasons. dispatch
can be used in this case and I suspect a lot of other cases. It would be a big help to find a way to set this up so dispatch
can be used as well, e.g.
precedence(
digit1.try_map(|d: &str| d.parse::<i32>()),
dispatch!{any;
'-' => prefix(2, |x| -1 * x),
'+' => prefix(2, |x| x),
'!' => postfix(2, |x| factorial(x)),
'+' => infix(0, |a, b| a + b),
'-' => infix(0, |a, b| a + b),
'*' => infix(1, |a, b| a * b),
'/' => infix(1, |a, b| a / b),
_ => fail,
),
)
.parse_next(i)
(note: I might give other suggestions that run counter to this, we'll need to weigh out each one and figure out how we want to balance the different interests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea. The implementation in nom
rust-bakery/nom#1362 uses 3 parsers one for each affix in the function signature. It has an advantage that you can use dispatch
and there is no prefix
, postfix
.. functions. The downside is that from just the function signature it's hard to remember which parser is which. And also fail
is used if the parser is not provided.
Maybe abuilder
pattern would be ideal e.g:
precedence(digit1.parse_to::<i32>())
.prefix(dispatch!{any;
"+" => empty.map(|_| move |a, b| a + b)
"-" => empty.map(|_| move |a, b| a + b)
})
.postfix(dispatch!{any; ...})
.infix(dispatch!{any; ...})
.build()
.parse_next(i)
Methods are similar to repeat().fold
. This approach would eliminate stray functions, reduce the performance impact of iterating over each parser for each affix, and allow users to use dispatch or any custom logic they want.
I’m not sure yet how this would interact with binding power. I'm going to research whether we can drop the explicit index argument and rely on array indexing, how binding power interact between affixes .etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across another idea. Instead of dispathing by the parser type maybe we could dispatch by the input type:
enum Input<'i, I>{
Prefix(&'i mut I),
Postfix(&'i mut I),
Infix(&'i mut I)
}
then it would be possible to (I'm not sure yet if Deref would work or not)
dispatch!{any;
'-' => prefix(2, |x| -1 * x),
'+' => prefix(2, |x| x),
'!' => postfix(2, |x| factorial(x)),
'+' => infix(0, |a, b| a + b),
'-' => infix(0, |a, b| a + b),
'*' => infix(1, |a, b| a * b),
'/' => infix(1, |a, b| a / b),
_ => fail,
)
then prefix
parsers would do
if let Prefix() = input {
...
} else {
fail
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice because we only run the needed parsers (less alt
, no dispatch
just to get an error).
I agree about the opacity of parameter order. Its also not great if a parameter is unused (e.g. no postfix).
If we go with the "builder" approach, we are again violating the API design guidelines as discussed in #614 (comment). However, it does make it nice for naming the parameters and only requiring what is needed.
Maybe one way of looking at this is the error reporting. What happens if I use an operator in the wrong location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite getting the input type and dealing with custom input types in the middle of a parser adds a lot of complexity.
#[doc(alias = "shunting yard")] | ||
#[doc(alias = "precedence climbing")] | ||
#[inline(always)] | ||
pub fn precedence<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organizationally, O prefer the "top level" thing going first and then branching out from there. In this case, precedence
is core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming support
This looks to be agnostic of streaming support like separated
is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hide behind a feature flagunstable-pratt
If this is going to start off unstable, then its fine noting most of my feedback in the "tracking" issue and not resolving all of it here
// recursive function | ||
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can do this without recursion? Parser authors are generally cautious around recursion due to blowing up the stack.
Hmm, seems all of the other ones use recursion
} | ||
|
||
// recursive function | ||
fn precedence_impl<I, ParseOperand, Operators, Operand: 'static, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #614 (comment)
While looking at it I found a potential error. My current implementation is unsound. Consider the 2 + 4 + 7. In the current parser it evaluates as (4 + 7) + 2 instead of (2 + 4) + 7 when operators have equal binding power . That is why the algorithm uses associativity. While + or * are commutative someone might want to parse function calls for example
Closes: #131
This is the initial implementation of the
operator precedence
parser based on the Pratt algorithm https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing.The interface:
Any parser that returns a number can be considered as operator with its
binding power
. To convert the parser intoprefix
,infix
orpostfix
, we use the extension trait on top of theimpl Parser
.Implementation details
Binding power, associativity, Pratt and Dijkstra
The Pratt algorithm only uses theEDIT: I am wrong here. We need to use associativity forbinding power
.Left
andright
associativity are used in Dijkstra'sShunting Yard
algorithm. (It seems likechumsky
incorrectly names its implementation aspratt
while usingshunting yard
internally).infix
operators to determine the order of the operators with equal binding power.I named the combinator
precedence
with aliases topratt
,shunting yard
,precedence climbing
,separated
. If any of the names are confusing, we may consider better alternatives.The implementation is partially inspired by https://gist.github.com/ilonachan/3d92577265846e5327a3011f7aa30770.
But it doesn't use any
Rc<...>
allocations for closures; instead, it uses uses& RefCell<& dyn ...>
references.Operators from the tuple are statically dispatched by the requested affix so
prefix.as_postfix()
will returnfail
whileprefix.as_prefix()
with return the actual parser. This may have a performance issue because it iterates through the whole tuple each time the parser is requested for any affix:See the trace:
(I tried to make a some sort of a linked list where each parser holds an index to the next one with the same affix but, it quickly became a nightmare of lifetimes, complexity, and compilation errors so it was rejected).
Performance
A quick criterion benchmark with a simple input &str
1-2*4+12-561-5-6*6/9-3+1*-2*4-758*3
shows:TODO
unstable-pratt