Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a stable and extensible parser API (and hiding the rougher implementation details behind sealed traits) #354

Open
zesterer opened this issue Mar 7, 2023 · 9 comments
Labels
1.0 Features that should be implemented for the 1.0 release api A problem with the design of an API feature help wanted Extra attention is needed interop An issue that affects interoperability with other crates

Comments

@zesterer
Copy link
Owner

zesterer commented Mar 7, 2023

Some things me might want to hide:

  • Mode
  • Check/Emit/etc.
  • Parser::go/Parser::go_check/Parser::go_emit/etc.
  • InputRef
  • Most of Input (Input::next_maybe, for example)

Hiding these details does not necessarily make the crate non-extensible! A viable route to recovering extensibility is to have a more simplistic set of traits for common interfaces (like Parser, Input, etc.) that provide a more limited API for which we can guarantee long-term stability, and then provide blanket implementations of the more complex sealed traits for implementers of the simpler traits.

The goal here is not to hurt downstream developers that may want to build upon chumsky, but to protect them from changes to volatile implementation details that are likely to happen as we find more ways to aggressively optimise parsers. This has a twofold benefit:

  1. Developers can extend the trait without fear that the ground is suddenly going to shift under them, providing room for an ecosystem to build up around a stable core interface

  2. We, as developers of said core, get the change to aggressively optimise the core and alter the way in which parsers work to provide users with better performance without worrying that we're hurting downstream extensions with regular breaking changes to APIs

How you can help

If you're thinking about building on top of chumsky in the future - perhaps to add new parsers, combinators, errors, etc - then knowing what requirements you have would be enormously helpful! What interfaces would you like to see around backtracking, error generation, token processing, etc.?

@zesterer zesterer added 1.0 Features that should be implemented for the 1.0 release help wanted Extra attention is needed api A problem with the design of an API feature labels Mar 7, 2023
@zesterer zesterer changed the title Hide implementation details within sealed trait(s) Building a stable and extensible parser API (and hiding the rougher implementation details behind sealed traits) Mar 8, 2023
@zesterer zesterer added the interop An issue that affects interoperability with other crates label Mar 8, 2023
@stefnotch
Copy link
Contributor

stefnotch commented Mar 8, 2023

One thing that I would find interesting would be a combinator that takes multiple parsers, and "merges" them.
It would apply all parsers to an input, and return whichever one leads to the longest result.

This would be convenient for writing grammars in really simple ways and not having to manually merge them correctly. The boring example would be

Statement → 
  "if" Condition "{" Statement "}" |
  "if" Condition "{" Statement "}" else "{" Statement "}"

While the way more interesting example is parsing decimal and hexadecimal numbers.

Number → 
  ("0" | "1" | "2" | ...)+ |
  "0x" ("0" | "1" | ... | "a" | "b" ...)+

The or combinator is similar to that, but it has a different behavior.

I am, sadly, no expert on parsing, so it's entirely possible that there's a better way of getting the same results.

@zesterer
Copy link
Owner Author

zesterer commented Mar 8, 2023

That's an interesting idea. We'd have to be careful to avoid exponential blowup (if you nest such parsers, even the happy path becomes exponential). Sort of like choice, except the parser that left the input the furthest progressed wins.

This is absolutely possible with the extension API I implemented today, and actually quite easy to implement.

@PizzaCrust
Copy link

PizzaCrust commented Mar 19, 2023

It would be really nice to have Mode publicly available instead of splitting it into two functions.
I don't really want to copy the same code without the output or otherwise leave potential performance on the table.

@zesterer
Copy link
Owner Author

zesterer commented Mar 19, 2023

It would be really nice to have Mode publicly available

I too would like it to be public... in time. The reason the current extension API doesn't reveal it is that I'm not even entirely convinced that it's the way we should be going long-term: it might be that another formulation pops up (maybe Rust implements HKTs, or we find a more expressive way to control parsing artefacts).

It's this uncertainty that makes me want to avoid making it public, at least for the initial 1.0 release: I really want to avoid introducing breaking changes after 1.0 if we can help it so that users have a stable core upon which to build.

Note that introducing it later would most definitely be possible without breaking changes: an additional method on ExtParser with a default implementation that calls out to either ExtParser::parse or ExtParser::check depending on the mode could easily be introduced.

For what it's worth, you can (with a bit of extra work) already emulate Mode yourself, like so:

struct MyParser { ... }

impl MyParser {
    fn parse_inner<M: MyMode, E>(&self, inp: &mut InputRef<'a, '_, I, E>) -> Result<M::Output<O>, E> {
        ...
    }
}

impl<'a, I, E> ExtParser for MyParser
where
    I: Input<'a>,
    E: ParserExtra<'a, I>,
{
    fn parse(&self, inp: &mut InputRef<'a, '_, I, E>) -> Result<O, E::Error> {
        self.parse_inner::<MyEmit>(inp)
    }

    fn check(&self, inp: &mut InputRef<'a, '_, I, E>) -> Result<(), E::Error> {
        self.parse_inner::<MyCheck>(inp)
    }
}

trait MyMode { type Output<T>; }

struct MyCheck;
impl MyMode for MyCheck { type Output<T> = (); }

struct MyEmit;
impl MyMode for MyEmit { type Output<T> = T; }

If you're implementing a lot of parsers, this 'extra work' isn't particularly enormous when compared to everything else, and can be easily be removed if we do decide to make Mode public in a future release.

Apologies if this seems like an unsatisfactory answer from the perspective of a consumer of chumsky. The 1.0 release has already required an enormous amount of work to prepare, and right now my priority is minimising the API surface area in service of releasing a polished core API. This isn't the end of the story though: successive releases will be building upon this minimal central core (in particular, I'm looking forward to a visitor API that allows writing 'static analysis' passes for parsers, such as #295).

@stefnotch
Copy link
Contributor

stefnotch commented Aug 20, 2023

That's an interesting idea. We'd have to be careful to avoid exponential blowup (if you nest such parsers, even the happy path becomes exponential). Sort of like choice, except the parser that left the input the furthest progressed wins.

This is absolutely possible with the extension API I implemented today, and actually quite easy to implement.

Any pointers on how I'd get started with implementing this for myself? I'm using the latest version straight from GitHub.
Edit: Ah, I found it https://docs.rs/chumsky/1.0.0-alpha.4/chumsky/extension/index.html

I'm mainly using it for tasks where a regex-based lexer could also do the job. So at I'll be able to avoid the exponential blowup for the most part.

@zesterer
Copy link
Owner Author

If you look at the public API of InputRef, you should have everything you need there. You can invoke parsers and do backtracking on each one, choosing the output that left the input offset in the furthest state.

@stefnotch
Copy link
Contributor

choosing the output that left the input offset in the furthest state.

I attempted this, and haven't quite found a good way of doing that. I can query the InputRef.offset, however that struct doesn't implement Ord. So I have offsets that cannot be compared.

P.S. As a workaround, I tried constraining the output type, but that makes it incompatible with the pratt parser API for reasons that I don't quite understand yet.

@zesterer
Copy link
Owner Author

Oh, we should definitely implement Ord for it in that case. I'll try to get time to do this in the next few days.

@stefnotch
Copy link
Contributor

stefnotch commented Aug 26, 2023

I implemented it, hopefully that saves you some time #511

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Features that should be implemented for the 1.0 release api A problem with the design of an API feature help wanted Extra attention is needed interop An issue that affects interoperability with other crates
Projects
None yet
Development

No branches or pull requests

3 participants