Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to parse parts of the input? #1424

Open
jjalowie opened this issue Jun 17, 2024 · 12 comments
Open

Is it possible to parse parts of the input? #1424

jjalowie opened this issue Jun 17, 2024 · 12 comments
Labels

Comments

@jjalowie
Copy link

What is your question?

I'm interested in writing a transformer for certain parts of input programs that wouldn't require a full language grammar. So let's say I have the below sample cpp code:

void function(int a, int b) {
  // body
}

I want to parse just function signatures and transform them e.g. by adding the '_' suffix to all argument names. I.e. the output for the above input program should be:

void function(int _a, int _b) {
  // body (unchanged)
}

How could I possibly achieve this using Lark?

@MegaIng
Copy link
Member

MegaIng commented Jun 17, 2024

See #1371

It's not exactly easy and especially for C-like syntax that is already tricky to parse with lark it might be quite annoying.

@jjalowie
Copy link
Author

Wow, thanks for a quick response.
I'm thinking about moving on with Earley using the start: /.*/ rule /.*/ approach (my inputs aren't very big so should be fine performance-wise).
I ran into the following problem though when experimenting around:

import lark
grammar = f"""
    start: /.*/ expr /.*/
    expr: "qwerty"
    %import common.WS
    %ignore WS
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')

ast = parser.parse("""asdf qwerty poiuy""")
print(ast.pretty())

This gives the following error:

[...]
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 215, in create_earley_parser
    return f(lexer_conf, parser_conf, resolve_ambiguity=resolve_ambiguity,
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 192, in create_earley_parser__dynamic
    earley_matcher = EarleyRegexpMatcher(lexer_conf)
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 178, in __init__
    raise GrammarError("Dynamic Earley doesn't allow zero-width regexps", t)
lark.exceptions.GrammarError: ("Dynamic Earley doesn't allow zero-width regexps", TerminalDef('__ANON_0', '.*'))

I'm using lark 1.1.9.

@MegaIng
Copy link
Member

MegaIng commented Jun 17, 2024

As it says in the error message, the terminal can't be zero width. Use /.+/? instead.

@jjalowie
Copy link
Author

Hm, I can't figure out how to proceed.
Here is an example:

import lark
grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "qwerty" | "asdf"
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())
# output:
# start
#   expr
#    qwerty

I can't force asdf to be parsed as an expr because it gets parsed as /.+/? greedily. Any tips how to cope with that?

@MegaIng
Copy link
Member

MegaIng commented Jun 17, 2024

Use ambiguity='explicit' and manually select the correct parse, see https://github.com/lark-parser/lark/blob/master/examples/advanced/dynamic_complete.py

Or choose the easier and quicker root and use the scan function I coded up in the other issue.

@jjalowie
Copy link
Author

jjalowie commented Jun 17, 2024

From what I see MegaIng@4975608 doesn't expose a way to use transformers on the matched code. Could this be improved so I can also use transformers? What would be the needed steps?

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?

import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

parser = lark.Lark(grammar, parser='earley', lexer='dynamic_complete', ambiguity='explicit')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())

It yields this output:

start
  expr
   qwerty

Shouldn't there be an ambiguity between /.*/? and expr visible in the AST this time because of ambiguity='explicit'? Isn't it a bug in lark?

@MegaIng
Copy link
Member

MegaIng commented Jun 17, 2024

Yeah, that does indeed look like a bug, maybe @erezsh has an idea what is going on.

With regard to scan: You have to do a bit of extra work right now, but you get the starting and end address of the segments that match, which you can then replace in the original. This would make a good recipe to add the documentation when scan gets officially added.

@MegaIng
Copy link
Member

MegaIng commented Jun 17, 2024

This function should do what you want using the scan method:

def scan_and_replace(parser: lark.Lark, text: str, replacement: Callable[[lark.ParseTree], str],
                     start: str = None) -> str:
    """
    Scans the `text` and replaces all matches of `parser` by the value returned by `replacement`
    given the corresponding tree.
    
    `start` is for passing in the start rule if required, not the starting position of the scanning.
    """
    last = 0
    res = ""
    for (start_pos, end_pos), tree in parser.scan(text, start=start):
        res += text[last:start_pos]
        res += replacement(tree)
        last = end_pos
    return res

@jjalowie
Copy link
Author

jjalowie commented Jun 17, 2024

This function should do what you want using the scan method:
[...]

That looks very promising. I will pick up from there. Thanks a lot!

@erezsh
Copy link
Member

erezsh commented Jun 18, 2024

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?

import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

It's not a bug, you have a space in your text, and only ANY can handle it.

When I change to this grammar:

    !start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"

    %ignore " "

I get -

_ambig
  start
    expr        asdf
    any  qwerty
  start
    expr        asdf
    any qwerty
  start
    expr        asdf
    expr        qwerty

(this is without dynamic_complete, which adds more derivations)

@MegaIng
Copy link
Member

MegaIng commented Jun 18, 2024

@erezsh No, what actually fixed it is moving /.+/? into a separate rule. These two grammars should have the same behavior, but don't:

grammar = f"""
    start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"
"""
grammar = f"""
    start: /.+/? expr+ /.+/?
    !expr: "asdf" | "qwerty"
"""

The later doesn't produce any ambiguities with dynamic_complete, the former does.

Similar for your grammar, with dynamic_complete it produces fewer derivations when any is inlined.

@erezsh
Copy link
Member

erezsh commented Jun 30, 2024

I tested @MegaIng 's example again on the latest master, and looks like this bug is fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants