Is it possible to parse parts of the input? #1424

jjalowie · 2024-06-17T16:00:34Z

What is your question?

I'm interested in writing a transformer for certain parts of input programs that wouldn't require a full language grammar. So let's say I have the below sample cpp code:

void function(int a, int b) {
  // body
}

I want to parse just function signatures and transform them e.g. by adding the '_' suffix to all argument names. I.e. the output for the above input program should be:

void function(int _a, int _b) {
  // body (unchanged)
}

How could I possibly achieve this using Lark?

The text was updated successfully, but these errors were encountered:

MegaIng · 2024-06-17T16:12:39Z

See #1371

It's not exactly easy and especially for C-like syntax that is already tricky to parse with lark it might be quite annoying.

jjalowie · 2024-06-17T18:03:06Z

Wow, thanks for a quick response.
I'm thinking about moving on with Earley using the start: /.*/ rule /.*/ approach (my inputs aren't very big so should be fine performance-wise).
I ran into the following problem though when experimenting around:

import lark
grammar = f"""
    start: /.*/ expr /.*/
    expr: "qwerty"
    %import common.WS
    %ignore WS
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')

ast = parser.parse("""asdf qwerty poiuy""")
print(ast.pretty())

This gives the following error:

[...]
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 215, in create_earley_parser
    return f(lexer_conf, parser_conf, resolve_ambiguity=resolve_ambiguity,
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 192, in create_earley_parser__dynamic
    earley_matcher = EarleyRegexpMatcher(lexer_conf)
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 178, in __init__
    raise GrammarError("Dynamic Earley doesn't allow zero-width regexps", t)
lark.exceptions.GrammarError: ("Dynamic Earley doesn't allow zero-width regexps", TerminalDef('__ANON_0', '.*'))

I'm using lark 1.1.9.

MegaIng · 2024-06-17T18:06:24Z

As it says in the error message, the terminal can't be zero width. Use /.+/? instead.

jjalowie · 2024-06-17T18:45:22Z

Hm, I can't figure out how to proceed.
Here is an example:

import lark
grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "qwerty" | "asdf"
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())
# output:
# start
#   expr
#    qwerty

I can't force asdf to be parsed as an expr because it gets parsed as /.+/? greedily. Any tips how to cope with that?

MegaIng · 2024-06-17T18:48:21Z

Use ambiguity='explicit' and manually select the correct parse, see https://github.com/lark-parser/lark/blob/master/examples/advanced/dynamic_complete.py

Or choose the easier and quicker root and use the scan function I coded up in the other issue.

jjalowie · 2024-06-17T19:42:23Z

From what I see MegaIng@4975608 doesn't expose a way to use transformers on the matched code. Could this be improved so I can also use transformers? What would be the needed steps?

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?

import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

parser = lark.Lark(grammar, parser='earley', lexer='dynamic_complete', ambiguity='explicit')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())

It yields this output:

start
  expr
   qwerty

Shouldn't there be an ambiguity between /.*/? and expr visible in the AST this time because of ambiguity='explicit'? Isn't it a bug in lark?

MegaIng · 2024-06-17T19:51:25Z

Yeah, that does indeed look like a bug, maybe @erezsh has an idea what is going on.

With regard to scan: You have to do a bit of extra work right now, but you get the starting and end address of the segments that match, which you can then replace in the original. This would make a good recipe to add the documentation when scan gets officially added.

MegaIng · 2024-06-17T20:03:10Z

This function should do what you want using the scan method:

def scan_and_replace(parser: lark.Lark, text: str, replacement: Callable[[lark.ParseTree], str],
                     start: str = None) -> str:
    """
    Scans the `text` and replaces all matches of `parser` by the value returned by `replacement`
    given the corresponding tree.
    
    `start` is for passing in the start rule if required, not the starting position of the scanning.
    """
    last = 0
    res = ""
    for (start_pos, end_pos), tree in parser.scan(text, start=start):
        res += text[last:start_pos]
        res += replacement(tree)
        last = end_pos
    return res

jjalowie · 2024-06-17T20:18:59Z

This function should do what you want using the scan method:
[...]

That looks very promising. I will pick up from there. Thanks a lot!

erezsh · 2024-06-18T12:03:38Z

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?
import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

It's not a bug, you have a space in your text, and only ANY can handle it.

When I change to this grammar:

    !start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"

    %ignore " "

I get -

_ambig
  start
    expr        asdf
    any  qwerty
  start
    expr        asdf
    any qwerty
  start
    expr        asdf
    expr        qwerty

(this is without dynamic_complete, which adds more derivations)

MegaIng · 2024-06-18T12:09:19Z

@erezsh No, what actually fixed it is moving /.+/? into a separate rule. These two grammars should have the same behavior, but don't:

grammar = f"""
    start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"
"""

grammar = f"""
    start: /.+/? expr+ /.+/?
    !expr: "asdf" | "qwerty"
"""

The later doesn't produce any ambiguities with dynamic_complete, the former does.

Similar for your grammar, with dynamic_complete it produces fewer derivations when any is inlined.

erezsh · 2024-06-30T10:23:28Z

I tested @MegaIng 's example again on the latest master, and looks like this bug is fixed!

jjalowie added the question label Jun 17, 2024

MegaIng mentioned this issue Jun 18, 2024

Forgiving syntax #1425

Open

erezsh mentioned this issue Jun 18, 2024

BUGFIX Earley: Now yielding a previously repressed ambiguity #1427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to parse parts of the input? #1424

Is it possible to parse parts of the input? #1424

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024 •

edited

Loading

MegaIng commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024 •

edited

Loading

erezsh commented Jun 18, 2024

MegaIng commented Jun 18, 2024 •

edited

Loading

erezsh commented Jun 30, 2024

Is it possible to parse parts of the input? #1424

Is it possible to parse parts of the input? #1424

Comments

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024 • edited Loading

MegaIng commented Jun 17, 2024

MegaIng commented Jun 17, 2024

jjalowie commented Jun 17, 2024 • edited Loading

erezsh commented Jun 18, 2024

MegaIng commented Jun 18, 2024 • edited Loading

erezsh commented Jun 30, 2024

jjalowie commented Jun 17, 2024 •

edited

Loading

jjalowie commented Jun 17, 2024 •

edited

Loading

MegaIng commented Jun 18, 2024 •

edited

Loading