Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting from pyparsing match_previous_literal #1437

Open
kevemueller opened this issue Jul 8, 2024 · 4 comments
Open

Porting from pyparsing match_previous_literal #1437

kevemueller opened this issue Jul 8, 2024 · 4 comments
Labels

Comments

@kevemueller
Copy link

What is your question?
I am porting from pyparsing to Lark due to expectations on increased performance. Initial test show very promising.

One of the few constructs that I could not identify how to express in Lark is match_previous_literal. It allows to dynamically match based on a previously matched literal.

# pyparsing example
first = Word(nums)
match_expr = first + ":" + match_previous_literal(first)

(see pyparsing-docs.readthedocs.io

I need this functionality to match a sed search and replace like construct.
${var:S/from/to/g} including all of its equivalents, like e.g. ${var:S#from#to#g}
I would like to match the character after the signaling token :S and use that as the delimiter for the expression. The delimited content must be parsed as well, i.e. may contain constructs like ${var2}.

If you're having trouble with your code or grammar

Currently I am using a workaround using templates simply listing some commonly used separators, but this is not exhaustive and not generic. Almost any character can be used as the separator

?sc_template{x, sep}: sep x sep x sep /[1gW]+/?
expansion_modifier_sc: /[SC]/ (sc_template{token, "/"} | sc_template{token, "%"})

I wonder if there is any possibility to define this generically in the grammar. I could live with an interim solution that retrieves the delimited content and run a parse on it in a post-processing step.

@erezsh
Copy link
Member

erezsh commented Jul 8, 2024

LALR only parses context-free grammars. However, a function like match_previous_literal() is context-sensitive. So the parser can't help you there.

But a regular expression should be capable of matching this construct.

It sounds like the best solution is to create a regexp terminal that matches the entire expression, and then in post-processing run the regexp again to parse the string. (using groups)

@kevemueller
Copy link
Author

Hi Erez,

thanks for pointing me into the right direction with RE.

I confirm that it is as easy as

expansion_modifier_sc: /[SC]/ /(.).*?\1.*?\1[1gW]*/

to back-reference in the RE the matching first character. The complete expression is handed to the post-processor which will then still need to take it apart using Python code.
To complete the exercise I will still need to add escaping to the RE, but that is beyond Lark.

Thx for this nice tool.

@kevemueller
Copy link
Author

kevemueller commented Jul 10, 2024

Update:
This works with earley/dynamic, but not with larl/contextual.
larl/contextual adds additional groups which make calculating the backreference impossible (sometimes \1, sometimes \2).

The error message is re.error: cannot refer to an open group at position

The workaround is to use named capturing group which results in a grammar of

expansion_modifier_sc: /[SC]/ /(?P<sep>.).*?(?P=sep).*?(?P=sep)[1gW]*/

This is agnostic to the lexer adding additional groups.

@erezsh
Copy link
Member

erezsh commented Jul 10, 2024

That makes sense. Maybe there's a way we could make it work, by fixing indexes for example.

Although requiring names for back-references almost sounds like a win :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants