Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero length regular expression match misbehaviour #2565

Closed
weeble opened this issue Mar 30, 2023 · 4 comments · Fixed by #2677
Closed

Zero length regular expression match misbehaviour #2565

weeble opened this issue Mar 30, 2023 · 4 comments · Fixed by #2677
Labels
Milestone

Comments

@weeble
Copy link

weeble commented Mar 30, 2023

Description

Some regular expressions behave weirdly when matches are zero length - they miss matches or match too often.

Reproduction

Case 1

Input:

jq --null-input --compact-output '"xx"|match("";"g")'

Output:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}

Expected:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

There should be a match at the end of the string too.

Case 2

Input:

$ jq --null-input --compact-output '"xx"|match("$";"g")'

Output:

{"offset":2,"length":0,"string":"","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

Expected:

{"offset":2,"length":0,"string":"","captures":[]}

The same match shouldn't show up twice. (Extra matches seem to be returned in number equal to the length of the string, for some reason. But they're all the same match at the same offset.)

Environment

  • OS and Version: Ubuntu
  • jq version: master branch, commit cff5336

Additional context

I stumbled on this trying to fix #2148 and found that my attempted solution didn't behave like I expected it to. I tried to isolate the problem and ended up with these.

@weeble
Copy link
Author

weeble commented Mar 31, 2023

Okay, having looked at f_match, in particular this handling for zero-width matches and the loop termination condition, I see some issues. Here's another test case:

jq --null-input --compact-output '"xx"|match("|x";"g")'

This outputs:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}

But it should output:

{"offset":0,"length":0,"string":"","captures":[]}
{"offset":0,"length":1,"string":"x","captures":[]}
{"offset":1,"length":0,"string":"","captures":[]}
{"offset":1,"length":1,"string":"x","captures":[]}
{"offset":2,"length":0,"string":"","captures":[]}

Example using PCRE on regex101

@weeble
Copy link
Author

weeble commented Mar 31, 2023

Hmmm, oniguruma doesn't make this easy. There doesn't seem to be any way to call onig_search to get the next match that isn't a zero-width match at the start position. It seems you can either call it with a callback and get all the matches, which could be very inefficient, or you call it with a start offset and get only one match. This also makes it clear that my approach in #2564 is going to be way more costly than I thought, because I was assuming that match(...; "g") returns results lazily, and it doesn't look like it does.

I wonder if you would get the right result if you called it with the callback and filtered out every match that overlaps with the previously replaced match.

@weeble
Copy link
Author

weeble commented Mar 31, 2023

This seems very thorny. It might be worth reviewing what other users of oniguruma do to implement this. It doesn't even seem like there's universal agreement between regex engines, especially with awkward cases like "xx"|match("x?";"g") and "xx"|match("|x";"g")

@itchyny itchyny added the bug label Jun 3, 2023
@pkoppstein
Copy link
Contributor

pkoppstein commented Jul 4, 2023

Please note that Case 2 in the original post has been resolved by 83f375c :

jqMaster -nM '"xx"|[match("$";"g")] | length'
1

Note also that in this repaired version of jq, the following is nevertheless incorrect:

jqMaster -nM '"x"|sub("";"y"; "g")'
"yx"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants