Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM: Some builtins don't work with strings >64 characters as inputs #6376

Closed
sandhose opened this issue Nov 2, 2023 · 4 comments · Fixed by #6395
Closed

WASM: Some builtins don't work with strings >64 characters as inputs #6376

sandhose opened this issue Nov 2, 2023 · 4 comments · Fixed by #6395
Labels

Comments

@sandhose
Copy link

sandhose commented Nov 2, 2023

Short description

Some builtins, at least regex.find_all_string_submatch_n, don't work for input strings longer than 64 characters

Steps To Reproduce

repro.rego:

package repro

const_c64 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
const_c65 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

test.input_no_dollar_c64 { match_no_dollar(input.c64) }
test.input_no_dollar_c65 { match_no_dollar(input.c65) }
test.input_dollar_c64 { match_dollar(input.c64) }
test.input_dollar_c65 { match_dollar(input.c65) }
test.const_no_dollar_c64 { match_no_dollar(const_c64) }
test.const_no_dollar_c65 { match_no_dollar(const_c65) }
test.const_dollar_c64 { match_dollar(const_c64) }
test.const_dollar_c65 { match_dollar(const_c65) }

match_no_dollar(s) = m {
	[m] = regex.find_all_string_submatch_n("^.*", s, 1)
}

match_dollar(s) = m {
	[m] = regex.find_all_string_submatch_n("^.*$", s, 1)
}

input.json:

{
  "c64": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
  "c65": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}

Running with -t rego

They all evaluate truthy:

$ opa eval --format pretty -t rego -d ./repro.rego 'data.repro.test' -i input.json 
[
  "const_dollar_c64",
  "const_dollar_c65",
  "const_no_dollar_c64",
  "const_no_dollar_c65",
  "input_dollar_c64",
  "input_dollar_c65",
  "input_no_dollar_c64",
  "input_no_dollar_c65"
]

Running with -t wasm

They all evaluate truthy:

$ opa eval --format pretty -t wasm -d ./repro.rego 'data.repro.test' -i input.json 
[
  "const_dollar_c64",
  "const_dollar_c65",
  "const_no_dollar_c64",
  "const_no_dollar_c65",
  "input_dollar_c64",
  "input_no_dollar_c64",
  "input_no_dollar_c65"
]

Notice how input_dollar_c65 is the only test evaluating as undefined.
So this issue is only when:

  • the string comes from the input
  • it is more than 64 characters long
  • the regex matches until the end of the string

Expected behavior

That it correctly evaluates, both in wasm and rego mode

Additional context

Tested both on the latest version and old version, as far back as v0.36.0

@sandhose sandhose added the bug label Nov 2, 2023
@srenatus
Copy link
Contributor

srenatus commented Nov 3, 2023

I tried to see if it affected any natively implemented builtin or is specific to the regexp one. I did this is a simplified repro rego:

package repro

c65 := count(input.c65)
c64 := count(input.c65)

and it turns out that the result of this is the same for -t wasm and -t rego:

{
  "c64": 63,
  "c65": 64
}

Surprise! The input data is one character short, it seems. But that only slightly varies the bug report, not in any material way.

Also, I guess you've already ruled out that it's a more general problem, since without the trailing $ the outcomes are as expected.

So I'll keep digging.

@srenatus
Copy link
Contributor

srenatus commented Nov 3, 2023

OK at this point, I'm not sure what works and what doesn't and what shouldn't 😅 -- We're passing UNANCHORED to re2 here and I guess that makes sense for submatches, but it might not square with anchored regexps fed into find_all_string_submatch_n.

So, despite this being somewhat buggy -- Where did the need arise for you to use an anchored RE with find_all_string_submatch_n?

@srenatus
Copy link
Contributor

srenatus commented Nov 9, 2023

Coming back to this! 🔍

srenatus added a commit to srenatus/opa that referenced this issue Nov 9, 2023
When feeding a `char *` into `re->Match()`, it was converted to a StringPiece,
taking its size as `strlen()`. For our (long) input, that wasn't resulting in
the correct size, and did then freak out the re2 match input validation if the
regular expression has an end anchor, but the endpos wasn't the same as its
length. Since the endpos was taken from `s->len`, and the "length" taken via
the mentioned StringPiece's strlen() call, they did indeed not match.

Worked around by feeding it a properly-constructed std::string instead. I'm a
C++ novice at best, but it does the trick, and I'm reasonable certain it's less
wrong than before.

Fixes open-policy-agent#6376.

Signed-off-by: Stephan Renatus <[email protected]>
srenatus added a commit to srenatus/opa that referenced this issue Nov 9, 2023
When feeding a `char *` into `re->Match()`, it was converted to a StringPiece,
taking its size as `strlen()`. For our (long) input, that wasn't resulting in
the correct size, and did then freak out the re2 match input validation if the
regular expression has an end anchor, but the endpos wasn't the same as its
length. Since the endpos was taken from `s->len`, and the "length" taken via
the mentioned StringPiece's strlen() call, they did indeed not match.

Worked around by feeding it a properly-constructed std::string instead. I'm a
C++ novice at best, but it does the trick, and I'm reasonable certain it's less
wrong than before.

Fixes open-policy-agent#6376.

Signed-off-by: Stephan Renatus <[email protected]>
@srenatus
Copy link
Contributor

srenatus commented Nov 9, 2023

☝️ PR is ready. Thanks again for helping uncover this ✨

ashutosh-narkar pushed a commit that referenced this issue Nov 9, 2023
When feeding a `char *` into `re->Match()`, it was converted to a StringPiece,
taking its size as `strlen()`. For our (long) input, that wasn't resulting in
the correct size, and did then freak out the re2 match input validation if the
regular expression has an end anchor, but the endpos wasn't the same as its
length. Since the endpos was taken from `s->len`, and the "length" taken via
the mentioned StringPiece's strlen() call, they did indeed not match.

Worked around by feeding it a properly-constructed std::string instead. I'm a
C++ novice at best, but it does the trick, and I'm reasonable certain it's less
wrong than before.

Fixes #6376.

Signed-off-by: Stephan Renatus <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants