Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"contains" filter behaves improperly with NUL characters #1732

Closed
wchargin opened this issue Oct 2, 2018 · 12 comments
Closed

"contains" filter behaves improperly with NUL characters #1732

wchargin opened this issue Oct 2, 2018 · 12 comments

Comments

@wchargin
Copy link
Contributor

wchargin commented Oct 2, 2018

Description

The contains(needle) filter does not match an input that contains
needle only after a NUL character. In JSON (and Unicode), NUL is a
normal character, not an end-of-string marker.

To reproduce

jqplay link: https://jqplay.org/s/ufUZAtLeHn

Filter: [contains("x"), contains("x\u0000"), contains("x\u0000y"), contains("y")]

JSON: "x\u0000y"

Expected behavior

Output should be [true, true, true, true].

Actual behavior

Output is [true, true, true, false].

Environment

  • OS and Version: Linux Mint 18.2 (Ubuntu 16.04)
  • jq-1.5-1-a5b5cbe
@pkoppstein
Copy link
Contributor

I can confirm that this is a bug that persists in the current (October 2, 2018) "master" version.

@haguenau
Copy link
Contributor

This is interesting; I will look into this if I find the time (if not, I'll try to at least contribute some tests that show variations on this issue).

haguenau added a commit to haguenau/jq that referenced this issue Jan 19, 2019
This is meant to fix jqlang#1732. In the existing code, the libc function
underlying `contains` on strings was `strstr`, which only works
properly on C strings (i.e., arrays of characters, where the first
null is an end marker). This is not suitable for JSON strings, which
can embed null bytes; for example `"xx\u0000yy"` is considered to
include `"\u0000aa"` as a substring, since the latter is interpreted
as the empty string.

This changeset uses `memset` instead of `strstr`.
@nicowilliams
Copy link
Contributor

jv string values are both, counted-length, and C strings. They are stored with all escapes unescaped. That means that the string "\n\u0000\n", for example, is stored in memory as 0x0A 0x00 0x0A 0x00, and the length is understood to be 3 (not 1, and nor 4). If you care to understand how the string literal gets parsed, you need to look at src/parser.y and src/lexer.l:

In src/parser.y we have the following, which generates strings as a sequence of additions of sub-strings which later get constant-folded (by constant_fold()) into a single LOADK with the final string value as the concatenation of those substrings:

583 String:
584 QQSTRING_START { $<literal>$ = jv_string("text"); } QQString QQSTRING_END {
585   $$ = $3;
586   jv_free($<literal>2);
587 } |
588 FORMAT QQSTRING_START { $<literal>$ = $1; } QQString QQSTRING_END {
589   $$ = $4;
590   jv_free($<literal>3);
591 }
592
593
594 QQString:
595 %empty {
596   $$ = gen_const(jv_string(""));
597 } |
598 QQString QQSTRING_TEXT {
599   $$ = gen_binop($1, gen_const($2), '+');
600 } |
601 QQString QQSTRING_INTERP_START Exp QQSTRING_INTERP_END {
602   $$ = gen_binop($1, gen_format($3, jv_copy($<literal>0)), '+');
603 }

And this is where escapes are handled in the lexer, which as you can see, leverages jv_parse_sized() to do the actual work of parsing a JSON string (this only works because the lexer chunks string literals so that the string which is wrapped with double-quotes at line 108 cannot contain double-quotes):

 98 <IN_QQSTRING>{
 99   "\\(" {
100     return enter(QQSTRING_INTERP_START, YY_START, yyscanner);
101   }
102   "\"" {
103     yy_pop_state(yyscanner);
104     return QQSTRING_END;
105   }
106   (\\[^u(]|\\u[a-zA-Z0-9]{0,4})+ {
107     /* pass escapes to the json parser */
108     jv escapes = jv_string_fmt("\"%.*s\"", (int)yyleng, yytext);
109     yylval->literal = jv_parse_sized(jv_string_value(escapes), jv_string_length_bytes(jv_copy(escapes)));
110     jv_free(escapes);
111     return QQSTRING_TEXT;
112   }
113   [^\\\"]+ {
114     yylval->literal = jv_string_sized(yytext, yyleng);
115     return QQSTRING_TEXT;
116   }
117   . {
118     return INVALID_CHARACTER;
119   }
120 }

By the time the jq program is compiled we have something like:

$ jq --debug-dump-disasm -n '"abc\u0000def"|contains("de")'
0000 TOP
0001 LOADK "abc\u0000def"
0003 PUSHK_UNDER "de"
0005 CALL_BUILTIN contains
0008 RET

false

and the string constant is of length 7 (bytes), with an embedded NUL.

The contains builtin is implemented by f_contains() in src/builtin.c, which calls jv_contains().

The bug is that jv_contains() uses the C library's strstr() function to check if one string is contained in another, but strstr() deals in C strings, which means that embedded NULs terminate the strings, which means that strstr() will think the haystack is "abc", not "abc\u0000def", therefore the haystack does not contain the needle "de".

The fix is to not use strstr() but _jq_memmem() (which uses memem() if available):

diff --git a/src/jv.c b/src/jv.c
index 979d188..f051d73 100644
--- a/src/jv.c
+++ b/src/jv.c
@@ -1342,7 +1342,8 @@ int jv_contains(jv a, jv b) {
   } else if (jv_get_kind(a) == JV_KIND_ARRAY) {
     r = jv_array_contains(jv_copy(a), jv_copy(b));
   } else if (jv_get_kind(a) == JV_KIND_STRING) {
-    r = strstr(jv_string_value(a), jv_string_value(b)) != 0;
+    r = _jq_memmem(jv_string_value(a), jv_string_length_bytes(jv_copy(a)),
+                   jv_string_value(b), jv_string_length_bytes(jv_copy(b))) != 0;
   } else {
     r = jv_equal(jv_copy(a), jv_copy(b));
   }

@wchargin
Copy link
Contributor Author

Thanks, @nicowilliams. This is consistent with my cursory analysis, but
I wasn’t familiar enough with the code to be confident that it was the
full explanation. Appreciate the detailed account.

@haguenau
Copy link
Contributor

Woah, I was writing the text for my pull request while you were posting the explanation, @nicowilliams; talk about synchronicity :).

@nicowilliams
Copy link
Contributor

Fixed with 61cd6db.

@nicowilliams
Copy link
Contributor

@haguenau Oy, sorry... I should have let you fix it because it's always good to have more contributors! And this was both, easy, and tricky, which makes it a great bug for a contribution :(

@haguenau
Copy link
Contributor

Heh. Things like that happen, that's OK. Perhaps consider using the tests from #1793?
The code changes themselves are almost exactly identical between our two versions (I was more verbose than strictly necessary).

@nicowilliams
Copy link
Contributor

@haguenau Nice! Yeah, so, can you a) rebase so it's now just the tests, and b) try to reduce the number of tests by following the one I pushed as an example? (b) is not strictly necessary, naturally -- I suspect that the following pattern

[contains("foo"), contains("\u0000b"), contains("\u0000z"),  contains("bar"), contains("baz")]
"foo\u0000bar"
[true, true, false, true, false]

is faster than

contains("foo")
"foo\u0000bar"
true

contains("\u0000b")
"foo\u0000bar"
true

...

though I've never actually measured this sort of thing, so maybe that's wrong, and anyways, I find it easier to read. Since the [containts(...), ...] expression could get long, you can split that up into several tests.

Also, remove this test:

0
0
0

@nicowilliams
Copy link
Contributor

FYI, I've to pack and take a very early morning flight, so I might not communicate further or push your PR until tomorrow or Sunday :(

@nicowilliams
Copy link
Contributor

Anyways, thanks both of you, @wchargin and @haguenau, for a neat bug report and fix!

@haguenau
Copy link
Contributor

Thanks for the detailed review, Nico; I will probably have time to submit test cases that match your expectations early this week. Even though you fixed the code itself at the same time I did, this is not lost; I learned a bit more about jq's internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants