-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON.dump: avoid redundant UTF-8 validation #595
Conversation
Given that we called `rb_enc_str_asciionly_p`, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves. Also Ruby might have already computed that earlier.
👋 Just wanted to let you know that I encountered some breakage from this PR: Reduced (actual code involves networking somewhere): require "stringio"
require "json"
foo = StringIO.new("".b)
foo << '{"foo":"♥"}'
str = foo.string
pp str.encoding # => #<Encoding:BINARY (ASCII-8BIT)>
pp str.valid_encoding? # => true
pp str.bytes # => [123, 34, 102, 111, 111, 34, 58, 34, 226, 153, 165, 34, 125]
JSON.generate(str)
# lib/json/common.rb:306:in 'JSON::Ext::Generator::State#generate': source sequence is illegal/malformed utf-8 (JSON::GeneratorError) This already doesn't occur anymore since c96351f (or maybe 0819553 but that doesn't build for me) I'm very unfamiliar with this but here is what I found from this PR:
This all makes sense, I think. If I understand this, the string would actually need to declare itself as utf8, even if the bytes already happen to be valid utf8. Is this supposed to work (should a test be added?) or am I relying on unspecified behavior? |
Well, you can always debate this, but you are passing a BINARY string to JSON. previously it would inspect the string to figure out if that binary was valid UTF-8 by chance, but no longer does. This made sense back in Ruby 1.8, but now that Ruby strings have an associated encoding, it no longer does. Given the cost of checking for that, I think it's an acceptable regression. When getting data from the network like you suggest you do, you should ensure strings have the proper encoding at the boundaries. |
I'm good with this changing, the fix for me would be simple. It just took some effort to find out where the string was coming from. Anyways, I wonder if your optimization is still in place. With c96351f I no longer encounter the exception, even without changing my code. Here it seems to check every byte again? json/ext/json/ext/generator/generator.c Line 48 in e2dd834
Edit: I also found json/ext/json/ext/generator/generator.c Lines 760 to 765 in e2dd834
which includes binary. I suspect binary is the only encoding that behaves this way, the other seem to just be reencoded (?) |
Yes, I decided to merge the big PR that rewrite a lot to restart from a clean base. I may have to re-do this PR. |
FWIW this is what the pure-Ruby backend does: json/lib/json/pure/generator.rb Line 458 in e2dd834
That will work on BINARY strings only if they are ascii_only? , otherwise it will raise.
@Earlopain rb_usascii_encoding() is not the BINARY encoding, it's the US-ASCII encoding. |
That sounds sensible. I think it's what this PR was doing? Definitely worth adding a test for that. |
@eregon yes, my bad. I was confusing it with |
While profiling
JSON.dump
I noticed a large amount of time is spent validating UTF-8:Given that we called
rb_enc_str_asciionly_p
, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves.Also Ruby might have already computed that earlier.
cc @hsbt