Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON.dump: avoid redundant UTF-8 validation #595

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

casperisfine
Copy link

While profiling JSON.dump I noticed a large amount of time is spent validating UTF-8:

Capture d’écran 2024-09-02 à 10 25 23

Given that we called rb_enc_str_asciionly_p, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves.

Also Ruby might have already computed that earlier.

cc @hsbt

Given that we called `rb_enc_str_asciionly_p`, if the string encoding
isn't valid UTF-8, we can't know it very cheaply by checking the
encoding and coderange that was just computed by Ruby, rather than
to do it ourselves.

Also Ruby might have already computed that earlier.
@hsbt hsbt merged commit 9513e46 into ruby:master Oct 3, 2024
75 checks passed
@Earlopain
Copy link

Earlopain commented Oct 8, 2024

👋 Just wanted to let you know that I encountered some breakage from this PR:

Reduced (actual code involves networking somewhere):

require "stringio"
require "json"

foo = StringIO.new("".b)
foo << '{"foo":"♥"}'
str = foo.string

pp str.encoding # => #<Encoding:BINARY (ASCII-8BIT)>
pp str.valid_encoding? # => true
pp str.bytes # => [123, 34, 102, 111, 111, 34, 58, 34, 226, 153, 165, 34, 125]
JSON.generate(str)
# lib/json/common.rb:306:in 'JSON::Ext::Generator::State#generate': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)

This already doesn't occur anymore since c96351f (or maybe 0819553 but that doesn't build for me)

I'm very unfamiliar with this but here is what I found from this PR:
Ref: https://github.com/ruby/json/pull/595/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR206-R213

  • String is (obviously) not ascii only
  • RB_ENC_CODERANGE(string) == RUBY_ENC_CODERANGE_VALID
  • RB_ENCODING_GET_INLINED(string) == 0, however rb_utf8_encindex() == 1

This all makes sense, I think. If I understand this, the string would actually need to declare itself as utf8, even if the bytes already happen to be valid utf8.

Is this supposed to work (should a test be added?) or am I relying on unspecified behavior?

@casperisfine
Copy link
Author

or am I relying on unspecified behavior?

Well, you can always debate this, but you are passing a BINARY string to JSON.

previously it would inspect the string to figure out if that binary was valid UTF-8 by chance, but no longer does. This made sense back in Ruby 1.8, but now that Ruby strings have an associated encoding, it no longer does.

Given the cost of checking for that, I think it's an acceptable regression.

When getting data from the network like you suggest you do, you should ensure strings have the proper encoding at the boundaries.

@Earlopain
Copy link

Earlopain commented Oct 8, 2024

I'm good with this changing, the fix for me would be simple. It just took some effort to find out where the string was coming from.

Anyways, I wonder if your optimization is still in place. With c96351f I no longer encounter the exception, even without changing my code. Here it seems to check every byte again?

for (pos = 0; pos < in_utf8_len;) {

Edit:

I also found

static int enc_utf8_compatible_p(rb_encoding *enc)
{
if (enc == rb_usascii_encoding()) return 1;
if (enc == rb_utf8_encoding()) return 1;
return 0;
}

which includes binary. I suspect binary is the only encoding that behaves this way, the other seem to just be reencoded (?)

@casperisfine
Copy link
Author

I wonder if your optimization is still in place.

Yes, I decided to merge the big PR that rewrite a lot to restart from a clean base. I may have to re-do this PR.

@eregon
Copy link
Member

eregon commented Oct 8, 2024

FWIW this is what the pure-Ruby backend does:

string = encode(::Encoding::UTF_8)

That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

@Earlopain rb_usascii_encoding() is not the BINARY encoding, it's the US-ASCII encoding.

@casperisfine
Copy link
Author

That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

That sounds sensible. I think it's what this PR was doing? Definitely worth adding a test for that.

@Earlopain
Copy link

@eregon yes, my bad. I was confusing it with ASCII_8BIT. This makes more sense that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants