JSON.dump: avoid redundant UTF-8 validation #595

casperisfine · 2024-09-02T08:27:11Z

While profiling JSON.dump I noticed a large amount of time is spent validating UTF-8:

Given that we called rb_enc_str_asciionly_p, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves.

Also Ruby might have already computed that earlier.

cc @hsbt

Given that we called `rb_enc_str_asciionly_p`, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves. Also Ruby might have already computed that earlier.

Earlopain · 2024-10-08T17:13:48Z

👋 Just wanted to let you know that I encountered some breakage from this PR:

Reduced (actual code involves networking somewhere):

require "stringio"
require "json"

foo = StringIO.new("".b)
foo << '{"foo":"♥"}'
str = foo.string

pp str.encoding # => #<Encoding:BINARY (ASCII-8BIT)>
pp str.valid_encoding? # => true
pp str.bytes # => [123, 34, 102, 111, 111, 34, 58, 34, 226, 153, 165, 34, 125]
JSON.generate(str)
# lib/json/common.rb:306:in 'JSON::Ext::Generator::State#generate': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)

This already doesn't occur anymore since c96351f (or maybe 0819553 but that doesn't build for me)

I'm very unfamiliar with this but here is what I found from this PR:
Ref: https://github.com/ruby/json/pull/595/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR206-R213

String is (obviously) not ascii only
RB_ENC_CODERANGE(string) == RUBY_ENC_CODERANGE_VALID
RB_ENCODING_GET_INLINED(string) == 0, however rb_utf8_encindex() == 1

This all makes sense, I think. If I understand this, the string would actually need to declare itself as utf8, even if the bytes already happen to be valid utf8.

Is this supposed to work (should a test be added?) or am I relying on unspecified behavior?

casperisfine · 2024-10-08T17:25:00Z

or am I relying on unspecified behavior?

Well, you can always debate this, but you are passing a BINARY string to JSON.

previously it would inspect the string to figure out if that binary was valid UTF-8 by chance, but no longer does. This made sense back in Ruby 1.8, but now that Ruby strings have an associated encoding, it no longer does.

Given the cost of checking for that, I think it's an acceptable regression.

When getting data from the network like you suggest you do, you should ensure strings have the proper encoding at the boundaries.

Earlopain · 2024-10-08T17:42:02Z

I'm good with this changing, the fix for me would be simple. It just took some effort to find out where the string was coming from.

Anyways, I wonder if your optimization is still in place. With c96351f I no longer encounter the exception, even without changing my code. Here it seems to check every byte again?

json/ext/json/ext/generator/generator.c

Line 48 in e2dd834

for (pos = 0; pos < in_utf8_len;) {

Edit:

I also found

json/ext/json/ext/generator/generator.c

Lines 760 to 765 in e2dd834

    
           static int enc_utf8_compatible_p(rb_encoding *enc) 
        
           { 
        
               if (enc == rb_usascii_encoding()) return 1; 
        
               if (enc == rb_utf8_encoding()) return 1; 
        
               return 0; 
        
           }

which includes binary. I suspect binary is the only encoding that behaves this way, the other seem to just be reencoded (?)

casperisfine · 2024-10-08T18:02:01Z

I wonder if your optimization is still in place.

Yes, I decided to merge the big PR that rewrite a lot to restart from a clean base. I may have to re-do this PR.

eregon · 2024-10-08T19:38:33Z

FWIW this is what the pure-Ruby backend does:

json/lib/json/pure/generator.rb

Line 458 in e2dd834

string = encode(::Encoding::UTF_8)

That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

@Earlopain rb_usascii_encoding() is not the BINARY encoding, it's the US-ASCII encoding.

casperisfine · 2024-10-08T19:40:02Z

That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

That sounds sensible. I think it's what this PR was doing? Definitely worth adding a test for that.

Earlopain · 2024-10-08T19:43:02Z

@eregon yes, my bad. I was confusing it with ASCII_8BIT. This makes more sense that way.

hsbt merged commit 9513e46 into ruby:master Oct 3, 2024
75 checks passed

casperisfine mentioned this pull request Oct 14, 2024

Restore support for improperly encoded strings #609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON.dump: avoid redundant UTF-8 validation #595

JSON.dump: avoid redundant UTF-8 validation #595

casperisfine commented Sep 2, 2024

Earlopain commented Oct 8, 2024 •

edited

Loading

casperisfine commented Oct 8, 2024

Earlopain commented Oct 8, 2024 •

edited

Loading

casperisfine commented Oct 8, 2024

eregon commented Oct 8, 2024

casperisfine commented Oct 8, 2024

Earlopain commented Oct 8, 2024

JSON.dump: avoid redundant UTF-8 validation #595

JSON.dump: avoid redundant UTF-8 validation #595

Conversation

casperisfine commented Sep 2, 2024

Earlopain commented Oct 8, 2024 • edited Loading

casperisfine commented Oct 8, 2024

Earlopain commented Oct 8, 2024 • edited Loading

casperisfine commented Oct 8, 2024

eregon commented Oct 8, 2024

casperisfine commented Oct 8, 2024

Earlopain commented Oct 8, 2024

Earlopain commented Oct 8, 2024 •

edited

Loading

Earlopain commented Oct 8, 2024 •

edited

Loading